## Import libraries

In [14]:
import pandas as pd 
import numpy as np # linear algebra
import matplotlib.pyplot as plt # plotting
import seaborn as sns

from sklearn import preprocessing

from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.tag import pos_tag, pos_tag_sents
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 

import re
import string

## Read in the file

In [10]:
df = pd.read_csv('PubmedAbstracts1519(10000).csv')
df.reset_index(drop=True, inplace=True)

### About this file (Description from Kaggle)
The dataset is abstracts of research articles between 2015 and 2019. 
The search keywords are "tobacco", "alcohol". There are a total of 10 thousand abstracts.
- It contains two columns, document ID and Abstract.

## Preliminary EDA

### General Inspection

#### Look at the top 5 rows

In [17]:
df.head()

Unnamed: 0.1,Unnamed: 0,x
0,1,This case report investigated the transactiona...
1,2,Does it matter what we eat for our mental heal...
2,3,Pubertal timing matters for psychological deve...
3,4,"Plasminogen activator inhibitor 1 (PAI-1), whi..."
4,5,The adolescent developmental stage appears to ...


#### Rename column titles

In [22]:
# Column Titles
df.columns

# Rename columns
df.rename({'Unnamed: 0': 'PaperID', 'x': 'Abstract'}, axis=1, inplace=True)

df.head()

Unnamed: 0,PaperID,Abstract
0,1,This case report investigated the transactiona...
1,2,Does it matter what we eat for our mental heal...
2,3,Pubertal timing matters for psychological deve...
3,4,"Plasminogen activator inhibitor 1 (PAI-1), whi..."
4,5,The adolescent developmental stage appears to ...


#### There are 9,179 observations (each represents one Pubmed article)

In [23]:
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')

There are 9719 rows and 2 columns


#### Looking at dataframe datatypes

In [25]:
df.dtypes

PaperID      int64
Abstract    object
dtype: object

#### Look at the first few abstracts to see how the text should be handled

In [26]:
df['Abstract'][0]

"This case report investigated the transactional model of stress and coping as guidance for nursing care of an adolescent patient with thalassemia.A case study of a 15-year-old female patient with β-thalassemia major. Data were collected using patient medical records, an interview with the patient and physical examination.Four issues related to coping were isolated: Worsening physical symptoms; psychosocial consequences, coping process, and building supportive networks. These issues and the patient's adaption are explored via the transactional model.Having thalassemia was cognitively appraised by the patient as a stressful and taxing situation with detrimental consequences, such as changes in physical appearance, stigmatization, and depression. Nurses should evaluate each patient's physical and psychosocial needs, utilizing appropriate theoretical models for designing a suitable care plan. As the case study demonstrates, the transactional model was an effective guide for nurses in plan

This paper appears to be about coping for adolesscent patients with a blood disorder

In [27]:
df['Abstract'][1]

'Does it matter what we eat for our mental health? Accumulating data suggests that this may indeed be the case and that diet and nutrition are not only critical for human physiology and body composition, but also have significant effects on mood and mental wellbeing. While the determining factors of mental health are complex, increasing evidence indicates a strong association between a poor diet and the exacerbation of mood disorders, including anxiety and depression, as well as other neuropsychiatric conditions. There are common beliefs about the health effects of certain foods that are not supported by solid evidence and the scientific evidence demonstrating the unequivocal link between nutrition and mental health is only beginning to emerge. Current epidemiological data on nutrition and mental health do not provide information about causality or underlying mechanisms. Future studies should focus on elucidating mechanism. Randomized controlled trials should be of high quality, adequa

This paper appears to be about the effects of diet and nutrition on mental health

In [28]:
df['Abstract'][2]

'Pubertal timing matters for psychological development. Early maturation in girls is linked to risk for depression and externalizing problems in adolescence and possibly adulthood, and early and late maturation in boys are linked to depression. It is unclear whether pubertal timing uniquely predicts problems; it might instead mediate the continuity of behavior problems from childhood to adolescence or create psychological risk specifically in youth with existing problems, thus moderating the link. We investigated these issues in 534 girls and 550 boys, measuring pubertal timing by a logistic model fit to annual self-report measures of development and, in girls, age at menarche. Prepuberty internalizing and externalizing behavior problems were reported by parents. Adolescent behavior problems were reported by parents and youth. As expected, behavior problems were moderately stable. Pubertal timing was not predicted by childhood problems, so it did not mediate the continuity of behavior 

This paper appears to be about early maturation (puberty) and behavioral problems

In [31]:
df['Abstract'][9718]

'Cognitive fatigue is among the most profound and disabling sequelae of pediatric acquired brain disorders, however the neural correlates of these symptoms in children remains unexplored. One hypothesis suggests that cognitive fatigue may arise from dysfunction of cortico-striatal networks (CSNs) implicated in effort output and outcome valuation. Using pediatric traumatic brain injury (TBI) as a model, this study investigated (i) the sub-acute effect of brain injury on CSN\xa0volume; and (ii) potential relationships between cognitive fatigue and sub-acute volumetric abnormalities of the CSN. 3D T1 weighted magnetic resonance imaging sequences were acquired sub-acutely in 137 children (TBI: n\xa0=\xa0103; typically developing - TD\xa0children: n\xa0=\xa034). 67 of the original 137 participants (49%) completed measures of cognitive fatigue and psychological functioning at 24-months post-injury. Results showed that compared to TD controls and children with milder injuries, children with s

This paper appears to be about cognitive fatigue and brain disorders

**Observation: While these abstracts are supposedly from Pubmed from the search results for "tobacco" and "alcohol", they seem very loosely connected to tobacco and alcohol. None of them explicitly have the keywords in the abstracts. Maybe this is because 10,000 paper abstracts were scraped from PubMed so the relevance significantly declined. However, even the first few abstracts in the dataframe don't appear to be relevant. So maybe the papers are not in the order of the search results. This may pose problems for topic modeling.**

### Inspecting content and relevance of abstracts more closely

#### Of the 9,719 abstracts in the dataframe, 726 contain the words "tobacco" or "alcohol"

In [41]:
abstracts_lower = df['Abstract'].str.lower()

In [42]:
abstracts_lower

0       this case report investigated the transactiona...
1       does it matter what we eat for our mental heal...
2       pubertal timing matters for psychological deve...
3       plasminogen activator inhibitor 1 (pai-1), whi...
4       the adolescent developmental stage appears to ...
                              ...                        
9714    several studies have identified associations b...
9715    obsessive-compulsive disorder (ocd) has a chro...
9716    this pilot randomized controlled trial (rct) i...
9717    few studies have explored stress and coping am...
9718    cognitive fatigue is among the most profound a...
Name: Abstract, Length: 9719, dtype: object

In [43]:
alcohol_series = abstracts_lower.str.contains('alcohol', regex=False)

In [44]:
alcohol_series

0       False
1       False
2       False
3       False
4       False
        ...  
9714    False
9715    False
9716    False
9717    False
9718    False
Name: Abstract, Length: 9719, dtype: object

In [45]:
alcohol_series.sum()

670

In [46]:
tobacco_series = abstracts_lower.str.contains('tobacco', regex=False)

In [47]:
tobacco_series.sum()

135

In [48]:
keywords_series = abstracts_lower.str.contains('tobacco|alcohol', regex=True)

In [49]:
len(keywords_series)

9719

In [50]:
keywords_series.sum()

726

#### Inspect a few of the abstracts that actually contain the keywords

In [51]:
keywords_series[keywords_series == True]

8       True
12      True
15      True
43      True
44      True
        ... 
9692    True
9693    True
9694    True
9708    True
9713    True
Name: Abstract, Length: 726, dtype: object

In [52]:
df['Abstract'][8]

'Background and Objectives: Psychological outcomes following termination of wanted pregnancies have not previously been studied. Does excluding such abortions affect estimates of psychological distress following abortion? To address this question this study examines long-term psychological outcomes by pregnancy intention (wanted or unwanted) following induced abortion relative to childbirth in the United States. Materials and Methods: Panel data on a nationally-representative cohort of 3935 ever-pregnant women assessed at mean age of 15, 22, and 28 years were examined from the National Longitudinal Survey of Adolescent to Adult Health (Add Health). Relative risk (RR) and incident rate ratios (IRR) for time-dynamic mental health outcomes, conditioned by pregnancy intention and abortion exposure, were estimated from population-averaged longitudinal logistic and Poisson regression models, with extensive adjustment for sociodemographic differences, pregnancy and mental health history, and 

This abstract is about psychological effects of termination of wanted pregnancies that includes substance abuse disorders

In [53]:
df['Abstract'][12]

'Anxiety disorders in adolescence have been associated with several psychiatric outcomes. We sought to describe the prospective relationship between various levels of adolescent anxiety and psychiatric diagnoses (anxiety-, bipolar/psychotic-, depressive-, and alcohol and drug misuse disorders) and suicidal ideation in early adulthood while adjusting for childhood attention-deficit/hyperactivity disorder (ADHD), autism spectrum disorder (ASD), and developmental coordination disorder (DCD). Furthermore, we aimed to estimate the proportion attributable to the various anxiety levels for the outcomes.We used a nation-wide population-based Swedish twin study comprising 14,106 fifteen-year-old twins born in Sweden between 1994 and 2002 and a replication sample consisting of 9211 Dutch twins, born between 1985 and 1999. Adolescent anxiety was measured with parental and self-report. Psychiatric diagnoses and suicidal ideation were retrieved from the Swedish National Patient Register and via sel

This abstract is about anxiety disorders in adolescents  and alcohol and drug misuse.

In [54]:
df['Abstract'][9692]

'Few studies have evaluated exercise interventions for smokers with depression or other psychiatric comorbidities. This pilot study evaluated the potential role of supervised vigorous exercise as a smoking cessation intervention for depressed females.Thirty adult women with moderate-severe depressive symptoms were enrolled and randomly assigned to 12 weeks of thrice weekly, in person sessions of vigorous intensity supervised exercise at a YMCA setting (EX; n = 15) or health education (HE; n = 15). All participants received behavioral smoking cessation counseling and nicotine patch therapy. Assessments were done in person at baseline, at the end of 12 weeks of treatment, and at 6 months post-target quit date. Primary end points were exercise adherence (proportion of 36 sessions attended) and biochemically confirmed 7-day point prevalence abstinence at Week 12. Biomarkers of inflammation were explored for differences between treatment groups and between women who smoked and those abstine

This abstract is about exercise intervention for smokers with depression

#### After subsetting on the abstracts only explicitly containing the keywords, the papers appear to be more relevant, and may lend themselves to topic modeling and a recommender system

### Inspecting abstract lengths of entire dataset (have not yet excluded irrelevant articles)

#### Finding longest, shortest, and mean abstract length

In [56]:
# Longest abstract length - characters
max_abstract_c = df['Abstract'].str.len().max()
# Longest abstract length - words 
max_abstract_w = df['Abstract'].str.split().str.len().max()

# Shortest abstract length - characters
min_abstract_c = df['Abstract'].str.len().min()
# Shortest abstract length - words
min_abstract_w = df['Abstract'].str.split().str.len().min()

# Mean abstract length - characters
mean_abstract_c = df['Abstract'].str.len().mean()
# Mean abstract length - words
mean_abstract_w = df['Abstract'].str.split().str.len().mean()

print("Max abstract length - characters: ", max_abstract_c)
print("Max abstract length - words: ", max_abstract_w)
print("Min abstract length - characters: ", min_abstract_c)
print("Min abstract length - words: ", min_abstract_w)
print("Mean abstract length - characters: ", mean_abstract_c)
print("Mean abstract length - words: ", mean_abstract_w)

Max abstract length - characters:  13897.0
Max abstract length - words:  2084.0
Min abstract length - characters:  5.0
Min abstract length - words:  1.0
Mean abstract length - characters:  1673.9989706639217
Mean abstract length - words:  237.31168296448791


Observation: That's interesting... Scientific papers abstracts are usually 200 to 250 words.

#### Looking at abstracts with 10 words or fewer

In [57]:
mask = (df['Abstract'].str.split().str.len() <= 10)
df.loc[mask]

Unnamed: 0,PaperID,Abstract
2858,2859,Adolescents and millennials show the highest r...
3804,3805,"NCT03022370. Registered 13 January 2017, retro..."
4624,4625,[This corrects the article DOI: 10.1371/journa...
6054,6055,No abstract available.
8335,8336,<p/>.
9149,9150,© Georg Thieme Verlag KG Stuttgart · New York.


In [58]:
df['Abstract'][6054]

'No abstract available.'

In [59]:
df['Abstract'][2858]

'Adolescents and millennials show the highest rates of increase.'

In [90]:
df['Abstract'][3804]

'NCT03022370. Registered 13 January 2017, retrospectively registered.'

In [91]:
df['Abstract'][4624]

'[This corrects the article DOI: 10.1371/journal.pone.0161062.].'

In [92]:
df['Abstract'][8335]

'<p/>.'

In [93]:
df['Abstract'][9149]

'© Georg Thieme Verlag KG Stuttgart · New York.'

One sentence is not a conventional abstract!

We will need to throw these out since they are not useful

#### Looking at abstracts with 100 or fewer words

In [65]:
mask = (df['Abstract'].str.split().str.len() <= 100)
df.loc[mask]

Unnamed: 0,PaperID,Abstract
24,25,The mental health of children and young people...
86,87,[This corrects the article DOI: 10.2196/14734....
95,96,Axon guidance molecules direct growing axons t...
98,99,This review examines sex differences in sleep ...
119,120,The impacts of body dissatisfaction have been ...
...,...,...
9149,9150,© Georg Thieme Verlag KG Stuttgart · New York.
9438,9439,The Socio-Economic Panel (SOEP) has outstandin...
9534,9535,Emetophobia is the specific fear of vomiting t...
9557,9558,We don't know which selective serotonin reupta...


In [94]:
df['Abstract'][98]

'This review examines sex differences in sleep disturbance and risk for psychopathology, with a particular focus on the emergence of insomnia and risk for depression among adolescents. Possible explanations for the female preponderance of adolescent insomnia is discussed. The significance of the temporal relationship between adolescent insomnia and depression is discussed, as the extant literature suggests that insomnia tends to precede depression more than the inverse. Whether a causal relationship may exist between the two conditions and possible mechanisms underlying sex differences are highlighted along with important areas for future research.Copyright © 2019 Elsevier Ltd. All rights reserved.'

In [95]:
df['Abstract'][9534]

'Emetophobia is the specific fear of vomiting that usually commences during childhood and adolescence. Cognitive behavioral therapy aims to expose patients to vomiting. In this paper, a newly developed metacognitive concept and treatment approach to this disorder is illustrated within a small case series.Three adolescent girls with emetophobia were treated with metacognitive therapy (MCT). Measures of anxiety, worry, depression, and metacognitions before and after the treatment were documented.All patients recovered during the course of 8 to 11 sessions, and measurements of anxiety, worry, depression, and metacognitions dropped markedly.MCT presents a valuable treatment option for emetophobia in adolescents.'

In [96]:
df['Abstract'][9557]

"We don't know which selective serotonin reuptake inhibitors (SSRIs) are the most effective and safe because no studies have compared these antidepressants with each other. Three SSRI antidepressant medications--fluoxetine, sertraline, and escitalopram--produce modest improvements (about 5% to 10%) in standardized depression scores without a significant increase in the risk of suicide-related outcomes (suicidal behavior or ideation) in adolescent patients with major depression of moderate severity."

In [97]:
df['Abstract'][9706]

"Many African Americans (AAs) use clergy as their primary source of help for depression, with few being referred to mental health providers. This study used face-to-face workshops to train AA clergy to recognize the symptoms and levels of severity of depression. A pretest/posttest format was used to test knowledge (N = 42) about depression symptoms. Results showed that the participation improved the clergy's ability to recognize depression symptoms. Faith community nurses can develop workshops for clergy to improve recognition and treatment of depression. "

**Now, we should either make a new dataframe with abstracts of a length and check the remaining number that have keywords. 
Or we should make a new dataframe with abstracts containing the keywords, and then explore the lengths.**

**Objective: see how many abstracts containing the keywords "tobacco" and "alcohol" are a valid abstracts that are long enough for topic modeling**

#### Abstracts with 20 words or less

In [70]:
mask = (df['Abstract'].str.split().str.len() <= 20)
df.loc[mask]

Unnamed: 0,PaperID,Abstract
2858,2859,Adolescents and millennials show the highest r...
3804,3805,"NCT03022370. Registered 13 January 2017, retro..."
4624,4625,[This corrects the article DOI: 10.1371/journa...
6054,6055,No abstract available.
8335,8336,<p/>.
9149,9150,© Georg Thieme Verlag KG Stuttgart · New York.


All invalid.

#### Abstracts with less than 30 words

In [71]:
mask = (df['Abstract'].str.split().str.len() <= 30)
df.loc[mask]

Unnamed: 0,PaperID,Abstract
86,87,[This corrects the article DOI: 10.2196/14734....
1526,1527,The original version of this paper [1] did not...
2858,2859,Adolescents and millennials show the highest r...
3804,3805,"NCT03022370. Registered 13 January 2017, retro..."
4624,4625,[This corrects the article DOI: 10.1371/journa...
6054,6055,No abstract available.
8335,8336,<p/>.
9021,9022,Adolescent girls taking combined oral contrace...
9149,9150,© Georg Thieme Verlag KG Stuttgart · New York.


All invalid.

#### Abstracts with 40 words or less

In [72]:
mask = (df['Abstract'].str.split().str.len() <= 40)
df.loc[mask]

Unnamed: 0,PaperID,Abstract
86,87,[This corrects the article DOI: 10.2196/14734....
309,310,[This corrects the article DOI: 10.2196/13628....
1526,1527,The original version of this paper [1] did not...
2858,2859,Adolescents and millennials show the highest r...
3252,3253,"Trichotillomania, or hair-pulling disorder, is..."
3804,3805,"NCT03022370. Registered 13 January 2017, retro..."
4624,4625,[This corrects the article DOI: 10.1371/journa...
6054,6055,No abstract available.
8335,8336,<p/>.
8809,8810,King's College London researchers have called ...


All invalid or irrelevant.

#### Abstracts with 50 words or less

In [73]:
mask = (df['Abstract'].str.split().str.len() <= 50)
df.loc[mask]

Unnamed: 0,PaperID,Abstract
24,25,The mental health of children and young people...
86,87,[This corrects the article DOI: 10.2196/14734....
309,310,[This corrects the article DOI: 10.2196/13628....
1526,1527,The original version of this paper [1] did not...
2121,2122,"This study tested the feasibility, acceptabili..."
2858,2859,Adolescents and millennials show the highest r...
3252,3253,"Trichotillomania, or hair-pulling disorder, is..."
3804,3805,"NCT03022370. Registered 13 January 2017, retro..."
4624,4625,[This corrects the article DOI: 10.1371/journa...
5633,5634,Neurofibromatosis type 1 (NF1) is an inherited...


All irrelevant or invalid.

#### A lot more abstracts have <60 words than <50 words

In [77]:
mask = (df['Abstract'].str.split().str.len() <= 60)
df.loc[mask]

Unnamed: 0,PaperID,Abstract
24,25,The mental health of children and young people...
86,87,[This corrects the article DOI: 10.2196/14734....
309,310,[This corrects the article DOI: 10.2196/13628....
771,772,Adolescent depression is a prevalent disorder ...
1526,1527,The original version of this paper [1] did not...
2121,2122,"This study tested the feasibility, acceptabili..."
2858,2859,Adolescents and millennials show the highest r...
3252,3253,"Trichotillomania, or hair-pulling disorder, is..."
3654,3655,We would like to respond to some concerns rais...
3804,3805,"NCT03022370. Registered 13 January 2017, retro..."


In [104]:
df['Abstract'][771]

"Adolescent depression is a prevalent disorder that increases risk for significant functional impairment and suicidality.1-3 Several psychotherapies are available, and it has been widely assumed that failure to complete these therapies will undermine benefit. The important study by O'Keeffe et\xa0al. raises questions about that assumption.4.Copyright © 2019. Published by Elsevier Inc."

In [105]:
df['Abstract'][1526]

'The original version of this paper [1] did not specify that a website was used in the final year of recruitment, in addition to the other stated recruitment methods.'

In [106]:
df['Abstract'][2858]

'Adolescents and millennials show the highest rates of increase.'

In [107]:
df['Abstract'][3654]

'We would like to respond to some concerns raised by Dr.\xa0McClellan in his editorial comment1 on our article that reported the results of a placebo-controlled study of lurasidone for the treatment of children and adolescents with bipolar I depression.2.Copyright © 2018 American Academy of Child and Adolescent Psychiatry. Published by Elsevier Inc. All rights reserved.'

In [108]:
df['Abstract'][5053]

'Bladder cancer is one of the most common cancers among urologic cancers.\r\nIntravesical instillation following transurethral resection of bladder tumor\r\n(TURBT) is used as a treatment of bladder cancer. According to results of this\r\nstudy, before and after intravesical instillations following TURBT have no effect\r\non symptom outcomes of patients with superficial bladder cancer.'

In [109]:
df['Abstract'][5633]

'Neurofibromatosis type 1 (NF1) is an inherited disorder often associated with optic nerve gliomas, low-grade brain tumors, and readily visible signs. Though these features are frequently emphasized, the psychosocial and emotional morbidities are often overlooked. We present a patient with depressive disorder resulting in suicide in a patient with NF1.'

In [110]:
df['Abstract'][5908]

'Psychoneurologic symptoms commonly reported by adolescents and young adults (AYAs) following hematopoetic stem cell transplantation (HSCT) include anxiety, depression, fatigue, and pain. Complementary and alternative medicine (CAM) appeals to AYAs as a means of coping with these symptoms. One example of CAM is a publicly available illness blog authored by a young adult woman undergoing HSCT. \u2029.'

In [111]:
df['Abstract'][8809]

"King's College London researchers have called for a greater focus on the mental health of young people with chronic liver conditions, after a study showed an increased risk of anxiety and depression."

In [112]:
df['Abstract'][8828]

'Care for patients with eating disorders is complex and plurimodal. Care plans need to be adapted in order to take into account the body in crisis. A series of hospital admissions combined with specific psychomotor approaches, can contribute to the patient being reappropriated with their own body.Copyright Â© 2016 Elsevier Masson SAS. All rights reserved.'

In [113]:
df['Abstract'][8939]

'Although research papers show antidepressants have been used as first-line therapies for children and young people, in this article Edward Freshwater examines non-pharmacological therapies and suggests they can be effective when used as preventative interventions. He urges nurses to develop knowledge in lifestyle interventions such as diet and exercise to inform health promotion.'

None of the above are relevant to tobacco or alcohol.

#### Abstracts with 70 words or less

In [78]:
mask = (df['Abstract'].str.split().str.len() <= 70)
df.loc[mask]

Unnamed: 0,PaperID,Abstract
24,25,The mental health of children and young people...
86,87,[This corrects the article DOI: 10.2196/14734....
309,310,[This corrects the article DOI: 10.2196/13628....
728,729,This article summarizes current knowledge and ...
771,772,Adolescent depression is a prevalent disorder ...
1015,1016,Amenorrhea is one of the clinical consequences...
1526,1527,The original version of this paper [1] did not...
2121,2122,"This study tested the feasibility, acceptabili..."
2858,2859,Adolescents and millennials show the highest r...
3252,3253,"Trichotillomania, or hair-pulling disorder, is..."


#### Abstracts with 80 words or less

In [115]:
mask = (df['Abstract'].str.split().str.len() <= 80)
df.loc[mask]

Unnamed: 0,PaperID,Abstract
24,25,The mental health of children and young people...
86,87,[This corrects the article DOI: 10.2196/14734....
309,310,[This corrects the article DOI: 10.2196/13628....
728,729,This article summarizes current knowledge and ...
730,731,Depression risk is 2 to 3 times higher in medi...
771,772,Adolescent depression is a prevalent disorder ...
1015,1016,Amenorrhea is one of the clinical consequences...
1526,1527,The original version of this paper [1] did not...
1939,1940,Major depression is a common illness that seve...
1959,1960,Transcranial direct current stimulation (tDCS)...


#### Abstracts with 90 words or less

In [117]:
mask = (df['Abstract'].str.split().str.len() <= 90)
df.loc[mask]

Unnamed: 0,PaperID,Abstract
24,25,The mental health of children and young people...
86,87,[This corrects the article DOI: 10.2196/14734....
309,310,[This corrects the article DOI: 10.2196/13628....
728,729,This article summarizes current knowledge and ...
730,731,Depression risk is 2 to 3 times higher in medi...
771,772,Adolescent depression is a prevalent disorder ...
1015,1016,Amenorrhea is one of the clinical consequences...
1069,1070,Levetiracetam is an antiepileptic agent that i...
1526,1527,The original version of this paper [1] did not...
1580,1581,Psychological treatments for adolescent depres...


We should inspect these and check for relevance

#### Let's inspect the abstracts above and see if any are valid length/relevant for keeping.

In [180]:
df['Abstract'][24]
df['Abstract'][86]
df['Abstract'][309]
df['Abstract'][728] #valid but irrelevant
df['Abstract'][730] #valid but irrelevant
df['Abstract'][771] #valid but irrelevant
df['Abstract'][1015] #valid but irrelevant
df['Abstract'][1069] #valid but irrelevant
df['Abstract'][1526]
df['Abstract'][1580] #valid but irrelevant
df['Abstract'][1939] #valid but irrelevant
df['Abstract'][1959] #valid but irrelevant
df['Abstract'][1974] #valid but irrelevant
df['Abstract'][2121] #valid but irrelevant
df['Abstract'][2628] #valid but irrelevant
df['Abstract'][2691] #valid but irrelevant
df['Abstract'][2858]
df['Abstract'][2917] #valid but irrelevant
df['Abstract'][3252] #valid but irrelevant
df['Abstract'][3317] #valid but irrelevant
df['Abstract'][3365] #valid but irrelevant
df['Abstract'][3574] #valid but irrelevant
df['Abstract'][3654]
df['Abstract'][3655]
df['Abstract'][3804]
df['Abstract'][3998] #valid but irrelevant
df['Abstract'][4131] #valid but irrelevant
df['Abstract'][4624]
df['Abstract'][4626] #valid but irrelevant
df['Abstract'][4877] #somewhat relevant; mental health issues for college students who may benefit from online depression info
df['Abstract'][4978] #valid but irrelevant
df['Abstract'][5053] #valid but irrelevant
df['Abstract'][5239] #valid but irrelevant
df['Abstract'][5315] #somewhat relevant; grief in adolescents
df['Abstract'][5353] #valid but irrelevant
df['Abstract'][5549] #somewhat relevant; anxiety and depression in youth
df['Abstract'][5633] #valid but irrelevant
df['Abstract'][5816] #somewhat relevant; child and adolescent depression
df['Abstract'][5908]
df['Abstract'][6363] #somewhat relevant; depression and suicide ideation
df['Abstract'][6469] #somewhat relevant; risk management, signs and symptoms of depression in children
df['Abstract'][6913]
df['Abstract'][6957]
df['Abstract'][7581]
df['Abstract'][7606]
df['Abstract'][7676] #somewhat relevant; hopelessness and suicide risk in college students
df['Abstract'][8645]
df['Abstract'][8647]
df['Abstract'][8667] #somewhat relevant; childhood predictors of adult physical morbidity and mortality
df['Abstract'][8809]
df['Abstract'][8828]
df['Abstract'][8915] #valid but irrelevant
df['Abstract'][8939]
df['Abstract'][9021]
df['Abstract'][9149]
df['Abstract'][9557] #valid but irrelevant
df['Abstract'][9706] #valid but irrelevant

"Many African Americans (AAs) use clergy as their primary source of help for depression, with few being referred to mental health providers. This study used face-to-face workshops to train AA clergy to recognize the symptoms and levels of severity of depression. A pretest/posttest format was used to test knowledge (N = 42) about depression symptoms. Results showed that the participation improved the clergy's ability to recognize depression symptoms. Faith community nurses can develop workshops for clergy to improve recognition and treatment of depression. "

#### New dataframe with only abstracts having more than 90 words

In [186]:
# New dataframe with only abstracts having more than 90 words
mask = (df['Abstract'].str.split().str.len() > 90)
above_90w = df.loc[mask]

In [187]:
above_90w

Unnamed: 0,PaperID,Abstract
0,1,This case report investigated the transactiona...
1,2,Does it matter what we eat for our mental heal...
2,3,Pubertal timing matters for psychological deve...
3,4,"Plasminogen activator inhibitor 1 (PAI-1), whi..."
4,5,The adolescent developmental stage appears to ...
...,...,...
9714,9715,Several studies have identified associations b...
9715,9716,Obsessive-compulsive disorder (OCD) has a chro...
9716,9717,This pilot randomized controlled trial (RCT) i...
9717,9718,Few studies have explored stress and coping am...


#### Inspect some abstracts that have 90-100 words

In [188]:
# Now see how many have 90-100 words
mask = (above_90w['Abstract'].str.split().str.len() <= 100)
above_90w.loc[mask]

Unnamed: 0,PaperID,Abstract
95,96,Axon guidance molecules direct growing axons t...
98,99,This review examines sex differences in sleep ...
119,120,The impacts of body dissatisfaction have been ...
627,628,Depression is the number one cause of disabili...
678,679,Adverse childhood experiences (ACE) are associ...
734,735,There are special considerations when treating...
1009,1010,This cross-sectional German study examined the...
1424,1425,Transdiagnostic approach has a long history in...
1823,1824,Postpartum depression (PPD) is a common and di...
1857,1858,Depression is one of the most frequent mood di...


By simply scanning and glancing at these, they all appear to be valid.

#### Inspect abstracts that have more than 250 words

In [190]:
# Now see how many have more than 250 words
mask = (above_90w['Abstract'].str.split().str.len() > 250)
above_90w.loc[mask]

Unnamed: 0,PaperID,Abstract
5,6,Adolescence is a neuroplastic period for self-...
7,8,To assess the prevalence of psychiatric disord...
8,9,Background and Objectives: Psychological outco...
11,12,Depression represents a major public health co...
13,14,Minimal/non-response to antipsychotic treatmen...
...,...,...
9705,9706,"Collaborative, nurse-led care is a well-establ..."
9709,9710,Does mode of conception [spontaneous/after inf...
9712,9713,Depression is one of the most common mental he...
9714,9715,Several studies have identified associations b...


#### Inspect abstracts that have more than 300 words

In [191]:
# Now see how many have more than 300 words
mask = (above_90w['Abstract'].str.split().str.len() > 300)
above_90w.loc[mask]

Unnamed: 0,PaperID,Abstract
8,9,Background and Objectives: Psychological outco...
26,27,Antenatal maternal psychological distress is c...
30,31,Background: Depression is common among women i...
34,35,"Over the past decade, more and more children a..."
44,45,To measure health-related behaviors and risk f...
...,...,...
9694,9695,Pregnancy is a strong motivator to quit smokin...
9699,9700,Patients with affective disorders of different...
9701,9702,"As data on the phenotype, characteristics and ..."
9705,9706,"Collaborative, nurse-led care is a well-establ..."


In [192]:
above_90w['Abstract'][9709]

"Does mode of conception [spontaneous/after infertility treatment (IT)], type of pregnancy (singleton/twin) and parent gender have an effect on anxiety and depression levels and trajectories during pregnancy and the post-partum period?Conception after IT was associated with a transitory increase in anxiety during the perinatal period for parents of singletons, while for IT parents of twins higher levels of psychopathological symptoms tended to persist during pregnancy and the post-partum period.Most previous studies have shown that successful IT is not associated with poor psychological well-being during pregnancy and the post-partum period, but there is also some evidence for heightened pregnancy-related anxiety, lower self-esteem and lower self-efficacy. Parents of twins experience increased postnatal anxiety and depression.This prospective longitudinal study assessed 267 couples (N\xa0=\xa0534) at each trimester of pregnancy, after childbirth and at 3 months post-partum.The sample c

In [193]:
above_90w['Abstract'][9694]

'Pregnancy is a strong motivator to quit smoking, yet postpartum relapse rates are high. Growing evidence suggests a role of sex hormones in drug abuse behavior and given the precipitous drop in sex hormones at delivery, they may play a role in postpartum relapse. This pilot study evaluates the feasibility and potential role of exogenous progesterone in postpartum smoking relapse.This 12-week double-blind placebo-controlled randomized pilot trial randomized 46 abstinent postpartum women to active progesterone (PRO; 200mg twice a day) versus placebo (PBO) for 4 weeks. Participants were followed for relapse for 12 weeks. Main study outcomes include abstinence (point prevalence), feasibility (compliance per number of clinic visits attended, pill counts and Electronic Data Capture [EDC] completed) and self-reported acceptability. Safety was also measured by depressive symptom scores, adverse events, and breastfeeding.Overall retention rate was 87% at week 12. At week 4, abstinence rates we

In [194]:
above_90w['Abstract'][8]

'Background and Objectives: Psychological outcomes following termination of wanted pregnancies have not previously been studied. Does excluding such abortions affect estimates of psychological distress following abortion? To address this question this study examines long-term psychological outcomes by pregnancy intention (wanted or unwanted) following induced abortion relative to childbirth in the United States. Materials and Methods: Panel data on a nationally-representative cohort of 3935 ever-pregnant women assessed at mean age of 15, 22, and 28 years were examined from the National Longitudinal Survey of Adolescent to Adult Health (Add Health). Relative risk (RR) and incident rate ratios (IRR) for time-dynamic mental health outcomes, conditioned by pregnancy intention and abortion exposure, were estimated from population-averaged longitudinal logistic and Poisson regression models, with extensive adjustment for sociodemographic differences, pregnancy and mental health history, and 

These are long but still appear to be valid, single-paragraph abstracts.

#### Inspect abstracts that have greater than 400 words

In [195]:
# Now see how many have more than 400 words
mask = (above_90w['Abstract'].str.split().str.len() > 400)
above_90w.loc[mask]

Unnamed: 0,PaperID,Abstract
69,70,Childhood stress exposure is associated with i...
74,75,Mood and psychotic syndromes most often emerge...
84,85,An increasing prevalence of adult attention-de...
139,140,Bipolar disorder is a severe and common mental...
170,171,National longitudinal studies that examine the...
...,...,...
9577,9578,Many mental health disorders emerge in late ch...
9614,9615,Mental health of migrant populations has becom...
9664,9665,"With increasing age at pregnancy, the likeliho..."
9671,9672,Depression disorder may become the first cause...


In [200]:
above_90w['Abstract'][9709]
above_90w['Abstract'][84] #This has some funky characters and nonsense words so we should remove it
above_90w['Abstract'][9577]#This has some funky characters and nonsense words, looks like some words or characters got distorted or corrupted
above_90w['Abstract'][170]


# Note: We can probably use regex to remove any words enclosed in parentheses because these are unnecessary

'National longitudinal studies that examine the linkages between early family experiences and sex-specific development of depression across the life course are lacking despite the urgent need for interventions in family settings to prevent adult depression.To examine whether positive adolescent family relationships are associated with reduced depressive symptoms among women and men as they enter midlife.This study analyzed data from the National Longitudinal Study of Adolescent to Adult Health, which used a multistage, stratified school-based design to select a prospective cohort of 20\u202f745 adolescents in grades 7 to 12 from January 3, 1994, to December 26, 1995 (wave 1). Respondents were followed up during 4 additional waves from April 14 to September 9, 1996 (wave 2); April 2, 2001, to May 9, 2002 (wave 3); April 3, 2007, to February 1, 2009 (wave 4); and March 3, 2016, to May 8, 2017 (sample 1, wave 5), when the cohort was aged 32 to 42 years. The study sample of 8952 male adole

#### Inspect abstracts that have greater than 600 words

In [201]:
# Now see how many have more than 600 words
mask = (above_90w['Abstract'].str.split().str.len() > 600)
above_90w.loc[mask]

Unnamed: 0,PaperID,Abstract
74,75,Mood and psychotic syndromes most often emerge...
139,140,Bipolar disorder is a severe and common mental...
220,221,"Depression is usually managed in primary care,..."
594,595,Substance use disorders (SUDs) can have a deva...
633,634,Previous research has documented differences i...
865,866,The term “ptosis” is derived from the Greek wo...
898,899,This is the first update of a review published...
984,985,Psychological therapies for parents of childre...
1308,1309,Anaemia is a condition in which the number of ...
1376,1377,Clinical guidelines recommend outpatient care ...


In [205]:
# Randomly inspect some
above_90w['Abstract'][9124]
above_90w['Abstract'][1308]
above_90w['Abstract'][3838] #This looks like a long literature review
above_90w['Abstract'][9232]

'Poor medication adherence contributes to negative treatment response, symptom relapse, and hospitalizations in schizophrenia. Many health plans use claims-based measures like medication possession ratios or proportion of days covered (PDC) to measure patient adherence to antipsychotics. Classifying patients solely on the basis of a single average PDC measure, however, may mask clinically meaningful variations over time in how patients arrive at an average PDC level.To model patterns of medication adherence evolving over time for patients with schizophrenia who initiated treatment with an oral atypical antipsychotic and, based on these patterns, to identify groups of patients with different adherence behaviors.We analyzed health insurance claims for patients aged ≥ 18 years with schizophrenia and newly prescribed oral atypical antipsychotics in 2007-2013 from 3 U.S. insurance claims databases: Truven MarketScan (Medicaid and commercial) and Humana (Medicare). Group-based trajectory mod

#### Inspect abstracts that have greater than 1000 words

In [206]:
# Inspect abstracts that have greater than 1000 words
mask = (above_90w['Abstract'].str.split().str.len() > 1000)
above_90w.loc[mask]

Unnamed: 0,PaperID,Abstract
74,75,Mood and psychotic syndromes most often emerge...
865,866,The term “ptosis” is derived from the Greek wo...
2059,2060,L'objectif est de guider les femmes enceintes ...
2172,2173,The objective is to provide guidance for pregn...
2392,2393,"Poverty has significant, detrimental, and long..."
3838,3839,Scoliosis is a lateral curvature of the spine ...


In [209]:
above_90w['Abstract'][2059] #This is french, we should delete this
above_90w['Abstract'][865] #This is way too long, would be too much info for topic modelling
above_90w['Abstract'][2172]

'The objective is to provide guidance for pregnant women, and obstetric care and exercise professionals, on prenatal physical activity.The outcomes evaluated were maternal, fetal, or neonatal morbidity or fetal mortality during and following pregnancy.Literature was retrieved through searches of Medline, EMBASE, PsycINFO, Cochrane Database of Systematic Reviews, Cochrane Central Register of Controlled Trials, Scopus and Web of Science Core Collection, CINAHL Plus with Full-text, Child Development & Adolescent Studies, ERIC, Sport Discus, ClinicalTrials.gov, and the Trip Database from database inception up to January 6, 2017. Primary studies of any design were eligible, except case studies. Results were limited to English, Spanish, or French language materials. Articles related to maternal physical activity during pregnancy reporting on maternal, fetal, or neonatal morbidity or fetal mortality were eligible for inclusion. The quality of evidence was rated using the Grading of Recommenda

#### Inspect how many abstracts with more than 90 words have keywords tobacco and alcohol

In [None]:
alcohol_series = abstracts_lower.str.contains('alcohol', regex=False)
tobacco_series = abstracts_lower.str.contains('tobacco', regex=False)
keywords_series = abstracts_lower.str.contains('tobacco|alcohol', regex=True)

In [211]:
abstracts_lower = above_90w['Abstract'].str.lower()

In [219]:
alcohol_series = abstracts_lower.str.contains('alcohol', regex=False)
print(f'Num abstracts containing alcohol keyword: {alcohol_series.sum()}')

Num abstracts containing alcohol keyword: 670


In [233]:
tobacco_series = abstracts_lower.str.contains('tobacco', regex=False)
print(f'Num abstracts containing tobacco keyword: {tobacco_series.sum()}')

Num abstracts containing tobacco keyword: 135


In [225]:
keywords_series = abstracts_lower.str.contains('alcohol|tobacco', regex=True)

In [226]:
keywords_series.sum()

726

In [224]:
670 + 135

805

In [227]:
mask = above_90w['Abstract'].str.lower().str.contains('tobacco', regex=False)
above_90w[mask]

Unnamed: 0,PaperID,Abstract
15,16,"Cannabis use patterns vary considerably, with ..."
43,44,"To date, studies have highlighted cross-sectio..."
161,162,"In the United States (US), rates of teenage pr..."
235,236,The quality of the mother-child relationship i...
322,323,Perinatal depression affects 21-50% of women i...
...,...,...
9601,9602,Perinatal smoking is associated with a wide ra...
9692,9693,Few studies have evaluated exercise interventi...
9693,9694,American Indians and Alaska Natives (AI/AN) ha...
9694,9695,Pregnancy is a strong motivator to quit smokin...


In [228]:
mask = above_90w['Abstract'].str.lower().str.contains('alcohol', regex=False)
above_90w[mask]

Unnamed: 0,PaperID,Abstract
8,9,Background and Objectives: Psychological outco...
12,13,Anxiety disorders in adolescence have been ass...
15,16,"Cannabis use patterns vary considerably, with ..."
43,44,"To date, studies have highlighted cross-sectio..."
44,45,To measure health-related behaviors and risk f...
...,...,...
9627,9628,Previous research has shown racial/ethnic diff...
9635,9636,A young refugee woman attended antenatal clini...
9654,9655,To investigate whether mental health services ...
9674,9675,Desvenlafaxine is used to treat major depressi...


In [232]:
mask = above_90w['Abstract'].str.lower().str.contains('tobacco|alcohol')
above_90w[mask]

Unnamed: 0,PaperID,Abstract
8,9,Background and Objectives: Psychological outco...
12,13,Anxiety disorders in adolescence have been ass...
15,16,"Cannabis use patterns vary considerably, with ..."
43,44,"To date, studies have highlighted cross-sectio..."
44,45,To measure health-related behaviors and risk f...
...,...,...
9692,9693,Few studies have evaluated exercise interventi...
9693,9694,American Indians and Alaska Natives (AI/AN) ha...
9694,9695,Pregnancy is a strong motivator to quit smokin...
9708,9709,Depressive symptoms and drinking to cope with ...


When I use regex to check for alcohol OR tobacco, I get 726 results. However, when I check for tobacco and alcohol separately, I get 135 and 670 which sums to 805.

In [235]:
# Dataframe with abstracts containing 'tobacco' keyword
mask = above_90w['Abstract'].str.lower().str.contains('tobacco', regex=False)
above_90w_tobacco = above_90w[mask]
above_90w_tobacco

Unnamed: 0,PaperID,Abstract
15,16,"Cannabis use patterns vary considerably, with ..."
43,44,"To date, studies have highlighted cross-sectio..."
161,162,"In the United States (US), rates of teenage pr..."
235,236,The quality of the mother-child relationship i...
322,323,Perinatal depression affects 21-50% of women i...
...,...,...
9601,9602,Perinatal smoking is associated with a wide ra...
9692,9693,Few studies have evaluated exercise interventi...
9693,9694,American Indians and Alaska Natives (AI/AN) ha...
9694,9695,Pregnancy is a strong motivator to quit smokin...


In [236]:
# Dataframe with abstracts containing 'alcohol' keyword
mask = above_90w['Abstract'].str.lower().str.contains('alcohol', regex=False)
above_90w_alcohol = above_90w[mask]
above_90w_alcohol

Unnamed: 0,PaperID,Abstract
8,9,Background and Objectives: Psychological outco...
12,13,Anxiety disorders in adolescence have been ass...
15,16,"Cannabis use patterns vary considerably, with ..."
43,44,"To date, studies have highlighted cross-sectio..."
44,45,To measure health-related behaviors and risk f...
...,...,...
9627,9628,Previous research has shown racial/ethnic diff...
9635,9636,A young refugee woman attended antenatal clini...
9654,9655,To investigate whether mental health services ...
9674,9675,Desvenlafaxine is used to treat major depressi...


### Concatenate tobacco and alcohol dataframes

In [242]:
above_90_keywords = pd.concat([above_90w_tobacco, above_90w_alcohol], join="inner")

In [247]:
above_90_keywords

Unnamed: 0,PaperID,Abstract
15,16,"Cannabis use patterns vary considerably, with ..."
43,44,"To date, studies have highlighted cross-sectio..."
161,162,"In the United States (US), rates of teenage pr..."
235,236,The quality of the mother-child relationship i...
322,323,Perinatal depression affects 21-50% of women i...
...,...,...
9627,9628,Previous research has shown racial/ethnic diff...
9635,9636,A young refugee woman attended antenatal clini...
9654,9655,To investigate whether mental health services ...
9674,9675,Desvenlafaxine is used to treat major depressi...


### Now inspect abstract word counts now that we only have abstracts with the keywords

In [248]:
# Longest abstract length - words 
max_abstract_w = above_90_keywords['Abstract'].str.split().str.len().max()

# Shortest abstract length - words
min_abstract_w = above_90_keywords['Abstract'].str.split().str.len().min()

# Mean abstract length - words
mean_abstract_w = above_90_keywords['Abstract'].str.split().str.len().mean()

print("Max abstract length - words: ", max_abstract_w)
print("Min abstract length - words: ", min_abstract_w)
print("Mean abstract length - words: ", mean_abstract_w)

Max abstract length - words:  1392
Min abstract length - words:  101
Mean abstract length - words:  256.15403726708075


In [251]:
# Now see how many have more than 1000 words
mask = (above_90_keywords['Abstract'].str.split().str.len() > 1000)
above_90_keywords.loc[mask]

Unnamed: 0,PaperID,Abstract
74,75,Mood and psychotic syndromes most often emerge...


In [252]:
above_90_keywords['Abstract'][74] #relevant but really long

"Mood and psychotic syndromes most often emerge during adolescence and young adulthood, a period characterised by major physical and social change. Consequently, the effects of adolescent-onset mood and psychotic syndromes can have long term consequences. A key clinical challenge for youth mental health is to develop and test new systems that align with current evidence for comorbid presentations and underlying neurobiology, and are useful for predicting outcomes and guiding decisions regarding the provision of appropriate and effective care. Our highly personalised and measurement-based care model includes three core concepts: ▶ A multidimensional assessment and outcomes framework that includes: social and occupational function; self-harm, suicidal thoughts and behaviour; alcohol or other substance misuse; physical health; and illness trajectory. ▶ Clinical stage. ▶ Three common illness subtypes (psychosis, anxious depression, bipolar spectrum) based on proposed pathophysiological mec

In [254]:
# Now see how many have more than 800 words
mask = (above_90_keywords['Abstract'].str.split().str.len() > 800)
above_90_keywords.loc[mask]

Unnamed: 0,PaperID,Abstract
74,75,Mood and psychotic syndromes most often emerge...
4729,4730,Preconception health is a broad term that enco...
8661,8662,Automated telephone communication systems (ATC...


In [256]:
above_90_keywords['Abstract'][4729] #relevant but really long

"Preconception health is a broad term that encompasses the overall health of nonpregnant women during their reproductive years (defined here as aged 18-44 years). Improvement of both birth outcomes and the woman's health occurs when preconception health is optimized. Improving preconception health before and between pregnancies is critical for reducing maternal and infant mortality and pregnancy-related complications. The National Preconception Health and Health Care Initiative's Surveillance and Research work group suggests ten prioritized indicators that states can use to monitor programs or activities for improving the preconception health status of women of reproductive age. This report includes overall and stratified estimates for nine of these preconception health indicators.2013-2015.Survey data from two surveillance systems are included in this report. The Behavioral Risk Factor Surveillance System (BRFSS) is an ongoing state-based, landline and cellular telephone survey of non

### Save dataframe with abstracts containing keywords and more than 90 words to csv

In [257]:
above_90_keywords.to_csv('abstracts_keywords.csv')