Plan of attack: 

-Subset THE DEMOGRAPHICS dataset by: 
Gender (Male/Female)
Race 

-Merge the demographics data and the Start FEIS data by patient ID # 
-Clean data so only relevant columns are left (Demographic data + family input)

We plan firstly to look at the spectrum of responses comparing available services/client mental health (as the answers are on a scale) and turn this into numerical data in order to quantify the quality of each subset’s degree of care.

We then plan to conduct topic modeling on the column in which families discuss where care is lacking in order to find the most popular/most desired methods of care that START did not provide. 

We also plan to conduct topic modeling and sentiment analysis on the column in which families offer advice to their caregiver in order to form a rough idea of the quality of care and how it may vary across demographic groups. We also are interested to see if these responses’ sentiment scores will trend in a specific direction, indicating biases in those who actually responded to the survey.


In [97]:
# Importing modules
## helpful packages
import pandas as pd
import numpy as np
import random
import re

## nltk imports
import nltk
### uncomment and run these lines if you haven't downloaded relevant nltk add-ons yet
#nltk.download('averaged_perceptron_tagger')
#nltk.download('stopwords')
from nltk import pos_tag
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

## spacy imports
import spacy
### uncomment and run the below line if you haven't loaded the en_core_web_sm library yet
#! python -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()

## vectorizer
from sklearn.feature_extraction.text import CountVectorizer

## sentiment
#!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## lda
from gensim import corpora
import gensim

## repeated printouts and wide-format text
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_colwidth', None)

In [74]:
demo_df = pd.read_excel(r"../files/Dartmouth_Data_Set.xlsx")
FEIS_df = pd.read_excel(r"../files/START_FEIS_Data.xlsx")
time_df = pd.read_excel(r"../files/Dartmouth_Time_Data.xlsx")
dict_df = pd.read_excel(r"../files/Final SIRS_Data_Dictionary_V13.1 October 2020.xlsx")

In [75]:
# Cleaning the demographics dataset
demographics = demo_df[['Local ID', 'Region', 'Date Enrolled in START', 'Gender', 'Race', 'Date of birth', 'Ethnicity',
                              'Level of Intellectual Disability', 'Psychiatric diagnoses', 'Medical diagnoses', 'Other Disabilities',
                              'Funding']]

In [76]:
# Merging datasets
merged = pd.merge(demographics, FEIS_df, how = 'inner', left_on = ['Local ID'], right_on = ['Respondent ID #  (SIRS Local ID)'])

# Look at type of join (changed)

merged['Local ID'].unique
merged.shape

(1097, 69)

In [150]:
# Subsetting by gender
demographics_male = merged.loc[merged['Gender']=='Male']
demographics_female = merged.loc[merged['Gender']=='Female']

# Subsetting by race
male_white = demographics_male[demographics_male['Race'] == "White"]
male_nonwhite = demographics_male[demographics_male['Race'] != "White"]

female_white = demographics_female[demographics_female['Race'] == "White"]
female_nonwhite = demographics_female[demographics_female['Race'] != "White"]


#male_white.shape
#male_nonwhite.shape

female_white.head()
#female_nonwhite.shape


Unnamed: 0,Local ID,Region,Date Enrolled in START,Gender,Race,Date of birth,Ethnicity,Level of Intellectual Disability,Psychiatric diagnoses,Medical diagnoses,...,"In\nthe past year, did your family member use in-patient psychiatric services?","If\nyes, were the inpatient services that your family member received helpful to\nhim/her in your opinion? ?",How\nmuch help was available to you at night or on weekends if your family member\nhad a crisis?,Are\nthere options outside of the hospital for individuals experiencing a crisis to\ngo for help (i.e. crisis/hospital diversion beds)?,Who\nwas the primary source of information about your family memberâ€™s mental health\nservices?,"If other, please describe..2","During the past year, how much involvement\ndid you want to have in your family memberâ€™s treatment plan?",Was there any particular service that your\nfamily member needed that was not available?,"If yes, please describe the service.",What\nadvice would you give to service planners regarding the mental health service\nneeds of persons with IDD and their families?
2,434021,New York : Region 3,2020-12-28,Female,White,2001-12-24,Not of Hispanic origin,Borderline,"Attention-Deficit/Hyperactivity Disorder, Autism Spectrum Disorder, Oppositional Defiant Disorder, Social Anxiety Disorder",Gastro/Intestinal,...,No,,None at all,None at all,His/her psychiatrist,,A lot,Yes,In-home behavior support,
9,21347,Texas : Tarrant County,2020-12-14,Female,White,1991-09-15,Hispanic - specific origin not specified,Moderate,,Endocrine,...,No,None at all,All that was wanted/needed,Did not know/answer,Other,Casa,,No,,
10,8146562,California : CA START East Bay,2020-12-09,Female,White,2006-10-03,Not of Hispanic origin,Mild,Autism Spectrum Disorder,,...,No,,Very little,None at all,His/her psychiatrist,,A lot,Yes,"Good psychiatry, crisis help that was more hands on, caregivers/ respite workers, etc.",
21,7697408,California : CA START Westside,2020-11-20,Female,White,1993-03-16,"Unknown, not collected",Borderline,"Attention-Deficit/Hyperactivity Disorder, Autism Spectrum Disorder","Obesity, Other: Asthma; had asthma when she was younger",...,Yes,Very little,All that was wanted/needed,All that was wanted/needed,Your family member him/herself,,A lot,Yes,"Mental Health Services, individual therapy",Harlee can create fabricated stories based off information she received from mental health providers.
33,359313,New York : Region 3,2020-11-16,Female,White,2008-12-03,Not of Hispanic origin,Normal intelligence,"Attention-Deficit/Hyperactivity Disorder, Autism Spectrum Disorder, Oppositional Defiant Disorder",Pulmonary disorders,...,No,,None at all,Did not know/answer,Other,Grandmother,A lot,No,,Family had no information form their assigned social worker from a community agency and felf uninformed.


In [159]:
# female_white.head()
female_white_subset = female_white[['Local ID','What\nadvice would you give to service planners regarding the mental health service\nneeds of persons with IDD and their families?', "Was there any particular service that your\nfamily member needed that was not available?", "If yes, please describe the service."]]
female_white_subset.columns = ['ID', 'Advice', 'Missing Service', 'Service Needed']

advice = female_white_subset[["ID", "Advice"]]
advice = advice.dropna()
#advice.head()adv
# female_white_subset.head()



stop_words = set(stopwords.words('english'))

snowball = SnowballStemmer(language="english")

def process(string):
    string_lower = string.lower()
    #string_lower
    tokens = word_tokenize(string_lower)
    tokenize_string = [s for s in tokens if not s.lower() in stop_words]
    #tokenize_string
    alpha_string = [re.sub('[^A-Za-z]+', '', s) for s in tokenize_string]
    #alpha_string
    stem_string = [snowball.stem(s) for s in alpha_string]
    #stem_string
    final_string = " ".join(stem_string)
    #final_string
    return final_string

advice['processed_text'] = [process(string) for string in advice["Advice"]]
advice



# female_white_subset['Advice']

# female_white["What\nadvice would you give to service planners regarding the mental health service\nneeds of persons with IDD and their families?"]
# female_white["Was there any particular service that your\nfamily member needed that was not available?"]
# female_white["If yes, please describe the service."]
# # what advice would you give and 

# if services are not easy to access, 

Unnamed: 0,ID,Advice,processed_text
21,7697408,Harlee can create fabricated stories based off information she received from mental health providers.,harle creat fabric stori base inform receiv mental health provid
33,359313,Family had no information form their assigned social worker from a community agency and felf uninformed.,famili inform form assign social worker communiti agenc felf uninform
41,907533C,"""Think outside the box""",think outsid box
44,240792,"Remember how overwhelming it is for the family, it never ends.",rememb overwhelm famili never end
45,136986,"none, I feel supported by all the team",none feel support team
...,...,...,...
1079,1056248,HAVE LITERATURE TO HELP FAMILIES UNDERSTAND WHAT THE DISABILITY IS.,literatur help famili understand disabl
1080,39089,,none
1083,11144011,Therapist should encourage and foster input from the team members,therapist encourag foster input team member
1088,704085W,"People with IDD/MH behavioral dysregulation being confusing to systems. Parents/caregivers request medication to address the behavioral dysregulation displayed by the person supported; however, most times the behavioral dysregulation is related to the inability to communicate/frustration and IDD. There is not a medication to treat IDD; conveying this to families and systems can be challenging. There are only medications to treat symptoms.",peopl iddmh behavior dysregul confus system parentscaregiv request medic address behavior dysregul display person support howev time behavior dysregul relat inabl communicatefrustr idd medic treat idd convey famili system challeng medic treat symptom


In [160]:
# Creating the document-term matrix 

def create_dtm(list_of_strings, metadata):
    vectorizer = CountVectorizer(lowercase = True)
    dtm_sparse = vectorizer.fit_transform(list_of_strings)
    dtm_dense_named = pd.DataFrame(dtm_sparse.todense(),
                columns=vectorizer.get_feature_names())
    metadata.columns = ["metadata_" + col for col in metadata.columns]
    dtm_dense_named_withid = pd.concat([metadata.reset_index(), 
                                        dtm_dense_named], axis = 1)
    return(dtm_dense_named_withid)


In [165]:
# your code here

dtm_nopre = create_dtm(list_of_strings= advice['processed_text'],
                metadata = 
                advice[["ID"]])

dtm_nopre.head()

Unnamed: 0,index,metadata_ID,abil,abl,access,address,advoc,agenc,also,alway,...,way,weekend,well,whole,will,work,worker,workshop,would,written
0,21,7697408,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,33,359313,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
2,41,907533C,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,44,240792,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,45,136986,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [172]:

def get_topwords(dtm): 
    topdtm = dtm[[col for col in dtm.columns
               if 'metadata' not in col and col != 'index']].sum(axis=0)
    return topdtm.sort_values(ascending=False).head(30)


print("Top words for Advice")
get_topwords(dtm_nopre)

# Justifying not dropping named entities - none really came up

Top words for Advice


famili        20
servic        19
need          14
provid        14
help           9
inform         8
find           7
medic          6
none           6
program        6
support        6
health         6
better         6
member         5
make           5
peopl          5
understand     4
idd            4
system         4
client         4
get            4
mental         4
avail          4
assist         4
answer         4
sure           3
time           3
nt             3
look           3
advoc          3
dtype: int64

In [178]:
text_raw_tokens = [wordpunct_tokenize(s) 
                for s in 
                advice['processed_text']]

text_raw_dict = corpora.Dictionary(text_raw_tokens)

corpus_fromdict = [text_raw_dict.doc2bow(s) 
                   for s in text_raw_tokens]

ldamod = gensim.models.ldamodel.LdaModel(corpus_fromdict, 
                                num_topics = 3, id2word=text_raw_dict, 
                                passes=6, alpha = 'auto',
                                per_word_topics = True, random_state = 2)

topics = ldamod.print_topics(num_words = 30)
for topic in topics:
    print(topic)


(0, '0.027*"servic" + 0.019*"provid" + 0.019*"famili" + 0.014*"work" + 0.014*"team" + 0.014*"need" + 0.014*"help" + 0.010*"idd" + 0.010*"peopl" + 0.010*"support" + 0.010*"better" + 0.010*"baselin" + 0.010*"communic" + 0.010*"child" + 0.010*"pay" + 0.010*"listen" + 0.010*"hard" + 0.010*"advoc" + 0.010*"like" + 0.010*"therapist" + 0.010*"find" + 0.006*"system" + 0.006*"none" + 0.006*"choos" + 0.006*"challeng" + 0.006*"time" + 0.006*"nt" + 0.006*"individu" + 0.006*"lot" + 0.006*"understand"')
(1, '0.033*"famili" + 0.033*"need" + 0.029*"servic" + 0.029*"provid" + 0.025*"help" + 0.021*"program" + 0.017*"find" + 0.017*"client" + 0.013*"health" + 0.013*"better" + 0.013*"assist" + 0.013*"understand" + 0.013*"disabl" + 0.009*"look" + 0.009*"s" + 0.009*"support" + 0.009*"get" + 0.009*"way" + 0.009*"staff" + 0.009*"go" + 0.009*"person" + 0.009*"take" + 0.005*"member" + 0.005*"inform" + 0.005*"enough" + 0.005*"mental" + 0.005*"avail" + 0.005*"medic" + 0.005*"nt" + 0.005*"fund"')
(2, '0.031*"famili

In [180]:
## Visualize - may not work on jhub yet
import pyLDAvis.gensim as gensimvis
# alternate: import pyLDAvis.gensim_models as gensimvis 
import pyLDAvis
#pyLDAvis.enable_notebook()
lda_display = gensimvis.prepare(ldamod, corpus_fromdict, text_raw_dict)
pyLDAvis.display(lda_display)

### visualize
pyLDAvis.enable_notebook()
lda_display_proc = gensimvis.prepare(ldamod_proc, corpus_fromdict_proc, text_proc_dict)
pyLDAvis.display(lda_display_proc)

ERROR: Could not find a version that satisfies the requirement pyLDAvis.gensim (from versions: none)
ERROR: No matching distribution found for pyLDAvis.gensim


ModuleNotFoundError: No module named 'pyLDAvis.gensim'