# About this notebook
### Approach for data management:
The source of all data is documented, however it is downloaded as a local file for ease of use



## Data Source 1 - National Mental Health Services Survey (N-MHSS)
Mental Health Facilities Data

https://www.datafiles.samhsa.gov/dataset/national-mental-health-services-survey-2020-n-mhss-2020-ds0001

Codebook here: https://www.datafiles.samhsa.gov/sites/default/files/field-uploads-protected/studies/N-MHSS-2020/N-MHSS-2020-datasets/N-MHSS-2020-DS0001/N-MHSS-2020-DS0001-info/N-MHSS-2020-DS0001-info-codebook.pdf

Dataset here: https://www.datafiles.samhsa.gov/sites/default/files/field-uploads-protected/studies/N-MHSS-2020/N-MHSS-2020-datasets/N-MHSS-2020-DS0001/N-MHSS-2020-DS0001-bundles-with-study-info/N-MHSS-2020-DS0001-bndl-data-csv_v1.zip
        
Some information on this source dataset
N-MHSS is an annual survey that collects data on the services and characteristics of all known mental health treatment facilities in the 50 states, the District of Columbia, and the U.S. territories and jurisdictions. Every other year (since 2014), the survey also collects data on the number and demographics of people served in these facilities as of a specified survey reference date.

N-MHSS is the only source of national and state-level data on the mental health service delivery system reported by both public and private specialty mental health treatment facilities, including:

Public and private psychiatric hospitals
Nonfederal general hospitals with separate psychiatric units
U.S. Department of Veterans Affairs medical centers
Residential treatment centers for children and adults
Community mental health centers
Outpatient, day treatment, or partial hospitalization mental health facilities
Multi-setting (nonhospital) mental health facilities
N-MHSS complements the information collected through SAMHSA’s National Survey of Substance Abuse Treatment Services (N-SSATS). Treatment facility information from N-MHSS is used to populate the mental health component of SAMHSA’s Behavioral Health Treatment Services Locator.

## Data Source 2 - National Survey on Drug Use and Health (NSDUH) 

https://www.datafiles.samhsa.gov/dataset/national-survey-drug-use-and-health-2020-nsduh-2020-ds0001
    
Codebook here: https://www.datafiles.samhsa.gov/sites/default/files/field-uploads-protected/studies/NSDUH-2020/NSDUH-2020-datasets/NSDUH-2020-DS0001/NSDUH-2020-DS0001-info/NSDUH-2020-DS0001-info-codebook.pdf
        
Dataset here: https://www.datafiles.samhsa.gov/sites/default/files/field-uploads-protected/studies/NSDUH-2020/NSDUH-2020-datasets/NSDUH-2020-DS0001/NSDUH-2020-DS0001-bundles-with-study-info/NSDUH-2020-DS0001-bndl-data-tsv_v1.zip
        
Population Data

The NSDUH series, formerly the National Household Survey on Drug Abuse, is the leading source of statistical information on the use of illicit drugs, alcohol, and tobacco and mental health issues in the United States. The survey tracks trends in specific substance use and mental illness measures and assesses substance use disorders and treatment for these disorders.

The population of the NSDUH series is the general civilian population aged 12 and older. Questions include age at first use, as well as lifetime, annual, and past-month use of the following drugs: alcohol, marijuana, cocaine (including crack), hallucinogens, heroin, inhalants, tobacco, pain relievers, tranquilizers, stimulants, and sedatives. The survey covers substance use treatment history and perceived need for treatment, and it includes questions from the “Diagnostic and Statistical Manual (DSM) of Mental Disorders” (DSM) that allow diagnostic criteria to be applied.

Respondents are also asked about personal and family income, health care access and coverage, illegal activities and arrest records, problems resulting from the use of drugs, and perceptions of risks. Demographic data include gender, race, age, ethnicity, educational level, employment status, income level, veteran status, household composition, and population density.

## Data Source 3 - Mental Health Client-Level Data (MH-CLD) 

https://www.datafiles.samhsa.gov/dataset/mental-health-client-level-data-2019-mh-cld-2019-ds0001
    
Codebook here: https://www.datafiles.samhsa.gov/sites/default/files/field-uploads-protected/studies/MH-CLD-2019/MH-CLD-2019-datasets/MH-CLD-2019-DS0001/MH-CLD-2019-DS0001-info/MH-CLD-2019-DS0001-info-codebook.pdf
        
Dataset here: https://www.datafiles.samhsa.gov/sites/default/files/field-uploads-protected/studies/MH-CLD-2019/MH-CLD-2019-datasets/MH-CLD-2019-DS0001/MH-CLD-2019-DS0001-bundles-with-study-info/MH-CLD-2019-DS0001-bndl-data-csv_v1.zip
        
Client-Level Mental Health Data

MH-CLD and the Mental Health Treatment Episode Data Set (MH-TEDS) provide information on mental health diagnoses and the mental health treatment services, outcomes, and demographic and substance use characteristics of people in mental health treatment facilities. This information comes from facilities that report to individual state administrative data systems.    
        

In [2]:
!pip3 install pandas
!pip3 install numpy
!pip3 show pandas

Name: pandas
Version: 1.3.5
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: pandas-dev@python.org
License: BSD-3-Clause
Location: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages
Requires: numpy, python-dateutil, pytz
Required-by: 


## PART A 

Goal:
In this section Client-Level Mental Health Data for 2 years 2019 and 2014 is used.
Using this data, exploratory analysis is done in the following areas
* Understanding the distribution of mental health disorders by type and by state
* Comparing the number of mental health cases (MHC) and substance abuse cases(SAC)

Further analysis is done to 
* Check if there is a correlation between the two
* Compare the numbers of MHC and SAC between the years 2014 and 2019.

These years are picked because I have the corresponding data for the Mental health services facilities for the same years to
do further analysis in the next part.

Approach:
Since the 2 files for 2014 and 2019 are quite large and jupyterlab environment is struggling with it, 
I trim data before merging them so I have smaller data sets to work with. (Expecting jupyterlab to cope better)

In [None]:
%%bash
wget https://www.datafiles.samhsa.gov/sites/default/files/field-uploads-protected/studies/MH-CLD-2019/MH-CLD-2019-datasets/MH-CLD-2019-DS0001/MH-CLD-2019-DS0001-bundles-with-study-info/MH-CLD-2019-DS0001-bndl-data-csv_v1.zip
unzip MH-CLD-2019-DS0001-bndl-data-csv_v1.zip
wget https://www.datafiles.samhsa.gov/sites/default/files/field-uploads-protected/studies/MH-CLD-2014/MH-CLD-2014-datasets/MH-CLD-2014-DS0001/MH-CLD-2014-DS0001-bundles-with-study-info/MH-CLD-2014-DS0001-bndl-data-csv_1.zip
unzip MH-CLD-2014-DS0001-bndl-data-csv_1.zip


In [8]:
%%bash
python3 -v

import _frozen_importlib # frozen
import _imp # builtin
import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
# installing zipimport hook
import 'zipimport' # <class '_frozen_importlib.BuiltinImporter'>
# installed zipimport hook
import '_frozen_importlib_external' # <class '_frozen_importlib.FrozenImporter'>
import '_io' # <class '_frozen_importlib.BuiltinImporter'>
import 'marshal' # <class '_frozen_importlib.BuiltinImporter'>
import 'posix' # <class '_frozen_importlib.BuiltinImporter'>
import _thread # previously loaded ('_thread')
import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
import _weakref # previously loaded ('_weakref')
import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
# /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/encodings/__pycache__/__init__.cpython-37.pyc matches /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/encodings/__init__.py
# co

In [1]:
import pandas as pd

## Data Access and Data Load - 2020 nmhss data load
file1 = "mhcld-puf-2019-csv.csv"
mh2019 = pd.read_csv(file1)

ModuleNotFoundError: No module named 'pandas'

In [2]:
mh2019.shape

(6362044, 40)

### Remove certain rows with unknown information where
1. AGE is -9
2. GENDER is -9 
3. IJSSERVICE is 1 (served in a justince system institution)
4. MH1 is -9
5. MH2 is -9
6. MH3 is -9
7. SUB is -9
8. SMISED is -9
9. STATEFIP is 99
10. REGION is 0

In [3]:
mh2019.dtypes

YEAR           int64
AGE            int64
EDUC           int64
ETHNIC         int64
RACE           int64
GENDER         int64
SPHSERVICE     int64
CMPSERVICE     int64
OPISERVICE     int64
RTCSERVICE     int64
IJSSERVICE     int64
MH1            int64
MH2            int64
MH3            int64
SUB            int64
MARSTAT        int64
SMISED         int64
SAP            int64
EMPLOY         int64
DETNLF         int64
VETERAN        int64
LIVARAG        int64
NUMMHS         int64
TRAUSTREFLG    int64
ANXIETYFLG     int64
ADHDFLG        int64
CONDUCTFLG     int64
DELIRDEMFLG    int64
BIPOLARFLG     int64
DEPRESSFLG     int64
ODDFLG         int64
PDDFLG         int64
PERSONFLG      int64
SCHIZOFLG      int64
ALCSUBFLG      int64
OTHERDISFLG    int64
STATEFIP       int64
DIVISION       int64
REGION         int64
CASEID         int64
dtype: object

### Removing rows that having missing data that I believe is important. 
By the end of the next cell this data is cleaned

In [4]:
mh2019 = mh2019[mh2019['AGE'] != -9]
mh2019 = mh2019[mh2019['GENDER'] != -9]
mh2019 = mh2019[mh2019['IJSSERVICE'] != 1]
mh2019 = mh2019[mh2019['SMISED'] != -9]
mh2019 = mh2019[mh2019['STATEFIP'] != -99]
mh2019 = mh2019[mh2019['SMISED'] != -9]
mh2019 = mh2019[mh2019['REGION'] != 0]

### Deriving data from the MH1, MH2, MH3 (as per the code book)
If first, second, or third mental health diagnosis is 1 trauma- or stressor-related disorder, then
trauma- or stressor-related disorder flag is 1;
● if first, second, or third mental health diagnosis is 2 anxiety disorder, then anxiety disorder
flag is 1;
● if first, second, or third mental health diagnosis is 3 attention deficit/hyperactivity disorder,
then attention deficit/hyperactivity disorder flag is 1;
● if first, second, or third mental health diagnosis is 4 conduct disorder, then conduct disorder
flag is 1;
● if first, second, or third mental health diagnosis is 5 delirium/dementia disorder, then
delirium/dementia disorder flag is 1;
● if first, second, or third mental health diagnosis is 6 bipolar disorder, then bipolar disorder
flag is 1;
● if first, second, or third mental health diagnosis is 7 depressive disorder, then depressive
disorder flag is 1;
● if first, second, or third mental health diagnosis is 8 oppositional defiant disorder, then
oppositional defiant disorder flag is 1;
● if first, second, or third mental health diagnosis is 9 pervasive developmental disorder, then
pervasive developmental disorder flag is 1;
● if first, second, or third mental health diagnosis is 10 personality disorder, then personality
disorder flag is 1;
● if first, second, or third mental health diagnosis is 11 schizophrenia or other psychotic
disorder, then schizophrenia or other psychotic disorder flag is 1;
● if first, second, or third mental health diagnosis is 12 alcohol or substance use disorder, then
alcohol or substance use disorder flag is 1;
● if first, second, or third mental health diagnosis is 13 other mental disorder, then other mental
disorder flag is 1.

This section defines a "disorder" function and applies it to all rows. The MH1, MH2, MH3 are used to create a 
verdict on the disorder. 


In [5]:
def disorder(vals):
    """Series -> float
    vals is a Pandas series with 3 values
    MH1, MH2, MH3 
    The 3 values are used to created the 
    Condition as per the code book guidelines above"""
    mh1 = vals.MH1
    mh2 = vals.MH2
    mh3 = vals.MH3
    if (mh1 == 13) or (mh2 == 13) or (mh3 == 13):
        return "OTHER"
    if (mh1 == 12) or (mh2 == 12) or (mh3 == 12):
        return "SA" # Substrance Abuse
    if (mh1 == 11) or (mh2 == 11) or (mh3 == 11):
        return "PSYCH" # Psychotic or Schizophrenia
    if (mh1 == 10) or (mh2 == 10) or (mh3 == 10):
        return "PERSONALITY" # Psychotic or Schizophrenia
    if (mh1 == 9) or (mh2 == 9) or (mh3 == 9):
        return "PERVASIVE" # Pervasive Development  
    if (mh1 == 8) or (mh2 == 8) or (mh3 == 8):
        return "OPPOSITIONAL" # Oppositional Defiant 
    if (mh1 == 7) or (mh2 == 7) or (mh3 == 7):
        return "DEPRESSIVE" # Depressive
    if (mh1 == 6) or (mh2 == 6) or (mh3 == 6):
        return "BIPOLAR" # Bipolar
    if (mh1 == 5) or (mh2 == 5) or (mh3 == 5):
        return "DEL/DEM" # Delerium, Dementia
    if (mh1 == 4) or (mh2 == 4) or (mh3 == 4):
        return "CONDUCT" # Conduct
    if (mh1 == 3) or (mh2 == 3) or (mh3 == 3):
        return "ADHD" # ADHD
    if (mh1 == 2) or (mh2 == 2) or (mh3 == 2):
        return "ANXIETY" 
    if (mh1 == 1) or (mh2 == 1) or (mh3 == 1):
        return "TRAUMA"
    return "None"

def anydisorder(vals):
    """Series -> float
    vals is a Pandas series with 1 values
    Disorder
    """
    disorder = vals.Disorder
    if disorder == "None":
        return 'No'
    else:
        return 'Yes'

In [6]:
mh2019['Disorder'] = mh2019[['MH1', 'MH2', 'MH3']].apply(disorder, axis=1)
mh2019['Disorder'].value_counts()
mh2019['AnyDisorder'] = mh2019[['Disorder']].apply(anydisorder, axis=1)

In [7]:
### Deriving data for SUBSTANCE Abuse Information
substancetable = {
  1: "ALCOHOL_IND",
  2: "ALCOHOL_INTOX",
  3: "SUB_IND",
  4: "ALCOHOL_DEP",
  5: "COCAINE_DEP",
  6: "CANNABIS_DEP",
  7: "OPIOID_DEP",
  8: "OTHER",
  9: "ALCOHOL_ABUSE",
  10: "COCAINE_ABUSE",
  11: "CANNABIS_ABUSE",
  12: "OPIOID_ABUSE",
  13: "OTHER_RELATED",
  -9: "None"
}

def substancelookup(vals):
    """Series -> float
    vals is a Pandas series with 1 values
    SUB"""
    sub = vals.SUB
    return substancetable[sub]

def anysubstance(vals):
    """Series -> float
    vals is a Pandas series with 1 values
    SUB"""
    sub = vals.SUB
    if (sub == -9):
        return 'No'
    else:
        return 'Yes'

mh2019['SubstanceAbuse'] = mh2019[['SUB']].apply(substancelookup, axis=1)
mh2019['AnySubstance'] = mh2019[['SUB']].apply(anysubstance, axis=1)

### Have a more readable form of "STATEFIP"

In [8]:
import states
mh2019['State'] = mh2019[['STATEFIP']].apply(states.statelookupbycode, axis=1)

### Columns that are not interesting anymore are dropped in this section to have a relevant data frame
1. MH1, MH2,MH3  (Given the newly created "Disorder" Column)
2. MARITALSTATUS 

In [9]:
columnstodrop = ['MARSTAT', 'MH1', 'MH2', 'MH3']
mh2019 = mh2019.drop(['MARSTAT', 'MH1', 'MH2', 'MH3'], axis=1)

In [10]:
mh2019.shape

(5892510, 41)

### Lets study the disorder occurance per state in a few  aspects
1. AnyDisorder per State
2. Type of Disorder in the state with the highest number of disorders
3. Any substance per State
4. Type of substance abuse in the state with the highest number of substance abuse cases

In [None]:
summary2019_disorder = mh2019[['YEAR','State', 'AnyDisorder', 'Disorder', 'AnySubstance']]
#summary_disorder.groupby(['State', 'AnyDisorder']).agg(Count = ('AnyDisorder', 'count')).reset_index()
#summary_disorder
summary2019_disorder

In [None]:
import pandas as pd
file2 = "./data/mhcld2014/mhcld-puf-2014-csv.csv"
mh2014 = pd.read_csv(file2)
mh2014.head()

In [None]:
summary2014_disorder = mh2014[['YEAR','State', 'AnyDisorder', 'Disorder', 'AnySubstance']]
summary_disorder.groupby(['State', 'AnyDisorder']).agg(Count = ('AnyDisorder', 'count')).reset_index()

In [None]:
# import libraries
import seaborn as sns
import matplotlib.pyplot as plt

# set plot style: grey grid in the background:
sns.set(style="darkgrid")

# Set the figure size
plt.figure(figsize=(20, 10))

# grouped barplot
ax = sns.countplot(data=summary_disorder, x='State', hue='AnyDisorder')
#ax.set(ylim=(0, 100))

ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)


In [None]:
# import libraries
import seaborn as sns
import matplotlib.pyplot as plt

# set plot style: grey grid in the background:
sns.set(style="darkgrid")

# Set the figure size
plt.figure(figsize=(20, 10))

# grouped barplot
ax = sns.countplot(data=summary_disorder, x='State', hue='AnySubstance')
#ax.set(ylim=(0, 100))

ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)


In [None]:
summary_disorder.groupby(['State', 'AnySubstance']).agg(Count = ('AnySubstance', 'count')).reset_index()


In [None]:
### Join this data with number of clinics for that state

In [None]:
### Join this data with population of the state