# Predicting if People Have Tried Crack/Cocaine 
## Exploratory Data Analysis

You can also check out my Crack/Cocaine Usage Prediction repository on my Github!

https://github.com/bgallamoza/Cocaine_Usage_Classification

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

# Importing our Data

We have a large dataset, so we're going to choose specific columns to extract. Each column is a code in the NSDUH Survey Documentation, which can be found here:

https://www.datafiles.samhsa.gov/sites/default/files/field-uploads-protected/studies/NSDUH-2002-2018/NSDUH-2002-2018-datasets/NSDUH-2002-2018-DS0001/NSDUH-2002-2018-DS0001-info/NSDUH-2002-2018-DS0001-info-codebook.pdf

A variety of continuous, ordinal, and categorical data was chosen, such as sex, highest completed education, total days consumed alcohol, etc. The results of each survey question is represented as a code. For example, "newrace2" contains the following codes to represent categorical data:

1. 1 = NonHisp White
2. 2 = NonHisp Black/Afr Am
3. 3 = NonHisp Native Am/AK Native
4. 4 = NonHisp Native HI/Other Pac Isl
5. 5 = NonHisp Asian
6. 6 = NonHisp more than one race
7. 7 = Hispanic

In [None]:
# Columns we will be extracting
cols = [
    "crkever", # Ever used crack
    "cocever", # Ever used cocaine
    "iralcfy", # Number of days used alcohol in past year
    "catag3", # Age group
    "health", # Health condition
    "irwrkstat", # Work status
    "ireduhighst2", # Highest completed education
    "newrace2", # Race/Ethnicity
    "irsex", # Sex
    "irpinc3", # Income range
    "irki17_2", # Number of kids <18 y/o
    "irmjfy", # Number of days used marijuana in past year
    "wrkdhrswk2", # Number of hours worked in past week
    "irhhsiz2", # Number of people in household
    "cig30use", # Number of days smoked cigarettes in past month
    "irherfy", # Number of days used heroine in past year
    "irmethamyfq", # Number of days used methamphetamine in past year
    "year"
]

In [None]:
# Because our dataset is large, we will read our data in chunks. Each chunk is stored in chunk_list
chunk_list = []

# Read the csv we exported from our R script into a pandas dataframe
for chunk in pd.read_csv("../input/national-survey-of-drug-use-and-health-20152019/NSDUH_2015-2019.csv", index_col=False, usecols=cols, chunksize=1000):
    chunk_list.append(chunk)

# Concatenate all chunks, then delete our chunk_list to preserve memory
df = pd.concat(chunk_list, axis=0)
del chunk_list

In [None]:
# Take a brief look at our data and its shape
df

In [None]:
# Check if there is currently any NaN values in our dataset
df.isna().sum()

# Data Cleaning

No dataset is perfect. Our NSDUH documentation shows that several of our columns contain special codes equivalent to NaN (such as the 85 == BAD DATA codes in "cocever") or essentially equate to 0 (such as 991 == "Have never drank alcohol" for "iralcfy). So we will have to clean our data according to the NSDUH documentation codes if we want to explore trends in the data.

In [None]:
# Create a copy df that we will modify for cleaning
df2 = df.copy(deep=True)

# Look at the unique values in each column, confirming that special codes exist in our data
for i, col in zip(range(len(cols)), cols):
    print(col)
    print(df2[cols[i]].unique())
    print("========================")

The data codes for certain columns differ from others. Continuous data columns tend to use three-digit special codes, whereas ordinal and categorical variables use two-digit codes. We will create two separate functions to avoid cleaning meaningful data from the continuous columns.

In [None]:
cont_cols = [
    "iralcfy",
    "irmjfy",
    "wrkdhrswk2",
    "irherfy",
    "irmethamyfq"
]

# The columns without categorical data are all ordinal
ord_cols = [x for x in cols if x not in cont_cols]

# Function that cleans continuous numerical data counting by year
def cont_clean_data(x):

    # Survey codes for "Bad Data", Don't Know, Skip, Refused, or Blank
    if ((x == -9) |
    (x == 985) |
    (x == 989) |
    ((x >= 994) & (x < 1000))):
        return np.nan

    # Codes for "Have never done..." or "Have not done in the past X days"
    # Equivalent to 0 for numbered questions
    if ((x == 991) | 
    (x == 993)):
        return 0

    # Ignore value if conditions don't match
    return x 

# Function that cleans all other special data codes
def ord_clean_data(x):

    # Survey codes for "Bad Data", Don't Know, Skip, Refused, or Blank
    if ((x == -9) |
    ((x >= 94) & (x < 100)) |
    (x == 85) |
    (x == 89)):
        return np.nan

    # Codes for "Have never done..." or "Have not done in the past X days"
    # Equivalent to 0 for numbered questions
    if ((x == 91) |
    (x == 93)):
        return 0

    # Ignore value if conditions don't match
    return x 

There are some special situations that need to be accounted for before applying our clean_data function. These situations are as follows:

### **wrkdhrsw2**

Does not specify "0 hours worked", but we have irwrkstat, which specifies "Unemployed" or "Not in work force" for irwrkstat = 3 or 4. We can use irwrkstat to create 0 values for wrkhrsw2, as logically, someone who is unemployed or not in the work force would work 0 hours per week.

Additionally, people who work 61 hours or more are all pooled into wrkdhrsw2 = 61. If we want to maintain the continuous structure of wrkhrsw2, we cannot include this, so we'll remove them by converting 61 into np.nan values.

### **impweeks**

We will assume that if a respondent has skipped this question, it's because it does not apply to them. Therefore, skip codes 89 and 99 should be converted to 0, for 0 weeks having difficulties with mental health.

### **Binary Categorical Variables**

We have a two binary categorical variables that need editing:

1. **cocever**: "Have you ever used cocaine before?"

2. **irsex** Sex of respondent

Right now, the codes are such that 1 = Yes, 2 = No. We want to change it to 1 = Yes and 0 = No, as this will be identical to categorical variables we will dummify later.

We will do this by matching these situations using df2.loc

In [None]:
# Changes for wrkdhrsw2
df2.loc[(df2.irwrkstat == 3) | (df2.irwrkstat == 4), "wrkdhrswk2"] = 0
df2.loc[df2.wrkdhrswk2 == 61, "wrkdhrswk2"] = np.nan

# Changes for binary categorical variables
df2.loc[(df2.cocever == 2), "cocever"] = 0
df2.loc[(df2.crkever == 2), "crkever"] = 0
df2.loc[(df2.irsex == 2), "irsex"] = 0

# Apply clean_data functions
df2[cont_cols] = df2[cont_cols].applymap(cont_clean_data)
df2[ord_cols] = df2[ord_cols].applymap(ord_clean_data)

df2

In [None]:
# Observe unique values in each column to ensure our changes
# are correct
for i, col in zip(range(len(cols)), cols):
    print(col)
    print(df2[cols[i]].unique())
    print("========================")

In [None]:
df2.isna().sum()

# Creating our Target Column

In [None]:
df2['coccrkever'] = np.zeros(df.shape[0])
df2.loc[(df2.cocever == 1) | (df2.crkever == 1), "coccrkever"] = 1

df2.loc[(df2.cocever == np.nan) & (df2.crkever == np.nan), "coccrkever"] = np.nan
df2

In [None]:
df2[['cocever','crkever','coccrkever']].sum()

In [None]:
# Pickle our data
df2.to_pickle("./NSDUH_cleaned_2016-2019.pkl")
# df2 = pd.read_pickle(".//NSDUH_cleaned_2016-2019.pkl")

# Exploratory Data Analysis

### Are the years significantly different from eachother?

We can test this with a Chi-Squared test for homogeneity. To train our model, we want data that does not significantly differ between the years. First, we need to state our hypotheses:

**Null:** There is no difference in distribution of people who have/have not used cocaine in the years 2015-2019

**Alternative:** There is a difference in the distribution of people who have/have not used cocaine in the years 2015=2019

We will be using an **$\alpha$ = 0.10**, however, the value we pass into chi2.ppf() will be 0.05, as it looks at one tail (the lower tail probability)

Additionally, we will drop all rows with NaN so that we only do our analysis with completely valid observations.

In [None]:
from scipy.stats import chi2_contingency, chi2

In [None]:
# df2_clean will be df2 with NaN values removed
df2_clean = df2.copy(deep=True)
df2_clean = df2_clean.drop(['irwrkstat'], axis=1).dropna()
df2_clean.isna().sum()

In [None]:
import seaborn as sns
sns.set(font_scale=1.2)

In [None]:
plot = sns.catplot(data=df2_clean, x='year', y='coccrkever', kind='bar', estimator=(lambda x: sum(x)/len(x)), legend=True)

In [None]:
# Observe the total "Yes" and "No" answers for crack/cocaine users by year
df2_clean.groupby('coccrkever').year.value_counts()

In [None]:
no_values = []
yes_values = []

# Append lists with yes/no values, where indices correspond to a given year
for year in range(2015, 2020):
    no_values.append(df2_clean.groupby('coccrkever').year.value_counts()[0][year])
    yes_values.append(df2_clean.groupby('coccrkever').year.value_counts()[1][year])

# Create 2D matrix of values
chi_matrix = [no_values, yes_values]
chi_matrix

In [None]:
# Use alpha of 0.05, but in reality our test is two-tailed
significance = 0.05
stat, p, dof, expected = chi2_contingency(chi_matrix)
critical = chi2.ppf(significance, dof)
print("P-value = %f\nChi-Squared Stat = %f\nCritical Value = %f" %(p, stat, critical))

Failed to reject the null hypothesis. So, we can assume the years 2015-2019 have the same distributions

## Creating Figures

Now, we can explore how our features interact with our target through graphs. NaN values interfere with our graph estimators, so we will continue using df2_clean for graphs.

In [None]:
# Function for easily plotting sns barplots on a grid
def plot_bar(data, grid, x, y, xlabel, ylabel, title, xticklabels, rotation=0):
    ax = fig.add_subplot(grid[0], grid[1], grid[2])
    sns.barplot(data=data, x=x, y=y, 
    estimator=(lambda x: sum(x)/len(x)), ax=ax).set_title(title)
    ax.set(xlabel=xlabel, ylabel=ylabel)
    ax.set_xticklabels(xticklabels, rotation=rotation)

In [None]:
# Set figure parameters
plt.rcParams['figure.figsize'] = [16, 12]
plt.rcParams['figure.subplot.wspace'] = 0.3
plt.rcParams['figure.subplot.hspace'] = 0.7
fig = plt.figure()

# Call plot_bar to plot bar graphs for various variables
plot_bar(df2_clean,[2, 2, 1], 'coccrkever', 'iralcfy', 'Crk/Coc Usage', 'Days Consumed Alcohol', 
"Average # Days Consumed Alcohol\nin a Year vs Crk/Coc Usage", ["Never Used", "Used"])

plot_bar(df2_clean,[2, 2, 2], 'ireduhighst2', 'coccrkever', 'Highest Completed Education', 'Proportion of People Who\nHave Used Crk/Cocaine', 
"Proportion of People who have\nUsed Crk/Coc by Highest Completed Education",
["5th or less", "6th", "7th", "8th", "9th", "10th", "11th/12th,\nno diploma,", 
"High school\ndiploma/GED", "Some college,\nno degree", "Associate's Deg.", "College Grad\nor Higher"], 90)

plot_bar(df2_clean,[2, 2, 3], 'newrace2', 'coccrkever', 'Race/Ethnicity', 'Proportion of People Who\nHave Used Crk/Cocaine', 
"Average # Days Consumed Alcohol\nin a Year vs Crk/Coc Usage", 
["White", "Black/Afr Am", "Native Am/\nAK Native", "Pacific Isl/\nNative HI", "Asian", "Multiracial", "Hispanic"], 90)

plot_bar(df2_clean,[2, 2, 4], 'coccrkever', 'irmethamyfq', 'Crk/Coc Usage', 'Days Used Meth', 
"Average # Days Consumed Alcohol\nin a Year vs Crk/Coc Usage", ["Never Used", "Used"])

plt.show()

In [None]:
# Pickle our data
df2_clean.to_pickle("./NSDUH_cleaned_dropna_2016-2019.pkl")
# df2_clean = pd.read_pickle("./NSDUH_cleaned_dropna_2016-2019.pkl")