# Opioid Addiction Project
## Notebook 01: Load Data

This notebook loads the data, codes the misuse variable, and saves it to a pickle file.

Note, we have a file on Colab (originally local) that takes the raw NSDUH data, removes lines for individiuals who did not take pain killers, and also concatenates all 3 years 2015-2017.

### W210, Capstone
Summer 2019

Team:  Cameron Kennedy, Aditi Khullar, Rachel Kramer, Sharad Varadarajan

# 0. Load Libraries and Set Global Variables
This analysis is performed in the cells below.

In [1]:
#Import Required Libraries
import pandas as pd

#Set initial parameter(s)
pd.set_option('display.max_rows', 200)
pd.options.display.max_columns = 20
dataDir = './data/'

print('Pandas Version', pd.__version__)

Pandas Version 0.24.2


# 1. Load Data

This step loads the data from the file `data.pickle.zip`.

In [2]:
#Load Data
df = pd.read_pickle('data.pickle.zip')
df

Unnamed: 0,AALTMDE,ABODALC,ABODCOC,ABODHER,ABODMRJ,ABPYILANAL,ABPYILLALC,ABUPOSHAL,ABUPOSINH,ABUPOSMTH,...,YUTPOTPP,YUTPSCHL,YUTPSOR,YUTPSTN2,YUTPSTYR,YUTPSUIC,ZALEPDAPYU,ZOHYANYYR2,ZOLPPDAPYU,ZOLPPDPYMU
19,,0,0,0,0,0,0,91.0,91.0,91.0,...,99,99,99,999,2,99,0,0,0,0
20,0.0,1,0,0,0,0,1,93.0,91.0,91.0,...,99,99,99,999,99,99,0,0,0,0
21,,0,0,0,0,0,0,91.0,93.0,91.0,...,99,99,99,999,99,99,0,0,0,0
23,,0,0,0,0,0,0,91.0,91.0,91.0,...,99,99,99,999,99,99,0,0,0,0
33,,0,0,0,0,0,0,91.0,91.0,91.0,...,99,99,99,999,99,99,0,0,0,0
34,,0,0,0,0,0,0,91.0,91.0,91.0,...,99,99,99,999,99,99,0,0,0,0
36,,0,0,0,0,0,0,91.0,91.0,91.0,...,99,99,99,999,99,99,0,0,0,0
46,,0,0,0,0,0,0,0.0,93.0,91.0,...,99,99,99,999,99,99,0,0,0,0
50,,0,0,0,0,0,0,91.0,93.0,91.0,...,99,99,99,999,99,99,0,0,0,0
52,,0,0,0,0,0,0,91.0,91.0,91.0,...,99,99,99,999,2,99,0,0,0,0


# 2. Code Misuse Variable

In [3]:
#Set MISUSE variable
misuseCodes = [1, 2, 8]
df['MISUSE'] = df.apply(lambda row:
                        1 if row.PNRNMREC in misuseCodes else 0,
                        axis=1)

df

Unnamed: 0,AALTMDE,ABODALC,ABODCOC,ABODHER,ABODMRJ,ABPYILANAL,ABPYILLALC,ABUPOSHAL,ABUPOSINH,ABUPOSMTH,...,YUTPSCHL,YUTPSOR,YUTPSTN2,YUTPSTYR,YUTPSUIC,ZALEPDAPYU,ZOHYANYYR2,ZOLPPDAPYU,ZOLPPDPYMU,MISUSE
19,,0,0,0,0,0,0,91.0,91.0,91.0,...,99,99,999,2,99,0,0,0,0,0
20,0.0,1,0,0,0,0,1,93.0,91.0,91.0,...,99,99,999,99,99,0,0,0,0,0
21,,0,0,0,0,0,0,91.0,93.0,91.0,...,99,99,999,99,99,0,0,0,0,0
23,,0,0,0,0,0,0,91.0,91.0,91.0,...,99,99,999,99,99,0,0,0,0,0
33,,0,0,0,0,0,0,91.0,91.0,91.0,...,99,99,999,99,99,0,0,0,0,0
34,,0,0,0,0,0,0,91.0,91.0,91.0,...,99,99,999,99,99,0,0,0,0,1
36,,0,0,0,0,0,0,91.0,91.0,91.0,...,99,99,999,99,99,0,0,0,0,0
46,,0,0,0,0,0,0,0.0,93.0,91.0,...,99,99,999,99,99,0,0,0,0,1
50,,0,0,0,0,0,0,91.0,93.0,91.0,...,99,99,999,99,99,0,0,0,0,0
52,,0,0,0,0,0,0,91.0,91.0,91.0,...,99,99,999,2,99,0,0,0,0,0


In [4]:
#Determine classification imbalance
print('COUNT of MISUSE')
print(df.groupby(['MISUSE'])['MISUSE'].count())
print('\nPERCENT MISUSE')
print(sum(df.MISUSE) / len(df))

COUNT of MISUSE
MISUSE
0    44675
1     8683
Name: MISUSE, dtype: int64

PERCENT MISUSE
0.16273098691855017


# 3. Save to Pickle

In [5]:
#Save to pickle
df.to_pickle(dataDir+'misuse.pickle.zip')
'''Note, the .to_pickle command INFERS zip compression based on the '.zip'
extension. Changing the extension will result in a 1GB file instead of a 
compressed file.
''';