# Drug Consumption Final Project for COMP 562 - Random Forest Methods
#### by Samantha Anthony

## Import Libraries and Data
- X/y_train/val/test are full dataset without schedule classification
- _sch are dataset with schedule classifications instead of drug labels
- _X2 one-hot encodes categorical
- _X3 label encodes categorical
- y_train/val/test_bin are binary outputs

In [150]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.inspection import permutation_importance

drug_labels = ['Alcohol','Amphet','Amyl','Benzos','Caff','Choc','Coke','Crack','Ecstasy','Heroin','Ketamine','Legalh','LSD','Meth','Mushrooms','Nicotine','VSA']
sched_1 = ['Heroin','LSD','Ecstasy','Mushrooms']
sched_2 = ['Amphet','Coke','Crack','Meth']
sched_3 = ['Ketamine']
sched_4 = ['Benzos']
not_controlled_substance = ['Alcohol','Amyl','Caff','Choc','Legalh','Nicotine','VSA']
schedules = ['Sch1','Sch2','Sch3','Sch4', 'SchNA']
personality_labels = ['Nscore','Escore','Oscore','Ascore','Cscore','Impulsive','SS']
demographic_labels = ['Age','Gender','Education','Country','Ethnicity']
target_label = ['Cannabis']

In [3]:
drugs = pd.read_csv('drugs.csv')
print(drugs.shape)
drugs.head()

(1877, 35)


Unnamed: 0,Age,Gender,Education,Country,Ethnicity,Nscore,Escore,Oscore,Ascore,Cscore,...,LSD,Meth,Mushrooms,Nicotine,VSA,Sch1,Sch2,Sch3,Sch4,SchNA
0,35-44,1,Professional certificate/ diploma,UK,Mixed-White/Asian,39.0,36.0,42.0,37.0,42.0,...,0,0,0,1,0,0,1,0,1,1
1,25-34,0,Doctorate degree,UK,White,29.0,52.0,55.0,48.0,41.0,...,1,1,0,1,0,1,1,1,0,1
2,35-44,0,Professional certificate/ diploma,UK,White,31.0,45.0,40.0,32.0,34.0,...,0,0,1,0,0,1,0,0,0,1
3,18-24,1,Masters degree,UK,White,34.0,34.0,46.0,47.0,46.0,...,0,0,0,1,0,0,1,1,1,1
4,35-44,1,Doctorate degree,UK,White,43.0,28.0,43.0,41.0,50.0,...,0,0,1,1,0,1,1,0,0,1


In [99]:
y = drugs['Cannabis']
X = drugs.drop(['Cannabis']+['Sch1','Sch2','Sch3','Sch4', 'SchNA'], 1)

  X = drugs.drop(['Cannabis']+['Sch1','Sch2','Sch3','Sch4', 'SchNA'], 1)


In [155]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1/0.9, random_state=1) # 0.1/0.9 x 0.9 = 0.1

In [156]:
X_train.shape, X_val.shape, X_test.shape

((1501, 29), (188, 29), (188, 29))

X_Sch only uses schedules instead of drug names

In [157]:
X_train_sch, X_test_sch, y_train_sch, y_test_sch = train_test_split(drugs[demographic_labels + personality_labels + schedules], y, test_size=0.1, random_state=1)
X_train_sch, X_val_sch, y_train_sch, y_val_sch = train_test_split(X_train_sch, y_train_sch, test_size=0.1/0.9, random_state=1) # 0.1/0.9 x 0.9 = 0.1

This X2 denotes the datasets where the categorical features are one-hot encoded.

In [103]:
X2 = pd.get_dummies(X, columns=(demographic_labels))
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y, test_size=0.1, random_state=1)
X_train2, X_val2, y_train2, y_val2 = train_test_split(X_train2, y_train2, test_size=0.1/0.9, random_state=1) # 0.1/0.9 x 0.9 = 0.1

This X3 denotes the datasets where the categorical features are label encoded

In [104]:
X3 = X.copy()
for i in demographic_labels:
    X3[i] = X3[i].astype('category')
    X3[i] = X3[i].cat.codes
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y, test_size=0.1, random_state=1)
X_train3, X_val3, y_train3, y_val3 = train_test_split(X_train3, y_train3, test_size=0.1/0.9, random_state=1) # 0.1/0.9 x 0.9 = 0.1

This y_bin denotes the datasets where the output is divided into binary categories of user and non-user.

In [105]:
# if the user has not used the drug in the past decade the drug is encoded 0, if they have used it is a 1
y_bin_train = np.where((y_train == 'CL0'), 0, 1) # | (y_train == 'CL1'), 0, 1) #
y_bin_val = np.where((y_val == 'CL0'), 0, 1) # | (y_val == 'CL1'), 0, 1) 
y_bin_test = np.where((y_test == 'CL0'), 0, 1) # | (y_test == 'CL1'), 0, 1) 
y_bin_train2 = np.where((y_train2 == 'CL0'), 0, 1) # | (y_train2 == 'CL1'), 0, 1) 
y_bin_val2 = np.where((y_val2 == 'CL0'), 0, 1) # | (y_val2 == 'CL1'), 0, 1) 
y_bin_test2 = np.where((y_test2 == 'CL0'), 0, 1) # | (y_test2 == 'CL1'), 0, 1) 
y_bin_train3 = np.where((y_train3 == 'CL0'), 0, 1) # | (y_train3 == 'CL1'), 0, 1) 
y_bin_val3 = np.where((y_val3 == 'CL0'), 0, 1) # | (y_val3 == 'CL1'), 0, 1) 
y_bin_test3 = np.where((y_test3 == 'CL0'), 0, 1) # | (y_test3 == 'CL1'), 0, 1) 

## Naive Bayes Models

#### This first model uses the entire dataset (one-hot encoded X2) to predict which of the 7 classes of weed user (y) each participant is
Accuracy = 0.28

In [135]:
gnb = GaussianNB().fit(X_train2, y_train2)
gnb_predictions = gnb.predict(X_test2)
accuracy = gnb.score(X_test2, y_test2)
print(accuracy)
cm = confusion_matrix(y_test2, gnb_predictions)
cm

0.28191489361702127


array([[27,  6,  0,  1,  0,  2,  0],
       [ 8,  5,  0,  2,  0,  2,  0],
       [ 5,  6,  0,  2,  2, 14,  0],
       [ 1,  2,  0,  4,  0, 15,  0],
       [ 0,  2,  0,  0,  2, 11,  0],
       [ 0,  1,  1,  2,  0, 15,  0],
       [ 0,  1,  1,  2,  7, 39,  0]], dtype=int64)

It appears that a Naive Bayes Model on this many outputs, with one-hot encoded variables, is not very accurate of a predictor of the 7 weed user types.

####  This second model uses the entire dataset (label encoded X3) to predict which of the 7 classes of weed user (y) each participant is
Accuracy = 0.28

In [137]:
gnb = GaussianNB().fit(X_train3, y_train3)
gnb_predictions = gnb.predict(X_test3)
accuracy = gnb.score(X_test3, y_test3)
print(accuracy)
cm = confusion_matrix(y_test3, gnb_predictions)
cm

0.2765957446808511


array([[36,  0,  0,  0,  0,  0,  0],
       [ 9,  2,  1,  0,  5,  0,  0],
       [ 6,  5,  1,  0, 17,  0,  0],
       [ 1,  2,  0,  0, 19,  0,  0],
       [ 2,  0,  0,  0, 13,  0,  0],
       [ 0,  1,  0,  0, 18,  0,  0],
       [ 0,  2,  1,  0, 47,  0,  0]], dtype=int64)

#### This third model looks at the label encoded dataset (X3) to predict which of two classes of weed user (y_bin) each participant is
Accuracy = 0.85

In [108]:
gnb = GaussianNB().fit(X_train3, y_bin_train3)
gnb_predictions = gnb.predict(X_test3)
accuracy = gnb.score(X_test3, y_bin_test3)
print(accuracy)
cm = confusion_matrix(y_bin_test3, gnb_predictions)
cm

0.8457446808510638


array([[ 36,   0],
       [ 29, 123]], dtype=int64)

#### This fourth model uses only the other drugs (X[drug_labels]) to predict which of two classes of weed user (y_bin) each participant is
Accuracy = 0.86

In [148]:
clf = CategoricalNB().fit(X_train[drug_labels], y_bin_train)
clf_predictions = clf.predict(X_test[drug_labels])
accuracy = clf.score(X_test[drug_labels], y_bin_test)
print(accuracy)
cm = confusion_matrix(y_bin_test, clf_predictions)
cm

0.8563829787234043


array([[ 36,   0],
       [ 27, 125]], dtype=int64)

This shows that when using the Naive Bayes model, the demographic data is not contributing much to the model, and the drugs are a better predictor.

#### This fifth model uses only the scheduling (X_sch) to predict which of two classes of weed user (y_bin) each participant is
Accuracy = 0.86

In [173]:
clf = CategoricalNB().fit(X_train_sch[schedules], y_bin_train)
clf_predictions = clf.predict(X_test_sch[schedules])
accuracy = clf.score(X_test_sch[schedules], y_bin_test)
print(accuracy)
cm = confusion_matrix(y_bin_test, clf_predictions)
cm

0.8617021276595744


array([[ 36,   0],
       [ 26, 126]], dtype=int64)

#### Overall, it seems that the most accurate Naive Bayes model was using the schedule class of drugs  to predict the binary output of whether or not a person is a weed user (~0.86 accuracy). This will be used as a baseline for comparing the effectiveness of the other models.

In [179]:
imps = permutation_importance(clf, X_train_sch[schedules], y_bin_train)
importances = imps.importances_mean
std = imps.importances_std
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")
for f in range(X_train_sch[schedules].shape[1]):
    print("%d. %s (%f)" % (f + 1, schedules[indices[f]], importances[indices[f]]))

Feature ranking:
1. Sch1 (0.075017)
2. Sch2 (0.059161)
3. Sch3 (0.013458)
4. SchNA (0.000000)
5. Sch4 (0.000000)


Using permutation importance, we can determine from this model that using a Schedule 1 drug was the biggest predictor of being a weed user, followed by Schedule 2 and Schedule 3.

## Random Forest