# Is there a relationship between a person's personality and 'drug name' consumption?

According to the National Institute of Drug Abuse the drug usage related cost can reach or even exceed the annual $740 billion dollar in the USA: accidents from driving under the influence, crimes caused by or driven by drugs, healthcare cost, or when people dropout from potential workforce. The deaths caused by drug overdose is steadily increasing every year.

Our purpose of the study is to identify the groups of people who are more likely to become users of a certain drug, so that we can reach them with preventive programs or targeted education that can keep them from becoming drug users.

**[Caffeine](https://adf.org.au/drug-facts/caffeine/#wheel)**

## Import Packages

In [None]:
import pandas as pd
import numpy as np
import pickle

import warnings
warnings.filterwarnings("ignore")

# libraries for cleaning and preprocessing data
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

# libraries for modeling
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import lightgbm as lgb

# libraries for evaluating models
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score

# libraries for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from helper import *
from visualizations import *
from modeling import *

%load_ext autoreload
%autoreload 2

## Reading Data

In [None]:
drugs = pd.read_csv('data/drug_consumption.data', header=None, index_col=0)

In [None]:
# rename columns
drugs.columns = ['Age', 'Gender', 'Education', 'Country', 'Ethnicity',
                 'Neuroticism', 'Extraversion', 'Openness-to-experience',
                 'Agreeableness', 'Conscientiousness', 'Impulsive',
                 'Sensation-seeking', 'Alcohol', 'Amphet', 'Amyl', 'Benzos',
                 'Caff', 'Cannabis', 'Choc', 'Coke', 'Crack', 'Ecstasy',
                 'Heroin', 'Ketamine', 'Legalh', 'LSD', 'Meth', 'Mushrooms',
                 'Nicotine', 'Semer', 'VSA']

In [None]:
personality_cols = ['Neuroticism', 'Extraversion', 'Openness-to-experience',
                    'Agreeableness', 'Conscientiousness', 'Impulsive',
                    'Sensation-seeking']

In [None]:
# Convert standardized values into categories
category_converter(drugs)

In [None]:
# plot distribution of personalities for each class for <drug name>
plot_personality(drugs, personality_cols, '<drug name>')

In [None]:
# define drug columns
drug_cols = ['Alcohol', 'Amphet', 'Amyl', 'Benzos',
             'Caff', 'Cannabis', 'Choc', 'Coke', 'Crack', 'Ecstasy',
             'Heroin', 'Ketamine', 'Legalh', 'LSD', 'Meth', 'Mushrooms',
             'Nicotine', 'Semer', 'VSA']

# define user and non-user for each drug
for col in drug_cols:
    drugs[f"{col}_User"] = [0 if (x == 'CL0') | (x == 'CL1') else 1
                            for x in drugs[col]]

In [None]:
plot_feature_dist(drugs, '<drug name>')

In [None]:
drugs = encoding(drugs)

In [None]:
# filter for just <drug name> Users
<drug name> = drugs[['<drug name>_User', 'Age', 'Gender', 'Education',
             'Country', 'Ethnicity'] + personality_cols]

<drug name>.rename({'<drug name>_User': 'User'}, axis=1, inplace=True)
<drug name>.drop(columns=['Country', 'Ethnicity'], axis=1, inplace=True)

# save <drug name> DataFrame for future use
<drug name>.to_csv('data/<drug name>.csv', index=False)

In [None]:
sns.reset_orig()

In [None]:
sns.countplot(<drug name>['User'])
plt.title(f'Number of Users vs Non-Users of <drug name> \n',
          fontsize=14)
plt.xticks(np.arange(len(['Non-user', 'User'])),
           labels=['Non-user', 'User'],
           fontsize=14)
plt.ylabel('count', fontsize=14)
plt.xlabel('')
sns.despine(left=False, bottom=False)
plt.savefig('img/users_vs_nonusers - <drug name>.png',
            bbox_inches='tight');

## Preprocessing Data

In [None]:
<drug name> = pd.read_csv('data/<drug name>.csv')
<drug name>.head()

## Modeling

In [None]:
# define feature and target variables
y = <drug name>['User']
X = <drug name>.drop('User', axis=1)

sm = SMOTE(random_state=220)
X_sm, y_sm = sm.fit_resample(X, y)
X_sm = pd.DataFrame(X_sm, columns=X.columns)

In [None]:
# train-test split for modeling
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.2,
                                                    random_state=220)

In [None]:
# scale data for Logistic Regression, KNN, and SVM
X_train_scale = X_train.copy()
X_test_scale = X_test.copy()

scale = StandardScaler()

X_train_scale.loc[:, ['Age', 'Education']] = scale.fit_transform(
    X_train_scale.loc[:, ['Age', 'Education']])
X_test_scale.loc[:, ['Age', 'Education']] = scale.transform(
    X_test_scale.loc[:, ['Age', 'Education']])

### Logistic Regression

In [None]:
grid_log = {'C': [0.001, 0.01, 10, 100],
            'penalty': ['l1', 'l2']}

gs_log = run_gridsearch_scaled(LogisticRegression, grid_log,
                               X_train_scale, X_test_scale,
                               y_train, y_test, random_state=220)

### Random Forest

In [None]:
grid_forest = {'n_estimators': [120, 500, 1200],
               'max_depth': [5, 25, None],
               'min_samples_split': [2, 10, 100],
               'min_samples_leaf': [1, 5, 10],
               'max_features': ['log2', 'sqrt', None]}

gs_forest = run_gridsearch(RandomForestClassifier, grid_forest,
                           X_train, X_test, y_train,
                           y_test, random_state=220)

### LightGBM

In [None]:
grid_lgb = {'learning_rate': [0.01, 0.025, 0.1],
            'max_depth': [3, 12, 25],
            'min_child_weight': [1, 5, 7],
            'subsample': [0.1, 0.6, 1]}

gs_lgb = run_gridsearch(lgb.LGBMClassifier, grid_lgb,
                        X_train, X_test, y_train,
                        y_test, random_state=220)

# pickle.dump(gs_<model>, open('models/<drug name>.sav', 'wb'))

### KNN

In [None]:
grid_knn = {'n_neighbors': [2, 16, 64]}

gs_knn = run_gridsearch_scaled(KNeighborsClassifier, grid_knn,
                               X_train_scale, X_test_scale,
                               y_train, y_test)

### SVM

In [None]:
grid_svm = {'C': [0.001, 10, 1000],
            'class_weight': ['balanced', None],
            'kernel': ['linear', 'rbf']}

gs_svm = run_gridsearch_scaled('SVM', grid_svm,
                               X_train_scale, X_test_scale,
                               y_train, y_test, random_state=220)

## Findings

### ROC Curve

We used Receiver Operating Characteristic (ROC) curves and the Area Under the Curve (AUC) scores to compare which classification method performed the best. The ROC curve shows the ratio of True Positive and False Positive rates. The perfect model (red dotted line) would show an AUC of 1 and a ROC curve that looks like an upside-down 'L', because the perfect model would show 100% True Positives. The black dotted line shows the ROC curve of a random guess.

In [None]:
models = [gs_forest, gs_lgb, gs_log, gs_knn, gs_svm]

model_names = ['RandomForest', 'LightGBM',
               'Logistic Regression', 'KNN', 'SVM']

plot_roc_curve(models, model_names, X_test, y_test, '<drug name>', X_test_scale)

### Interpreting 'Model Name' Results

I chose to focus on the 'model name' model to analyze the accuracy of predictions and which features are most important in predicting 'drug name' users.

In [None]:
gs_<model> = pickle.load(open('models/<drug name>.sav', 'rb'))

#### Confusion Matrix

The confusion matrix below shows the percent accuracy of predictions. The 'model name' model was #% accurate in predicting whether a person was a user (#%) or a nonuser (#%).

In [None]:
plot_confusion_matrix(y_test, X_test_scale, gs_<model>, '<drug name>')
plt.savefig('img/<drug name>_matrix.png', bbox_inches='tight');

#### Important Features in Predicting 'drug name' Users

In [None]:
plot_feat_imp(gs_<model>, X_train, '<drug name>')

## Conclusion

Based on the 5 classification models, their demographics (age, gender, and education level) and their personalities had accuracy levels of #-#% in predicting 'drug name' users. feat1 and feat2 were the 2 most influential in predicting 'drug name' consumption.