# Predicting Underage Drinking in High School Students Using Ensemble Machine Learning Methods
## Harnessing the Power of Ensemble Learning to Detect and Intervene in Underage Drinking Cases
### Pablo X Zumba

In this exercise, we will focus on underage drinking. The data set contains data about high school students. Each row represents a single student. The columns include the characteristics of deidentified students. This is a binary classification task: predict whether a student drinks alcohol or not (this is the **alc** column: 1=Yes, 0=No). This is an important prediction task to detect underage drinking and deploy intervention techniques. 

## Description of Variables

The description of variables are provided in "Alcohol - Data Dictionary.docx"

## Goal

Use the **alcohol.csv** data set and build a model to predict **alc**. 

# Read and Prepare the Data

In [1]:
# Common imports

import pandas as pd
import numpy as np

np.random.seed(42)

# Get the data

In [2]:
#We will predict the "price" value in the data set:

alcohol = pd.read_csv("alcohol.csv")
alcohol.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,health,absences,gender,alc
0,18,2,1,4,2,0,5,4,2,5,2,M,1
1,18,4,3,1,0,0,4,4,2,3,9,M,1
2,15,4,3,2,3,0,5,3,4,5,0,F,0
3,15,3,3,1,4,0,4,3,3,3,10,F,0
4,17,3,2,1,2,0,5,3,5,5,2,M,1


# Split data (train/test)

In [3]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(alcohol, test_size=0.3)

# Data Prep

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer

## Separate the target variable 

In [5]:
train_target = train['alc']
test_target = test['alc']

train_inputs = train.drop(['alc'], axis=1)
test_inputs = test.drop(['alc'], axis=1)

## Feature Engineering: Derive a new column

Examples:
- Ratio of study time to travel time
- Student is younger than 18 or not
- Average of father's and mother's level of education
- (etc.)

In [6]:
def new_col(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    df1['studytime_binary'] = np.where(df1['studytime'] > 2, 1, 0)
    
    return df1[['studytime_binary']]
    # You can use this to check whether the calculation is made correctly:
    #return df1

In [7]:
#Let's test the new function:

# Send train set to the function we created
new_col(train)

Unnamed: 0,studytime_binary
12759,0
4374,1
8561,0
10697,0
19424,1
...,...
16850,0
6265,0
11284,1
860,1


##  Identify the numeric, binary, and categorical columns

In [8]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [9]:
numeric_columns

['age',
 'Medu',
 'Fedu',
 'traveltime',
 'studytime',
 'failures',
 'famrel',
 'freetime',
 'goout',
 'health',
 'absences']

In [10]:
categorical_columns

['gender']

In [11]:
feat_eng_columns = ['studytime']

# Pipeline

In [12]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [13]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=0)),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [14]:
# Create a pipeline for the transformed column here
my_new_column = Pipeline(steps=[('studytime', FunctionTransformer(new_col))])


In [15]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),        
        ('trans', my_new_column, feat_eng_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

# Transform: fit_transform() for TRAIN

In [16]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[ 0.66643886,  0.96597412,  0.90362635, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.66643886, -0.93881619, -1.68666277, ...,  0.        ,
         1.        ,  1.        ],
       [ 0.66643886,  0.33104402,  0.04019664, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [ 0.66643886, -2.20867639, -2.55009248, ...,  1.        ,
         0.        ,  1.        ],
       [ 1.6195814 , -0.30388608, -1.68666277, ...,  0.        ,
         1.        ,  1.        ],
       [ 1.6195814 , -0.30388608, -2.55009248, ...,  0.        ,
         1.        ,  0.        ]])

In [17]:
train_x.shape

(23800, 14)

# Tranform: transform() for TEST

In [18]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[-1.23984621,  0.33104402,  1.76705606, ...,  1.        ,
         0.        ,  0.        ],
       [-1.23984621, -0.30388608,  0.04019664, ...,  0.        ,
         1.        ,  0.        ],
       [-0.28670367,  0.33104402,  0.04019664, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [ 0.66643886, -0.30388608,  0.04019664, ...,  1.        ,
         0.        ,  0.        ],
       [-1.23984621, -0.93881619,  0.04019664, ...,  0.        ,
         1.        ,  1.        ],
       [-1.23984621,  0.96597412,  0.04019664, ...,  1.        ,
         0.        ,  0.        ]])

In [19]:
test_x.shape

(10200, 14)

# Calculate the Baseline

In [20]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")

dummy_clf.fit(train_x, train_target)

DummyClassifier(strategy='most_frequent')

In [21]:
from sklearn.metrics import accuracy_score

In [22]:
#Baseline Train Accuracy
dummy_train_pred = dummy_clf.predict(train_x)

baseline_train_acc = accuracy_score(train_target, dummy_train_pred)

print('Baseline Train Accuracy: {}' .format(baseline_train_acc))

Baseline Train Accuracy: 0.5234873949579832


In [23]:
#Baseline Test Accuracy
dummy_test_pred = dummy_clf.predict(test_x)

baseline_test_acc = accuracy_score(test_target, dummy_test_pred)

print('Baseline Test Accuracy: {}' .format(baseline_test_acc))

Baseline Test Accuracy: 0.5194117647058824


# Train a voting classifier 

In [24]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.linear_model import SGDClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier


dtree_clf = DecisionTreeClassifier(max_depth=20)
log_clf = LogisticRegression(multi_class='auto', solver = 'lbfgs', C=10, max_iter=1000)

In [25]:
#Each model should have predict_proba() function. Otherwise, you can't use it for soft voting
#We can't use sgd, because it doesn't have predict_proba() function.

voting_clf = VotingClassifier(
            estimators=[('dt', dtree_clf), 
                        ('lr', log_clf)],
            voting='soft')

voting_clf.fit(train_x, train_target)

VotingClassifier(estimators=[('dt', DecisionTreeClassifier(max_depth=20)),
                             ('lr', LogisticRegression(C=10, max_iter=1000))],
                 voting='soft')

In [26]:
#Train accuracy

train_y_pred = voting_clf.predict(train_x)

train_acc = accuracy_score(train_target, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.9641176470588235


In [27]:
#Test accuracy

test_y_pred = voting_clf.predict(test_x)

test_acc = accuracy_score(test_target, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.7599019607843137


# Train a bagging classifier

In [28]:
from sklearn.ensemble import BaggingClassifier 


#If you want to do pasting, change "bootstrap=False"
#n_jobs=-1 means use all CPU cores
#bagging automatically performs soft voting

bag_clf = BaggingClassifier( 
            SGDClassifier(), n_estimators=50, 
            max_samples=1000, bootstrap=True, n_jobs=-1) 

bag_clf.fit(train_x, train_target)

BaggingClassifier(base_estimator=SGDClassifier(), max_samples=1000,
                  n_estimators=50, n_jobs=-1)

In [29]:
#Train accuracy

train_y_pred = bag_clf.predict(train_x)

train_acc = accuracy_score(train_target, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.82109243697479


In [30]:
#Test accuracy

test_y_pred = bag_clf.predict(test_x)

test_acc = accuracy_score(test_target, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.8214705882352941


Good model. Almost no gap on training/testing, hence, no overfitting. 

# Train a random forest classifier

In [31]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1) 

rnd_clf.fit(train_x, train_target)

RandomForestClassifier(n_estimators=500, n_jobs=-1)

In [32]:
#Train accuracy

train_y_pred = rnd_clf.predict(train_x)

train_acc = accuracy_score(train_target, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.9829831932773109


In [33]:
#Test accuracy

test_y_pred = rnd_clf.predict(test_x)

test_acc = accuracy_score(test_target, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.8042156862745098


# Train an adaboost classifier

In [34]:
from sklearn.ensemble import AdaBoostClassifier 

#Create Adapative Boosting with Decision Stumps (depth=1)
ada_clf = AdaBoostClassifier( 
            DecisionTreeClassifier(max_depth=1), n_estimators=500, 
            learning_rate=0.1) 


ada_clf.fit(train_x, train_target)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                   learning_rate=0.1, n_estimators=500)

In [35]:
#Train accuracy

train_y_pred = ada_clf.predict(train_x)

train_acc = accuracy_score(train_target, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.8209243697478992


In [36]:
#Test accuracy

test_y_pred = ada_clf.predict(test_x)

test_acc = accuracy_score(test_target, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.8195098039215686


Good model. 

# Train a gradient boosting classifier

In [37]:
#Use GradientBoosting

from sklearn.ensemble import GradientBoostingClassifier

gbclf = GradientBoostingClassifier(max_depth=2, n_estimators=100, learning_rate=0.1) 

gbclf.fit(train_x, train_target)

GradientBoostingClassifier(max_depth=2)

In [38]:
#Train accuracy

train_y_pred = gbclf.predict(train_x)

train_acc = accuracy_score(train_target, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.8190336134453782


In [39]:
#Test accuracy

test_y_pred = gbclf.predict(test_x)

test_acc = accuracy_score(test_target, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.8116666666666666
