For the purposes of this notebook, I'll be examining the inverse of the question, which is what predictors, if any, exist for identifying heavy drinking. Assuming a causal connection in the forward direction, that heavy day drinking leads to lower scores, then we should also examine if there is any predictive intervention power in examining grades to find heavy drinkers.

To do this, first it must be established to some degree that grades are influenced in a statistically significant way by drinking, then the reverse causal correlation can be established.

For the purposes of this, I'm using simple regression techniques, OLS, Ridge, and Lasso, to find any connection.

Lasso is used because of its ability to set coefficients to zero, however, a good look through correlation matrices does not hurt either.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import pandas as pd
from pandas.tools.plotting import parallel_coordinates,andrews_curves
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import scale
from sklearn.linear_model import LassoCV,RidgeCV,Lasso,Ridge
from sklearn.metrics import mean_squared_error
from sklearn.cross_validation import train_test_split
%matplotlib inline

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
df_mat = pd.read_csv("../input/student-mat.csv")
df_mat.info()

Looking through the summaries of the dataset reveals that the vast majority of students do not drink much, if at all, during weekdays or during the weekend, though more during the weekend. For the purposes of this, I'll only look at the most likely to impact behavior and grades, which is Dalc - the Weekday Alcohol Intake.

In [None]:
df_mat[['Dalc','Walc','G1','G2','G3']].describe()

In [None]:
df_mat[['Dalc','Walc','G1','G2','G3','Medu','Fedu']].corr()

##Plotting
This dataset is fairly large considering the number of features it has compared to samples. To go through an entire description or correlation matrix would take a considerable amount of time compared to the graphing tools we have on hand.

In [None]:
parallel_coordinates(df_mat[['Dalc','Walc','G1','G2','G3','Medu','Fedu','health','absences']],'Dalc')

In [None]:
andrews_curves(df_mat[['Dalc','Walc','G1','G2','G3','Medu','Fedu','health','absences','studytime','traveltime','goout','freetime','famrel']],'Dalc')

The parallel plots show a similar story to the andrews curves. There exist a handful of outliers for light drinkers, which are likely the ones who failed their second and third exams. Additionally, the data seems to be banded by the light drinkers. This could mean that the data is quite noisy without good separation between the classes at first glance. But it does appear that heavier drinking, the 3s and 4s, are banded relatively close together indicating that some separation should be possible.

##Basic Regression

In [None]:
basic_ols = smf.ols(formula="G3 ~ G1 + G2",data=df_mat)
basic_ols.fit().summary()

82% of the variance in the final grade is controlled for by G1 and G2 alone, so any major factor involved in those variables must have some effect. But to tease it out, we'll need to bin the drinking classes together to create easier classifications down the road.

In [None]:
df_mat_heavy_drinking = df_mat[df_mat['Dalc']>=3]
df_mat_light_drinking = df_mat[df_mat['Dalc']<3]

In [None]:
basic_hd_ols = smf.ols(formula="G3 ~ G1 + G2",data=df_mat_heavy_drinking)
basic_hd_ols.fit().summary()

In [None]:
basic_ld_ols = smf.ols(formula="G3 ~ G1 + G2",data=df_mat_light_drinking)
basic_ld_ols.fit().summary()

It appears that the andrews curves provided some useful insight. The split at 3 for Dalc was a good one. The remaining dataset, being the vast majority of the dataset, remains very similar to our naive regression. However, the heavy drinking dataset shows some degradation. Of particular interest is the diminishing role of G2, meaning that any increase in the G2 score will reflect less on the final score. Or to put it differently, the G3 score is rather set in stone by other possible influences beyond just what the G2 tells us.

##Plotting Again

In [None]:
andrews_curves(df_mat_heavy_drinking[['Dalc','Walc','G1','G2','G3','Medu','Fedu','health','absences','studytime','traveltime','goout','freetime','famrel']],'Dalc')

In [None]:
parallel_coordinates(df_mat_heavy_drinking[['Dalc','Walc','G1','G2','G3','Medu','Fedu','health','absences']],'Dalc')

The data now is much much noisier than anticipated, but the parallels graph is indeed easier to read. Noticeably, the 3 class fares much better across the G1-G3 scores than the other two classes. This lightly confirms the intuition that drinking moderately to heavily in a day will decrease academic performance.

Since the data is less noisy now, it makes sense to me to dive into a correlation matrix and scan for influencing factors on the grades.

In [None]:
df_mat_heavy_drinking.corr()

What is perhaps most interesting is to see that study time is not heavily correlated with the final grade. However, failures, absences, famrel, and freetime, all correlate much higher. Which I'll use in another OLS regression, later to see if Lasso and/or Ridge choose similar variables and confirms or conflicts with intuition.

In [None]:
df_regressor_mat_hd = df_mat_heavy_drinking.drop(['school','sex','age','address','famsize','Pstatus','Fjob','Mjob','reason','guardian','schoolsup','famsup','paid','activities','nursery','higher','internet','romantic'],axis=1)
df_regressor_mat_hd = pd.DataFrame(scale(df_regressor_mat_hd),columns=['Medu','Fedu','traveltime','studytime','failures','famrel','freetime','goout','Dalc','Walc','health','absences','G1','G2','G3'])
X_train, X_test, y_train, y_test = train_test_split(df_regressor_mat_hd.drop('G3',axis=1),df_regressor_mat_hd.G3,random_state=42)

In [None]:
alphas = 10**np.linspace(-4,4,150)

In [None]:
lasso_coefs = []
lasso_mse = []
lasso = Lasso()
for alpha in alphas:
    lasso.set_params(alpha=alpha)
    lasso.fit(X_train,y_train)
    lasso_coefs.append(lasso.coef_)
    lasso_mse.append(mean_squared_error(y_test,lasso.predict(X_test)))

In [None]:
lasso_cv = LassoCV(alphas=alphas)
lasso_cv.fit(X_train,y_train)

In [None]:
plt.plot(alphas,lasso_coefs)
plt.xscale("log")
plt.axvline(lasso_cv.alpha_,linestyle="dashed",color='g',alpha=0.8)
plt.xlim(0.001,1)

In [None]:
plt.plot(alphas,lasso_mse)
plt.xscale("log")
plt.axvline(lasso_cv.alpha_,alpha=0.8,linestyle="dashed")

In [None]:
lasso_coefficients = pd.Series(lasso_cv.coef_,index=['Medu','Fedu','traveltime','studytime','failures','famrel','freetime','goout','Dalc','Walc','health','absences','G1','G2'])
lasso_coefficients

Lasso picked out alcohol immediately, though (perhaps) counter-intuitively it also says that weekend drinking is correlated with higher G3 scores. Still, the impact is small, and what is more important doing well midway through the semester. Good thing the classification was separated out from the rest of the noise.

In [None]:
ridge_coefs = []
ridge_mse = []
ridge = Ridge()
for alpha in alphas:
    ridge.set_params(alpha=alpha)
    ridge.fit(X_train,y_train)
    ridge_coefs.append(ridge.coef_)
    ridge_mse.append(mean_squared_error(y_test,ridge.predict(X_test)))

In [None]:
ridge_cv = RidgeCV(alphas=alphas)
ridge_cv.fit(X_train,y_train)

In [None]:
plt.plot(alphas,ridge_coefs)
plt.xscale("log")
plt.axvline(ridge_cv.alpha_,linestyle="dashed",color='g',alpha=0.8)


In [None]:
plt.plot(alphas,ridge_mse)
plt.xscale("log")
plt.axvline(ridge_cv.alpha_,alpha=0.8,linestyle="dashed")

In [None]:
ridge_coefficients = pd.Series(ridge_cv.coef_,index=['Medu','Fedu','traveltime','studytime','failures','famrel','freetime','goout','Dalc','Walc','health','absences','G1','G2'])
ridge_coefficients

Ridge makes similar predictions, which is expected, but heartening to see. And since this is a much more complex and noisy system, which Ridge tends to perform better in, I'll continue using Ridge to reverse Dalc equation.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_regressor_mat_hd.drop('Dalc',axis=1),df_regressor_mat_hd.Dalc,random_state=42)

In [None]:
ridge_dalc_coefs = []
ridge_dalc_mse = []
ridge = Ridge()
for alpha in alphas:
    ridge.set_params(alpha=alpha)
    ridge.fit(X_train,y_train)
    ridge_dalc_coefs.append(ridge.coef_)
    ridge_dalc_mse.append(mean_squared_error(y_test,ridge.predict(X_test)))

ridge_cv = RidgeCV(alphas=alphas)
ridge_cv.fit(X_train,y_train)

plt.plot(alphas,ridge_dalc_coefs)
plt.xscale("log")
plt.axvline(ridge_cv.alpha_,linestyle="dashed",color='g',alpha=0.8)
plt.xlim(0.01,10000)

In [None]:
plt.plot(alphas,ridge_dalc_mse)
plt.xscale("log")
plt.axvline(ridge_cv.alpha_,alpha=0.8,linestyle="dashed")

In [None]:
ridge_coefficients = pd.Series(ridge_cv.coef_,index=['Medu','Fedu','traveltime','studytime','failures','famrel','freetime','goout','Walc','health','absences','G1','G2','G3'])
ridge_coefficients

Surprisingly, the 5th largest predictor surrounding Dalc is the final grade. While all other variables slowly extinguish, it would be in contention for a spot at the table. Given this, and the other predictors such as Walc, freetime, famrel, and studytime, there is enough to pursue the hypothesis that there is a relationship between alcohol and final grade that is not just one way, but can be used predictively too.