Introduction
============

In this notebook we'll try to find associacions between the school grades for math and portuguese language courses and the students' habits. We have a small sample of 648 students from 2 different secondary schools, with a mean age of ~17 years. All the individuals are alcohol consumers (they drink alcohol 1 time / week at least).


----------
Index of contents
=================

 - [Set up][1]
 - [Data cleaning][2]
 - [Exploratory Data Analysis][3]
  - [Grades Distribution][4]
  - [Correlations][5]
 - [Regression Modeling][6]
 - [Conclusions][7]


----------


Set up
======


  [1]: https://www.kaggle.io/svf/430451/eb99792180875fdccf8096248730fc1c/__results__.html#Set-up
  [2]: https://www.kaggle.io/svf/430451/eb99792180875fdccf8096248730fc1c/__results__.html#Data-cleaning
  [3]: https://www.kaggle.io/svf/430451/eb99792180875fdccf8096248730fc1c/__results__.html#Exploratory-Data-Analysis
  [4]: https://www.kaggle.io/svf/430451/eb99792180875fdccf8096248730fc1c/__results__.html#Grades-Distribution
  [5]: https://www.kaggle.io/svf/430451/eb99792180875fdccf8096248730fc1c/__results__.html#Correlations
  [6]: https://www.kaggle.io/svf/430451/eb99792180875fdccf8096248730fc1c/__results__.html#Regression-Modeling
  [7]: https://www.kaggle.io/svf/430451/eb99792180875fdccf8096248730fc1c/__results__.html#Conclusions

In [None]:
#Packages
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
import scipy
import matplotlib.pyplot as plt

#Data import & Fusion
data1 = pd.read_csv("../input/student-mat.csv",sep=",")
data2 = pd.read_csv("../input/student-por.csv",sep=",")
data = [data1,data2]
data=pd.concat(data)

----------


Data cleaning
=============

We've merged the databases from the math and portuguese language courses, but there are some students (382) who are present on both courses, so we have to remove the duplicated cases:

In [None]:
#Cleaning duplicates
data=data.drop_duplicates(["school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"])

----------


Exploratory Data Analysis
=========================

First we have to see how is distributed our target variable over the sample, and test if the data apparently follows any known pattern. 

Grades Distribution
-------------------

In [None]:
#Test if the distribution of the grades follows a normal
grades=["G1","G2","G3"]
norm=[]
for i in grades:
    norm.append(scipy.stats.normaltest(data[i]))
for i in range(0,len(grades)):
    print(grades[i])
    print(norm[i])
    print('----------')

Apparently the first period grade (G1) is the closest variable to the normal distribution. We can also see this by plotting the 3 distributions:

In [None]:
i = 1
for w in grades:
    plt.subplot(3, 1, i)
    plt.tight_layout()
    i += 1
    plt.hist(data[w])
    plt.title(w)

In the 2nd and the 3rd period of the secondary school the outliers with 0-4 grades start to become more frequent.

Correlations
------------

Lets explore if the variables show significant correlations by pairs between them.

In [None]:
corr = data.corr()
for a in corr.columns:
    for b in corr.index:
        if (a != b) and (abs(corr[a][b]) >= 0.75):
            print(a,b,'-->',corr[a][b])

This is a bad indicator. Only the G1, G2 and G3 variables (the grades itselves) show strong correlations (above 0,75). This rules out the options of doing some factorial analysis, like a PCA. We can try to explore the chances to perform some regression model, and see if we can approach this problem with them.


----------


Regression Modeling
===================

I'll drop the variables 'G1', 'G2' and 'G3', because they're our target variable itselves. I have taken as a starting point [the analysis from Dmitriy Batogov][1] (which I really recommend to take a look at!), and I've parameterised the analysis of the cross val scores in order to make a cleaner code in the case I need to perform this analysis again on the next lines.


  [1]: https://www.kaggle.com/dmitriy19/d/uciml/student-alcohol-consumption/basic-eda-and-final-grade-prediction

In [None]:
def regression_explore(y,drop):
    Y = data[y]
    X = data.drop(drop, axis=1)
    X = pd.get_dummies(X)

    names = ['DecisionTreeRegressor', 'LinearRegression', 'Ridge', 'Lasso']

    clf_list = [DecisionTreeRegressor(),
            LinearRegression(),
            Ridge(),
            Lasso()]
            
    print('Models performance in: ' + str(y))
    print('------------------------')
    for name, clf in zip(names, clf_list):
        print(name, end=': ')
        print(cross_val_score(clf, X, Y, cv=5).mean())
        
a = ['G3','G2','G1']
for i in a:
    regression_explore(i, a)
    print('\n')

Again bad news. Aparently, the variables G1, G2, and G3 do not show big relationships with the rest of the variables that we have. This could be due to the small sample size, or maybe we would need to collet data from some other variables. Either way, we'll have to change our approximation to this dataset if we want to extract some model.

In the correlation analysis we noticed that the grades (G1, G2 and G3) have strong correlations between them. This indicates that someway the students who have big grades on the first period (G1) use to have big grades on the second period (G2) too. I'll check that the grades from the previous period can be a useful predictor from the grades on the next period:

In [None]:
variables_explore = ['G2','G3']
for b in variables_explore:
    regression_explore(b,b)
    print('\n')

This confirms the theory that the grades of the previous period are a good indicator of how the grades of the next period will be. Almost all of the models show a nice performance on the estimation, but I will choose the simple linear regression to perform this model:

In [None]:
y = data['G2']
X = data['G1']
X = pd.DataFrame(X)
y = pd.DataFrame(y)
linreg = LinearRegression()
linreg.fit(X,y)
y_pred = linreg.predict(X)
fig = plt.figure()
plt.scatter(X,y, color='black', alpha=.1)
fig.suptitle('Relatinship Grades G1 - Grades G2', fontsize=12)
plt.xlabel('Grades G1', fontsize=12)
plt.ylabel('Grades G2', fontsize=12)
plt.xlim(0, 20)
plt.ylim(0, 20)
plt.plot(X,y_pred, color='red')
# The mean squared error
print("Mean squared error: %.2f"
      % np.mean((linreg.predict(X) - y) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % linreg.score(X, y))

The model looks pretty accurate for the prediction of the grades from the 2nd period using the grades from the 1st period. The darker of the dot, the more frequent that results are. The red line is our prediction model. We can see that the darker dots are located closer to the red line.

Lets see if this regression works for the prediction of the 3rd period too:

In [None]:
y = data['G3']
X = data['G2']
X = pd.DataFrame(X)
y = pd.DataFrame(y)
linreg = LinearRegression()
linreg.fit(X,y)
y_pred = linreg.predict(X)
fig = plt.figure()
plt.scatter(X,y, color='black', alpha=.1)
fig.suptitle('Relatinship Grades G2 - Grades G3', fontsize=12)
plt.xlabel('Grades G2', fontsize=12)
plt.ylabel('Grades G3', fontsize=12)
plt.xlim(0, 20)
plt.ylim(0, 20)
plt.plot(X,y_pred, color='red')
# The mean squared error
print("Mean squared error: %.2f"
      % np.mean((linreg.predict(X) - y) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % linreg.score(X, y))

It works even better for the 3rd period grades. This is because the correlation between the 2nd and the 3rd grades (~ 90%) is stronger than the correlation between the 1st and the 2nd period grades (~ 84%).


----------


Conclusions
============

We have not been able to find any significant correlations between the students' alcohol habits and their grades on the math and portuguese courses. This could be caused because we need to collect data from a bigger sample size, or maybe because the grades are influenced by another different variables that we're not considering on this database. Anyway, we've also good news: it seems to be some kind of similar pattern between the grades that the students get on the three periods, showing a continuation on the school results. It will be needed some extra research in this field in order to get more clear results.