# Student Grades and Alcohol Consumption

In this dataset there are the informations of students Portuguese courses.
What we are going to do is to arrange these data, make a social analysis and then a prediction to know the final grade.

In [None]:
#import libreries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#read datasets
student=pd.read_table("../input/student-alcohol-consumption/student-por.csv", sep=',')

student

Before proceeding with an analysis it is useful to know what the columns and its values refer to.

1. school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2. sex - student's sex (binary: 'F' - female or 'M' - male)
3. age - student's age (numeric: from 15 to 22)
4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6. Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8. Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9. Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
10. Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
11. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12. guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13. traveltime - home to school travel time (numeric: 1 - 1 hour)
14. studytime - weekly study time (numeric: 1 - 10 hours)
15. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16. schoolsup - extra educational support (binary: yes or no)
17. famsup - family educational support (binary: yes or no)
18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19. activities - extra-curricular activities (binary: yes or no)
20. nursery - attended nursery school (binary: yes or no)
21. higher - wants to take higher education (binary: yes or no)
22. internet - Internet access at home (binary: yes or no)
23. romantic - with a romantic relationship (binary: yes or no)
24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. health - current health status (numeric: from 1 - very bad to 5 - very good)
30. absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

31. G1 - first period grade (numeric: from 0 to 20)
32. G2 - second period grade (numeric: from 0 to 20)
33. G3 - final grade (numeric: from 0 to 20, output target)

In [None]:
student.info()

In [None]:
student.describe()

What we can initially see is that:
- there are no NaN values
- all the features are on different scales
- many of these are categorical
- the average of all students is constant throughout the period and is around 11
- Moms and dads have an education between the 5th and 9th grade
- The average that a student devotes to study is approximately 2 hours
- Students are in excellent health and do not have excessive alcohol consumption throughout the week (according to a subjective opinion)
- G1 and G2 is very similar to G3

Now seeing the correlation between the values.

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(student.corr(), annot=True)

This graph gives us a general picture of how everything affects grades.

Before going for a more in-depth analysis, however, we note other things:
- G1, G2, G3 are closely related each other. Which mean: If the average of the marks of G1 and G2 does not vary, these two features would be enough to predict G3. I could think to delete G1 and G2 as they are closely correlated with G3
- Failures are also closely related to grades. It is obvious that with good grades the chances of failure decrease
- Study time and weekend alcohol consumption also influence each other. A student who tends to drink (so go out more times) a lot reduces the hours of study and viceversa

In [None]:
sns.countplot('age', data=student, hue='sex')
plt.ylabel('Numeber of students')
plt.title('Age of students by sex')

For my analysis, since the grades are closely related to each other, I will only consider G3.

In [None]:
sns.boxplot('G3', 'sex', data=student)
plt.title ('Better sex on G3')

The boys have a lower average than girls. Now we will see if feature 'age' is relevant on G3.

In [None]:
plt.figure(figsize=(6,6))
sns.boxplot(x="age",y="G3",data=student)
plt.title('G3 Avarage based on age ')

The average of the final grade, even if not by much, decreases with increasing age. This can be due to many factors that make students more mature and more free to do things.

Let's do a check.

First we check the number of students who have failed and for how many times.

In [None]:
plt.figure(figsize=(6,6))
sns.countplot('age', data=student, hue='failures')
plt.ylabel('n students')
plt.title('failures per age')

As we can see there are many students who fail and repeat the year. Students thus tend to finish school later and also lose concentration in their studies, so that they also have a lower average.

In [None]:
plt.figure(figsize=(6,6))
sns.barplot(x="age",y="Walc",data=student)
plt.ylabel('Weekend alcohol consumption')
plt.title('Alcohol consumption based on age')

Average students alcohol consumption does not appear to be high, except for older students.

With these other 2 graphs we will see if with increasing age the outputs also increase.

In [None]:
plt.figure(figsize=(6,6))
sns.lmplot(x="age",y="goout",data=student)

plt.title('Go out based on age')

In [None]:
plt.figure(figsize=(6,6))

sns.barplot(x="age",y="goout",data=student)
plt.title('Go out based on age')

Here is how we can see there is a slightly negative influence. Students between the ages of 16 and 19 are the ones most affected by outings. In fact, as you can see from the previous graph, there are more failures included in that age group.

Now let's see if the weekly study hours affect the final grade.

In [None]:
sns.boxplot('studytime', 'G3', data=student)
plt.xlabel("Weekly study hours")
plt.title('Influence of weekly study hours')

The hours of study per week influence the average grade of students. The greater the dedication, the greater the final grade.

But how is the average affected if we also consider extracurricular activities and a romantic relationship?

In [None]:
sns.violinplot('activities', 'G3', data=student, hue='romantic', color='r')
plt.title('Influence of extracurricular activities and \nromantic relationship on final grades.')

Most students have an average of around 11, but also we can see that some students have slightly higher grades. The extra-curricular activities and romantic relationship does not affect G3.

In [None]:
sns.boxplot('Pstatus', 'G3', data=student, hue='famsize')
plt.xlabel('Family Apart o Together')
plt.title('Final grades based on family')

Family status don't influence school performance.

In [None]:
student['walcool']= student['Dalc']+ student['Walc'] #create and sum (provvisory) alchol daily and weekend consuption
sns.boxplot('walcool', 'G3', data=student)
plt.title('Alcohol consumption influences')
plt.xlabel('Alcohol consumption')

In [None]:
student= student.drop('walcool', axis=1) #deleting provvisory column

The alcohol consumption is not strictly related to the grades of the students, at least not as excessively as you might think. A student who drinks more tends to have a not so high G3.

In [None]:
sns.boxplot('G3', 'nursery', data=student)
plt.title('G3 based on nursery')

This last graph compares G3 with nursery and non-nursery students. There is not a big difference between the two, except in the median of the grades. In nursery schools it is higher.

### Cleaning data

For our best prediction we have to clean the data, so transform the features.

In [None]:
#transform binary value in 0 and 1
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

student.nursery = le.fit_transform(student['nursery'])
student.internet = le.fit_transform(student['internet'])
student.schoolsup = le.fit_transform(student['schoolsup'])
student.activities = le.fit_transform(student['activities'])
student.paid = le.fit_transform(student['paid'])
student.higher = le.fit_transform(student['higher'])
student.school=le.fit_transform(student['school'])
student.address=le.fit_transform(student['address'])
student.sex=le.fit_transform(student['sex'])
student.Pstatus=le.fit_transform(student['Pstatus'])
student.famsize=le.fit_transform(student['famsize'])
student.famsup=le.fit_transform(student['famsup'])
student.activities=le.fit_transform(student['activities'])
student.romantic=le.fit_transform(student['romantic'])


#now for other data not boolean we can do an one-hot encoding
student = pd.get_dummies(student, columns = ['Mjob', 'Fjob', 'reason', 'guardian'])
student

student.head()

### Grades prediction

What we are going to do is calculate the final grade of a student thanks to our machine learning models, evaluating which is the best.

In [None]:
#import models and metrics
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, GridSearchCV, StratifiedKFold
from sklearn.metrics import r2_score, mean_absolute_error

In [None]:
#separating data
X = student.drop (['school','failures', 'absences', 'G1', 'G2', 'G3'], axis=1)
y = student.G3

#splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=10)
skf = StratifiedKFold(n_splits=10, random_state=90, shuffle=True) #cross validation with 10 split

To calculate the final grade I want to evaluate these three model of ML. I'm starting to use models that don't need to be scale like DecisionTree and RandomForestRegressor after that I'm going to take in consideration SVR.

To make more realistic my analysis I deleted G1 and G2 because is too close to G3 and school, failures and abseces because I want to predict G3 based on social status. So my prediction will based on other features.

#### DecisionTree

In [None]:
#defining parameters for decision tree regressor
params= { 'max_features':[0.5], #[0.1,0.2,0.3,0.4,0.5],
    'min_samples_split':[0.1], #5,6,7,8, 0.1,0.2
    'min_samples_leaf':[0.1], #2,3,4,5,6,7,8,0.1,0.2,0.3
        }

reg_tree = DecisionTreeRegressor()
gs = GridSearchCV(estimator=reg_tree, param_grid=params, cv=5, n_jobs=-1) #validate model with his parameters
gs.fit(X_train, y_train) #fitting training set

reg_tree = gs.best_estimator_
print(reg_tree) #printing best estimator values

pred_tree = reg_tree.predict (X_test)

#printing scores
dt_score = r2_score(y_test, pred_tree)
dt_mae = mean_absolute_error(y_test, pred_tree)
print('MAE: %.2f' %dt_mae)
print('Score: %.2f' %dt_score)

importances = reg_tree.feature_importances_
indices= np.argsort(importances)[::-1]
# summarize feature importance
for i,v in enumerate(importances):
    print("%d. Feature %s(%.3f)" % (i + 1, X.columns.values[indices[i]], importances[indices[i]]))

#### RandomForest Regressor

In [None]:
params = {'n_estimators':[1000],
        'min_samples_leaf': [4], #2,3,4,5,6,7
        'min_samples_split': [4], #2,3,4,5,6
          'max_features': [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]

             }


reg_forest = RandomForestRegressor()
gs = GridSearchCV(reg_forest,params, cv=5, n_jobs=-1) #validation for Random Forest
gs.fit (X_train, y_train)
reg_forest=gs.best_estimator_
print(reg_forest)

pred_forest = reg_forest.predict (X_test)
rf_score = r2_score(y_test, pred_forest) 
rf_mae = mean_absolute_error(y_test, pred_forest)
print('MAE: %.2f' %rf_mae)
print('Score: %.2f' %rf_score)

#features importance
importances = reg_forest.feature_importances_
indices= np.argsort(importances)[::-1]
# summarize feature importance
for i,v in enumerate(importances):
    print("%d. Feature %s(%.3f)" % (i + 1, X.columns.values[indices[i]], importances[indices[i]]))

#### Scaling 

In [None]:
#scaling data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) #fitting and transform training data
X_test = scaler.transform(X_test) #transform test data

Now I scaled data because I want to try with SVR

#### SVR

In [None]:
params= {
    'kernel':['rbf'], #linear
    'C':[0.9], #[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
    'epsilon':[1.2], #[0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,2.0,3.0,1.1,1.2,1.3,1.4],
    'gamma':[0.01], #[0.1,0.2,0.3,0.4,0.5]
        }

reg_svr = SVR()
gs = GridSearchCV(reg_svr,params, cv=5, n_jobs=-1) #validation for Random Forest
gs.fit (X_train, y_train)
reg_svr=gs.best_estimator_
print(reg_svr)

pred_svr = reg_svr.predict (X_test)
svr_score = r2_score(y_test, pred_svr)
svr_mae = mean_absolute_error(y_test, pred_svr)
print('MAE: %.2f' %svr_mae)
print('Score: %.2f' %svr_score)


#importeance only for linear kernel
#importance = reg_svr.coef_
#indices= np.argsort(importance)[::-1]
# summarize feature importance
#for i,v in enumerate(importance):
#    print("%d. Feature %s(%.2f)" % (i+1, X.columns.values[indices[i]], importance[indices[i]]))

##### Best scroring

In [None]:
from tabulate import tabulate
data=[[svr_score, svr_mae],
      [rf_score, rf_mae],
      [dt_score, dt_mae ]]
index = ['SVR','Random Forest Regressor', 'Decision Tree Regressor']
tab = pd.DataFrame(data, index=index, columns=['R2 score', 'MAE']).sort_values('R2 score',ascending = False).round(2)

   

print(tabulate(tab, headers= ['Model', 'R2 score', 'MAE'],tablefmt='fancy_grid'))


#### Important features

In our model the most important and influence feature is higher and age.