Libraries utiles

In [None]:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, explained_variance_score, max_error, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

matplotlib.rcParams['figure.figsize'] = (10,10)
sns.set_style('whitegrid')

On importe le dataset

In [None]:
df = pd.read_csv('../input/student-grade-prediction/student-mat.csv')

In [None]:
df.columns

# Informations sur les données
1. school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2. sex - student's sex (binary: 'F' - female or 'M' - male)
3. age - student's age (numeric: from 15 to 22)
4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6. Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
8. Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
9. Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
10. Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
11. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12. guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13. traveltime - home to school travel time (numeric: 1 - 1 hour)
14. studytime - weekly study time (numeric: 1 - 10 hours)
15. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16. schoolsup - extra educational support (binary: yes or no)
17. famsup - family educational support (binary: yes or no)
18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19. activities - extra-curricular activities (binary: yes or no)
20. nursery - attended nursery school (binary: yes or no)
21. higher - wants to take higher education (binary: yes or no)
22. internet - Internet access at home (binary: yes or no)
23. romantic - with a romantic relationship (binary: yes or no)
24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. health - current health status (numeric: from 1 - very bad to 5 - very good)
30. absences - number of school absences (numeric: from 0 to 93)

En plus de ces données, il y a les colonnes G1, G2 et G3 : 
1. G1 - first period grade (numeric: from 0 to 20)
2. G2 - second period grade (numeric: from 0 to 20)
3. G3 - final grade (numeric: from 0 to 20, output target)

Notre résultat sera donc la colonne G3, qui correspond à la note sur 20 de l'examen du troisième trimestre.

In [None]:
df.shape

Point faible de ce dataset : peu de données

In [None]:
df.columns

In [None]:
df.info()

Je vais transformer les objets en integer, pour permettre d'utiliser toutes les données.

In [None]:
num_features = [name for name in df.columns if df[name].dtype in ['int64', 'float64']]
cat_features = [name for name in df.columns if df[name].dtype == 'object']

In [None]:
cat_features

In [None]:
for x in cat_features:
  print(x," = ",df[x].unique())

On voit ici qu'on peut transformer chaque "string" en integer, par exemple dans "internet", 'no' devient 0, et 'yes' devient 1

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
for i in list(cat_features):
    df[i]=le.fit_transform(df[i])

In [None]:
for x in cat_features:
  print(x," = ",df[x].unique())

In [None]:
df.head()

**Un peu de statistiques :** 

Légende :
* Orange : Femme
* Bleu : Homme

In [None]:
sns.kdeplot(df.groupby('sex').get_group(1)['age'], shade = True,label = 1)
sns.kdeplot(df.groupby('sex').get_group(0)['age'], shade = True, label = 0)
plt.xlabel('data range')
plt.ylabel('% data distribution')
plt.show()

In [None]:
sns.kdeplot(df.groupby('sex').get_group(1)['studytime'], shade = True,label = 1)
sns.kdeplot(df.groupby('sex').get_group(0)['studytime'], shade = True, label = 0)
plt.xlabel('data range')
plt.ylabel('% data distribution')
plt.show()

In [None]:
sns.kdeplot(df.groupby('sex').get_group(1)['G1'], shade = True,label = 1)
sns.kdeplot(df.groupby('sex').get_group(0)['G1'], shade = True, label = 0)
plt.xlabel('data range')
plt.ylabel('% data distribution')
plt.show()

In [None]:
plt.figure(figsize = (20,50))
for i,item in enumerate(['school', 'sex', 'famsize', 'Pstatus', 'Mjob', 'Fjob',
       'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities',
       'nursery', 'higher', 'internet', 'romantic']):
    plt.subplot(9,2,i+1)
    sns.countplot(df[item])
    plt.title(item)

plt.show()   

In [None]:
plt.figure(figsize= (15,10))
plt.subplot(1,2,1)
order_by = df.groupby('Fjob')['G1'].median().sort_values(ascending = False).index
sns.boxplot(x = df['Fjob'], y = df['G1'],order = order_by)
plt.xticks(rotation = 90)
plt.title('Fjob v/s G1')

plt.subplot(1,2,2)
order_by = df.groupby('Mjob')['G1'].median().sort_values(ascending = False).index
sns.boxplot(x = df['Mjob'], y = df['G1'],order = order_by)
plt.xticks(rotation = 90)
plt.title('Mjob v/s G1')

plt.show()

In [None]:
plt.figure(figsize= (15,5))
plt.subplot(1,2,1)
order_by = df.groupby('Fedu')['G1'].median().sort_values(ascending = False).index
sns.boxplot(x = df['Fedu'], y = df['G1'],order = order_by)
plt.xticks(rotation = 90)
plt.title('Fedu v/s G1')

plt.subplot(1,2,2)
order_by = df.groupby('Medu')['G1'].median().sort_values(ascending = False).index
sns.boxplot(x = df['Medu'], y = df['G1'],order = order_by)
plt.xticks(rotation = 90)
plt.title('Medu v/s G1')

plt.show()

In [None]:
plt.figure(figsize = (15,15))
for i, item in enumerate(['schoolsup', 'famsup', 'paid', 'activities',
       'nursery', 'higher', 'internet', 'romantic']):
    plt.subplot(4,2,i+1)
    order_by = df.groupby(item)['G1'].median().sort_values(ascending = False).index
    sns.boxplot(x = df[item], y = df['G1'],order = order_by)
    plt.xticks(rotation = 90)
    plt.title(item+' v/s G1')

Nous pouvons à présent, observer les corrélations

In [None]:
corr = df.corr()
plt.figure(figsize=(25,25))
sns.heatmap(corr, annot=True)

In [None]:
matrix_corr = df.corr()
matrix_corr.G3.sort_values()

In [None]:
sns.clustermap(abs(corr), cmap="coolwarm")

# Machine learning pour la note G3

In [None]:
X = df.drop('G3',axis=1)
y = df['G3']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Régression linéaire multiple

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

In [None]:
lr.fit(X_train,y_train)
y_lr = lr.predict(X_test)
print(lr.intercept_)

In [None]:
predictions = lr.predict(X_test)  
plt.scatter(y_test,predictions)

In [None]:
sns.distplot((y_test-predictions)); 

In [None]:
from sklearn import metrics
from sklearn.metrics import mean_squared_error, r2_score
print('Erreur absolue médian:', metrics.mean_absolute_error(y_test, predictions))
print('Erreur des moindres carrés:', metrics.mean_squared_error(y_test, predictions))
scoreR2 = r2_score(y_test, predictions)
print('Score R2 : ',scoreR2)

In [None]:
plt.figure(figsize=(12,12))
plt.scatter(y_test, predictions)
plt.plot([y_test.min(),y_test.max()],[y_test.min(),y_test.max()], color='red', linewidth=3)
plt.xlabel("Note")
plt.ylabel("Prediction de la note")
plt.title("Note réelle VS prédiction")

On voit qu'on a de plutôt bons résultats avec cette méthode, avec une erreur médiane de 1,65 points par note et un score R2 de 0,81.
On voit néanmoins qu'il y a plusieurs problèmes avec la note de 0.

# Régression par forêts aléatoires

In [None]:
print(X_train.shape)
print(X_test.shape)

In [None]:
from sklearn import ensemble
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_auc_score,auc, accuracy_score

rf = ensemble.RandomForestRegressor()
rf.fit(X_train, y_train)
y_rf = rf.predict(X_test)
print(rf.score(X_test,y_test))

On voit qu'il y a une amélioration pour la régression par forêts aléatoires. 

In [None]:
plt.figure(figsize=(12,12))
plt.scatter(y_test, y_rf)
plt.plot([y_test.min(),y_test.max()],[y_test.min(),y_test.max()], color='red', linewidth=3)
plt.xlabel("Note")
plt.ylabel("Prediction de la note")
plt.title("Note réelle VS prédiction")

In [None]:
sns.distplot(y_test-predictions)

L'erreur sur les moindres carrés est divisée par deux ici, comparée à la régression linéaire

# XGBoost

In [None]:
import xgboost as XGB
xgb  = XGB.XGBRegressor()
xgb.fit(X_train, y_train)
y_xgb = xgb.predict(X_test)
print(xgb.score(X_test,y_test))

plt.figure(figsize=(12,12))
plt.scatter(y_test, y_xgb)
plt.plot([y_test.min(),y_test.max()],[y_test.min(),y_test.max()], color='red', linewidth=3)
plt.xlabel("Note")
plt.ylabel("Prediction de la note")
plt.title("Note réelle VS prédiction")

XGBoost est bon mais moins bien que les forêts aléatoires. 

Nouveaux objectifs, essayer de faire une prédiction sur la note G1 et sur la note G2. Car pour prédire G3, l'aide de G1 et G2 est très importante.
Pourquoi ne pas tenter de prédire G3 sans G1 ni G2 ?


# Prédire G1

Pour prédire G3, j'ai utilisé G1 et G2, mon objectif maintenant est de prédire la note G1. 

In [None]:
X1 = df.drop('G1',axis=1)
X2 = X1.drop('G2',axis=1)
X = X2.drop('G3',axis=1)
y = df['G1']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Régression linéaire multiple

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(X_train,y_train)
y_lr = lr.predict(X_test)
print(lr.intercept_)

In [None]:
predictions = lr.predict(X_test)  
plt.scatter(y_test,predictions)

In [None]:
sns.distplot((y_test-predictions)); 

In [None]:
print('Erreur absolue médian:', metrics.mean_absolute_error(y_test, predictions))
print('Erreur des moindres carrés:', metrics.mean_squared_error(y_test, predictions))
scoreR2 = r2_score(y_test, predictions)
print('Score R2 : ',scoreR2)

In [None]:
plt.figure(figsize=(12,12))
plt.scatter(y_test, predictions)
plt.plot([y_test.min(),y_test.max()],[y_test.min(),y_test.max()], color='red', linewidth=3)
plt.xlabel("Note")
plt.ylabel("Prediction de la note")
plt.title("Note réelle VS prédiction")

# Régression par forêts aléatoires

In [None]:
print(X_train.shape)
print(X_test.shape)

In [None]:
rf = ensemble.RandomForestRegressor()
rf.fit(X_train, y_train)
y_rf = rf.predict(X_test)
print(rf.score(X_test,y_test))

In [None]:
 plt.figure(figsize=(12,12))
plt.scatter(y_test, y_rf)
plt.plot([y_test.min(),y_test.max()],[y_test.min(),y_test.max()], color='red', linewidth=3)
plt.xlabel("Note")
plt.ylabel("Prediction de la note")
plt.title("Note réelle VS prédiction")

On voit très bien que mes résultats sont très décevants.
J'imagine que le résultat sera le même pour G2, je vais donc changer la sortie en disant :
* Si un élève a plus de 10 / 20, il valide, la sortie sera égale à 1
* Sinon, la sortie = 0 

# Jeux d'apprentissage avec une sortie en booléen 

On fait la transformation.

In [None]:
df.head()

In [None]:
df.loc[df.G3 <= 9, 'G3'] = 0
df.loc[df.G3 > 9, 'G3'] = 1

df.loc[df.G2 <= 9, 'G2'] = 0
df.loc[df.G2 > 9, 'G2'] = 1

df.loc[df.G1 <= 9, 'G1'] = 0
df.loc[df.G1 > 9, 'G1'] = 1

In [None]:
df.G3

On peut commencer les prédictions sur G1, pour voir si on fait mieux que 0,20. 

In [None]:
df.head(10)

In [None]:
X1 = df.drop('G1',axis=1)
X2 = X1.drop('G2',axis=1)
X = X2.drop('G3',axis=1)
y = df['G1']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Régression logistique

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_auc_score,auc, accuracy_score
lr = LogisticRegression()
lr.fit(X_train,y_train)
y_lr = lr.predict(X_test)

In [None]:
print(confusion_matrix(y_test,y_lr))

In [None]:
print(accuracy_score(y_test,y_lr))

In [None]:
print(classification_report(y_test, y_lr))
probas = lr.predict_proba(X_test)

On remarque l'accurancy n'est pas très bon, mais c'est mieux que la prédiction de la note. 
Le problème vient des notes inférieures à 10 notamment la note de 0/20. Car on peut donner un 0/20 malgré qu'on soit dans une bonne situation au niveau de l'éducation etc. Et ça l'algorithme n'arrive pas à le déterminer.
Une solution serait d'enlever tous les 0/20. 

# Random forests

In [None]:
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
rf.fit(X_train, y_train)
y_rf = rf.predict(X_test)

In [None]:
print(classification_report(y_test, y_rf))

In [None]:
cm = confusion_matrix(y_test, y_rf)
print(cm)

In [None]:
rf1 = ensemble.RandomForestClassifier(n_estimators=10, min_samples_leaf=10, max_features=3)
rf1.fit(X_train, y_train)
y_rf1 = rf.predict(X_test)
print(classification_report(y_test, y_rf1))

La méthode des random forests est mieux que la méthode de la régression logistique.

In [None]:
xgb  = XGB.XGBClassifier()
xgb.fit(X_train, y_train)
y_xgb = xgb.predict(X_test)
cm = confusion_matrix(y_test, y_xgb)
print(cm)
print(classification_report(y_test, y_xgb))

XGBoost ne semble pas bien fonctionner avec ce dataset.

# Prédire G3 en booléen, sans G1 ni G2

In [None]:
X1 = df.drop('G1',axis=1)
X2 = X1.drop('G2',axis=1)
X = X2.drop('G3',axis=1)
y = df['G3']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
rf.fit(X_train, y_train)
y_rf = rf.predict(X_test)

In [None]:
print(classification_report(y_test, y_rf))

On remarque qu'on arrive à avoir une accuracy de 0,76. Ce qui est un score moyen mais qui est nettement meilleur que les autres scores de prédiction. 