I are wondering how many key factors impact students' grade, and what are they? In this kernel, there are 3 steps to explore and answer this question:

* Explore datasets and EDA 
* modeling, using svm, perceptron, decision tree, BaggingClassifier, knn, logistic modeling
* Analysis feature importance

Some of EDA and model skill are from this kernel: https://www.kaggle.com/nirajvermafcb/comparing-various-ml-models-roc-curve-comparison

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn import tree

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn import svm

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

# 1 Explore datasets and preprocess

Loading datasets into ipython as dataframes and combine them into one, with a new attribute "course" to identify the origin.

* data_mat["course"]="math": records from student-mat.csv
* data_por["course"]="portuguese": records from student-por.csv

In [None]:
data_mat=pd.read_csv("../input/student-mat.csv")
data_mat["course"]="math"
data_por=pd.read_csv("../input/student-por.csv")
data_por["course"]="portuguese"
data=data_mat.append(data_por)

In [None]:
data.info()
data.head(3)

From above, we can see there are 1044 records and 34 features in this dataframe, without missing value. And there are 3 course grades for each students, i.e. G1, G2 and G3. Now We introduce new attributes “Gmean” and "Glevel", which describes different level of average value of G1, G2, G3.

* data["Glevel"]='fail': average(G1,G2,G3)<12
* data["Glevel"]='good': average(G1,G2,G3)>16
* data["Glevel"]='padd': average(G1,G2,G3)>=12 and average(G1,G2,G3)<=16

In [None]:
data["Gmean"]=data.loc[:,("G1","G2","G3")].mean(1)
data["Glevel"]=np.where(data["Gmean"]<12,'fail','pass')
data["Glevel"]=np.where(data["Gmean"]>16,'good',data['Glevel'])

del data["G1"]
del data["G2"]
del data["G3"]

Exploring value distribution of each features, for basic understanding of the dataframe. Because of precision, I prefer showing tables to pictures in 1 Dimension feature EDA, though it's not as intuitive as bar chart.

In the next script, we will explore the percentage of ordinal values in each feature.

In [None]:
for i in data.columns:
    print("=====Attr:",i,"=====")
    print(100*data[i].value_counts()/len(data[i]))
    print("\n")

It is really a surprise that <b>56% STUDENTS FAIL IN COURSES</b>. Why does this happen? Are the courses too difficult? Do students need to study harder? Or parents should spend more time on their kids?

From above, there are some clues for deeper analysis:

1. 73% students are from School GP. Are students' grades in School GP bad?
2. 62% students study protuguese course, while only 38% students study math. Is the math course very difficult?
3. 99% students are below 19 years old. Are they too young for these exams?
4. 11% students drink alcohol in workday frequently(Dalc>=3), and 40% student in weekend. Does they drink too much alcohol?
5. 91% students want to take higher education.

## 1) EDA with school and course

Firstly, students in different schools and courses may have different grades. Let's use boxplot to distinguish these difference.

In [None]:
plt.figure(figsize=(12,12))
plt.subplot(221)
sns.boxplot(x="course",y="Gmean",data=data)
plt.subplot(222)
sns.boxplot(x="school",y="Gmean",data=data)
plt.subplot(223)
sns.boxplot(x="course",y="Gmean",hue="school",data=data)
plt.subplot(224)
sns.boxplot(x="school",y="Gmean",hue="course",data=data)
plt.show()

As shown above, we find that students' grade in math course are just the same between two schools, but in portuguese course they are different obviously. A proper explanation is that math examination is objective, while portuguese exam is subjective in some degree.

## 2) EDA with alcohol consumption

In [None]:
plt.figure(figsize=(12,12))
plt.subplot(221)
sns.violinplot(x="Dalc",y="Gmean",data=data)
plt.subplot(222)
sns.swarmplot(x="Walc",y="Gmean",data=data)
plt.subplot(223)
sns.boxplot(x="Dalc",y="Gmean",hue="course",data=data)
plt.subplot(224)
sns.boxplot(x="Walc",y="Gmean",hue="course",data=data)
plt.show()

Students drinking too much have bad course grades in both math and portuguese courses. Then we will use cross table to analyse grade level under different ages.

## 3) EDA with age

In [None]:
print("=====For math course=====")
agegrade=pd.crosstab(data["age"][data["course"]=="math"],data["Glevel"][data["course"]=="math"])
agegrade["sum"]=agegrade.sum(1)
agegrade["fail%"]=agegrade["fail"]/agegrade["sum"]*100
agegrade["good%"]=agegrade["good"]/agegrade["sum"]*100
agegrade["pass%"]=agegrade["pass"]/agegrade["sum"]*100
del agegrade["fail"]
del agegrade["good"]
del agegrade["pass"]
del agegrade["sum"]
print(agegrade)
print("\n")

print("=====For portuguese course=====")
agegrade=pd.crosstab(data["age"][data["course"]=="portuguese"],data["Glevel"][data["course"]=="portuguese"])
agegrade["sum"]=agegrade.sum(1)
agegrade["fail%"]=agegrade["fail"]/agegrade["sum"]*100
agegrade["good%"]=agegrade["good"]/agegrade["sum"]*100
agegrade["pass%"]=agegrade["pass"]/agegrade["sum"]*100
del agegrade["fail"]
del agegrade["good"]
del agegrade["pass"]
del agegrade["sum"]
print(agegrade)

Students aren't too young to pass exams. students below 18 years old have the same grades, but in the ages 18,21,22, grades are worse, except in the age 20. This raises another question, why elder students have worse grades? Here we don't have enough information.

## 4) Preprocessing

Before going deeper, we need to transform ordinal features into numeric values. There are 2 ways for preprocessing: 

1. using the sklearn library 
2. encoding categories

and we choose the former.

In [None]:
#from sklearn.preprocessing import LabelEncoder
#labelencoder=LabelEncoder()
#for col in ["school", "sex", "address", "famsize", "Pstatus", "Mjob", "Fjob", "reason", "guardian", "schoolsup", 
#        "famsup", "paid", "activities", "nursery", "higher", "internet", "romantic", "course","Glevel"]:
#    data[col] = labelencoder.fit_transform(data[col])

for i in ["school","sex","address","famsize","Pstatus","Mjob","Fjob","reason","guardian","schoolsup","famsup",
          "paid","activities","nursery","higher","internet","romantic","course","Glevel"]:
    data[i]=data[i].astype("category")
    data[i].cat.categories=range(0,len(data[i].unique()),1)
    data[i]=data[i].astype("int")

data.head()

In [None]:
cor=data.corr()
plt.figure(figsize=(12,12))
sns.heatmap(cor,annot=False)
plt.show()

In [None]:
trainfeatures=["school", "sex", "age", "address", "famsize", "Pstatus", "Medu", "reason", "guardian", "traveltime", "studytime", "failures", "schoolsup", "famsup", "paid", "activities", "nursery", "higher", "internet", "romantic", "famrel", "freetime", "goout", "Walc", "health", "absences", "course","Glevel"]

In [None]:
cor["Glevel"].drop("Glevel").drop("Gmean").sort_values()

# 2 Modeling

Here we use different supervised classification method for modeling. Firstly, split dataset into training set and testing set.

In [None]:
training, testing = train_test_split(data[trainfeatures], test_size=0.3, random_state=0)

X=training.iloc[:,0:27]
y=training.iloc[:,27]
xtest=testing.iloc[:,0:27]
ytest=testing.iloc[:,27]

Then use svm, perceptron, decision tree, BaggingClassifier, knn, logistic regression for modeling.

In [None]:
## Logistic Regression
clf_log = LogisticRegression()
clf_log = clf_log.fit(X,y)
score_log = cross_val_score(clf_log, xtest, ytest, cv=5).mean()
print(score_log)

In [None]:
## Perceptron
clf_pctr = Perceptron(
    class_weight='balanced'
    )
clf_pctr = clf_pctr.fit(X,y)
score_pctr = cross_val_score(clf_pctr, xtest, ytest, cv=5).mean()
print(score_pctr)

In [None]:
## Kneighbor
clf_knn = KNeighborsClassifier(
    n_neighbors=10,
    weights='distance'
    )
clf_knn = clf_knn.fit(X,y)
score_knn = cross_val_score(clf_knn, xtest, ytest, cv=5).mean()
print(score_knn)

In [None]:
## SVM
clf_svm = svm.SVC(
    class_weight='balanced'
    )
clf_svm.fit(X, y)
score_svm = cross_val_score(clf_svm, xtest, ytest, cv=5).mean()
print(score_svm)

In [None]:
## Bagging
bagging = BaggingClassifier(
    KNeighborsClassifier(
        n_neighbors=5,
        weights='distance'
        ),
    oob_score=True,
    max_samples=0.5,
    max_features=1.0
    )
clf_bag = bagging.fit(X,y)
score_bag = clf_bag.oob_score_
print(score_bag)

In [None]:
## Decision Tree
clf_tree = tree.DecisionTreeClassifier(
    #max_depth=3,\
    class_weight="balanced",\
    min_weight_fraction_leaf=0.01\
    )
clf_tree = clf_tree.fit(X,y)
score_tree = cross_val_score(clf_tree, xtest, ytest, cv=5).mean()
print(score_tree)

In [None]:
## Random Forest
clf_rf = RandomForestClassifier(
    n_estimators=1000, \
    n_jobs=-1
    )
clf_rf = clf_rf.fit(X,y)
score_rf = cross_val_score(clf_rf, xtest, ytest, cv=5).mean()
print(score_rf)

In [None]:
## Extra Tree
clf_ext = ExtraTreesClassifier(
    max_features='auto',
    bootstrap=True,
    oob_score=True,
    n_estimators=1000,
    max_depth=None,
    min_samples_split=10
    #class_weight="balanced",
    #min_weight_fraction_leaf=0.02
    )
clf_ext = clf_ext.fit(X,y)
score_ext = cross_val_score(clf_ext, xtest, ytest, cv=5).mean()
print(score_ext)

In [None]:
## Summary of each classifier
odels = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
              'Perceptron','BaggingClassifier','Random Forest','Decision Tree','Extra Tree'],
    'Score': [score_svm, score_knn, score_log,score_pctr, score_bag,score_rf, score_tree, score_ext]})
print(odels.sort_values("Score",ascending=False))

All of these classifiers don't get good performances. Maybe it is due to insufficient data and features.

# 3 feature importance

In [None]:
## Importance of each features
importances = clf_ext.feature_importances_
features = data.columns[0:31]
sort_indices = np.argsort(importances)[::-1]
sorted_features = []
for idx in sort_indices:
    sorted_features.append(features[idx])
plt.figure()
plt.figure(figsize=(14,14))
plt.bar(range(len(importances)), importances[sort_indices], align='center');
plt.xticks(range(len(importances)), sorted_features, rotation='vertical');
plt.xlim([-1, len(importances)])
plt.grid(False)
plt.show()

result=pd.DataFrame({'factor':sorted_features,'weight':importances[sort_indices]})
print(result.sort_values("weight",ascending=False))

Top 10 features impacting students' grade are guardian, paid, Dalc, famrel, Medu, reason, school, freetime, Fedu,address