# Machine Learning in Python

by [Piotr Migdał](http://p.migdal.pl/) & Dominik Krzemiński

for El Passion, 2017

## 7. Random Forest Classification

Same dataset: https://archive.ics.uci.edu/ml/datasets/Student+Performance

* [Random Forests in Python](http://blog.yhat.com/posts/random-forests-in-python.html)
* [Random Forest](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)
* [sklearn.ensemble.RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)


In [None]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

In [None]:
students = pd.read_csv("data/students_cleaner.csv")

In [None]:
# good grade
students["G"] = students["G1"] + students["G2"] + students["G3"]
students["good_G"] = students["G"] > students["G"].mean()

In [None]:
X = students.drop(['G', 'good_G', 'G1', 'G2', 'G3'], axis='columns')
Y = students['good_G']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

In [None]:
# more or less 50-50
Y.mean()

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_train, Y_train)

In [None]:
# accuracy on the training dataset
rf_clf.score(X_train, Y_train)

In [None]:
# accuracy on the test dataset
rf_clf.score(X_test, Y_test)

In [None]:
sns.heatmap(confusion_matrix(Y_test, rf_clf.predict(X_test)), annot=True, fmt='d')
plt.xlabel("prediction")
plt.ylabel("ground_truth")

In [None]:
pd.Series(rf_clf.feature_importances_, index=X.columns).sort_values().plot('barh', figsize=(6, 8))

In [None]:
# Talk also about:
# Cross Validation
# Grid Search