# Analysis of 2019 National High School Exam Data (ENEM) in Brazil (2)

This notebook is my personal playground in which I am applying some machine learning techniques to predict student outcomes on the 2019 National High School Exam (Exame Nacional do Ensino MÃ©dio, or ENEM). The dataset is from <a href="https://www.kaggle.com/saraivaufc/enem-2019">Kaggle</a>.

This notebook focuses on some modeling exercises. See the other notebook on the same repo for the datasetup/feature creation.

In [1]:
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.5f' % x)
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from sklearn.metrics import roc_auc_score
data = pd.read_csv("cleaned.csv")

This dataset is huge, so let us focus on a random sample (about 100k records).

In [2]:
data = data.sample(frac = 0.02)

In [3]:
y = data['prouni_pass'].values
data.drop(columns = ['score_science', 'score_humanities', 
                     'score_language', 'score_math',
                     'score_essay', 'prouni_pass'], inplace = True)

In [4]:
X = data.values

In [5]:
# encoding categoricals
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0, 6, 7])],
                        sparse_threshold = 0, remainder = 'passthrough')
X = np.array(ct.fit_transform(X))

In [6]:
# split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 3453)

Trying out some standard methods here. Overall, there does not seem to be much difference between baseline models, among which logistic regression appears to have a very slight edge over others.

In [7]:
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

# let us try a grid search for best tuning parameters
from sklearn.model_selection import GridSearchCV, cross_val_score
tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 3452)
tree_params = {'max_depth': range(1,20),
               'max_features': range(1,20)}

tree_grid = GridSearchCV(tree, tree_params,
                         cv=5, n_jobs=-1, verbose=True)

tree_grid.fit(X_train, y_train)
print(tree_grid.best_params_) 

Fitting 5 folds for each of 361 candidates, totalling 1805 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    4.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   19.5s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   44.9s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  3.1min


{'max_depth': 11, 'max_features': 12}


[Parallel(n_jobs=-1)]: Done 1805 out of 1805 | elapsed:  3.1min finished


In [8]:
# fit
classifier = DecisionTreeClassifier(max_depth = 11, max_features = 12, criterion = 'entropy', random_state = 3452)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))
print(roc_auc_score(y_test, classifier.predict_proba(X_test)[:, 1]))

[[ 6923  6859]
 [ 4515 12274]]
0.6279480553465703
0.6734664277308604


In [9]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

# # grid search
rfc = RandomForestClassifier(n_estimators = 10, max_features = 10, max_depth = 3,
                                    criterion = 'gini', random_state = 3452)
tree_params = {'n_estimators': range(10,100),
               'max_features': range(1,20)}

tree_grid = GridSearchCV(rfc, tree_params,
                         cv=5, n_jobs=-1, verbose=True)

tree_grid.fit(X_train, y_train)
print(tree_grid.best_params_) 

Fitting 5 folds for each of 1710 candidates, totalling 8550 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.4s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   40.6s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  8.0min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed: 13.4min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed: 19.2min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed: 28.4min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed: 39.8min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed: 53.9min
[Parallel(n_jobs=-1)]: Done 6034 tasks      | elapsed: 70.0min
[Parallel(n_jobs=-1)]: Done 7184 tasks      | elapsed: 93.3min
[Parallel(n_jobs=-1)]: Done 8434 tasks      | elapsed: 103.6min
[Parallel(n_jobs=-1)]: Done 8550 out of 8550 | elapsed: 103.8min finished


{'max_features': 2, 'n_estimators': 45}


In [10]:
# fit
classifier = RandomForestClassifier(n_estimators = 45, max_features = 2, max_depth = 3,
                                    criterion = 'gini', random_state = 3452)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))
print(roc_auc_score(y_test, classifier.predict_proba(X_test)[:, 1]))

[[ 6014  7768]
 [ 3773 13016]]
0.6224853619443264
0.6754424634631522


In [11]:
# Logit
from sklearn.linear_model import LogisticRegression

# fit
classifier = LogisticRegression(random_state = 3452)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))
print(roc_auc_score(y_test, classifier.predict_proba(X_test)[:, 1]))

[[ 6793  6989]
 [ 4384 12405]]
0.6279807660855059
0.6762080391744361


In [12]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB

# fit
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))
print(roc_auc_score(y_test, classifier.predict_proba(X_test)[:, 1]))

[[10468  3314]
 [ 8565  8224]]
0.611429132184096
0.6726992875342439


In [13]:
# KNN
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))
print(roc_auc_score(y_test, classifier.predict_proba(X_test)[:, 1]))

[[ 6603  7179]
 [ 5523 11266]]
0.5845081940401033
0.6113054018938517


In [14]:
# CatBoost
from catboost import CatBoostClassifier
classifier = CatBoostClassifier(random_seed = 3452, silent = True)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))
print(roc_auc_score(y_test, classifier.predict_proba(X_test)[:, 1]))

[[ 6994  6788]
 [ 4555 12234]]
0.6289620882535737
0.6751796191228477


In [15]:
###END###