### Predicting Math Grad for Brazilian Enem 

In Brazil, there is a high school national grading exam called ENEM (Exame Nacional do Ensino Médio) which takes place yearly and sets grades for admission on most universities and colleges. The goal is to predict exam grades by using regression models.

Here, two models were implemented: **Linear Regression** and **Random Forest**, and their performance was measured through r² metric.

Let's check it!

### Importing libraries

In [None]:
import pandas as pd
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

### Loading Dataset

In [None]:
train_dataset = pd.read_csv('../input/codenation-enem2/train.csv', index_col=0)
train_dataset.head()

In [None]:
train_dataset['NU_NOTA_MT'].head()

In [None]:
train_dataset.shape

In [None]:
train_dataset.info()

In [None]:
train_dataset.describe()

In [None]:
test_dataset = pd.read_csv('../input/codenation-enem2/test.csv')
test_dataset.head()

In [None]:
test_dataset.shape

In [None]:
test_dataset.info()

In [None]:
test_dataset.describe()

### Data Analysis

In [None]:
# Generating the answer dataframe with 'NU_INSCRICAO' variable
answer = pd.DataFrame()
answer['NU_INSCRICAO'] = test_dataset['NU_INSCRICAO']
answer.head()

In [None]:
answer.shape

#### - Testing hypotseis to select features (based on D-Tale report)

a) **First hypothesis**: `NU_IDADE` and `IN_TREINEIRO` are weakly correlated with other features.

In [None]:
var = ['NU_IDADE','IN_TREINEIRO','NU_NOTA_CN','NU_NOTA_CH','NU_NOTA_LC','NU_NOTA_REDACAO']
train_dataset[var].corr()


b) **Second hypothesis**: `NU_IDADE` and `IN_TREINEIRO` could be dropped, avoiding interferences in model predictions.


In [None]:
features = ['NU_NOTA_CN','NU_NOTA_CH','NU_NOTA_LC','NU_NOTA_REDACAO']

In [None]:
train_dataset[features].corr()

In [None]:
plt.figure(figsize=(9,6))
plt.title('Train Features')
sns.heatmap(train_dataset[features].corr(), annot=True, cmap='Reds')
plt.xticks(rotation=70)
plt.show()

In [None]:
test_dataset[features].corr()

In [None]:
plt.figure(figsize=(9,6))
plt.title('Test Features')
sns.heatmap(test_dataset[features].corr(), annot=True, cmap='Reds')
plt.xticks(rotation=70)
plt.show()

### Data Preprocessing

Since there are null data in the dataset, there are two approaches that can be taken:

   > 1) Drop null values from the dataset. It could drastically decrease the samples to train the model;

   > 2) Fill null values with zeros. It keeps the number of samples in the dataset.

   > 3) Fill null values with the average value of the features. It keeps the number of samples in the dataset.

Here, the null values will be filled with zeros (second approach). 

- Checking for null values

In [None]:
train_dataset[features].isnull().sum()

In [None]:
train_dataset['NU_NOTA_MT'].isnull().sum()

In [None]:
test_dataset[features].isnull().sum()

- Filling null values with zeros

In [None]:
train_dataset['NU_NOTA_CN'].fillna(0, inplace=True)
train_dataset['NU_NOTA_CH'].fillna(0, inplace=True)
train_dataset['NU_NOTA_REDACAO'].fillna(0, inplace=True)
train_dataset['NU_NOTA_LC'].fillna(0, inplace=True)
train_dataset['NU_NOTA_MT'].fillna(0, inplace=True)
test_dataset['NU_NOTA_CN'].fillna(0, inplace=True)
test_dataset['NU_NOTA_CH'].fillna(0, inplace=True)
test_dataset['NU_NOTA_REDACAO'].fillna(0, inplace=True)
test_dataset['NU_NOTA_LC'].fillna(0, inplace=True)

- Confirming if null values wer filled

In [None]:
train_dataset[features].isnull().sum()

In [None]:
train_dataset['NU_NOTA_MT'].isnull().sum()

In [None]:
test_dataset[features].isnull().sum()

### Splitting dataset

In [None]:
X = train_dataset[features]
X.head()

In [None]:
y = train_dataset['NU_NOTA_MT']
y.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

### Features scale normalization

In [None]:
sc = StandardScaler()
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

### Modeling

- **Linear Regression**

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
# Getting predictions
y_pred = lr.predict(X_test)

In [None]:
# Getting r2 score
r2_score(y_test, y_pred)

 - **Random Forest with Cross-Validation**

In [None]:
# Perform Grid-Search
gsc = GridSearchCV(
    estimator=RandomForestRegressor(),
    param_grid={'max_depth': range(3,7), 
                'n_estimators': (50, 100, 500, 1000),
    },
    cv=10, scoring='r2', verbose=0, n_jobs=-1)

grid_result = gsc.fit(X, y)
best_params = grid_result.best_params_
rfr = RandomForestRegressor(max_depth=best_params["max_depth"], n_estimators=best_params["n_estimators"], random_state=False, verbose=False)

# Perform K-Fold CV
scores = cross_val_score(rfr, X, y, cv=10, scoring='r2')
scores

In [None]:
scores.mean() * 100

In [None]:
rfr.fit(train_dataset[features], train_dataset['NU_NOTA_MT'])

In [None]:
y_pred = rfr.predict(test_dataset[features])
y_pred

In [None]:
answer['NU_NOTA_MT'] = y_pred
answer.head()

In [None]:
answer.describe()

In [None]:
answer.to_csv('answer.csv', index=False, float_format='%.1f')