# Codenation Challenge
---
This challenge is based on predicting the mathematical score. I used a simple linear regression model to predict Enem's math grades. I also selected features using a Pearson correlation to fit the model. Since I can't send to get a score, I did a cross-validation to evaluate the model.

## Importing Libraries and Dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.metrics import make_scorer, r2_score, mean_absolute_error
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df_test = pd.read_csv('/kaggle/input/codenation-enem2/test.csv')
df_train = pd.read_csv('/kaggle/input/codenation-enem2/train.csv')

## Analysing the Database

### Some info about the Databse

Here is the chosen columns which will be used to make the prediction

- TP_ANO_CONCLUIU -> ano de conclusãão do ensino méédio
- TP_PRESENCA_CN -> se compareceu a prova de CN
- TP_PRESENCA_CH -> presença na prova de CH
- TP_PRESENCA_LC -> presença na prova de LC
- NU_NOTA_CN -> nota da prova de CN
- NU_NOTA_CH -> nota da prova de CH
- NU_NOTA_LC -> nota da prova de LC
- NU_NOTA_REDACAO -> nota de redaçãão
- NU_NOTA_MT -> nota de matemáática
- Nota de competências da Redação
  - NU_NOTA_COMP1
  - NU_NOTA_COMP2
  - NU_NOTA_COMP3
  - NU_NOTA_COMP4
  - NU_NOTA_COMP5

In [None]:
use_list = ['NU_NOTA_COMP1', 'NU_NOTA_COMP2', 'NU_NOTA_COMP3', 'NU_NOTA_COMP4', 'NU_NOTA_COMP5', 'NU_NOTA_CN', 'NU_NOTA_CH', 
            'NU_NOTA_LC', 'NU_NOTA_REDACAO', 'NU_NOTA_MT']

df_train = df_train[use_list] #selection all columns
df_test = df_test[use_list[:-1]] #selection all other columns except 'NU_NOTA_MT' column for testing

In [None]:
df_train.head()

In [None]:
df_test.head()

As we can see above, there are some NaN values. So, we need to do something about it. I will replace them with the number 0. Because the student could give up ENEM.

In [None]:
print(df_train.isna().sum() / df_train.shape[0] * 100)

In [None]:
print(df_test.isna().sum() / df_test.shape[0] * 100)

In [None]:
df_train_filled = df_train.fillna(0, axis=0)
df_test_filled = df_test.fillna(0, axis=0)

Now let's see if there is any correlation between the data. This helps to choose the best features to train the model.

In [None]:
correlacao_notas = df_train_filled.corr()

plt.figure(figsize=(10, 6))
sns.heatmap(correlacao_notas, annot=True, cmap="BrBG", vmin=-1, vmax=1)
plt.xticks(rotation=45)
plt.show()

I'll choose the columns with correlation above 0.75. It can improve the model.
Now let's split the df_train database in X and y for training.

In [None]:
X = df_train_filled.drop(columns=['NU_NOTA_COMP5', 'NU_NOTA_MT'])
y = df_train_filled['NU_NOTA_MT']

First, it is necessary to define what the variables will be for testing and training. To then train the model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

## Training the Base Linear Model

This model will be used as a base line to compare with other models below.

In [None]:
lr = LinearRegression(normalize=True)
lr.fit(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

The score was good. However for better scoring, it's important to use cross validation.

In [None]:
mae = make_scorer(mean_absolute_error)
r2 = make_scorer(r2_score)

cvs = cross_validate(estimator=LinearRegression(normalize=True), X=X, y=y, cv=10, verbose=10, 
                      scoring={'mae': mae, 'r2':r2})

In [None]:
print("The mean of the result is %.3f" % (cvs['test_r2'].mean()))
print("The standard desviation error is %.3f" % (cvs['test_r2'].std()))

In [None]:
print("The mean of the result is %.3f" % (cvs['test_mae'].mean()))
print("The standard desviation error is %.3f" % (cvs['test_mae'].std()))