# Treinamento & Validação
Esse **Jupyter Notebook** tem como objetivo treinar modelos de *Machine Learning* com uma ou mais *features* e verificar quão bem esses modelos estão aprendendo com base na métrica de validação - [Erro Médio Absoluto](https://en.wikipedia.org/wiki/Mean_absolute_error).

---

In [None]:
import py7zr

with py7zr.SevenZipFile("../datasets/Train_rev1.7z", mode='r') as archive:
  archive.extractall(path="/tmp") # For Linux users.

In [None]:
import pandas as pd

full_df = pd.read_csv("/tmp/Train_rev1.csv")
df_SalaryNormalized = full_df[["SalaryNormalized"]]
df_SalaryNormalized.info()
df_SalaryNormalized.head()

In [None]:
import scipy.sparse
df_title_vectorized = scipy.sparse.load_npz('df_title_vectorized.npz')

In [None]:
df_title_vectorized

In [None]:
x = df_title_vectorized
y = df_SalaryNormalized

In [None]:
x

In [None]:
y

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.3, random_state=42)

In [None]:
from sklearn.linear_model import LinearRegression
lrModel = LinearRegression() # Instance.
lrModel.fit(x_train, y_train) # Training.

salary_predicted_ln = lrModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_ln)

from sklearn.metrics import mean_absolute_error
mae_ln = mean_absolute_error(y_valid, salary_predicted_ln)
print('MAE for Linear Regression Model: ', mae_ln)

In [None]:
from sklearn.linear_model import Ridge
ridgeModel = Ridge() # Instance.
ridgeModel.fit(x_train, y_train) # Training.

salary_predicted_ridge = ridgeModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_ridge)

from sklearn.metrics import mean_absolute_error
mae_ridge = mean_absolute_error(y_valid, salary_predicted_ridge)
print('MAE for Ridge Model: ', mae_ridge)

In [None]:
from sklearn.linear_model import Lasso
lassoModel = Lasso() # Instance.
lassoModel.fit(x_train, y_train) # Training.

salary_predicted_lasso = lassoModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_lasso)

from sklearn.metrics import mean_absolute_error
mae_lasso = mean_absolute_error(y_valid, salary_predicted_lasso)
print('MAE for Lasso Model: ', mae_lasso)

In [None]:
from sklearn.linear_model import ElasticNet
elasticNetModel = ElasticNet() # Instance.
elasticNetModel.fit(x_train, y_train) # Training.

salary_predicted_elasticNet = elasticNetModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_elasticNet)

from sklearn.metrics import mean_absolute_error
mae_elasticNet = mean_absolute_error(y_valid, salary_predicted_elasticNet)
print('MAE for Elastic Net Model: ', mae_elasticNet)

In [None]:
from sklearn.ensemble import RandomForestRegressor
import numpy as np

RandomFRModel = RandomForestRegressor(n_jobs=-1) # Instance.

In [None]:
RandomFRModel.fit(x_train, np.ravel(y_train)) # Training.

In [None]:

salary_predicted_RandomFR = RandomFRModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_RandomFR)

from sklearn.metrics import mean_absolute_error
mae_RandomFR = mean_absolute_error(y_valid, salary_predicted_RandomFR.ravel())
print('MAE for RandomForestRegressor Model: ', mae_lasso)

---

# 01 - Baixando & Importando as bibliotecas necessárias

Inicialmente vamos baixar as bibliotecas necessárias para nossa análise (Eu já tenho todas baixadas no meu ambiente virtual mas você pode remover o comentário e baixar para sua máquina local ou Ambiente Virtual).

In [None]:
# !pip install --upgrade -r ../requirements.txt

# 02 - Extraindo o conjunto de dados de teste

> Inicialmente, a primeira coisa que vamos fazer é pegar o conjunto de dados de teste disponibilizado pelo **Adzuna**.

**NOTE:**  
Vale lembrar que nesse conjunto de dados não tem as colunas (feature) **"SalaryRaw"** e **"SalaryNormalized"**.

In [None]:
with py7zr.SevenZipFile("../datasets/Test_rev1.7z", mode='r') as archive:
  archive.extractall(path="/tmp") # For Linux users.

In [None]:
testing_df = pd.read_csv("/tmp/Test_rev1.csv")

In [None]:
testing_df.info()

# 03 - Treinando e validando o load-v1
Bem, como nós sabemos no **load-v1** foi apenas passado para a etapa de *treinamento* e *validação* a coluna **"SalaryNormalized"**.

**NOTE:**  
Esse vai ser o nosso **baseline model**.

**Pegando o conjunto de dados:**

**Overview:**

**Selecionando as variáveis independente e dependente (target):**