# Treinamento & Validação
Esse **Jupyter Notebook** tem como objetivo treinar modelos de *Machine Learning* com uma ou mais *features* e verificar quão bem esses modelos estão aprendendo com base na métrica de validação - [Erro Médio Absoluto](https://en.wikipedia.org/wiki/Mean_absolute_error).

---

In [1]:
import py7zr

with py7zr.SevenZipFile("../datasets/Train_rev1.7z", mode='r') as archive:
  archive.extractall(path="/tmp") # For Linux users.

In [2]:
import pandas as pd

full_df = pd.read_csv("/tmp/Train_rev1.csv")
df_SalaryNormalized = full_df[["SalaryNormalized"]]
df_SalaryNormalized.info()
df_SalaryNormalized.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column            Non-Null Count   Dtype
---  ------            --------------   -----
 0   SalaryNormalized  244768 non-null  int64
dtypes: int64(1)
memory usage: 1.9 MB


Unnamed: 0,SalaryNormalized
0,25000
1,30000
2,30000
3,27500
4,25000


In [3]:
import scipy.sparse
df_title_vectorized = scipy.sparse.load_npz('df_title_vectorized.npz')

In [4]:
df_title_vectorized

<244768x14917 sparse matrix of type '<class 'numpy.int64'>'
	with 923171 stored elements in Compressed Sparse Row format>

In [5]:
x = df_title_vectorized
y = df_SalaryNormalized

In [6]:
x

<244768x14917 sparse matrix of type '<class 'numpy.int64'>'
	with 923171 stored elements in Compressed Sparse Row format>

In [7]:
y

Unnamed: 0,SalaryNormalized
0,25000
1,30000
2,30000
3,27500
4,25000
...,...
244763,22800
244764,22800
244765,22800
244766,22800


In [8]:
from sklearn.model_selection import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.3, random_state=42)

In [9]:
from sklearn.linear_model import LinearRegression
lrModel = LinearRegression() # Instance.
lrModel.fit(x_train, y_train) # Training.

salary_predicted_ln = lrModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_ln)

from sklearn.metrics import mean_absolute_error
mae_ln = mean_absolute_error(y_valid, salary_predicted_ln)
print('MAE for Linear Regression Model: ', mae_ln)

Predictions:
  [[36329.3401252 ]
 [24994.45892553]
 [30252.71791482]
 ...
 [21808.86286281]
 [30802.06643447]
 [34720.7647732 ]]
MAE for Linear Regression Model:  8657.710037667131


In [10]:
from sklearn.linear_model import Ridge
ridgeModel = Ridge() # Instance.
ridgeModel.fit(x_train, y_train) # Training.

salary_predicted_ridge = ridgeModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_ridge)

from sklearn.metrics import mean_absolute_error
mae_ridge = mean_absolute_error(y_valid, salary_predicted_ridge)
print('MAE for Ridge Model: ', mae_ridge)

Predictions:
  [[36282.66675248]
 [17353.9188599 ]
 [30256.92827149]
 ...
 [23060.25942236]
 [28012.35552722]
 [34769.28500259]]
MAE for Ridge Model:  8616.753786449399


In [11]:
from sklearn.linear_model import Lasso
lassoModel = Lasso() # Instance.
lassoModel.fit(x_train, y_train) # Training.

salary_predicted_lasso = lassoModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_lasso)

from sklearn.metrics import mean_absolute_error
mae_lasso = mean_absolute_error(y_valid, salary_predicted_lasso)
print('MAE for Lasso Model: ', mae_lasso)

Predictions:
  [36249.43889789 20163.56243617 30465.88439844 ... 32526.98079274
 25040.10621769 39581.57774515]
MAE for Lasso Model:  8834.294987558691


In [12]:
from sklearn.linear_model import ElasticNet
elasticNetModel = ElasticNet() # Instance.
elasticNetModel.fit(x_train, y_train) # Training.

salary_predicted_elasticNet = elasticNetModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_elasticNet)

from sklearn.metrics import mean_absolute_error
mae_elasticNet = mean_absolute_error(y_valid, salary_predicted_elasticNet)
print('MAE for Elastic Net Model: ', mae_elasticNet)

Predictions:
  [35811.18015834 32896.16153646 35328.01934386 ... 33810.29474899
 33205.52933158 35922.86842709]
MAE for Elastic Net Model:  12888.346719289186


In [13]:
from sklearn.ensemble import RandomForestRegressor
import numpy as np

RandomFRModel = RandomForestRegressor(n_jobs=-1) # Instance.

In [14]:
RandomFRModel.fit(x_train, np.ravel(y_train)) # Training.

RandomForestRegressor(n_jobs=-1)

In [15]:

salary_predicted_RandomFR = RandomFRModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_RandomFR)

from sklearn.metrics import mean_absolute_error
mae_RandomFR = mean_absolute_error(y_valid, salary_predicted_RandomFR.ravel())
print('MAE for RandomForestRegressor Model: ', mae_lasso)

Predictions:
  [32479.58454046 30664.11458944 27278.71369819 ... 20893.28166667
 27511.01195345 24248.0786062 ]
MAE for RandomForestRegressor Model:  8834.294987558691


---

# 01 - Baixando & Importando as bibliotecas necessárias

Inicialmente vamos baixar as bibliotecas necessárias para nossa análise (Eu já tenho todas baixadas no meu ambiente virtual mas você pode remover o comentário e baixar para sua máquina local ou Ambiente Virtual).

In [None]:
# !pip install --upgrade -r ../requirements.txt

# 02 - Extraindo o conjunto de dados de teste

> Inicialmente, a primeira coisa que vamos fazer é pegar o conjunto de dados de teste disponibilizado pelo **Adzuna**.

**NOTE:**  
Vale lembrar que nesse conjunto de dados não tem as colunas (feature) **"SalaryRaw"** e **"SalaryNormalized"**.

In [None]:
with py7zr.SevenZipFile("../datasets/Test_rev1.7z", mode='r') as archive:
  archive.extractall(path="/tmp") # For Linux users.

In [None]:
testing_df = pd.read_csv("/tmp/Test_rev1.csv")

In [None]:
testing_df.info()

# 03 - Treinando e validando o load-v1
Bem, como nós sabemos no **load-v1** foi apenas passado para a etapa de *treinamento* e *validação* a coluna **"SalaryNormalized"**.

**NOTE:**  
Esse vai ser o nosso **baseline model**.

**Pegando o conjunto de dados:**

**Overview:**

**Selecionando as variáveis independente e dependente (target):**