# Treinamento & Validação
Esse **Jupyter Notebook** tem como objetivo treinar modelos de *Machine Learning* com uma ou mais *features* e verificar quão bem esses modelos estão aprendendo com base na métrica de validação - [Erro Médio Absoluto](https://en.wikipedia.org/wiki/Mean_absolute_error).

---

# 01 - Baixando & Importando as bibliotecas necessárias

Inicialmente vamos baixar as bibliotecas necessárias para nossa análise (Eu já tenho todas baixadas no meu ambiente virtual mas você pode remover o comentário e baixar para sua máquina local ou Ambiente Virtual).

In [1]:
# !pip install --upgrade -r ../requirements.txt

# 02 - Extraindo o conjunto de dados de teste

> A primeira coisa que vamos fazer é pegar o conjunto de dados de teste disponibilizado pelo **Adzuna**.

**NOTE:**  
Vale lembrar que nesse conjunto de dados não tem as colunas (feature) **"SalaryRaw"** e **"SalaryNormalized"**.

In [2]:
import py7zr

with py7zr.SevenZipFile("../datasets/Test_rev1.7z", mode='r') as archive:
  archive.extractall(path="/tmp") # For Linux users.

In [3]:
import pandas as pd
testing_df = pd.read_csv("/tmp/Test_rev1.csv")

In [4]:
testing_df.info()
testing_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122463 entries, 0 to 122462
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Id                  122463 non-null  int64 
 1   Title               122463 non-null  object
 2   FullDescription     122463 non-null  object
 3   LocationRaw         122463 non-null  object
 4   LocationNormalized  122463 non-null  object
 5   ContractType        33013 non-null   object
 6   ContractTime        90702 non-null   object
 7   Company             106202 non-null  object
 8   Category            122463 non-null  object
 9   SourceName          122463 non-null  object
dtypes: int64(1), object(9)
memory usage: 9.3+ MB


Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SourceName
0,11888454,Business Development Manager,The Company: Our client is a national training...,"Tyne Wear, North East",Newcastle Upon Tyne,,permanent,Asset Appointments,Teaching Jobs,cv-library.co.uk
1,11988350,Internal Account Manager,The Company: Founded in **** our client is a U...,"Tyne and Wear, North East",Newcastle Upon Tyne,,permanent,Asset Appointments,Consultancy Jobs,cv-library.co.uk
2,12612558,Engineering Systems Analysts,Engineering Systems Analysts Surrey ****K Loca...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,cv-library.co.uk
3,12613014,CIS Systems Engineering Consultant,CIS Systems Engineering Consultant Bristol So...,"Bristol, South West, South West",Bristol,,permanent,Gregory Martin International,Engineering Jobs,cv-library.co.uk
4,22454872,CNC Miller / Programmer Fanac,"CNC Miller / Programmer Fanac Fleet, Hampshire...","Fleet, Hampshire",Fleet,,permanent,Gregory Martin International,Manufacturing Jobs,cv-library.co.uk


# 03 - Treinando & Validando os Loads
> Na parte de **Treinamento & Validação** nós vamos utilizar as colunas (features) já Pré-Processadas em cada **Load** para treinar vários modelos de Regressão e tentar encontrar o melhor a partir que novas colunas (features) são Pré-Processadas.

## 03.1 - Treinando & Validando o Load-v1
Bem, como nós sabemos no **load-v1** foi apenas passado para a etapa de **treinamento & validação** as colunas (features) **"Title"** e **"SalaryNormalized"**.

Ou seja, nós vamos ter as seguintes variáveis (features) para o nosso modelo:

 - **Variáveis Independente:**
   - Title *(com CountVectorizer)*
   - FullDescription *(com CountVectorizer)*
 - **Variáveis Dependente:**
   - SalaryNormalized (normalizada pelo a Adzuna)

**NOTE:**  
Esse vai ser o nosso **baseline model**.

### Pegando a variável dependente (target):

In [15]:
import py7zr

with py7zr.SevenZipFile("../datasets/Train_rev1.7z", mode='r') as archive:
  archive.extractall(path="/tmp") # For Linux users.

In [16]:
import pandas as pd

full_df = pd.read_csv("/tmp/Train_rev1.csv")
df_SalaryNormalized = full_df[["SalaryNormalized"]]
df_SalaryNormalized.info()
df_SalaryNormalized.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column            Non-Null Count   Dtype
---  ------            --------------   -----
 0   SalaryNormalized  244768 non-null  int64
dtypes: int64(1)
memory usage: 1.9 MB


Unnamed: 0,SalaryNormalized
0,25000
1,30000
2,30000
3,27500
4,25000


### Pegando a(s) variável(s) Independente(s):

In [17]:
import scipy.sparse
df_title_vectorized = scipy.sparse.load_npz('df_title_vectorized.npz')

In [18]:
df_title_vectorized

<244768x14917 sparse matrix of type '<class 'numpy.int64'>'
	with 923171 stored elements in Compressed Sparse Row format>

### Separando as variáveis em "x" e "y":

In [19]:
x = df_title_vectorized
y = df_SalaryNormalized

### Dividindo os dados em dados de treino e dados de validação:

In [20]:
from sklearn.model_selection import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.3, random_state=42)

### Treinando modelos de Regressão:

#### Regressão Linear:

In [46]:
# Linear Regression.
from sklearn.linear_model import LinearRegression
lrModel = LinearRegression() # Instance.
lrModel.fit(x_train, y_train) # Training.

salary_predicted_ln = lrModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_ln)

from sklearn.metrics import mean_absolute_error
mae_ln = mean_absolute_error(y_valid, salary_predicted_ln)
print('MAE for Linear Regression Model:', round(mae_ln))

Predictions:
  [[36329.20127848]
 [24914.92159844]
 [30252.21978656]
 ...
 [21815.84460272]
 [30797.87181137]
 [34741.34393378]]
MAE for Linear Regression Model: 8658


#### Ridge Regression (L2):

In [48]:
from sklearn.linear_model import Ridge
ridgeModel = Ridge() # Instance.
ridgeModel.fit(x_train, y_train) # Training.

salary_predicted_ridge = ridgeModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_ridge)

from sklearn.metrics import mean_absolute_error
mae_ridge = mean_absolute_error(y_valid, salary_predicted_ridge)
print('MAE for Ridge Model:', round(mae_ridge))

Predictions:
  [[36282.66675248]
 [17353.9188599 ]
 [30256.92827149]
 ...
 [23060.25942236]
 [28012.35552722]
 [34769.28500259]]
MAE for Ridge Model: 8617


#### Lasso Regression (L1):

In [49]:
from sklearn.linear_model import Lasso
lassoModel = Lasso() # Instance.
lassoModel.fit(x_train, y_train) # Training.

salary_predicted_lasso = lassoModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_lasso)

from sklearn.metrics import mean_absolute_error
mae_lasso = mean_absolute_error(y_valid, salary_predicted_lasso)
print('MAE for Lasso Model:', round(mae_lasso))

Predictions:
  [36249.43889789 20163.56243617 30465.88439844 ... 32526.98079274
 25040.10621769 39581.57774515]
MAE for Lasso Model: 8834


#### Elastic Net:

In [50]:
from sklearn.linear_model import ElasticNet
elasticNetModel = ElasticNet() # Instance.
elasticNetModel.fit(x_train, y_train) # Training.

salary_predicted_elasticNet = elasticNetModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_elasticNet)

from sklearn.metrics import mean_absolute_error
mae_elasticNet = mean_absolute_error(y_valid, salary_predicted_elasticNet)
print('MAE for Elastic Net Model:', round(mae_elasticNet))

Predictions:
  [35811.18015834 32896.16153646 35328.01934386 ... 33810.29474899
 33205.52933158 35922.86842709]
MAE for Elastic Net Model: 12888


#### Random Forest Regressor:

In [40]:
from sklearn.ensemble import RandomForestRegressor
import numpy as np

RandomFRModel = RandomForestRegressor(n_jobs=-1) # Instance.
RandomFRModel.fit(x_train, np.ravel(y_train)) # Training.

salary_predicted_RandomFR = RandomFRModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_RandomFR)

from sklearn.metrics import mean_absolute_error
mae_RandomFR = mean_absolute_error(y_valid, salary_predicted_RandomFR.ravel())
print('MAE for RandomForestRegressor Model:', round(mae_RandomFR))

Predictions:
  [31102.85583861 30417.05190476 27328.22505411 ... 19981.93197619
 27823.39497785 23481.06557147]
MAE for RandomForestRegressor Model:  6703.093421570222


---

## 03.2 - Treinando & Validando o Load-v2
Para o nosso **2° Load (Load-v2)** nós vamos ter mais uma feature para trabalhar durante a etapa de **treinamento & validação**.

Agora nós vamos ter as seguintes variáveis (features) para os nossos modelos:

 - **Variáveis Independente:**
   - Title *(com CountVectorizer)*
   - FullDescription *(com CountVectorizer)*
 - **Variáveis Dependente:**
   - SalaryNormalized (normalizada pelo a Adzuna)

### Pegando a variável dependente (target):

In [5]:
import py7zr

with py7zr.SevenZipFile("../datasets/Train_rev1.7z", mode='r') as archive:
  archive.extractall(path="/tmp") # For Linux users.

In [6]:
import pandas as pd

full_df = pd.read_csv("/tmp/Train_rev1.csv")
df_SalaryNormalized = full_df[["SalaryNormalized"]]
df_SalaryNormalized.info()
df_SalaryNormalized.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column            Non-Null Count   Dtype
---  ------            --------------   -----
 0   SalaryNormalized  244768 non-null  int64
dtypes: int64(1)
memory usage: 1.9 MB


Unnamed: 0,SalaryNormalized
0,25000
1,30000
2,30000
3,27500
4,25000


### Pegando a(s) variável(s) Independente(s):

In [7]:
import scipy.sparse
df_title_vectorized = scipy.sparse.load_npz('df_title_vectorized.npz')
df_fulldescription_vectorized = scipy.sparse.load_npz('df_fulldescription_vectorized.npz')

### Separando as variáveis em "x" e "y":

**NOTE:**  
Uma observação aqui é que vamos começar treinando nossos modelos apenas com a variável (feature) **"fullDescription"**. Só para testar se nossos modelos aprendem melhor com ela do que a variável **"Title"**.

In [8]:
x = df_fulldescription_vectorized
y = df_SalaryNormalized

### Dividindo os dados em dados de treino e dados de validação:

In [9]:
from sklearn.model_selection import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.3, random_state=42)

### Treinando modelos de Regressão:

#### Regressão Linear:

In [9]:
# Linear Regression.
from sklearn.linear_model import LinearRegression
lrModel = LinearRegression() # Instance.
lrModel.fit(x_train, y_train) # Training.

salary_predicted_ln = lrModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_ln)

from sklearn.metrics import mean_absolute_error
mae_ln = mean_absolute_error(y_valid, salary_predicted_ln)
print('MAE for Linear Regression Model:', round(mae_ln))

Predictions:
  [[23032.87085644]
 [33610.17330689]
 [24136.00315204]
 ...
 [21469.48352848]
 [38877.12291253]
 [33263.10392749]]
MAE for Linear Regression Model: 10844


#### Ridge Regression (L2):

In [10]:
from sklearn.linear_model import Ridge
ridgeModel = Ridge() # Instance.
ridgeModel.fit(x_train, y_train) # Training.

salary_predicted_ridge = ridgeModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_ridge)

from sklearn.metrics import mean_absolute_error
mae_ridge = mean_absolute_error(y_valid, salary_predicted_ridge)
print('MAE for Ridge Model:', round(mae_ridge))

Predictions:
  [[26456.86360084]
 [23387.94309581]
 [25311.32956758]
 ...
 [21352.64221595]
 [31799.60241475]
 [37834.19762852]]
MAE for Ridge Model: 8191


#### Lasso Regression (L1):

In [11]:
from sklearn.linear_model import Lasso
lassoModel = Lasso() # Instance.
lassoModel.fit(x_train, y_train) # Training.

salary_predicted_lasso = lassoModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_lasso)

from sklearn.metrics import mean_absolute_error
mae_lasso = mean_absolute_error(y_valid, salary_predicted_lasso)
print('MAE for Lasso Model:', round(mae_lasso))

Predictions:
  [28389.19530136 18018.2873558  30094.66685536 ... 18804.51829832
 27585.07234606 37469.51276253]
MAE for Lasso Model: 8338


#### Elastic Net:

In [12]:
from sklearn.linear_model import ElasticNet
elasticNetModel = ElasticNet() # Instance.
elasticNetModel.fit(x_train, y_train) # Training.

salary_predicted_elasticNet = elasticNetModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_elasticNet)

from sklearn.metrics import mean_absolute_error
mae_elasticNet = mean_absolute_error(y_valid, salary_predicted_elasticNet)
print('MAE for Elastic Net Model:', round(mae_elasticNet))

Predictions:
  [31254.34235054 24438.25307445 37428.63777898 ... 29305.71927444
 28438.02246171 32729.01976647]
MAE for Elastic Net Model: 9812


#### Random Forest Regressor:

In [None]:
from sklearn.ensemble import RandomForestRegressor
import numpy as np

RandomFRModel = RandomForestRegressor(n_jobs=-1, max_features="auto") # Instance.
RandomFRModel.fit(x_train, np.ravel(y_train)) # Training.

salary_predicted_RandomFR = RandomFRModel.predict(x_valid)
print('Predictions:\n ', salary_predicted_RandomFR)

from sklearn.metrics import mean_absolute_error
mae_RandomFR = mean_absolute_error(y_valid, salary_predicted_RandomFR.ravel())
print('MAE for RandomForestRegressor Model:', round(mae_RandomFR))

# Resumos

 - **No *Load-v1* nós tinhamos as seguintes situações:**
   - **Variáveis (features):**
     - Independentes:
       - Title *(com CountVectorizer)*
     - Dependente:
       - SalaryNormalized (normalizada pelo a Adzuna)
   - **Como Métrica de Avaliação (MAE) tivemos os seguintes resultados (ordenados do menor para o maior):**
     - MAE for RandomForestRegressor Model: 6703
     - MAE for Ridge Model: 8616
     - MAE for Linear Regression Model: 8658
     - MAE for Lasso Model: 8834
     - MAE for Elastic Net Model: 12888
 - **No *Load-v2* nós tinhamos as seguintes situações:**
   - **Variáveis (features):**
     - Independentes:
       - Title *(com CountVectorizer)*
       - FullDescription *(com CountVectorizer)*
     - Dependente:
       - SalaryNormalized (normalizada pelo a Adzuna)
   - **O primeiro treinamento do Load-v2 vai ser apenas com a variável "FullDescription". Isso porque, nós vamos vamos comparar se nossos modelos aprendem melhor só com a variável "FullDescription" em relação a variável "Title". O resultado foi o seguinte  (ordenados do menor para o maior):**
     - MAE for Ridge Model: 8191
     - MAE for Lasso Model: 8338
     - MAE for Elastic Net Model: 9812
     - MAE for Linear Regression Model: 10844
     - MAE for RandomForestRegressor Model: Não finalizado.
       - Esse modelo demorou mais de 24h (quase 40) e não finalizou o treinamento. Devido a esse problema vou procurar outra abordagem para minimizar o custo de tempo.