# Model Training

In [1]:
import pandas as pd
import ast
import warnings
import os
warnings.filterwarnings("ignore")
import sys
sys.path.append("../")
from utils.machine_learning import DataPreprocess

In [2]:
preprocess = DataPreprocess()

# Data

In [3]:
df = pd.read_csv("./Data/model_training_payments.csv")
print(f"Shape: {df.shape}")
df.head()

Shape: (77414, 13)


Unnamed: 0,VALOR_A_PAGAR,TAXA,DDD,SEGMENTO_INDUSTRIAL,DOMINIO_EMAIL,PORTE,CEP_2_DIG,RENDA_MES_ANTERIOR,NO_FUNCIONARIOS,INADIMPLENTE,month_vencimento,days_until_due,days_since_registration
0,35516.41,6.99,Nordeste,Serviços,YAHOO,PEQUENO,Nordeste,252109.0,99.0,0,9,36,1821
1,17758.21,6.99,Nordeste,Serviços,YAHOO,PEQUENO,Nordeste,252109.0,99.0,0,9,40,1823
2,17431.96,6.99,Nordeste,Serviços,YAHOO,PEQUENO,Nordeste,252109.0,99.0,0,9,47,1830
3,1341.0,6.99,Nordeste,Serviços,YAHOO,PEQUENO,Nordeste,252109.0,99.0,1,10,65,1834
4,21309.85,6.99,Nordeste,Serviços,YAHOO,PEQUENO,Nordeste,252109.0,99.0,0,9,50,1835


# Preprocessing the Data

### Based on the information gathered during **Feature Engineering**:
- `VALOR_A_PAGAR`, `RENDA_MES_ANTERIOR`, `days_until_due` = Cubic Root Transformation
- `NO_FUNCIONARIOS`, `days_since_registration` = No transformation needed

### For the preprocess step, the data will be divided into a train and test split. The numerical columns will be transformed using the log and cubic transformations, and then they will be scaled using the Robust Scaler in reason of the outliers and because `TAXA` and `month_vencimento` columns won't go through the statistical transformation step. Also, onehot encoding will be performed on the categorical columns of the dataset. The preprocessor will be saved in the **Artifacts** folder
### Regarding the test dataset, the parameter **test_data** will be a flag to see if the data that is going to be preprocessed is the training or test one. Since in the test dataset the column `DATA_CADASTRO` has **Unknown** values, and this column was used together with `DATA_EMISSAO_DOCUMENTO` to create a new feature, the **Unknown** values of `DATA_CADASTRO` will be replaced with the values of `DATA_EMISSAO_DOCUMENTO`, so that the new feature `days_since_registration` has a value of **0** when `DATA_CADASTRO` was **Unknown**

In [4]:
X_train, X_test, y_train, y_test = preprocess.preprocess_data(df, target_name="INADIMPLENTE", test_size=0.3)