<h1><center>Train Model</center></h1>

In this tutorial we check the simplest way of predicting inside the Carol Platform. This notebook fetches data from Carol, makes predictions and then send them to Carol.

## 0. Installing required packages

Appart from the well known pandas an numpy libraries we are going to use:
 - sklearn: popular machine learning library comprising datasets, preprocessing and machine learning models.
 - pycarol: TOTVS library developed to assist on the data management for Carol platform.
 
I you have not yet installed these libraries just uncomment and run the cells below.

In [None]:
#!pip install pycarol
#!pip install sklearn

## 1. Fetching data from Carol

We start by defining a connection to the carol platform. To make the connection, though, we need to setup the security authorization to the environment, which is made through the __access token__.

On this example we are simply passing the credentials directly through the code, which is not the best approach for long term solution, specially if this code needs to go through version control servers. A better solution is to store these credentials in expernal files, preferably encrypted, and load them at run time.

In [1]:
from pycarol import Carol, Staging, Storage
from dotenv import load_dotenv
load_dotenv("/home/jro/wk/totvs/pyCarol/.env")
login = Carol()

Now we use the authentication to fetch the data from the staging

In [5]:
staging = Staging(login)

conn = "boston_house_price"
stag = "samples"

X_cols = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "sample"]
roi_cols = X_cols

data = staging.fetch_parquet(staging_name=stag,
                connector_name=conn,
                cds=True,
                columns=roi_cols            
                )

100%|██████████| 1/1 [00:01<00:00,  1.46s/it]


Revising some sample records:

In [6]:
data.sample(3)

Unnamed: 0,AGE,B,CHAS,CRIM,DIS,INDUS,LSTAT,NOX,PTRATIO,RAD,RM,TAX,ZN,sample
222,8.4,396.9,0.0,0.36894,8.9067,5.86,3.54,0.431,19.1,7.0,8.259,330.0,22.0,253.0
348,82.8,393.39,0.0,0.26938,3.2628,9.9,7.9,0.544,18.4,4.0,6.266,304.0,0.0,313.0
485,95.4,352.58,0.0,8.05579,2.4298,18.1,18.14,0.584,20.2,24.0,5.427,666.0,0.0,474.0


## 2. Predicting

Spliting the dataset in training and test parts.

In [23]:
from sklearn.model_selection import train_test_split

_, X_test = train_test_split(data[X_cols], test_size=0.20, random_state=1)

Download the trained model:

In [13]:
stg = Storage(login)
mlp_model = stg.load("bhp_mlp_regressor", format='pickle')



In [14]:
mlp_model

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(100,), learning_rate='constant',
             learning_rate_init=0.001, max_fun=15000, max_iter=500,
             momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
             power_t=0.5, random_state=1, shuffle=True, solver='adam',
             tol=0.0001, validation_fraction=0.1, verbose=False,
             warm_start=False)

Running predictions on the test set.

In [24]:
test_cols = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"]
y_pred = mlp_model.predict(X_test[test_cols])

In [30]:
import pandas as pd
predictions = X_test[["sample"]].copy()
predictions["predicted_value"] = y_pred
predictions["prediction_date"] = pd.Timestamp.now()

In [31]:
predictions

Unnamed: 0,sample,predicted_value,prediction_date
307,387.0,11.091209,2021-08-26 18:08:01.413006
343,308.0,33.918616,2021-08-26 18:08:01.413006
47,51.0,27.171708,2021-08-26 18:08:01.413006
67,97.0,34.670554,2021-08-26 18:08:01.413006
362,327.0,22.916432,2021-08-26 18:08:01.413006
...,...,...,...
92,79.0,25.738413,2021-08-26 18:08:01.413006
224,251.0,29.765805,2021-08-26 18:08:01.413006
110,166.0,41.352617,2021-08-26 18:08:01.413006
426,422.0,21.862447,2021-08-26 18:08:01.413006


## 4. Saving the predictions to Carol

In [32]:
staging = Staging(login)
staging.send_data(
    "predictions",
    data=predictions,
    connector_name="model",
)

fetched crosswalk  ['sample']
