<h1><center>Data Ingestion</center></h1>

In this tutorial we explore how to import the Boston House Prices dataset from scikit learn and load it to the Carol platform.

## 0. Installing required packages

Appart from the well known pandas an numpy libraries we are going to use:
 - sklearn: popular machine learning library comprising datasets, preprocessing and machine learning models.
 - pycarol: TOTVS library developed to assist on the data management for Carol platform.

In [None]:
!pip install pycarol

In [None]:
!pip install sklearn

## 1. Fetching data from source

Loading data from scikit learn and storing it on a dataframe.

In [5]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

# Loading the dataset from Scikit Learn
boston_dataset = load_boston()

# Composing the column names
column_names = list(boston_dataset['feature_names']) + ['target']

# Creating a dataframe
boston_dataframe = pd.DataFrame(data= np.c_[boston_dataset['data'], boston_dataset['target']],
columns=column_names)

# Using the row number as a primary key
boston_dataframe["sample"] = boston_dataframe.index

Checking a sample of the data to make sure data is correct.

In [6]:
boston_dataframe.sample(1)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target,sample
349,0.02899,40.0,1.25,0.0,0.429,6.939,34.5,8.7921,1.0,335.0,19.7,389.85,5.89,26.6,349


## 2. Sending data to carol

We start by defining a connection to the carol platform. To make the connection, though, we need to setup the security authorization to the environment, which is made through the __access token__.

On this example we are simply passing the credentials directly through the code, which is not the best approach for long term solution, specially if this code needs to go through version control servers. A better solution is to store these credentials in expernal files, preferably encrypted, and load them at run time.

In [8]:
from pycarol import Carol, Staging, ApiKeyAuth

# =================== AUTHENTICATION ON CAROL ===================
# Currently set to: Juvenal José Duarte
connectors = {"mltutorial": 'd6XXXXXeX6X3XXXXX75aXXX54XXXX1Xb'}
conn_tokens = {"mltutorial": 'XXXX40566adcXXXXXX54XXaX8X8XXXa2'}
# ===============================================================

login = Carol(domain="mltutorial",
        app_name="bostonhouseprice",
        organization='datascience',
        auth=ApiKeyAuth(conn_tokens["mltutorial"]),
        connector_id=connectors["mltutorial"])

__Note__: When running similar code inside a Carol App the authentication may be omitted, since it can be infereed through the user running the app. The simplified code for apps would be as below. Even domain, app_name and organization parameters can be supressed when running inside apps, in that case they will be retrieved from the environment the app is running on.

In [None]:
''' Simplified version for when running it inside an carol app
login = Carol(domain="mltutorial",
        app_name="bostonhouseprice",
        organization='datascience')
'''

Now we can go on and send the table stored on the dataframe to carol stagings.

In [9]:
staging = Staging(login)

staging.send_data(
    # The dataset itself
    data=boston_dataframe, 
    
    # The field which compose the primary key
    crosswalk_auto_create=['sample'],
    
    # Where to write the data
    staging_name="samples", 
    connector_name="boston_house_price", 
    
    # If the table doesn't exists, auto create the schema
    auto_create_schema=True,
    flexible_schema=False,
    
    # Other options
#3    async_send=True,
#    storage_only=True, 
    force=False,
    gzip=True,
)

  staging.send_data(


provided crosswalk  ['sample']
506/506 sent

If everything went well on your data ingestion, now you should be able to see your staging on the user interface by clicking on connectors on the left panel, selecting your connector. You can view sample records by clicking on View sample data.

__Note__: The same approach can be use to load any tabular data into Carol, just make sure you have your dataset stored on a dataframe and use PyCarol to help on the loading process.

## Troubleshooting

A couple of well known problems when loading data to Carol are given below:

- **Bad crosswalk**: It happens when the columns provided as the crosswalk parameter doesn't holds the unique property for the records.

- **Inconsistent schema**: If the staging has been already created before with columns and data types it may result in conflicts when loading new data. In that case it is recomended, whenever it is possible, to drop the previous staging and reload the full data.