# Titanic Example

In [1]:
import pandas as pd
import numpy as np
from wrangler import Wrangler
from wrangler.data import PandasDataset, CSVDataset
import wrangler.transformers as tr
import wrangler.transformers.text as text_tr
import wrangler.transformers.ml as ml_tr
from wrangler import logger as wrangler_logger
from sklearn.ensemble import RandomForestClassifier

# Uncomment for console logging 
wrangler_logger.enable()

# Uncomment for file output logging
# wrangler_logger.enable_file(filename='logfile.log')

## Inicialize Wrangler and add Datasets

To create datasets we can use any of the avaliables in the module ``wrangler.data``. There are datasets for multiple purposes, from in memory data to AWS-S3 data to choose. Each has its own parameters and but all share the same api: 

- A ``load()`` method which does not take any parameters.
- A ``save()``mehtod which takes the new data a parameter.

To add a new dataset to the ``Wrangler`` we simply call ``add_dataset()`` and pass the dataset object.

In [2]:
wrangler = Wrangler()

titanic_dataset = CSVDataset(name = 'titanic', filename='../data/titanic_train.csv', save_params={'index':False})

wrangler.add_dataset(titanic_dataset)

2022-02-22 at 14:45:11 | INFO | catalog | Adding dataset: intermediate
2022-02-22 at 14:45:11 | INFO | catalog | Adding dataset: titanic


## Create pipeline of nodes

The main component of the ``Wrangler`` are the ``Nodes`` of transformations. Each one is defined by four parameters.

- The ``name``: It's an optional parameter but strongly recommended for better documentation of the process.
- The ``transformer``: It's the object in charge of the transformations made in this node. There are a set of transformers already defined in the module ``wrangler.transformers`` and in the submodules:

    - ``wrangler.transformers.text``
    - ``wrangler.transformers.numeric``
    - ``wrangler.transformers.ml``
    - ``wrangler.transformers.date``
    
- The ``inputs``: The name of the dataset (already registered in the wrangler) to read from and pass as input to the transformer
- The ``outputs``: The name of the dataset to write the outputs of the transformers. In this case, the dataset does not need to be registered previously.

Each transformer can be used and tried outside of the Wrangler by simply calling the ``fit`` and ``transform`` methods.

The process of creating a ``Pipeline`` consist on populating the wrangler with nodes by calling the ``add_node`` method.


In [3]:

wrangler.add_node(
    name='drop id y textos',
    transformer=tr.ColumnDropper(columns=['PassengerId','Name','Ticket','Cabin','Embarked']),
    inputs='titanic',
    outputs='titanic_pro',
)

wrangler.add_node(
    name='one hot encode sex',
    transformer=text_tr.OneHotEncoderTransformer(column='Sex'),
    inputs='titanic_pro',
    outputs='titanic_pro',
)

def fillna_numerics(df):
    numericos = ['int64', 'float64','float16', 'int16', 'int32', 'float16', 'float32']
    for col in df.columns:
        if(df[col].dtypes in numericos):
            df[col]=df[col].fillna(0)
    return df

wrangler.add_node(
    name='fillna numericos',
    transformer=tr.DataframeTransformer(fillna_numerics),
    inputs='titanic_pro',
    outputs='titanic_pro',
)


wrangler.add_node(
    name='split target',
    transformer=ml_tr.SplitFeaturesTarget(target='Survived'),
    inputs='titanic_pro',
    outputs=['titanic_X','titanic_y'],
)


wrangler.add_node(
    name='evaluacion',
    transformer=ml_tr.ClassificationModelEvaluator(
        model=RandomForestClassifier(), 
        ),
    inputs=['titanic_X','titanic_y'],
    outputs=['model_report'],
)

wrangler.add_node(
    name='prediccion',
    transformer=ml_tr.SklearnModelTransformer(model=RandomForestClassifier(),
                                              name='rf_clf',
                                              filename= '../data/titanic_rf_clf'),
    inputs=['titanic_X','titanic_y'],
    outputs=['titanic_pred']
)

2022-02-22 at 14:45:11 | INFO | pipeline | Node drop id y textos added to Pipeline 
2022-02-22 at 14:45:11 | INFO | pipeline | Node one hot encode sex added to Pipeline 
2022-02-22 at 14:45:11 | INFO | pipeline | Node fillna numericos added to Pipeline 
2022-02-22 at 14:45:11 | INFO | pipeline | Node split target added to Pipeline 
2022-02-22 at 14:45:11 | INFO | pipeline | Node evaluacion added to Pipeline 
2022-02-22 at 14:45:11 | INFO | pipeline | Node prediccion added to Pipeline 


## Fit the Wrangler

To excecute the Wrangler we have two options:

- Calling ``fit_transform`` method: It calls the ``fit`` and the ``transform`` method of all te nodes in the wrangler.
- Calling ``transform`` method: It calls only the ``transform`` method of the nodes.

This is very usefull to separate the training process from the testing or production process.

In [4]:
wrangler.fit_transform()

2022-02-22 at 14:45:11 | INFO | node | Running Node: drop id y textos
2022-02-22 at 14:45:11 | INFO | catalog | Loading dataset: titanic
2022-02-22 at 14:45:11 | DEBUG | base | Loading CSVDataset(name='titanic', filename='../data/titanic_train.csv', save_params=dict)
2022-02-22 at 14:45:11 | DEBUG | base | Fitting ColumnDropper(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked'])
2022-02-22 at 14:45:11 | DEBUG | base | Transforming ColumnDropper(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked'])
2022-02-22 at 14:45:11 | INFO | catalog | Saving dataset: titanic_pro
2022-02-22 at 14:45:11 | INFO | catalog | Adding dataset: titanic_pro
2022-02-22 at 14:45:11 | INFO | node | Running Node: one hot encode sex
2022-02-22 at 14:45:11 | INFO | catalog | Loading dataset: titanic_pro
2022-02-22 at 14:45:11 | DEBUG | base | Loading PandasDataset(name='titanic_pro', data=DataFrame)
2022-02-22 at 14:45:11 | DEBUG | base | Fitting OneHotEncoderTransformer(column='Sex')
2022-02

## Reading the Results

To ``Wrangler`` saves all its inputs and outputs in a data catalog. We can retrieve that data by calling the ``data_catalog.load`` method.


In [5]:
print(wrangler.data_catalog)

{'intermediate': PandasDataset(name='intermediate', data=DataFrame), 'titanic': CSVDataset(name='titanic', filename='../data/titanic_train.csv', save_params=dict), 'titanic_pro': PandasDataset(name='titanic_pro', data=DataFrame), 'titanic_X': PandasDataset(name='titanic_X', data=DataFrame), 'titanic_y': PandasDataset(name='titanic_y', data=DataFrame), 'model_report': PandasDataset(name='model_report', data=DataFrame), 'titanic_pred': PandasDataset(name='titanic_pred', data=DataFrame)}


In [6]:
wrangler.data_catalog.load("model_report")

2022-02-22 at 14:45:14 | INFO | catalog | Loading dataset: model_report
2022-02-22 at 14:45:14 | DEBUG | base | Loading PandasDataset(name='model_report', data=DataFrame)


Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1,test_roc_auc
0,0.496035,0.095066,0.759777,0.685714,0.695652,0.690647,0.846904
1,0.268788,0.047642,0.820225,0.790323,0.720588,0.753846,0.812968
2,0.246877,0.05085,0.876404,0.870968,0.794118,0.830769,0.90127
3,0.252931,0.053372,0.775281,0.75,0.617647,0.677419,0.847928
4,0.239624,0.045305,0.831461,0.774648,0.797101,0.785714,0.896822


## Save the Wrangler

To save and pickle the fitted wrangler, we simply call ``save`` method with the path and the name of the file.

In [7]:
wrangler.save('../data/titanic_wrangler')

## Load a Fitted Wrangler

To load a fitted wrangler there are two steps:

- Create a new empty wrangler.
- Call ``load`` with the path to the fitted wrangler.

In order to apply transformations to new data, we need to add it to the data_catalog, otherwise the wrangler will mantain the reference to the fitted data.

For this task we have two options:

- Creating a new dataset with the new reference and calling ``add_dataset``.
- Calling the ``datasets_from_config`` method to load the new references.

In [8]:
loaded_wrangler = Wrangler()
loaded_wrangler.load('../data/titanic_wrangler')

2022-02-22 at 14:45:15 | INFO | catalog | Adding dataset: intermediate


### Adding datasets with ``add_dataset``

In [9]:
titanic_dataset = CSVDataset(name = 'titanic', filename='../data/titanic_test.csv')

loaded_wrangler.add_dataset(titanic_dataset)

2022-02-22 at 14:45:15 | INFO | catalog | Adding dataset: titanic


### Adding datasets with ``datasets_from_config``

- We call the ``datasets_to_config`` of the fitted wrangler to generate a .yml file with the current reference of the datasets. This step can be done before saving it.
- Then we can modify this .yml and pass it to ``datasets_from_config``.


Example of .yml file:

    inputs:
        titanic:
            filename: ../data/titanic_train.csv
            type: CSVDataset
    outputs:
        model_report:
            data: DataFrame
            type: PandasDataset
        titanic_pred:
            data: DataFrame
            type: PandasDataset

In [10]:
wrangler.datasets_to_config("../data/titanic_datasets.yml")

<wrangler.wrangler.Wrangler at 0x7f53e9594df0>

In [11]:
loaded_wrangler_from_config = Wrangler()
loaded_wrangler_from_config.load('../data/titanic_wrangler')
loaded_wrangler_from_config.datasets_from_config("../data/titanic_datasets_new.yml")

2022-02-22 at 14:45:16 | INFO | catalog | Adding dataset: intermediate
2022-02-22 at 14:45:16 | INFO | catalog | Adding dataset: titanic
2022-02-22 at 14:45:16 | INFO | catalog | Adding dataset: model_report
2022-02-22 at 14:45:16 | INFO | catalog | Adding dataset: titanic_pred


<wrangler.wrangler.Wrangler at 0x7f539d947a00>

## Transform new Dataset

In [12]:
loaded_wrangler.transform()

2022-02-22 at 14:45:16 | INFO | node | Running Node: drop id y textos
2022-02-22 at 14:45:16 | INFO | catalog | Loading dataset: titanic
2022-02-22 at 14:45:16 | DEBUG | base | Loading CSVDataset(name='titanic', filename='../data/titanic_test.csv')
2022-02-22 at 14:45:16 | DEBUG | base | Transforming ColumnDropper(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked'])
2022-02-22 at 14:45:16 | INFO | catalog | Saving dataset: titanic_pro
2022-02-22 at 14:45:16 | INFO | catalog | Adding dataset: titanic_pro
2022-02-22 at 14:45:16 | INFO | node | Running Node: one hot encode sex
2022-02-22 at 14:45:16 | INFO | catalog | Loading dataset: titanic_pro
2022-02-22 at 14:45:16 | DEBUG | base | Loading PandasDataset(name='titanic_pro', data=DataFrame)
2022-02-22 at 14:45:16 | DEBUG | base | Transforming OneHotEncoderTransformer(column='Sex', encoder=OneHotEncoder, one_hot_columns=ndarray)
2022-02-22 at 14:45:16 | INFO | catalog | Saving dataset: titanic_pro
2022-02-22 at 14:45:16 | DEBU

In [13]:
loaded_wrangler.data_catalog.load('titanic_pred')

2022-02-22 at 14:45:17 | INFO | catalog | Loading dataset: titanic_pred
2022-02-22 at 14:45:17 | DEBUG | base | Loading PandasDataset(name='titanic_pred', data=DataFrame)


Unnamed: 0,rf_clf_prediction
0,0
1,0
2,1
3,1
4,0
...,...
413,0
414,1
415,0
416,0


## Extras

``wrangler`` has some extra features:

- The ``plot_wrangler``: it generates a graph plot of the wrangler with its inputs, outputs and nodes.

In [14]:
from wrangler.extras import plot_wrangler

plot_wrangler(loaded_wrangler)

ModuleNotFoundError: No module named 'graphviz'