<a id="top"></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Table of Content</h3>

* [1. Reading the Data](#1)
* [2. EDA: Exploring Insights](#2)
    - [2.1 Survival Rate](#2.1)
    - [2.2 Gender Analysis](#2.2)
    - [2.3 Social Class Influence](#2.3)
    - [2.4 Companion on the Ship](#2.4)
    - [2.5 The Age Factor](#2.5)
    - [2.6 Fare Paid by Passengers](#2.6)
    - [2.7 Simultaneous Analysis](#2.7)
* [3. Prep: Building Pipelines](#3) 
    - [3.1 Initial Pipeline](#3.1)
        - [3.1.1 Custom Features](#3.1.1)
        - [3.1.1 Candidate Features](#3.1.2)
        - [3.1.2 Duplicated Data](#3.1.3)
        - [3.1.3 Modifying Dtypes](#3.1.4)
        - [3.1.4 Training and Validation Data](#3.1.5)
    - [3.2 Numerical Pipeline](#3.2)
        - [3.2.1 Null Data](#3.2.1)
        - [3.2.2 Log Transformation](#3.2.2)
        - [3.2.3 Normalization](#3.2.3)
    - [3.3 Categorical Pipeline](#3.3)
        - [3.3.1 Encoding](#3.3.1)
    - [3.4 Complete Pipelines](#3.4)
* [4. Modeling: Survival Prediction](#4)
    - [4.1 Structuring Variables](#4.1)
    - [4.2 Training Models](#4.2)
    - [4.3 Evaluating Performance](#4.3)
    - [4.4 Training Flow and Visual Analysis](#4.4)
* [5. Tunning Hyperparameters](#5)
* [6. Submitting: Prediction Pipeline](#6)
* [7. References](#7)    

This notebook aims to allocate the development related to exploratory analysis of insights related to the dataset [Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic) taken from the Kaggle platform to improve skills in Data Science and Machine Learning. Also, this notebook uses the tools presented on [xplotter](https://github.com/ThiagoPanini/xplotter) and [mlcomposer](https://github.com/ThiagoPanini/mlcomposer) python packages made by myself and published on PyPI repository. This is a real good effort for coding useful functions for making the Exploratory Data Analysis and applying Machine Learning process a lot more easier for Data Scientists and Data Analysis through deliverying charts customization and matplotlib/seaborn plots with a little few lines of code. I really hope you all enjoy it!

<div align="center">
    <img src="https://i.imgur.com/5XFP1Ha.png" height=300 width=200 alt="xplotter Logo">
    <img src="https://i.imgur.com/MIcPH8g.png" width=450 height=450 alt="mlcomposer logo">
</div>

___
**_Description and context:_**
_The sinking of the Titanic is one of the most well-known events in the world. [...] While some factors related to "luck" were present among the survivors, apparently some specific groups of passengers and crew were more likely to survive than others. In this challenge, it is proposed to create a Machine Learning model capable of answering the following question: "Which groups of people could be more able to survive the shipwreck?"_

In [None]:
!pip install xplotter --upgrade
!pip install mlcomposer --upgrade

In [None]:
# Project libraries
import pandas as pd
import os
from warnings import filterwarnings
filterwarnings('ignore')

# Project variables
DATA_PATH = '../input/titanic'
TRAIN_FILENAME = 'train.csv'
TEST_FILENAME = 'test.csv'

<a id="1"></a>
<font color="darkslateblue" size=+2.5><b>1. Reading the Data</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

After importing the main libraries common to the project and also defining important variables for reading the data, it is possible to make the first contact with the database available for the development of the task.

In [None]:
# Reading training data
df = pd.read_csv(os.path.join(DATA_PATH, TRAIN_FILENAME))
df.head()

In case you may ask about [metadata](https://www.kaggle.com/c/titanic/data), here is some useful points to consider:

- **_PassengerId:_** reference of passenger id registered on the trip;
- **_Survived:_** target variable indicating passenger's survival (1=yes, 0=no);
- **_Pclass:_** passenger's social class (ticket's reference) (1=high, 2=medium ou 3=low);
- **_Name:_** passenger's name;
- **_Sex:_** passenger's gender;
- **_Age:_** passenger's age;
- **_SibSp:_** total of passenger's siblings / spouses aboard the ship;
- **_Parch:_** total of passenger's parents / children aboard the ship;
- **_Ticket:_** passenger's ticket reference number;
- **_Fare:_** amount paid of passenger on the ticket bought;
- **_Cabin:_** passenger's cabin reference number;
- **_Embarked:_** port of embarkation(C=Cherbourg, Q=Queenstown, S=Southampton).

<a id="2"></a>
<font color="darkslateblue" size=+2.5><b>2. EDA: Exploring Insights</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

At this point, there is a well-defined context of the project's objective, in addition to a database already read and transformed into a DataFrame format of the pandas. From this moment on, a true scan of the data will be proposed for the application of a detailed descriptive analysis in order to gather relevant insights for the business context.

Using the homemade package [xplotter](https://github.com/ThiagoPanini/xplotter), whose construction was motivated exactly to facilitate the work of data scientists in the pillars of insights and exploratory data analysis. The next steps will be based on the tools provided from xplotter library to make beautiful charts in order to get a deep understand of our data. 

<div align="center">
    <img src="https://i.imgur.com/5XFP1Ha.png" alt="xplotter logo">
</div>

<a id="2.1"></a>
<font color="dimgrey" size=+2.0><b>2.1 Survival Rate</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

In [None]:
# Importing libraries
from xplotter.insights import *

# Survival rate
survived_map = {1: 'Survived', 0: 'Not Survived'}
survived_colors = ['crimson', 'darkslateblue']
plot_donut_chart(df=df, col='Survived', label_names=survived_map, colors=survived_colors,
                 title='Absolute Total and Percentual of Passengers \nwho Survived Titanic Disaster')

The graph above clearly shows the proportion of survivors and victims of the 891 passengers and crew present in the database available for analysis. In it, it is possible to perceive a greater number of victims of the shipwreck, totaling 549 passengers or 61.6% of the total. The smallest number of survivors adds up to 342 passengers or 38.4% of the total.

The `Survived` variable can be given as the target variable for a possible predictive model to be trained in this notebook in the future. The main objective is to analyze the main factors or to identify the main groups of passengers with a greater chance of survival in view of the demographic and social characteristics present in the base.

<a id="2.2"></a>
<font color="dimgrey" size=+2.0><b>2.2 Gender Analysis</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

From the survival analysis performed above, it is possible to expand the views using other variables present in the base. In this session, we will check if there was any influence of the passenger gender on the issue of survival.

In [None]:
# Countplot for gender
gender_colors = ['lightskyblue', 'lightcoral']
gender_map = {'male': 'Male', 'female': 'Female'}
plot_countplot(df=df, col='Sex', palette=gender_colors, label_names=gender_map,
               title='Total Passengers by Its Gender')

From the graph above, it can be seen that the public of travelers on the Titanic was made up of approximately 65% men and 35% women, and this can probably be considered a standard scenario for the time (after all, we are talking about something that occurred in the middle of 1911).

Keeping in mind the total volumetry by gender is essential for further analysis of the survival rate by gender.

In [None]:
# Survival rate by gender
plot_countplot(df=df, col='Survived', hue='Sex', label_names=survived_map, palette=gender_colors,
               title="Could gender had some influence on\nsurviving from Titanic shipwreck?")

The graph above shows that, of the 549 recorded victims (61.6% of the total), we have 468 (or 52.5%) males and only 81 (or 9.1%) females. On the other hand, of the 342 (or 38.4%) survivors, we have 109 (or 12.2%) males and 233 (or 26.2%) females.

In other words, the graph shows that there was a possible rescue priority given to the women of the vessel, since their share of representativeness is more significant in the group of survivors than in the group of victims. To explain this scenario, it is possible to imagine the existence of a possible rescue protocol for prioritizing women and children in emergency cases.

Another way to analyze this representativeness of survival by gender can be found in the double donut chart below:

In [None]:
# Plotting a double donut chart
plot_double_donut_chart(df=df, col1='Survived', col2='Sex', label_names_col1=survived_map, 
                        colors1=['crimson', 'navy'], colors2=['lightcoral', 'lightskyblue'],
                        title="Did the passenger's gender influence \nsurvival rate?")

Both graphs show: female passengers had priority in the rescue during the shipwreck.

<a id="2.3"></a>
<font color="dimgrey" size=+2.0><b>2.3 Social Class Influence</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

The next topic of analysis takes into account the values contained in the `Pclass` variable, which, in turn, brings information regarding the social class of each of the passengers present on the vessel.

In [None]:
# Number of passengers for each class
pclass_map = {1: 'Upper Class', 2: 'Middle Class', 3: 'Lower Class'}
plot_pie_chart(df=df, col='Pclass', colors=['brown', 'gold', 'darkgrey'],
               explode=(0.03, 0, 0), label_names=pclass_map,
               title="Total Passengers by Social Class")

With the pie chart above, it is possible to see that the majority of the ship's passengers and crew were formed by members of the lower class, representing just over 55% of the total present. Next, we will analyze whether this variable could, in some way, have included passengers' survival

In [None]:
# Relação entre sobrevivência e classe social
plot_countplot(df=df, col='Pclass', hue='Survived', label_names=pclass_map, palette=survived_colors,
               title="Survival Analysis by Social Class")

From reading the bar graph above, it is possible to see how the red (victims) and blue (survivors) bars have different proportions for each of the social classes analyzed in the base. In general, we have:

* Upper Class: there is a higher number of survivors than victims;
* Middle Class: a more equal balance of survivors and victims;
* Lower Class: formed mostly by victims.

This scenario indicates a clear situation: members of the so-called "Upper Class" had a greater chance of surviving the shipwreck than, for example, members of the "Lower Class". Probably some privilege was given to these passengers during the rescue or, in some way, these same passengers could be allocated in more comfortable / safe positions on the ship, thus facilitating the rescue.

Uma outra forma de analisar essa representatividade de sobrevivência por classe social pode ser dada a partir do gráfico de barras agrupadas abaixo:

In [None]:
plot_pct_countplot(df=df, col='Pclass', hue='Survived', palette='rainbow_r',
                   title='Social Class Influence on Survival Rate')

<a id="2.4"></a>
<font color="dimgrey" size=+2.0><b>2.4 Companion on the Ship</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

There are two variables in the dataset that inform the presence of passengers' companions on the vessel, namely SibSp (number of siblings or spouses) and Parch (number of parents or children). In this session, we will analyze whether, in any way, the presence of companions influenced the survival of these passengers.

In [None]:
# Relationship between SibSp/Parch and Survived
plot_countplot(df=df, col='SibSp', hue='Survived', orient='v', palette=survived_colors,
               title='Survival Analysis by SibSp')

In [None]:
# Relationship between SibSp/Parch and Survived
plot_countplot(df=df, col='Parch', hue='Survived', orient='v', palette=survived_colors,
               title='Survival Analysis by Parch')

Analyzing the two bar graphs above, it is possible to infer that passengers accompanied by 1 or 2 people, be they siblings, spouses, parents or children, prevent a more positive survival scenario in relation to the others.

It is likely that, in the chaotic rescue scenario, the presence of a not too high and not too low number of companions may have helped in the rescue in a possible situation of mutual help.

<a id="2.5"></a>
<font color="dimgrey" size=+2.0><b>2.5 The Age Factor</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Well, until this moment of the exploratory analysis, only categorical attributes present in the base were placed on the agenda. However, the available data set has at least two numerical invoices of probable importance for the project objective: Age and Average Ticket.

In this first moment, we will analyze how age may have influenced the survival of the ship's passengers.

In [None]:
# Distribution of age variable
plot_distplot(df=df, col='Age', title="Passenger's Age Distribution", hist=True)

In the above distribution, it is possible to have a general idea of the audience present on the ship in terms of age. The peak density occurs at approximately 25 years of age, indicating that the largest range of passengers was formed by a relatively young audience.

We will repeat the density analysis above and build a curve for survivors and another for victims.

In [None]:
plot_distplot(df=df, col='Age', hue='Survived', kind='kde', color_list=['crimson', 'darkslateblue'],
              title="Is there any relationship between age distribution\n from survivors and non survivors passengers?")

Overall, the graph above does not show a clear relationship between the influence of age on passenger survival. However, it is possible to highlight some subtle points on the graph, for example, the elevation of the curve at ages "close to 0", thus indicating a possible rescue priority given to the youngest children.

This scenario is also repeated, even more subtly, in a portion of passengers over 80 years old.

<a id="2.6"></a>
<font color="dimgrey" size=+2.0><b>2.6 Fare Paid by Passengers</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

The `Fare` variable in the data set brings the amount paid by each passenger to board the Titanic. We saw, in some previous stages, that social class may have had a certain influence on the passenger's chance of survival. Does this scenario repeat itself for the amount paid for the entry ticket?

In [None]:
plot_distplot(df=df, col='Fare', title='Fare Distribution', hist=True)

Here we can see the distribution of Fare variable with some outliers on the right side of the curve (people who paid a really high amount for the ticket).

In [None]:
# Fare distribution by social class
plot_distplot(df=df, col='Fare', hue='Pclass', kind='strip', label_names=pclass_map,
              palette=['gold', 'silver', 'brown'],
              title="Fare Distribution by Social Class")

The graph above shows that, in general, the higher the passenger's social class, the higher the ticket paid to enter the ship. Something expected, but of utmost importance its graphic visualization in this study.

In [None]:
plot_distplot(df=df, col='Fare', hue='Sex', kind='boxen', palette=gender_colors,
              title="What's the Fare distribution by Gender?")

It is interesting to see that, in general, women paid higher amounts than men.

In [None]:
plot_distplot(df=df, col='Fare', hue='Survived', kind='kde', color_list=['crimson', 'darkslateblue'],
              title="What's the relationship between the \nfare paid and survivor?")

Finally, the graph above shows a density distribution relating the average ticket paid per passenger and the database survival indicator. Here, it is possible to notice that the orange curve related to the survivors (Survived = 1) has a higher concentration in high values for the Fare variable, thus indicating that the ticket value may have had some influence on the passengers' survival.

<a id="2.7"></a>
<font color="dimgrey" size=+2.0><b>2.7 Simultaneous Analysis</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Within the tools of the `insights` module of the `xplotter` package, it is possible to find some functions built to facilitate analysis in several columns of a base simultaneously. For example, to view volumetries of a series of categorical variables, it would be possible to execute the function `plot_multiple_catplots()`, as shown below:

In [None]:
cat_cols = ['Survived', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']
plot_multiple_countplots(df=df, col_list=cat_cols, orient='v')

Another possibility is to analyze categorical columns based on a numerical aggregation column. For this, the `xplotter` module brings with it the` plot_cat_aggreg_report() `function. The block below dynamically analyzes the behavior of the `Embarked` variable and the` Fare` variable together, answering questions such as "what is the volume of passengers per port of departure?" or "what are the statistical averages of the amount paid for the ticket?"

In [None]:
plot_cat_aggreg_report(df=df, cat_col='Embarked', value_col='Fare', title3='Statistical Analysis', 
                       desc_text=f'A statistical approach for Fare \nusing the data available',
                       stat_title_mean='Mean', stat_title_median='Median', stat_title_std='Std', 
                       stat_title_x_pos=.3, stat_x_pos=.3, inc_x_pos=10)

<a id="3"></a>
<font color="darkslateblue" size=+2.5><b>3. Prep: Building Pipelines</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

After a long journey in the exploratory analysis session, we have gathered valuable insights that can make modeling work much easier and more intuitive. Thus, the next topics in this notebook will deal with an extremely important step in the training of a machine learning model: the preparation of the database.

Using the homemade package pycomp, whose construction was motivated exactly to facilitate the work of data scientists in the pillars of insights, prep and modeling, for this second session is expected a full understanding of the set of available data and a clear idea of the steps required to be applied in the prep and in the modeling. To facilitate this work, a very powerful tool built within the `mlcomposer` package will be used. This is the `mlcomposer.transformers` module, which consists of ready-made classes capable of performing a series of activities and transformations in a database, in addition to a complete integration with` Pipelines` of prep data.

<div align="center">
    <img src="https://i.imgur.com/MIcPH8g.png" width=450 height=450 alt="mlcomposer logo">
</div>

Before starting the steps regarding these transformations, let's apply another equally powerful function of the `xplotter.insights` module capable of returning an overview of a database.

In [None]:
# Data overview from xplotter
data_overview(df=df)

<a id="3.1"></a>
<font color="dimgrey" size=+2.0><b>3.1 Initial Pipeline</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

<a id="3.1.1"></a>
<font color="dimgrey" size=+1.0><b>3.1.1 Custom Features</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

In this first session, we will propose the construction of customized features in our database, thinking about a possible extraction of value from columns that, in its raw format, probably do not bring relevant insights to a predictive model. Thus, the following blocks of code will focus on extracting new insights that may or may not be considered in the final model, as follows:

* **_name_title_**: column that extracts the "title" information from the passenger's name (Mr, Mrs, etc.)
* **_cabin_class_**: column responsible for bringing the "category" of the cabin (A, B, C, etc.)
* **_ticket_class_**: column that extracts categorized information from the passenger's ticket
* **_age_cat_**: column for categorizing age ranges
* **_fare_cat_**: Fare variable track categorization column
* **_family_size_**: column to inform the passenger's total family size

To contemplate the extraction of these new features, a class called `CustomFeaturesTitanic` will be built, which, in turn, contains extraction intelligence based on` RegEx` and other rules in its own structure for inclusion of this step in a possible pipeline for preparing the Dice. The main objective is to iterate this and other transformation classes in a search for the best combinations considered by a classification model.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
import re

class CustomFeaturesTitanic(BaseEstimator, TransformerMixin):
    
    def __init__(self, name_title=True, ticket_class=True, cabin_class=True, name_length=True,
                 age_cat=True, fare_cat=True, family_size=True, name_title_re='([a-zA-Z]+\.)', 
                 ticket_class_re='[A-Z]+([/]{0,})([^\s.]+)', cabin_class_re = '\w',
                 age_bins=[0, 10, 20, 40, 60, 999], age_labels=['0_10', '10_20', '20_40', '40_60', 'greater_60'],
                 fare_bins=[0, 8, 15, 25, 50, 99999], fare_labels=['0_8', '8_15', '15_25', '25_50', 'greater_50']):
        self.name_title = name_title
        self.ticket_class = ticket_class
        self.cabin_class = cabin_class
        self.name_length = name_length
        self.age_cat = age_cat
        self.fare_cat = fare_cat
        self.family_size = family_size
        
        self.age_bins = age_bins
        self.age_labels = age_labels
        self.fare_bins = fare_bins
        self.fare_labels = fare_labels
        self.name_title_re = name_title_re
        self.ticket_class_re = ticket_class_re
        self.cabin_class_re = cabin_class_re
        
    def fit(self, df, y=None):
        return self
    
    def transform(self, df, y=None):
        
        # Extraindo feature relacionada ao título da pessoa
        if self.name_title:
            df['name_match'] = df['Name'].apply(lambda x: re.search(self.name_title_re, x))
            df['name_regex'] = df['name_match'].apply(lambda x: ''.join(x.groups()) if x is not None else np.nan)
            
            def extract_name_title(x, other_tag='OTHER'):
                if x not in ['Mr.', 'Miss.', 'Mrs.', 'Master.']:
                    x = other_tag
                return x
                    
            df['name_title'] = df['name_regex'].apply(lambda x: extract_name_title(x))
            df.drop('name_match', axis=1, inplace=True)
            df.drop('name_regex', axis=1, inplace=True)
        
        # Extraindo feature relacionada a classe do ticket
        if self.ticket_class:
            df['ticket_match'] = df['Ticket'].apply(lambda x: re.search(self.ticket_class_re, x))
            df['ticket_regex'] = df['ticket_match'].apply(lambda x: x.group() if x is not None else 'all_numbers')
            
            # Definindo função para extração da classe
            def extract_ticket_class(x, other_tag='OTHER'):
                if x in ['A/5', 'A/4', 'A4', 'A/S']:
                    x = 'A'
                elif x in ['STON/O', 'SOTON/O', 'SOTON/OQ', 'STON/O2', 'SOTON/O2', 'SOTON']:
                    x = 'STON_SOTON'
                elif x in ['SC/PARIS', 'SC/Paris', 'SC/AH', 'SC']:
                    x = 'SC'
                elif x == 'PC':
                    x = 'PC'
                else:
                    x = other_tag
                return x

            df['ticket_class'] = df['ticket_regex'].apply(lambda x: extract_ticket_class(x))
            df.drop('ticket_match', axis=1, inplace=True)
            df.drop('ticket_regex', axis=1, inplace=True)
            
        # Extraindo feature relacionada a classe da cabine
        if self.cabin_class:
            df['cabin_match'] = df['Cabin'].apply(lambda x: re.search(self.cabin_class_re, x) if x is not np.nan else np.nan)
            df['cabin_regex'] = df['cabin_match'].apply(lambda x: x.group() if x is not np.nan else np.nan)
            
            # Definindo função para extração da classe
            def extract_cabin_class(x):
                if x in ['F', 'G', 'T']:
                    x = 'FGT'
                return x

            df['cabin_class'] = df['cabin_regex'].apply(lambda x: extract_cabin_class(x))
            df.drop('cabin_match', axis=1, inplace=True)
            df.drop('cabin_regex', axis=1, inplace=True)
            
        # Extraindo feature relacionada ao tamanho do nome
        if self.name_length:
            df['name_length'] = df['Name'].apply(lambda x: len(x))
            
        # Extraindo feature relacionada a categoria de idade
        if self.age_cat:
            df['age_cat'] = pd.cut(df['Age'], bins=self.age_bins, labels=self.age_labels)
        
        # Extraindo feature relacionada a categoria de Fare
        if self.fare_cat:
            df['fare_cat'] = pd.cut(df['Fare'], bins=self.fare_bins, labels=self.fare_labels)
            
        # Extraindo feature relacionada a tamanho da família
        if self.family_size:
            df['family_size'] = df['Parch'] + df['SibSp'] + 1
            
        return df

In [None]:
# Creating object and applying transformation
feature_adder = CustomFeaturesTitanic(name_title=True, cabin_class=True, ticket_class=True)
df_custom = feature_adder.fit_transform(df)
df_custom.head()

After applying the transformation function, it would be interesting to see the results obtained from the new features created.

In [None]:
# New features
custom_features = ['name_title', 'ticket_class', 'cabin_class', 'age_cat', 'fare_cat', 'family_size']
plot_multiple_countplots(df=df, col_list=custom_features)

After executing the `fit_transform()` method of the `feature_adder` object, it is possible to perceive, in the resulting base, the presence of the three new features previously considered. When training a predictive model, we will verify that these features are relevant to the modeling as a whole.

In addition, it is possible to analyze that, considering a predictive bias, not all columns of the base will be used to predict the survival of the ship's passengers. In practice, columns such as `Cabin`, `PassengerId`, `Ticket` and `Name`, for example, have no practical significance for predictive modeling, being responsible only for key passenger information or unique record indicators. Thus, as a next step, we will apply a feature selection process to choose only the candidate variables within a modeling context.

<a id="3.1.2"></a>
<font color="dimgrey" size=+1.0><b>3.1.2 Candidate Features</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Having defined the variables present in the initial set of features (variable `INITIAL_FEATURES`), it is possible to use the `ColumnSelection()` class of the `mlcomposer.transformers` module to apply the feature selection process to pre-selected attributes.

In [None]:
# Importando classe
from mlcomposer.transformers import ColumnSelection

# Initial features
TARGET = 'Survived'
TO_DROP = ['PassengerId', 'Name', 'Ticket', 'Cabin']
INITIAL_FEATURES = list(df.drop(TO_DROP, axis=1).columns)

# Aplicando transformador
selector = ColumnSelection(features=INITIAL_FEATURES)
df_slct = selector.fit_transform(df)

# Resultados
print(f'Shape of original dataset: {df.shape}')
print(f'Shape of dataset after selection: {df_slct.shape}')
df_slct.head()

From the resulting dimensions, it can be seen that the application of the FiltraColunas class resulted in the reduction of 4 columns from the original base.

<a id="3.1.3"></a>
<font color="dimgrey" size=+1.0><b>3.1.3 Duplicated Data</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

The handling of duplicate data is an important step in preparing the basis for model training. This is because eliminating duplicates also means eliminating redundancy at the base, allowing machine learning models a faster convergence to local / global minimums.

To accomplish this task, the class DuplicatesDimplicates of the module pycomp.ml.transformers will be used, which, in turn, is responsible for simply eliminating duplicate records from a database passed as input.

In [None]:
# Importing class
from mlcomposer.transformers import DropDuplicates

# Applying transformer
dup_dropper = DropDuplicates()
df_nodup = dup_dropper.fit_transform(df_slct)

# Results
print(f'Total of duplicates before: {df_slct.duplicated().sum()}')
print(f'Total of duplicates after: {df_nodup.duplicated().sum()}')

<a id="3.1.4"></a>
<font color="dimgrey" size=+1.0><b>3.1.4 Modifying Dtypes</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

During the analyzes proposed in the exploratory phase of the project, it was possible to notice the presence of some numerical columns with categorical meaning at the base. In other words, they are columns that, in their origin, are persisted with numeric types but, in practice, represent categories related to some specific information as we can see in, for example, `Pclass`, `SibSp` and `Parch` (more specifically in `Pclass` than in the others and, at first, we can consider a first version working on the transformation only in the` Pclass` column).

Thus, an important step so that the preparation processes can be applied correctly, is the transformation of these numerical columns into strings.

In [None]:
# Importando classe
from mlcomposer.transformers import DtypeModifier

# Definições iniciais
cat_custom_features = ['Pclass']
mod_dict = {col: str for col in cat_custom_features}
print(f'Selected columns dtype before transformation:\n')
print(df_nodup.dtypes[cat_custom_features])

# Criando objeto e aplicando transformação
dtype_mod = DtypeModifier(mod_dict=mod_dict)
df_mod = dtype_mod.fit_transform(df_nodup)
print(f'Selected columns dtype after transformation:\n')
print(df_mod.dtypes[cat_custom_features])

<a id="3.1.5"></a>
<font color="dimgrey" size=+1.0><b>3.1.5 Training and Validation Data</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Finishing what we could call the initial pipeline of the project, we have an important step responsible for separating the database in training and testing. Thinking about a future modeling step, evaluating the result on different bases is extremely important to make decisions regarding the best practical solution to be put into production.

For this, we will use the SplitDados class also from the pycomp.ml.transformers module, which, in turn, applies this separation in the base and returns us with properly separated training and test data.

In [None]:
# Importing class
from mlcomposer.transformers import DataSplitter

# Applying transformer
splitter = DataSplitter(target='Survived')
X_train, X_val, y_train, y_val = splitter.fit_transform(df_nodup)

# Results
print(f'Shape of X_train: {X_train.shape}')
print(f'Shape of X_val: {X_val.shape}')
print(f'Shape of y_train: {y_train.shape}')
print(f'Shape of y_val: {y_val.shape}')

With that, we ended the first stage regarding the analysis of transformers to be applied in an initial pipeline of the base. The idea is to apply these common steps to the base as a whole.

Next, specific pipelines will be built according to the primitive type of each column. Thus, we will have a numerical pipeline and a categorical pipeline that will compose the second transformation phase in the prep process.

<a id="3.2"></a>
<font color="dimgrey" size=+2.0><b>3.2 Numerical Pipeline</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Before starting the steps regarding the construction of a numerical pipeline, it is important to separate the attributes of our database according to their respective primitive types. For this, we will create two different lists containing, in each of them, the categorical and numeric columns of the base

In [None]:
# Splitting numerical features with categorical meaning
cat_custom_features = ['Pclass']

# Splitting features by dtype
num_features = list(X_train.select_dtypes(exclude=['object', 'category']).columns)
cat_features = list(X_train.select_dtypes(include=['object', 'category']).columns)

# Selecting datasets
X_train_num = X_train[num_features]
X_train_cat = X_train[cat_features]

print(f'Numerical features:\n{num_features}')
print(f'\nCategorical features:\n{cat_features}')

In this way, we can then begin the steps of building specific transformers for the numerical and categorical types of the base.

<a id="3.2.1"></a>
<font color="dimgrey" size=+1.0><b>3.2.1 Null Data</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

We saw, at the beginning of session 3. that some variables in the database have null data. It is necessary, in some way, to carry out the treatment of these data for a future insertion in predictive models. We will retrieve this analysis of the base overview, using now only the training data (attributes filtered and treated) in the initial pipeline.

In [None]:
# New overview
df_overview = data_overview(X_train)
print(f'Numerical data overview:')
data_overview(df=X_train).query('feature in @num_features')

The table above shows the need for handling null data in the Age column. Considering the centralization of transformations in numeric attributes in this session, we will consider filling in null data for the Age variable from sklearn's SimpleImputer class. As a statistical strategy, we can enter the median of this variable for completion.

In [None]:
# Importing library
from sklearn.impute import SimpleImputer

# Applying transformer
imputer = SimpleImputer(strategy='median')
X_train_num_imp = imputer.fit_transform(X_train_num)
X_train_num_imp = pd.DataFrame(X_train_num_imp, columns=num_features)

# Results
print(f'Null data before imputer: {X_train_num.isnull().sum().sum()}')
print(f'Null data after imputer: {X_train_num_imp.isnull().sum().sum()}')

Great! At first, we will not build other transformers besides filling in the null data in numeric attributes. In this way, we can start preparing the categorical data of the database

<a id="3.2.2"></a>
<font color="dimgrey" size=+1.0><b>3.2.2 Log Transformation</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

To validate the impact of the logarithmic transformation on candidate predictive models, we will optionally propose a step in the pipeline that applies this procedure to the numerical features present in the base. With this, we can validate whether the final performance of the model is sensitive to this type of transformation.

In [None]:
# Example
log_ex = 'Fare'
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(17, 7))
plot_distplot(df=X_train_num, col=log_ex, ax=axs[0], hist=True,
              title=f'Original {log_ex} Distribution')

tmp_data = X_train_num.copy()
tmp_data[log_ex] = tmp_data[log_ex].apply(lambda x: np.log1p(x))
plot_distplot(df=tmp_data, col=log_ex, ax=axs[1], color='mediumseagreen', hist=True, 
              title=f'{log_ex} After Log Transformation')

Two highly relevant statistical measures for distribution analysis are `skew` and` kurtosis`. Through the [link](https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa) it is possible to have a clear idea on what each of these measures is and how to interpret continuous distributions through their values.

The logarithmic transformation helps to increase performance for distributions with positive skewness (asymmetric on the left). Thus, we will analyze the numerical features again and rank the main features with the opportunity for improvement through this type of transformation.

In [None]:
from scipy.stats import skew, kurtosis

tmp_ov = data_overview(df=X_train_num_imp)
tmp_ov['skew'] = tmp_ov.query('feature in @num_features')['feature'].apply(lambda x: skew(X_train_num_imp[x]))
tmp_ov['kurtosis'] = tmp_ov.query('feature in @num_features')['feature'].apply(lambda x: kurtosis(X_train_num_imp[x]))
tmp_ov[~tmp_ov['skew'].isnull()].sort_values(by='skew', ascending=False).loc[:, ['feature', 'skew', 'kurtosis']]

The table above shows a list of features through their skewness and kurtosis measures of symmetry. In the code block below, we will execute the `DynamicLogTransformation` class, which, in turn, has the role of applying the logarithmic transformation in a database in a preparation pipeline. The advantage of this class is the previous definition of a list of features to which the transformation will be applied, which is defined by the user.

In [None]:
# Importing class
from mlcomposer.transformers import DynamicLogTransformation

# Setting parameters
COLS_TO_LOG = ['Fare']
log_tr = DynamicLogTransformation(num_features=num_features, cols_to_log=COLS_TO_LOG)
X_train_num_ori = X_train_num_imp.copy()
X_train_num_log = log_tr.fit_transform(X_train_num_imp)

# Plotting results
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(17, 7))

plot_distplot(df=df, col=COLS_TO_LOG[0], ax=axs[0], hist=True,
              title=f'{COLS_TO_LOG[0]} Distribution $Before$ \nLog Transformation')
plot_distplot(df=X_train_num_log, col=COLS_TO_LOG[0], ax=axs[1], hist=True, color='mediumseagreen',
              title=f'{COLS_TO_LOG[0]} Distribution $After$ \nLog Transformation')    

plt.tight_layout()

Additionally, it is worth mentioning that the class and `DynamicLogTransformation` have a Boolean attribute called `application` that can be used in the future for interactions in `GridSearch` or `RandomizedSearch`. Its objective is to enable performance analysis of models **with** or **without** the logarithmic transformation.

<a id="3.2.3"></a>
<font color="dimgrey" size=+1.0><b>3.2.3 Normalization</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Another interesting way to apply a procedure that helps a given predictive model to converge to the optimal value more quickly is given by the `normalization` of the data. For the context of machine learning, it is possible to use ready-made sklearn classes, for example, `MinMaxScaler` or` StandardScaler`.

This type of standardization / normalization can optionally be applied directly to the numerical pipeline. Below, an example of how this transformation can be applied to our numerical database will be demonstrated.

In [None]:
# Importing class
from mlcomposer.transformers import DynamicScaler

scaler = DynamicScaler(scaler_type='Standard')
X_train_num_scaled = scaler.fit_transform(X_train_num_log)
X_train_num_scaled = pd.DataFrame(X_train_num_scaled, columns=num_features)
X_train_num_scaled.head()

With that, we ended the preparation steps in the numerical pipeline of the project. In the future, we will consolidate each of these _steps_ into a single preparation block using the `sklearn` class `Pipeline`. As next steps, let's look at the categorical part of the set.

<a id="3.3"></a>
<font color="dimgrey" size=+2.0><b>3.3 Categorical Pipeline</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

For this session, we know that there is already a need to fill in null data from the Embarked column. In the analysis carried out in section 3.2.1 above, we saw that there are only 2 null records for the column that informs the port of departure of each passenger.

Although relatively irrelevant, we will use the same SimpleImputer applied previously, but with a strategy that takes into account the most common entry present in the column as a way to fill in nulls.

<a id="3.3.1"></a>
<font color="dimgrey" size=+1.0><b>3.3.2 Encoding</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

An extremely important step involving categorical attributes of the database is the application of a process known as data encoding. The importance of this step is due to the inability of most machine learning models to read categorical entries from a database provided as input. This is because, in the backgorund of the most common models, several numerical calculations are performed in order to minimize a cost function and reach a minimum error, which in fact is impossible in the presence of categorical variables.

For this, the encoding process exists to apply a kind of "encoding" to categorical data, often separating them into different columns, one for each different entry. In this context, we can use the class DummiesEncoding () of the pycomp.ml.transformers module, which, in turn, is responsible for applying this process automatically.

In [None]:
# Importing class
from mlcomposer.transformers import DummiesEncoding

# Applying transformer
encoder = DummiesEncoding(cat_features_ori=cat_features, dummy_na=True)
X_train_cat_enc = encoder.fit_transform(X_train_cat)

# Results
print(f'Shape before encoding: {X_train_cat.shape}')
print(f'Shape after encoding: {X_train_cat_enc.shape}')
X_train_cat_enc.head()

<a id="3.4"></a>
<font color="dimgrey" size=+2.0><b>3.4 Complete Pipelines</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

After going through 3 large blocks involving the configuration and application of specific classes for the preparation of our database, it is possible to consolidate these steps in native objects from sklearn's Pipeline class.

In this session, we will build pipelines capable of consolidating all the steps detailed so far into single objects of preparation. Such objects will be extremely important in the reproduction of this flow and also in the standardized application of the steps in new data to be received (test data, for example). So, considering the previous topics, we will build:

* **_initial_pipeline:_** pipeline responsible for receiving a raw database and applying transformations common to all attributes;
* **_prep_pipeline:_** pipeline responsible for consolidating the transformations applied to numerical and categorical data in a single object.

In [None]:
# Importing libraries
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

# Setting global variables
TARGET = 'Survived'

INITIAL_FEATURES = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'name_title', 
                    'ticket_class', 'cabin_class', 'name_length', 'age_cat', 'fare_cat', 'family_size']
INITIAL_PRED_FEATURES = [col for col in INITIAL_FEATURES if col not in TARGET]

DTYPE_MODIFICATION_DICT = {'Pclass': str}

NUM_FEATURES = ['Age', 'SibSp', 'Parch', 'Fare', 'name_length', 'family_size']
CAT_FEATURES = ['Pclass', 'Sex', 'Embarked', 'name_title', 'ticket_class', 'cabin_class', 'age_cat', 'fare_cat']

CAT_FEATURES_FINAL = ['Pclass_1', 'Pclass_2', 'Pclass_3', 'Pclass_nan', 'Sex_female', 'Sex_male', 'Sex_nan',
                      'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Embarked_nan', 'name_title_Master.', 
                      'name_title_Miss.', 'name_title_Mr.', 'name_title_Mrs.', 'name_title_OTHER',
                      'name_title_nan', 'ticket_class_A', 'ticket_class_OTHER', 'ticket_class_PC', 
                      'ticket_class_SC', 'ticket_class_STON_SOTON', 'ticket_class_nan', 'cabin_class_A', 
                      'cabin_class_B', 'cabin_class_C', 'cabin_class_D', 'cabin_class_E', 'cabin_class_FGT', 
                      'cabin_class_nan', 'age_cat_0_10', 'age_cat_10_20', 'age_cat_20_40', 'age_cat_40_60',
                      'age_cat_greater_60', 'age_cat_nan', 'fare_cat_0_8', 'fare_cat_15_25', 
                      'fare_cat_25_50', 'fare_cat_8_15', 'fare_cat_greater_50', 'fare_cat_nan']

MODEL_FEATURES = NUM_FEATURES + CAT_FEATURES_FINAL

# Variables for numerical pipeline
NUM_STRATEGY_IMPUTER = 'median'
SCALER_TYPE = None
LOG_APPLICATION = False
COLS_TO_LOG = ['Fare', 'Age']

# Variables for categorical pipeline
ENCODER_DUMMY_NA = True
NAME_TITLE = True
CABIN_CLASS = True
TICKET_CLASS = True
NAME_LENGTH = True
AGE_CAT = True
FARE_CAT = True
FAMILY_SIZE = True

# Building initial pipelines (train and prediction)
initial_train_pipeline = Pipeline([
    ('feature_adder', CustomFeaturesTitanic(name_title=NAME_TITLE, cabin_class=CABIN_CLASS, 
                                            ticket_class=TICKET_CLASS, name_length=NAME_LENGTH,
                                            age_cat=AGE_CAT, fare_cat=FARE_CAT, family_size=FAMILY_SIZE)),
    ('col_filter', ColumnSelection(features=INITIAL_FEATURES)),
    ('dtype_modifier', DtypeModifier(mod_dict=DTYPE_MODIFICATION_DICT)),
    ('dup_dropper', DropDuplicates())
])

initial_pred_pipeline = Pipeline([
    ('feature_adder', CustomFeaturesTitanic(name_title=NAME_TITLE, cabin_class=CABIN_CLASS, 
                                            ticket_class=TICKET_CLASS, name_length=NAME_LENGTH,
                                            age_cat=AGE_CAT, fare_cat=FARE_CAT, family_size=FAMILY_SIZE)),
    ('col_filter', ColumnSelection(features=INITIAL_PRED_FEATURES)),
    ('dtype_modifier', DtypeModifier(mod_dict=DTYPE_MODIFICATION_DICT))
])

# Building a numerical pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy=NUM_STRATEGY_IMPUTER)),
    ('log_transformer', DynamicLogTransformation(application=LOG_APPLICATION, num_features=NUM_FEATURES, 
                                                 cols_to_log=COLS_TO_LOG)),
    ('scaler', DynamicScaler(scaler_type=SCALER_TYPE))
])

# Building a categorical pipeline
cat_pipeline = Pipeline([
    ('encoder', DummiesEncoding(dummy_na=ENCODER_DUMMY_NA, cat_features_final=CAT_FEATURES_FINAL))
])

# Building a complete pipeline
prep_pipeline = ColumnTransformer([
    ('num', num_pipeline, NUM_FEATURES),
    ('cat', cat_pipeline, CAT_FEATURES)
])

In [None]:
# Reading raw data
df = pd.read_csv(os.path.join(DATA_PATH, TRAIN_FILENAME))

# Executing initial training pipeline
df_prep = initial_train_pipeline.fit_transform(df)

# Splitting data into training and testing
X_train, X_val, y_train, y_val = train_test_split(df_prep.drop(TARGET, axis=1), df_prep[TARGET].values,
                                                  test_size=.20, random_state=42)

# Executing preparation pipeline
X_train_prep = prep_pipeline.fit_transform(X_train)
X_val_prep = prep_pipeline.fit_transform(X_val)

# Results
print(f'Shape of X_train_prep: {X_train_prep.shape}')
print(f'Shape of X_val_prep: {X_val_prep.shape}')
print(f'\nTotal features considered: {len(MODEL_FEATURES)}')

Perfect! At this point in the project, we are ready to start the predictive modeling stage in search of building a model capable of returning the Titanic passengers' probability of survival against the variables considered for analysis. Before diving into the topics related to survival prediction, let's use the final base prepared to plot a correlation matrix in relation to the model's target variable (`Survived`). The purpose of this study is to have a prior idea of the most important variables present in the database.

In [None]:
# Preparing a final DataFrame after transformation
df_prep = pd.DataFrame(X_train_prep, columns=MODEL_FEATURES)
df_prep['Survived'] = y_train

# Plotting a correlation matrix
plot_corr_matrix(df=df_prep, corr_col='Survived', figsize=(12, 12), cbar=False, n_vars=15)

In [None]:
plot_corr_matrix(df=df_prep, corr='negative', corr_col='Survived', figsize=(12, 12), cbar=False, n_vars=15)

<a id="4"></a>
<font color="darkslateblue" size=+2.5><b>4. Modeling: Predicting Survival</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Finally, the long-awaited session where we finally conducted the training of machine learning models in search of the best algorithm capable of returning the probability of survival of the ship's passengers and crew against the wreck.

To reach this stage, we went through a dense exploratory analysis session that helped us greatly in gathering insights and in a complete understanding of the available database, in addition to a rich session where it was possible to build definitive pipelines for preparing the dataset.

In this way, we can separate the modeling step into:

1. **_Initial definitions_**, where we will define the fundamental blocks for the beginning of the training, as for example, the models used and their respective search hyperparameters;
2. **_Training_**, where we will apply, in fact, the training of the models selected in the previous step. For this, we will use, once again, powerful features of the pycomp package from its pycomp.ml.trainer module;
3. **_Evaluation_**, where, after due training of the candidate models, we will analyze the individual performances of each one against the proposed prediction task.

<a id="4.1"></a>
<font color="dimgrey" size=+2.0><b>4.1 Structuring Variables</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

As mentioned earlier, this is the stage where we prepare the structures for the start of training. Here, the objective is to import and prepare the classification models to be used to predict the survival of passengers.

In [None]:
# Importing models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

# Model objects
dtree = DecisionTreeClassifier()
forest = RandomForestClassifier()
lgbm = LGBMClassifier()
xgb = XGBClassifier()
adaboost = AdaBoostClassifier()
gradboost = GradientBoostingClassifier()

# Creating set_classifiers dict
model_obj = [dtree, forest, lgbm, xgb, adaboost, gradboost]
model_names = [type(model).__name__ for model in model_obj]
set_classifiers = {name: {'model': obj, 'params': {}} for (name, obj) in zip(model_names, model_obj)}

print(f'Classifiers that will be trained on next steps: \n\n{model_names}')

<a id="4.2"></a>
<font color="dimgrey" size=+2.0><b>4.2 Training Models</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Once the modeling structure has been prepared from specific objects, such as the set_classifiers dictionary, it is now possible to import the ClassifierBinary class present in the pycomp.ml.trainer module to perform all training and evaluation of the candidate models.

This class was developed in order to greatly facilitate the work of the analyst / scientist in terms of implementing codes to train, evaluate and optimize predictive models for binary classification. Its methods include powerful features that perform various actions with just one call.

In [None]:
# Importing class
from mlcomposer.trainer import BinaryClassifier

# Creating an object and starting training
trainer = BinaryClassifier()
trainer.fit(set_classifiers, X_train_prep, y_train)

The `fit()` method of the created trainer object is responsible for training the models encapsulated in the `set_classifiers` dictionary created in the initial definitions stage.

By configuring the method to also apply the `RandomizedSearchCV` process (random search for the best hyperparameters of each algorithm), it is possible to build models optimized according to the search space passed in the `set_classifiers` dictionary.

<a id="4.3"></a>
<font color="dimgrey" size=+2.0><b>4.3 Evaluating Performance</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Once the candidate models are trained through the `fit()` method, it is then possible to evaluate the performance obtained in each case, thus returning the main classification metrics capable of indicating the best direction for the given task.

To perform this process, we can use the `evaluate_performance()` or `plot_metrics()` methods of the trainer object. In the first case, the return is an analytical DataFrame containing the result of the evaluation of each model against the main metrics. In the second case, the return is a visual analysis of the metrics for each of the models.

In [None]:
# Analytical of training results
metrics = trainer.evaluate_performance(X_train_prep, y_train, X_val_prep, y_val)
metrics

As mentioned earlier, the `evaluate_performance()` method returns an analytical table containing the performance of each model (in training and testing) for the main metrics for evaluating classification models. From this table, it is possible to point out that, in terms of accuracy, the `RandomForest` model performed slightly better than the others, despite the high time required to perform the calculations.

Thinking about setting, in fact, an optimization objective to choose the best predictive model, let's consider the accuracy as the metric to be used for this decision. Another way to analyze the performance of candidate models is from the `plot_metrics()` method. Its result can be seen below:

In [None]:
# Graphical analysis of performance
trainer.plot_metrics()

Graphically, it is possible to visualize the superiority of the `RandomForest` model in terms of accuracy. In the first line, there is a boxplot analysis considering the `k` folds used in the cross validation and, from there, it is possible to visualize the dispersion of each evaluation round for each candidate model. In the second line of the plot, there is an analysis of the average per metric and per candidate model.

___
* **_Most relevant features_**
___

By further improving the analysis of the results, it is possible to return the importance that each feature of the base had in the final result of the prediction. Let's look at the most important features considered by each model from the `plot_feature_importance()` method

In [None]:
# Feature importances
trainer.plot_feature_importance(features=MODEL_FEATURES)

Analyzing the method log, it is noticed that the LogisticRegression model does not have the `feature_importances_()` method and, thus, it is impossible to calculate and return this front to this model. Then considering the `DecisionTrees` and `RandomForest` models, the bar graphs above indicate the main features of greatest relevance for predicting the survival of passengers to the sinking of the Titanic.

___
* **_Confusion matrix_**
___

In binary classification models, it is common to perform individual analyzes on True Positives, True Negatives, False Positives and False Negatives, that is, in basically all the indicators that make up the main classification metrics previously viewed. For this, the tool used is the confusion matrix, which, in turn, aims to consolidate all these indicators in a visual matrix format.

To perform this analysis, it is possible to execute the trainer object's `plot_confusion_matrix()` method. The result will be given by two confusion matrices for each model used (one for the training data and the other using the test data). Let's see:

In [None]:
# Plotting confusion matrix for the models
trainer.plot_confusion_matrix()

___
* **_ROC Curve_**
___

In some cases, datasets with an unbalanced target variable, that is, an extremely small amount of records for a given class, may present high biases for the classical metrics for the evaluation of classification models. Thus, a possible solution is to analyze the results in terms of `score` or` probability` instead of fixed metrics.

In these cases, it is possible to analyze the ROC curve of the models to see if they are performing well in view of the business objective to be achieved. For that, we can execute the `plot_roc_curve()` method of the `trainer` object.

In [None]:
# Plotting ROC Curve
trainer.plot_roc_curve()

Analyzing the above training and test curves, as well as the metric `roc_auc` (Area Under the Curve) in the legend of each series, it is noticed that the` RandomForest` and `LogisticRegression` models present a better performance in relation to the model `DecisionTrees`. In the future, if the metric `roc_auc` becomes the optimization goal for this task, we can take the above analysis into consideration to decide the best algorithm.

___
* **_Probabilities Distribution_**
___

Complementing the analysis and comments above, in some cases it is extremely relevant to observe how the scores (or probabilities) of the models are distributed in relation to the whole. Such an analysis is important to verify whether the classifiers are, in fact, separating the positive and negative classes well, thus providing a clear view of the possible errors in each score range returned.

This view can be returned using the `plot_score_distribution()` method of the `trainer` object.

In [None]:
# Plotting score distribution
trainer.plot_score_distribution()

The curves above represent the distribution of the scores of each of the trained models, considering the training and test data. The break by the positive and negative classes present in the base (`y = 1` and` y = 0`) indicates how each algorithm performed the separation of the scores and how good each one can be in this differentiation considering the probability of each class between 0 and 1.

In practice, the curves are read by the density of elements of each of the classes when the probability score is in the ranges determined by the x-axis. Exemplifying from the curves for the `RandomForest` model (last line of the figure), it is possible to notice that, when the probability returned by the model is between 0 and 0.4, there is a high density of elements of the negative class (blue curve), the which is a good sign. On the other hand, when the model returns high probabilities, the density of elements of the positive class is extremely higher (red curve).

___
* **_Probabilities Distribution on Bins_**
___

Another way to analyze the distribution of the probability score of the models is through their separation into specific analysis ranges. Thus, it is possible to analyze the volumetry of each score range considering the different classes present in the base, opening the possibility to verify, in fact, if the models are separating the classes in a discrete approach.

For this, we can use the `plot_score_bins()` method of the `trainer` object as shown below:

In [None]:
# Plotting score distribution on bins
trainer.plot_score_bins()

As seen in the continuous distribution previously, the purpose of this band analysis is also to verify that the models are properly separating classes from bands. Performing the result rule for the `RandomForest` model (bottom line of the figure), there is an increase in the volume of records belonging to the positive class (`y = 1`) according to the growth of the x-axis bands, thus indicating that high scores are, in general, related to positive class (expected scenario).

___
* **_Learning Curves_**
___

An extremely powerful tool to check if the models used suffer from some kind of problem related to bias or variance (underfitting or overfitting) are the "learning curves". With them, it is possible to verify the evolution of the errors obtained by the models, in training and in validation, from specific numbers of samples used for such calculations.

In practice, a sample evolution "step" is defined and, sequentially, the model is trained using an N + "step" number of samples and their respective errors are computed in training and validation. The result are evolution curves showing the error behavior related to the increase in the number of samples used in the calculation. To perform this analysis, we have the `plot_learning_curve()` method of the `trainer` object.

In [None]:
# Plotting learning curves
trainer.plot_learning_curve()

Analyzing the curves, it is possible to conclude that the models do not suffer from serious overfitting problems, given that the error in the training and validation data is very similar at the end of the 500 samples. To analyze other possible scenarios, the reference link can be very useful: https://www.dataquest.io/blog/learning-curves-machine-learning/

___
* **_Shap Analysis_**
___

Finally, concluding the performance analysis block of the trained models, there is the `shap` analysis as a powerful way to analyze the impact of each of the base features on a given predictive model. Unlike the feature importances analysis performed earlier, the shap analysis allows you to view the impact of features according to their absolute value. In other words, it can be seen whether a high (or extremely low) value for a given feature can significantly impact the model's output.

For this rich analysis, we can use the `plot_shap_analysis()` method of the `trainer` object, passing, as main argument, the name of a model already trained by the class and present in the `classifiers_info` attribute.

In [None]:
# Shap Analysis
trainer.plot_shap_analysis(model_name='LGBMClassifier', features=MODEL_FEATURES)

As mentioned earlier, the shap analysis allows to cross the value of each feature and its respective in the final answer of the model. As a reading example, the result obtained for the variable `Sex_male` indicates that the higher its value (red spot), the less its impact on the model result (left side of the x axis). In a more technical and summarized analysis, it is possible to say that the higher the value of `Sex_male` (that is, the closer to the male gender), the less the value of the probability (low chance of survival).

Similarly, the reading for the variable `Sex_female` indicates the opposite: female passengers had a greater chance of survival, which, in fact, was investigated and surveyed in the exploratory phase of the project.

<a id="4.4"></a>
<font color="dimgrey" size=+2.0><b>4.4 Training Flow and Visual Analysis</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Throughout the entire project, and especially in this modeling session, the powerful tools of the `mlcomposer` package were used, thus allowing a quick and rich analysis of candidate models, from training to performance evaluation and several other relevant parameters for a more assertive decision on the best model for a given task.

Throughout sections 4.2 and 4.2, each method of the `mlcomposer.trainer` module was executed individually in order to provide a specific analysis on each of the fronts involved. In practical terms, it is not always desirable to separate the analysis individually for each of the available methods and graphs, but to perform all analyzes at once.

With that in mind, two methods responsible for carrying out all the steps above in a consolidated manner were built within the classifierBinary class. They are: `training_flow()` and `visual_analysis()`. The big difference is due to the need to inform only a few essential arguments so that training, evaluation and generation of analysis graphs can be carried out. Results can be saved in specific user-defined directories.

In the cell below, we will simulate the execution of these two methods.

In [None]:
# Creating a new object
full_trainer = BinaryClassifier()

# Training and evaluating models all at once
full_trainer.training_flow(set_classifiers, X_train_prep, y_train, X_val_prep, y_val, 
                           features=MODEL_FEATURES, random_search=True, scoring='accuracy', n_jobs=3)

# Generating visual analysis for the models all at once
full_trainer.visual_analysis(features=MODEL_FEATURES, model_shap='LGBMClassifier')

After executing the methods `training_flow()` and `visual_analysis()` specified above, there is a series of generated outputs that can be accessed in the future for a more detailed analysis.

<a id="5"></a>
<font color="darkslateblue" size=+2.5><b>5. Hyperparameters Tunning</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

After a long exploratory and preparation journey in the database available for this task, we have enough inputs to decide between some candidate models in order to propose an improvement in their performances through the tuning of hyperparameters.

So far, training has been carried out considering a 'default' configuration in all algorithms, which, in fact, does not allow extracting the full potential of the model. In this stage, we will separate some candidate models that presented reasonable performance in this initial stage or that, in a way, may present a promising proposal after tuning hyperparameters. After this initial analysis, search dictionaries will be defined considering specific hyperparameters for each of these pre-selected models. At the end, a complete analysis will be proposed, through `RandomizedSearchCV` to find the best combination among all possible combinations of the candidate models.

Thus, we will then review the results obtained by the candidate models.

In [None]:
# Model metrics
metrics = pd.read_csv('output/metrics/metrics.csv')
metrics

In a general balance, considering the `accuracy` as an optimization objective, it is possible to consider the `RandomForest`, `LightGBM` and `XGBoost` models as good candidates for tuning hyperparameters. In practice, there is a good possibility of improving performance for these models and, thus, we will build some blocks capable of assisting us in this search and improvement.

The first step to be taken is the definition of dictionaries containing the search hyperparameters for each of the selected models. With this dictionary in hand, we will build a new pipeline with:

1. Data preparation with `prep_pipeline`;
2. Application of feature selection with `FeatureSelection`;
3. Model training

Thus, it will be possible to use the hyperparameter dictionaries to apply a random search in order to return the best possible combination, be it in relation to the modeling parameters or the base preparation parameters (such as the imputer strategy, the logarithm transformation application, the number of features to be considered, among others).

___

In [None]:
# Parâmetros de busca pro modelo RandomForest
forest_tunning_grid = {
    'bootstrap': [True, False],
    'class_weight': [None, 'balanced'],
    'criterion': ['gini', 'entropy'],
    'max_depth': [5, 6, 7, 9, 10],
    #'max_features': [None, 'auto', 'sqrt', 'log2'],
    #'max_leaf_nodes': np.arange(3, 50, 2),
    #'min_impuriti_decrease': np.linspace(0, 1, 50),
    #'min_samples_leaf': np.arange(1, 100, 1),
    #'min_samples_split': np.arange(2, 100, 1),
    #'min_weight_fraction_leaf': np.linspace(0, 1, 50),
    'n_estimators': np.arange(300, 600, 50),
    #'oob_score': [True, False],
    'random_state': [42]
}

# Parâmetros de busca pro modelo LightGBM
lgbm_tunning_grid = {
    'boosting_type': ['gbdt'],
    'class_weight': [None, 'balanced'],
    #'colsample_bytree': np.linspace(.5, 1, 50),
    #'importance_type': ['split', 'gain'],
    'learning_rate': [0.003, 0.01, 0.03, 0.1, 0.3, 1, 3],
    'max_depth': np.arange(-1, 100, 2),
    #'min_child_samples': np.arange(10, 50, 1),
    #'min_child_weight': np.linspace(1e-4, 1, 100),
    'n_estimators': np.arange(300, 700, 50),
    'num_leaves': [5, 10, 15, 20],
    'objective': ['binary'],
    'random_state': [42],
    'reg_alpha': np.linspace(.0, 25, 15)
}

# Parâmetros de busca pro modelo XGBoost
xgboost_tunning_grid = {
    #'base_score': np.linspace(0.3, 0.7, 50),
    'booster': ['gbtree'],
    'max_depth': [3, 4, 6, 7],
    'learning_rate': [0.003, 0.01, 0.03, 0.1, 0.3, 1, 3],
    'n_estimators': np.arange(300, 700, 50),
    'objective': ['binary:logistic'],
    'seed': [42],
    'reg_alpha': np.linspace(.0, 25, 15),
    'reg_lambda': np.linspace(.0, 25, 15),
    'colsample_bylevel': [0.5, 0.7, 0.9]
}

# Parâmetros de busca pro modelo AdaBoost
adaboost_tunning_grid = {
    'base_estimator': [DecisionTreeClassifier(max_depth=7)],
    'n_estimators': np.arange(50, 700, 50),
    'learning_rate': [0.003, 0.01, 0.03, 0.1, 0.3, 1, 3],
    'algorithm': ['SAMME', 'SAMME.R'],
    'random_state': [42]
}

# Parâmetros de busca pro modelo GradientBoost
gradboost_tunning_grid = {
    'n_estimators': np.arange(50, 700, 50),
    'max_depth': [3, 6, 7, 8, 9, 10, 15],
    'max_features': [None, 'auto', 'sqrt', 'log2'],
    'max_leaf_nodes': np.arange(3, 50, 2),
    #'min_impuriti_decrease': np.linspace(0, 1, 50),
    #'min_samples_leaf': np.arange(1, 100, 1),
    #'min_samples_split': np.arange(2, 100, 1),
    #'min_weight_fraction_leaf': np.linspace(0, 1, 50),
    'random_state': [42]
}

In [None]:
import time
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import roc_auc_score
from datetime import datetime

def clf_cv_performance(model, X, y, cv=5, model_name=None):
    """
    Função responsável por calcular as principais métricas de um modelo de classificação
    utilizando validação cruzada
    
    Parâmetros
    ----------
    :param model: estimator do modelo preditivo [type: estimator]
    :param X: dados de entrada do modelo [type: np.array]
    :param y: array de target do modelo [type: np.array]
    :param cv: número de k-folds utilizado na validação cruzada [type: int, default=5]
        
    Retorno
    -------
    :return df_performance: DataFrame contendo as principais métricas de classificação [type: pd.DataFrame]
    
    Aplicação
    ---------
    results = clf_cv_performance(model=model, X=X, y=y)
    """

    # Computing metrics using cross validation
    t0 = time.time()
    accuracy = cross_val_score(model, X, y, cv=cv, scoring='accuracy').mean()
    precision = cross_val_score(model, X, y, cv=cv, scoring='precision').mean()
    recall = cross_val_score(model, X, y, cv=cv, scoring='recall').mean()
    f1 = cross_val_score(model, X, y, cv=cv, scoring='f1').mean()

    # Probas for calculating AUC
    try:
        y_scores = cross_val_predict(model, X, y, cv=cv, method='decision_function')
    except:
        # Tree based models don't have 'decision_function()' method, but 'predict_proba()'
        y_probas = cross_val_predict(model, X, y, cv=cv, method='predict_proba')
        y_scores = y_probas[:, 1]
    auc = roc_auc_score(y, y_scores)

    # Creating a DataFrame with metrics
    t1 = time.time()
    delta_time = t1 - t0
    train_performance = {}
    if model_name is None:
        train_performance['model'] = model.__class__.__name__
    else:
        train_performance['model'] = model_name
    train_performance['approach'] = 'Final Model'
    train_performance['acc'] = round(accuracy, 4)
    train_performance['precision'] = round(precision, 4)
    train_performance['recall'] = round(recall, 4)
    train_performance['f1'] = round(f1, 4)
    train_performance['auc'] = round(auc, 4)
    train_performance['total_time'] = round(delta_time, 3)
    df_performance = pd.DataFrame(train_performance, index=train_performance.keys()).reset_index(drop=True).loc[:0, :]

    # Adding information of measuring and execution time
    cols_performance = list(df_performance.columns)
    df_performance['anomesdia'] = datetime.now().strftime('%Y%m%d')
    df_performance['anomesdia_datetime'] = datetime.now()
    df_performance = df_performance.loc[:, ['anomesdia', 'anomesdia_datetime'] + cols_performance]

    return df_performance

In [None]:
# Importing libraries
from mlcomposer.transformers import FeatureSelection
#from pycomp.ml.trainer import clf_cv_performance
from sklearn.model_selection import RandomizedSearchCV

# Preparing variables for a complete search
tunning_models_keys = ['RandomForestClassifier', 'LGBMClassifier', 'XGBClassifier', 'AdaBoostClassifier',
                       'GradientBoostingClassifier']
tunning_param_grids = [forest_tunning_grid, lgbm_tunning_grid, xgboost_tunning_grid, adaboost_tunning_grid,
                       gradboost_tunning_grid]
tunned_pipelines = {}
general_metrics = pd.DataFrame({})
pipe_model_key = 'model'

# Preparing dataset for training (train + validation)
X = X_train.append(X_val)
y = np.concatenate((y_train, y_val))

# Iterating over each model and param grid
for model_name, param_grid in zip(tunning_models_keys, tunning_param_grids):
    
    # Returning informations about the model
    baseline_model = trainer.get_estimator(model_name)
    feature_importance = baseline_model.feature_importances_
    general_metrics = general_metrics.append(trainer.get_metrics(model_name))
    
    # Creating a pipeline for preparation and prediction
    tunning_pipeline = Pipeline([
        ('prep', prep_pipeline),
        ('selector', FeatureSelection(feature_importance, k=len(MODEL_FEATURES))),
        (pipe_model_key, baseline_model)
    ])
    
    # Preparing model grid for joining with pipeline grid
    model_param_grid = {pipe_model_key + '__' + k: v for k, v in param_grid.items()}
    tunning_param_grid = {
        'prep__num__imputer__strategy': ['mean', 'median', 'most_frequent'],
        'prep__num__log_transformer__application': [True, False],
        'prep__num__scaler__scaler_type': [None, 'Standard', 'MinMax'],
        'selector__k': np.arange(5, len(MODEL_FEATURES) + 1, 2)
    } 
    tunning_param_grid.update(model_param_grid)
    
    # Applying search with RandomizedSearch
    tunning_search = RandomizedSearchCV(tunning_pipeline, tunning_param_grid, scoring='accuracy', cv=5,
                                        n_jobs=-1, verbose=-1, random_state=42)
    tunning_search.fit(X, y)
    print(f'\nBest hyperparameters for {model_name} found by RandomizedSearch:\n')
    for k, v in tunning_search.best_params_.items():
        print(f'{k}: {v}')

    # Returning objects for the best combination found
    final_pipeline = tunning_search.best_estimator_
    tunned_pipelines[model_name] = final_pipeline
    
    # Computing metrics with the best combination pipeline
    final_metrics = clf_cv_performance(final_pipeline, X, y, model_name=model_name)
    model_metrics = trainer.get_metrics(model_name=model_name)
    metrics_cols = general_metrics.columns
    final_metrics = final_metrics.loc[:, metrics_cols]
    general_metrics = general_metrics.append(final_metrics)
    
# Final result
general_metrics

Finally, in view of the best combinations obtained for each of the models and the resulting final metrics, it is possible to make a more assertive decision about the final model to be considered for the survival forecast in the official test base. Looking at the metrics dataset, it can be seen that, in practically all cases, there was an increase in the final performance of the candidate models after applying the search for hyperparameters. Another relevant point is that, in this case, the training base was used completely (without the separation between training and validation).

As a final model, we will select the `RandomForestClassifier` due to the good accuracy obtained in the cross validation. In the future, it is possible to consider other more robust algorithms to verify a possible improvement in performance.

In [None]:
# Returning final pipeline
FINAL_MODEL = 'RandomForestClassifier'
final_pipeline = tunned_pipelines[FINAL_MODEL]

<a id="6"></a>
<font color="darkslateblue" size=+2.5><b>6. Submitting: Prediction Pipeline</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Finally, it is then possible to read the official test base available for final validation of the predictive model. The goal is to simulate the execution of the data pre-processing and consumption pipelines of the trained model. Additionally, it is possible to study the possibility of consolidating a single pipeline capable of receiving a validation base, executing the prep pipelines and consuming the final model, thus generating a base containing the score and other parameters that may be included and necessary in the outlet base.

In [None]:
# Reding test dataset
df_test = pd.read_csv(os.path.join(DATA_PATH, TEST_FILENAME))
print(f'Shape of test dataset: {df_test.shape}')
df_test.head()

After reading the official test base, we will import the `ModelResults` class, which, in turn, is responsible for receiving the information from an entry base and enriching it with the results of a predictive model already trained. Once the best combination final pipeline obtained with the `Random Forest` model is returned, it is possible to provide this object as an input parameter for the consumption class of the model, thus opening the possibility of creating a definitive prediction pipeline to be applied in databases of test.

In [None]:
# Importing class
from mlcomposer.transformers import ModelResults

# Creating object and building a prediction pipeline
model_consumer = ModelResults(model=final_pipeline, features=INITIAL_PRED_FEATURES)

prediction_pipeline = Pipeline([
    ('initial', initial_pred_pipeline),
    ('prediction', model_consumer)
])

# Executing pipeline
df_pred = prediction_pipeline.fit_transform(df_test)
df_pred.head()

As a last step, we will prepare the final prediction base to submit the results in the Kaggle competition.

In [None]:
# Preparing submission dataset
df_sub = df_test.merge(df_pred, how='left', left_index=True, right_index=True)
df_sub = df_sub.loc[:, ['PassengerId', 'y_pred']]
df_sub.columns = ['PassengerId', 'Survived']
df_sub.to_csv('output/submission.csv', index=False)
df_sub.head()

<a id="7"></a>
<font color="darkslateblue" size=+2.5><b>7. References</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

One great inspiration for creating custom features on this project came from the excelent notebook from Awwal Malhi: https://www.kaggle.com/awwalmalhi/titanic-eda-and-feature-engineering

GERON, A.Hands-On Machine Learning with Scikit-Learn and TensorFlow. [S.l.]:O’Reilly, 2017. ISBN 978149196229.

NG, A.Machine Learning. Available at: <https://www.coursera.org/learn/machine-learning/home/welcome>

Please tell me what do you think about `xplotter` and `mlcomposer` packages and leave here a comment or a upvote. Your opinion is really important and I'm really excited to show you new implementations on those packages.

* **xplotter on Github:** https://github.com/ThiagoPanini/xplotter
* **xplotter on PyPI:** https://pypi.org/project/xplotter/


* **mlcomposer on Github:** https://github.com/ThiagoPanini/mlcomposer
* **mlcomposer on PyPI:** https://pypi.org/project/mlcomposer/
___

<font size="+1" color="black"><b>You can also visit my other kernels by clicking on the buttons</b></font><br>

<a href="https://www.kaggle.com/thiagopanini/presenting-xplotter-and-mlcomposer-on-tps-may21" class="btn btn-primary" style="color:white;">TPS May 2021</a>
<a href="https://www.kaggle.com/thiagopanini/pycomp-exploring-and-modeling-housing-prices" class="btn btn-primary" style="color:white;">Housing Prices</a>
<a href="https://www.kaggle.com/thiagopanini/predicting-restaurant-s-rate-in-bengaluru" class="btn btn-primary" style="color:white;">Bengaluru's Restaurants</a>
<a href="https://www.kaggle.com/thiagopanini/sentimental-analysis-on-e-commerce-reviews" class="btn btn-primary" style="color:white;">Sentimental Analysis E-Commerce</a>