In [32]:
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn Pipelines

## Definição Manual de um Pipeline

### Leitura do dataset

Estamos utilizando o dataset do Titanic, disponível no [kaggle](https://www.kaggle.com/c/titanic/data?select=train.csv),

In [None]:
!pip install gdown

In [None]:
!gdown https://drive.google.com/uc?id=1nUojuf_X8r3MMEpa60PZkes5rk1eCueI

In [35]:
import pandas as pd
from category_encoders import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from category_encoders import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer

## Definindo um workflow com Scikit-learn pipeline
* A mesma análise anterior, porem, incluindo cada avaliação em um objeto pipe
* Ao final um pipeline com outros pipes sao executados
* Um pipe possui uma ou mais transformações,por exemplo, normalização, codificação (objetos que possuem a função `transform()`), e por último um classificador (que possui a função `fit()`)

In [36]:
titanic_data = pd.read_csv('../datasets/pipelines/train_titanic.csv')

# retirando colunas com nome, ingresso e cabine dos conjuntos
titanic_data.drop(["Name", "Ticket", "Cabin"], axis=1, inplace=True)

Separando dados em teste e treino

In [37]:
from sklearn.model_selection import train_test_split
# dividindo em conjunto de treino e test
X_train, X_test, y_train, y_test = train_test_split(titanic_data.drop(['Survived'],
                                                                      axis=1),
                                                    titanic_data['Survived'],
                                                    test_size=0.3,
                                                    random_state=42)

### Criação de pipes

Pipeline para pré-processamento das variáveis `Age` e `Fare`

In [38]:
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

Pipeline para pré-processamento das variáveis `Sex` e `Embarked`

In [39]:
cat_transformer = Pipeline(steps=[
    ('one-hot encoder', OneHotEncoder())
])

Concatenação de pre-processadores

In [40]:
preprocessor = ColumnTransformer(transformers=[
    ('num', num_transformer, ['Age', 'Fare']),
    ('cat', cat_transformer, ['Sex', 'Embarked'])
])

Criando o modelo usando pipeline e um classificador `LogisticRegression`

### Definição do pipeline completo

In [41]:
model_lr = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('lr', LogisticRegression())
])

model_nb = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('nb', GaussianNB())
])


Realizar o `fit` dos modelos

In [42]:
model_lr.fit(X_train, y_train)
model_nb.fit(X_train, y_train)

### Avaliação dos modelos definidos nos pipelines

In [43]:
print("Score Logistic Regression: {:.2f}".format(model_lr.score(X_test, y_test)))
print("Score Naive Bayes: {:.2f}".format(model_nb.score(X_test, y_test)))

Score Logistic Regression: 0.78
Score Naive Bayes: 0.71


## Exercício

### Tarefas
* Adicionar [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) para a variável `Fare`
* Adicionar [Discretizador](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer) para a variável `Age`

In [51]:
age_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("discretizer", KBinsDiscretizer(n_bins=5, encode="onehot-dense"))
])


preprocessor2 = ColumnTransformer(transformers=[
    # Aplica KBinsDiscretizer apenas em 'Age'
    ('age_bin', age_pipeline, ['Age']),

    # Aplica StandardScaler apenas em 'Fare'
    ('fare_scaler', StandardScaler(), ['Fare']),

    # cat_transformer pode ser, por exemplo, OneHotEncoder
    ('cat', cat_transformer, ['Sex', 'Embarked'])
])

In [52]:
model_lr2 = Pipeline(steps=[
    ('preprocessor', preprocessor2),
    ('lr', LogisticRegression())
])

model_nb2 = Pipeline(steps=[
    ('preprocessor', preprocessor2),
    ('nb', GaussianNB())
])

In [53]:
model_lr2.fit(X_train, y_train)
model_nb2.fit(X_train, y_train)

In [55]:
print("Score Logistic Regression: {:.2f}".format(model_lr.score(X_test, y_test)))
print("Score Naive Bayes: {:.2f}".format(model_nb.score(X_test, y_test)))

print("Score Logistic Regression 2: {:.2f}".format(model_lr2.score(X_test, y_test)))
print("Score Naive Bayes 2: {:.2f}".format(model_nb2.score(X_test, y_test)))

Score Logistic Regression: 0.78
Score Naive Bayes: 0.71
Score Logistic Regression 2: 0.78
Score Naive Bayes 2: 0.63
