# Pipelines
**`sklearn.pipeline.Pipeline`**

У sklearn отличная [документация](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).


Pipeline побъединяет в себе последовательные преобразования данных (масштабирование и т. п) и финальную модель.
Промежуточные этапы Pipeline должны быть "трансформаторами", то есть в них должны быть реализованы методы fit и transform. В финальной модели должен быть реализован лишь метод fit.


In [1]:
from sklearn.pipeline import Pipeline

In [2]:
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
def rmse(y_hat, y):
    return np.sqrt(mean_squared_error(y_hat, y))


**California Housing Dataset**

Признаки:
* MedInc - median income in block
* HouseAge - median house age in block
* AveRooms - average number of rooms
* AveBedrms - average number of bedrooms
* Population - block population
* AveOccup - average house occupancy
* Latitude - house block latitude
* Longitude - house block longitude

Целевая переменная - цена апартаментов.

In [3]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
bunch = fetch_california_housing()
df = pd.DataFrame(bunch['data'], columns=bunch['feature_names'])
df['target'] = bunch['target']
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   target      20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


**Train_test_split**

In [5]:
from sklearn.model_selection import train_test_split, cross_val_score
X = df.drop('target', axis=1)
Y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42)

## Базовый пример Pipeline

**`fit`**

Метод обучает модель: последовательно обучает все трансформаторы, а затем на преобразованных данных делает fit финальной модели.


In [6]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
#apply scaler and Decision Tree Regressor
# название - метод 
pipe = Pipeline([('scaler', StandardScaler()), ('dt', DecisionTreeRegressor())])

# The pipeline can be used as any other estimator
# and avoids leaking the test set into the train set
pipe.fit(X_train, y_train)


**`predict`**

Последовательно применяет преобразования к данным, а затем вызывает метод predict для финальной модели.

In [7]:
preds = pipe.predict(X_test)
print('R2: ', r2_score(y_test, preds))
print('RSME: ', rmse(y_test, preds))

R2:  0.6026718472205614
RSME:  0.725087446691389


**`get_params`**

Возвращает параметры всех шагов Pipeline.



In [8]:
pipe.get_params()

{'memory': None,
 'steps': [('scaler', StandardScaler()), ('dt', DecisionTreeRegressor())],
 'verbose': False,
 'scaler': StandardScaler(),
 'dt': DecisionTreeRegressor(),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'dt__ccp_alpha': 0.0,
 'dt__criterion': 'squared_error',
 'dt__max_depth': None,
 'dt__max_features': None,
 'dt__max_leaf_nodes': None,
 'dt__min_impurity_decrease': 0.0,
 'dt__min_samples_leaf': 1,
 'dt__min_samples_split': 2,
 'dt__min_weight_fraction_leaf': 0.0,
 'dt__random_state': None,
 'dt__splitter': 'best'}

**`make_pipeline`**

`make_pipeline` удобная тулза для задания Pipeline; она принимает на вход список этапов Pipeline и возвращает готовый pipeline, заполняя имена автоматически.

In [9]:
from sklearn.pipeline import make_pipeline
make_pipeline(StandardScaler(), DecisionTreeRegressor())

In [10]:
pipe = make_pipeline(StandardScaler(), DecisionTreeRegressor())
pipe.fit(X_train, y_train)
preds = pipe.predict(X_test)
print('R2: ', r2_score(y_test, preds))
print('RSME: ', rmse(y_test, preds))

R2:  0.6145827346788482
RSME:  0.7141366139745605


## Pipelines and composite estimators

Пример из [документации](https://scikit-learn.org/stable/modules/compose.html#pipeline).

**обращение к отдельным шагам**

In [11]:
pipe = Pipeline([('scaler', StandardScaler()), ('dt', DecisionTreeRegressor())])
pipe.steps[0]

('scaler', StandardScaler())

In [12]:
pipe[0]

In [13]:
pipe['scaler']

In [14]:
pipe['dt']

Срез pipeline можно получить с помощью стандартных методов индексирования:

In [15]:
pipe[:1]

In [16]:
pipe[-1:]

## Nested parameters

К параметрам каждого шага pipeline можно обратиться, используя следующий синтакс: `<estimator>__<parameter>`

In [17]:
pipe

In [18]:
# указываем дополнительные параметры 

pipe.set_params(scaler__with_mean=False)

Это особенно удобно для Grid Search

In [19]:
from sklearn.model_selection import GridSearchCV
param_grid = dict(scaler__with_mean=[True,False],
                    dt__max_depth=[2, 5, 10])
grid_search = GridSearchCV(pipe, param_grid=param_grid, verbose = True)

In [20]:
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


In [21]:
grid_search.best_estimator_

In [22]:
grid_search.best_params_

{'dt__max_depth': 10, 'scaler__with_mean': False}

## ColumnTransformer
[Документация](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer).

Применяет преобразования к array или pandas DataFrame.

Многие датасеты содержат признаки разных типов (категориальные, числовые, текстовые). 


ColumnTransformer помогает выполнить различные преобразования признаков разных типов и совместить их воедино. 

In [48]:
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
ct = ColumnTransformer(
     [("norm1", Normalizer(norm='l1'), [0, 1]), #номера колонок 
      ("norm2", Normalizer(norm='l2'), slice(2, 4))])
X = np.array([[0., 1., 2., 2.],
               [1., 1., 0., 1.]])
X

array([[0., 1., 2., 2.],
       [1., 1., 0., 1.]])

In [49]:
# Normalizer scales each row of X to unit norm. A separate scaling
# is applied for the two first and two last elements of each
# row independently.
ct.fit_transform(X)


array([[0.        , 1.        , 0.70710678, 0.70710678],
       [0.5       , 0.5       , 0.        , 1.        ]])

**`make_column_transformer`**

Задает ColumnTransformer с заданными преобразованиями.



In [50]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
ct = make_column_transformer(
     (StandardScaler(), ['numerical_column']),
    (OneHotEncoder(), ['categorical_column']))
ct

In [51]:
X = pd.DataFrame([[0, 1, 6], ['a','b', 'b']], index = ['numerical_column', 'categorical_column']).T
X

Unnamed: 0,numerical_column,categorical_column
0,0,a
1,1,b
2,6,b


In [52]:
ct.fit_transform(X)

array([[-0.88900089,  1.        ,  0.        ],
       [-0.50800051,  0.        ,  1.        ],
       [ 1.3970014 ,  0.        ,  1.        ]])

## Примеры посложнее

[Примеры из документации](https://scikit-learn.org/stable/modules/compose.html#make-column-transformer).

In [53]:
import pandas as pd
X = pd.DataFrame(
     {'city': ['London', 'London', 'Paris', 'Sallisaw'],
      'title': ["His Last Bow", "How Watson Learned the Trick",
                "A Moveable Feast", "The Grapes of Wrath"],
      'expert_rating': [5, 3, 4, 5],
      'user_rating': [4, 5, 4, 3]})
X

Unnamed: 0,city,title,expert_rating,user_rating
0,London,His Last Bow,5,4
1,London,How Watson Learned the Trick,3,5
2,Paris,A Moveable Feast,4,4
3,Sallisaw,The Grapes of Wrath,5,3


Для этих данных закодируем переменную `city` как категориальную с использованием [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder), а к колонке `title` применим [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer). По умолчанию остальные колонки игнорируются `(remainder='drop')`.

In [54]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
column_trans = ColumnTransformer(
     [('city_category', OneHotEncoder(dtype='int'),['city']),
      ('title_bow', CountVectorizer(), 'title')],
     remainder='drop')

column_trans.fit(X)

В примере выше CountVectorizer ожидает в качестве input одномерный массив, поэтому тип признака указывается как строка (`title`). Однако OneHotEncoder, как и большинство трансформаторов, принимает на вход двумерные данные. Поэтому для него мы указывает тип, как список (`['city']`).

In [58]:
column_trans.get_feature_names_out()

array(['city_category__city_London', 'city_category__city_Paris',
       'city_category__city_Sallisaw', 'title_bow__bow',
       'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
       'title_bow__how', 'title_bow__last', 'title_bow__learned',
       'title_bow__moveable', 'title_bow__of', 'title_bow__the',
       'title_bow__trick', 'title_bow__watson', 'title_bow__wrath'],
      dtype=object)

In [59]:
column_trans.transform(X).toarray()

array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]], dtype=int64)

In [61]:
pd.DataFrame(column_trans.transform(X).toarray(), columns = column_trans.get_feature_names_out())

Unnamed: 0,city_category__city_London,city_category__city_Paris,city_category__city_Sallisaw,title_bow__bow,title_bow__feast,title_bow__grapes,title_bow__his,title_bow__how,title_bow__last,title_bow__learned,title_bow__moveable,title_bow__of,title_bow__the,title_bow__trick,title_bow__watson,title_bow__wrath
0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,1,0,1,0,0,1,1,1,0
2,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0
3,0,0,1,0,0,1,0,0,0,0,0,1,1,0,0,1


**`make_column_selector`**

 `make_column_selector` iиспользуется для выбора колонок определенных типов или с определенными названиями:

In [62]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_selector
ct = ColumnTransformer([
       ('scale', StandardScaler(),
       make_column_selector(dtype_include=np.number)),
       ('onehot',
       OneHotEncoder(),
       make_column_selector(pattern='city', dtype_include=object))])
res = ct.fit_transform(X)
res

array([[ 0.90453403,  0.        ,  1.        ,  0.        ,  0.        ],
       [-1.50755672,  1.41421356,  1.        ,  0.        ,  0.        ],
       [-0.30151134,  0.        ,  0.        ,  1.        ,  0.        ],
       [ 0.90453403, -1.41421356,  0.        ,  0.        ,  1.        ]])

Если мы хотим сохранить колонку рейтинга, то надо указать параметр `remainder='passthrough'`. Значения добавятся в конец, последними столбцами.

In [63]:
column_trans = ColumnTransformer(
     [('city_category', OneHotEncoder(dtype='int'),['city']),
      ('title_bow', CountVectorizer(), 'title')],
     remainder='passthrough')

column_trans.fit_transform(X)

array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 4],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 3, 5],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 5, 3]],
      dtype=int64)

In [64]:
X

Unnamed: 0,city,title,expert_rating,user_rating
0,London,His Last Bow,5,4
1,London,How Watson Learned the Trick,3,5
2,Paris,A Moveable Feast,4,4
3,Sallisaw,The Grapes of Wrath,5,3


In [65]:
from sklearn.preprocessing import MinMaxScaler
column_trans = ColumnTransformer(
     [('city_category', OneHotEncoder(), ['city']),
      ('title_bow', CountVectorizer(), 'title')],
     remainder=MinMaxScaler())
## Scaled expert_rating and user_rating
column_trans.fit_transform(X)[:, -2:]

array([[1. , 0.5],
       [0. , 1. ],
       [0.5, 0.5],
       [1. , 0. ]])

## Pipeline вместе с column transformer

Загрузим данные о пассажирах [Титаника](https://gist.github.com/michhar/2dfd2de0d4f8727f873422c5d959fff5).

    VARIABLE DESCRIPTIONS:
    survival        Survival
                (0 = No; 1 = Yes)
    pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
    name            Name
    sex             Sex
    age             Age
    sibsp           Number of Siblings/Spouses Aboard
    parch           Number of Parents/Children Aboard
    ticket          Ticket Number
    fare            Passenger Fare
    cabin           Cabin
    embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

    SPECIAL NOTES:
    Pclass is a proxy for socio-economic status (SES)
     1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

    Age is in Years; Fractional if Age less than One (1)
     If the Age is Estimated, it is in the form xx.5

In [66]:
df = pd.read_csv('https://grantmlong.com/data/titanic.csv')
#df = pd.read_csv('titanic.csv')
df.Age = df.Age.fillna(0)
df = df.drop(['PassengerId','Ticket','Name', 'Cabin'], axis = 1)
df = df.dropna()
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [67]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 889 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  889 non-null    int64  
 1   Pclass    889 non-null    int64  
 2   Sex       889 non-null    object 
 3   Age       889 non-null    float64
 4   SibSp     889 non-null    int64  
 5   Parch     889 non-null    int64  
 6   Fare      889 non-null    float64
 7   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 62.5+ KB


In [68]:
X = df.drop(['Survived'], axis = 1)
Y = df.Survived
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42)

In [69]:
X.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
count,889.0,889.0,889.0,889.0,889.0
mean,2.311586,23.740349,0.524184,0.382452,32.096681
std,0.8347,17.562609,1.103705,0.806761,49.697504
min,1.0,0.0,0.0,0.0,0.0
25%,2.0,6.0,0.0,0.0,7.8958
50%,3.0,24.0,0.0,0.0,14.4542
75%,3.0,35.0,1.0,0.0,31.0
max,3.0,80.0,8.0,6.0,512.3292


In [70]:
X.head(1)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S


In [71]:
from sklearn.preprocessing import OrdinalEncoder
ct = ColumnTransformer([
       ('scale', StandardScaler(), make_column_selector(dtype_include=np.number)),
       ('onehot', OneHotEncoder(), ['Sex']),
       ('ordinal', OrdinalEncoder(), ['Embarked'])
        ])

In [72]:
ct.fit(X_train)

In [73]:
pd.DataFrame(ct.transform(X_train)).head(2)

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.815528,-0.111904,-0.474917,-0.480663,-0.500108,1.0,0.0,2.0
1,-0.386113,1.462037,-0.474917,-0.480663,-0.435393,1.0,0.0,2.0


In [74]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
pipe = Pipeline([('ct', ct), ('scaler', StandardScaler()), ('knn', KNeighborsClassifier())])
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
print('Accuracy', accuracy_score(y_test,pred))
print('F1', f1_score(y_test,pred))

Accuracy 0.7892376681614349
F1 0.718562874251497


## ColumnTrasformer + Pipeline + GridSearch

In [75]:
param_grid = dict(scaler__with_mean=[True, False],
                  knn__n_neighbors=np.arange(1, 30))
gs = GridSearchCV(pipe, param_grid=param_grid, verbose = True)
gs.fit(X_train, y_train)
best_params = gs.best_params_
gs.best_params_

Fitting 5 folds for each of 58 candidates, totalling 290 fits


{'knn__n_neighbors': 4, 'scaler__with_mean': True}

In [76]:
pred = gs.predict(X_test)
print('Accuracy', accuracy_score(y_test,pred))
print('F1', f1_score(y_test,pred))

Accuracy 0.7937219730941704
F1 0.7088607594936709


**Setting best parameters**

In [77]:
pipe = Pipeline([('ct', ct), ('scaler', StandardScaler()), ('knn', KNeighborsClassifier())])
pipe.set_params(**gs.best_params_)
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
print('Accuracy', accuracy_score(y_test,pred))
print('F1', f1_score(y_test,pred))

Accuracy 0.7937219730941704
F1 0.7088607594936709
