**What is pipelining?**

Combining multiple steps involved into a single estimator.
Steps such as preprocessing can be combined into a single estimator.
Along with preprocessing model can be chained too.

*Imports and fetching data*

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data=pd.read_csv('../input/real-estate-price-prediction/Real estate.csv',index_col='No')
data.head()

**Here we're going to combine Standard scaler and Ridge Regression model into a single estimator.**


How to make a pipeline-

Build a pipeline object and provide it with list of steps. Each step is a list of tuple consisting of name of step(any string of your choice) and an instance of estimator.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

pipe=Pipeline([("StandardScaler",StandardScaler()),("ridge",Ridge())])

Now we've object of pipeline, which is itself an estimator,therefore we can call fit method.
On calling fit method , train data provided first will fit the standard scaler and then train data will be preprocessed by  fitted standard scaler and finally preprocessed data will be sent to fit our ridge regression model.
We can also call score method on pipeline object.
On calling score method test data will go through same preprocessing first it will be transformed by our standard scaler(fitted by train data) and then transformed test data will be sent to our ridge Regression model to calculate score.

In [None]:
#Splitting data into train and test data
from sklearn.model_selection import train_test_split
y=pd.DataFrame()
y['Price']=data['Y house price of unit area']
data.drop('Y house price of unit area',axis=1,inplace=True)
X_train,X_test,y_train,y_test=train_test_split(data,y,random_state=0)

In [None]:
#fitting out pipepline with train data
pipe.fit(X_train,y_train)

In [None]:
#score
print(pipe.score(X_test,y_test))
print(pipe.score(X_train,y_train))


Just like we can provide instance of model to GridSearchCV, we can provide pipelining object too.
This can help in parameter optimization. GridSearch in pipelining works the same way as with any other estimator. We define a parameter grid and then build a GridSearchCV from the pipeline and the parameter grid.

The only difference between GridSearchCV with model and GridSearchCV with pipeline is in defining the parameter grid. We need to specify parameters along with which step they belong to.


In [None]:
#step name with double underscore then the name of parameter to perform grid search.
param={"step2__alpha":[0.01,0.1,1,10,100]}

In [None]:
from sklearn.model_selection import GridSearchCV
pipe2=Pipeline([('step1',StandardScaler()),('step2',Ridge())])
grid=GridSearchCV(pipe2,param_grid=param,cv=5)
grid.fit(X_train,y_train)
print("Best cv accuracy : ",grid.best_score_)
print("Best parameter : ",grid.best_params_)
print("Train score : ",grid.score(X_train,y_train))
print("Test score : ",grid.score(X_test,y_test))

StandardScaler in step 1 is fitted(refit) using train data, test data is only used for calculating accuracy.

Pipelining can contain steps such as feature extraction,scailing of data,feature selection,regression or classification.
The only requirement that has to be fulfilled is that the all but the last step needs to have a transform method,so that a new representation of data is produced in each but last step and this new representation of data can be utilised by the next step in preprocessing.

**make_pipeline**

make_pipeline function will create a pipeline and will automatically name each step based on its class.

In [None]:
from sklearn.pipeline import make_pipeline
pipe3=make_pipeline(StandardScaler(),Ridge())

#names of steps can be seen using steps attribute
#if any step has same class, a number will be appended to its name
print(pipe3.steps)

In [None]:
#fitting the pipeline using train data
pipe3.fit(X_train,y_train)

In order to access the steps in a pipeline we can use **named_steps** attribute.
It is a dictionary with keys equal to step names and values equal to the estimators.

In [None]:
print(pipe3.named_steps.keys())
print(pipe3.named_steps.values())

**GridSearchCV with make_pipeline**

In [None]:
print(pipe3.steps)

In [None]:
param={"ridge__alpha":[0.01,0.1,0,1,10,100]}

In [None]:
grid2=GridSearchCV(pipe3,param,cv=5)
grid2.fit(X_train,y_train)

In [None]:
print("best estimator ",grid2.best_estimator_)
print("best param ",grid2.best_params_)
print("test score ",grid2.score(X_test,y_test))
print("train score ",grid2.score(X_train,y_train))