# Introduction to pipeline

Pipelines are a simple way to keep your data preprocessing and modeling code organized.

Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Benefits of pipeline:

1. Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually keep track of your training and validation data at each step.
2. Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
3. Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
4. More Options for Model Validation: You will see an example in the next tutorial, which covers cross-validation.

With a pipeline, it is easy to deal with both categorical and numerical data.

We construct a full pipeline in three steps:

## Purpose of using a pipeline: 

1. If you want to do different preprocessing of different columns of your same dataset then you can do that in one step by creating pipelines.


2. It also helps to cross-validate the entire pipeline i.e., the model along with the preprocessing involved. This way our cross-validation results are more reliable.


3. If you're predicting on some non-insample data then you do not need to repeat those preprocessing steps that you did on the training data before feeding it to the model to predict, as the pipeline does that for you.


4. If your test contains any categories that is not present in the training data (say your training data contained categories like 'B', 'C' and your test data contain categories like 'B', 'C' & 'D', then the pipeline can handle that too.





We construct a pipeline in three steps.

In [11]:
# importing dataset to work on
import pandas as pd

df = pd.read_csv('datasets/melb_data.csv')
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

As we can see we have both categorical and numerical columns in our dataset

In [13]:
# chaecking null values

df.isnull().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                62
Landsize            0
BuildingArea     6450
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64

In [27]:
# creating training and testing data

from sklearn.model_selection import train_test_split

X = df.drop(['Price'], axis=1)
y = df.Price

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [28]:
X_train.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
12167,St Kilda,11/22 Charnwood Cr,1,u,S,hockingstuart,29/07/2017,5.0,3182.0,1.0,1.0,1.0,0.0,,1940.0,Port Phillip,-37.85984,144.9867,Southern Metropolitan,13240.0
6524,Williamstown,18 James St,2,h,SA,Hunter,17/09/2016,8.0,3016.0,2.0,2.0,1.0,193.0,,,Hobsons Bay,-37.858,144.9005,Western Metropolitan,6380.0
8413,Sunshine,10 Dundalk St,3,h,S,Barry,8/04/2017,12.6,3020.0,3.0,1.0,1.0,555.0,,,Brimbank,-37.7988,144.822,Western Metropolitan,3755.0
2919,Glenroy,1/2 Prospect St,3,u,SP,Brad,18/06/2016,13.0,3046.0,3.0,1.0,1.0,265.0,,1995.0,Moreland,-37.7083,144.9158,Northern Metropolitan,8870.0
6043,Sunshine North,35 Furlong Rd,3,h,S,First,22/05/2016,13.3,3020.0,3.0,1.0,2.0,673.0,673.0,1970.0,Brimbank,-37.7623,144.8272,Western Metropolitan,4217.0


In [29]:
y_train.head()

12167    481000.0
6524     895000.0
8413     651500.0
2919     482500.0
6043     591000.0
Name: Price, dtype: float64

In [30]:
# from data preprocessing to pipeline


# Step 1: Define Preprocessing Steps

Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps. The code below:

1. imputes missing values in numerical data, and


2. imputes missing values and applies a one-hot encoding to categorical data.

In [31]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

In [32]:
# preprocessing for numerical data

# defining how to preprocess

# start for numerical
numerical_transformer = SimpleImputer(strategy='constant')
# end for numerical

# strategy : str, default='mean'
#     The imputation strategy.

#     - If "mean", then replace missing values using the mean along
#       each column. Can only be used with numeric data.
#     - If "median", then replace missing values using the median along
#       each column. Can only be used with numeric data.
#     - If "most_frequent", then replace missing using the most frequent
#       value along each column. Can be used with strings or numeric data.
#       If there is more than one such value, only the smallest is returned.
#     - If "constant", then replace missing values with fill_value. Can be
#       used with strings or numeric data.

#     .. versionadded:: 0.20
#        strategy="constant" for fixed value imputation.

In [33]:
# preprocessing for categorical data

# defining how to preprocess

# start for categorical
categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)
# end for categorical

In [34]:
# combining the above two ===> Bundle preprocessing for numerical and categorical data

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, ['Car', 'BuildingArea', 'YearBuilt']),
        ('onehot', categorical_transformer, ['CouncilArea'])
    ]
)

# Step 2: Define the model

In [35]:
# Defining the model to train and predict

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

# Step 3: Create and Evaluate the Pipeline

Finally, we use the Pipeline class to define a pipeline that bundles the preprocessing and modeling steps. There are a few important things to notice:

1. With the pipeline, we preprocess the training data and fit the model in a single line of code. (In contrast, without a pipeline, we have to do imputation, one-hot encoding, and model training in separate steps. This becomes especially messy if we have to deal with both numerical and categorical variables!)


2. With the pipeline, we supply the unprocessed features in X_valid to the predict() command, and the pipeline automatically preprocesses the features before generating predictions. (However, without a pipeline, we have to remember to preprocess the validation data before making predictions.)

In [36]:
# code
# Bundle preprocessing and modeling code in pipeline

my_pipeline = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ]
)

In [37]:
# preprocessing of training data and fitting model

my_pipeline.fit(X_train, y_train)

# preprocessing of test data and get predictions

preds = my_pipeline.predict(X_test)

In [38]:
# Evaluation of model
from sklearn.metrics import mean_absolute_error

score= mean_absolute_error(y_test, preds)
print("MAE: ", score)

MAE:  287656.91673759103


Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing.

#### Author: Piyush Kumar
[Github](https://github.com/styles3544)