# What is a Pipeline in Machine Learning?

A pipeline contains multiple transformers (or even models!) and performs operations on data IN SEQUENCE.  
Compare this to ColumnTransformers that perform operations on data IN PARALLEL.  
When a pipeline is fit on data, all of the transformers inside it are fit.  
When data is transformed using a pipeline, the data is transformed by the first transformer first, the second transformer second, etc.  
A pipeline can contain any number of transformers as long as they have .fit() and .transform() methods. These are called 'steps'.  
If needed, one single estimator, or model, can be placed at the end of a pipeline.  
The important thing to remember is that pipelines are ordered, so the order you use to build them matters.  
Pipelines can even contain ColumnTransformer AND ColumnTransformers can contain pipelines.

# Why Should I Use Pipelines for Machine Learning?

Reasons to use pipelines:

1. Pipelines use less code than doing each transformer individually. Since each transformer is fit in a single .fit() call, and the data is transformed by all of the transformers in the pipeline in a single.transform() call, pipelines use significantly less code.
2. Pipelines make your preprocessing workflow easier to understand. By reducing the code and displaying a diagram of the pipeline you can show your readers clearly how your data is being transformed before modeling.
3. Pipelines are easy to use in production models. When you are ready to deploy your model to use in new data, a preprocessing pipeline can ensure that new data can be quickly and easily preprocessed for modeling.
4. Pipelines can prevent data leakage. Pipelines are designed to only be fit on training data. Later you will learn a technique called 'cross-validation' and pipelines will simplify performing this without leaking data.


In [1]:
# Import Libraries
import pandas as pd
import numpy as np

from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import set_config

set_config(display="diagram")

In [7]:
# Load the data
path = "C:/Users/User/Desktop/Life Expectancy Data (all_numeric).csv"
df = pd.read_csv(path, index_col="CountryYear")
df.head()

# Inspect the data
print(df.info(), "\n")
print(df.isna().sum())


# We can see that several columns are missing data. We will want to impute the missing data before we scale the data, so our pipeline will be ordered as:
# Step 1. Imputer
# Step 2. Scaler.
# All of our data is numeric, so we don't need to one-hot encode the data. We can also use median imputation or mean imputation on all of the columns.
# If we wanted to, we COULD use ColumnTransformer to split the columns by integers and floats and apply mean imputation to the floats and median imputation to the integers, and then scale them all. You'll
# learn how to combine ColumnTransformer and Pipelines in a future lesson. For this lesson, we will just use a median imputer for all of the columns.

<class 'pandas.core.frame.DataFrame'>
Index: 2928 entries, Afghanistan2015 to Zimbabwe2000
Data columns (total 20 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Status                           2928 non-null   int64  
 1   Life expectancy                  2928 non-null   float64
 2   Adult Mortality                  2928 non-null   int64  
 3   infant deaths                    2928 non-null   int64  
 4   Alcohol                          2735 non-null   float64
 5   percentage expenditure           2928 non-null   float64
 6   Hepatitis B                      2375 non-null   float64
 7   Measles                          2928 non-null   int64  
 8   BMI                              2896 non-null   float64
 9   under-five deaths                2928 non-null   int64  
 10  Polio                            2909 non-null   float64
 11  Total expenditure                2702 non-null   float64
 12  Dip

In [14]:
# We will be predicting 'Life expectancy' so we will set that as our y target.

# divide features and target and perform a train/test split.
X = df.drop(columns=["Life expectancy"])
y = df["Life expectancy"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# instantiate an imputer and a scaler
median_imputer = SimpleImputer(strategy="median")
scaler = StandardScaler()

# combine the imputer and the scaler into a pipeline
preprocessing_pipeline = make_pipeline(median_imputer, scaler)
preprocessing_pipeline

# fit pipeline on training data
preprocessing_pipeline.fit(X_train)

# transform train and test sets
X_train_processed = preprocessing_pipeline.transform(X_train)
X_test_processed = preprocessing_pipeline.transform(X_test)

# inspect the result of the transformation
print(np.isnan(X_train_processed).sum().sum(), "missing values \n")
X_train_processed

# Scikit-Learn transformers and pipelines always return Numpy arrays, not Pandas dataframes.
# We can use np.isnan(array).sum().sum() (not the method .isna()!) to count the missing values in the resulting array.
# We can see that there are no remaining missing values and all of the values seem to be scaled.

0 missing values 



array([[ 0.        , -0.81229166, -0.26366021, ..., -0.87868801,
         1.19451878,  1.92222335],
       [ 0.        ,  1.43809769,  0.15576412, ...,  0.58477555,
         0.22791761,  0.08271906],
       [ 0.        ,  2.02690924, -0.18501814, ...,  0.87303352,
        -0.68443553, -0.80637468],
       ...,
       [ 0.        , -1.10266448, -0.11511409, ..., -0.10260885,
        -0.88170108, -1.17427554],
       [ 0.        , -0.73163255, -0.24618419, ..., -0.96738278,
         0.97259504,  0.87983758],
       [ 0.        ,  1.43003177, -0.20249416, ...,  1.07259673,
        -3.11080174, -2.24731971]])