# ML Pipeline (Putting it all together)
----
- Sequentially apply a list of transforms( remove NaNs,Convert into standard format,imputer,features selection ...etc) and a final estimator(ML algorithm)
- The last step in pipeline must be `estimator`
- Purpose of the pipeline is to **assemble** several steps in one order



## Step 1: Import necessary modules

In [3]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer # transformation
from sklearn.svm import SVC # estimator
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [4]:
Pipeline?

[1;31mInit signature:[0m [0mPipeline[0m[1;33m([0m[0msteps[0m[1;33m,[0m [0mmemory[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mverbose[0m[1;33m=[0m[1;32mFalse[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator.
Intermediate steps of the pipeline must be 'transforms', that is, they
must implement fit and transform methods.
The final estimator only needs to implement fit.
The transformers in the pipeline can be cached using ``memory`` argument.

The purpose of the pipeline is to assemble several steps that can be
cross-validated together while setting different parameters.
For this, it enables setting parameters of the various steps using their
names and the parameter name separated by a '__', as in the example below.
A step's estimator may be replaced entirely by setting the parameter
with its name to another estimator, or a transformer removed 

In [2]:
#from sklearn.feature_selection import SelectKBest

In [7]:
#SelectKBest?

## Step 2: Import Data

In [4]:
df = pd.read_csv("house-votes-84.csv")
df.head()

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [5]:
df.replace(["n","y"],[0,1],inplace=True)
df.head()

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0,1,0,1,1,1,0,0,0,1,?,1,1,1,0,1
1,republican,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,?
2,democrat,?,1,1,?,1,1,0,0,0,0,1,0,1,1,0,0
3,democrat,0,1,1,0,?,1,0,0,0,0,1,0,1,0,0,1
4,democrat,1,1,1,0,1,1,0,0,0,0,1,?,1,1,1,1


In [6]:
X = (df[df.columns[1:]].values)
y = df['Class Name'].values

In [7]:
X.shape

(435, 16)

In [8]:
y.shape

(435,)

In [9]:
X[:5]

array([[0, 1, 0, 1, 1, 1, 0, 0, 0, 1, '?', 1, 1, 1, 0, 1],
       [0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, '?'],
       ['?', 1, 1, '?', 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0],
       [0, 1, 1, 0, '?', 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1],
       [1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, '?', 1, 1, 1, 1]], dtype=object)

In [10]:
y[:10]

array(['republican', 'republican', 'democrat', 'democrat', 'democrat',
       'democrat', 'democrat', 'republican', 'republican', 'democrat'],
      dtype=object)

## Step 3: Divide into train and test

In [11]:
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=42)

## Step 4: Create Pipeline and fit the model

In [13]:
#SimpleImputer?

In [14]:
# Setup the pipeline steps: steps
steps = [('imputation', SimpleImputer(missing_values="?", strategy='most_frequent')),
        ('SVM', SVC())]

In [15]:
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

In [16]:
# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('imputation',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values='?', strategy='most_frequent',
                               verbose=0)),
                ('SVM',
                 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                     decision_function_shape='ovr', degree=3,
                     gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

## Step5: Predict Pipeline

In [17]:
# Predict the labels of the test set
y_pred = pipeline.predict(X_test)

In [23]:
y_pred[:5]

array(['democrat', 'democrat', 'republican', 'republican', 'republican'],
      dtype=object)

## Step 6: Observe Metrics

In [18]:
accuracy_score(y_test,y_pred)

0.9694656488549618

In [19]:
confusion_matrix(y_test,y_pred)

array([[82,  3],
       [ 1, 45]], dtype=int64)

In [20]:
# Compute metrics
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    democrat       0.99      0.96      0.98        85
  republican       0.94      0.98      0.96        46

    accuracy                           0.97       131
   macro avg       0.96      0.97      0.97       131
weighted avg       0.97      0.97      0.97       131



https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976