In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/Pipelines.git cloned-repo
%cd cloned-repo
!ls

# **Pipelines**

The purpose of the pipeline is to assemble **several steps that can be cross-validated together** while setting different parameters.<br>(HyperParameter Tuning or Optimization)

A pipeline can be used to chain multiple estimators into one. <br>
This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification.<br>

Pipeline serves multiple purposes here:<br>

- Convenience and encapsulation
You only have to call fit and predict once on your data to fit a whole sequence of estimators.

- Joint parameter selection
You can grid search over parameters of all estimators in the pipeline at once.

- Safety
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

**All estimators in a pipeline, except the last one, must be transformers**(i.e. must have a transform method). <br>
The last estimator may be any type (transformer, classifier, etc.).

[Pipeline User Guide](https://scikit-learn.org/stable/modules/compose.html#pipeline)


The Pipeline is built using a list of (key, value) pairs, where <br>
- the key is a string containing the name you want to give this step and <br>
- value is an estimator object:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA

This pipeline does Principal Component Analysis then uses a Supprt Vector Machine Model

In [None]:
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
pipe

The utility function make_pipeline is a shorthand for constructing pipelines;<br>

it takes a variable number of estimators and returns a pipeline, filling in the names automatically:

Import the make_pipeline library

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Binarizer

The pipeline makes the binarizer then inputs the data to the Naive Bayes model

Binarize data (set feature values to 0 or 1) according to a threshold.


Binarization is the process of dividing data into two groups and assigning one out. of two values to all the members of the same group. This is usually accomplished. by defining a threshold t and assigning the value 0 to all the data points below. the threshold and 1 to those above it.

In [None]:
make_pipeline(Binarizer(), MultinomialNB())

Example of [Binarize](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html)

In [None]:
from sklearn.preprocessing import Binarizer
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
transformer = Binarizer().fit(X)
transformer

transformer.transform(X)

**Assignment**<br>
Manually binarize the following data<br>
>X=<br>
[[1.0, 2.0, 3.0, 4.0],<br>
   [2.0, 3.0, 4.0, 5.0]], <br>

**Assignment**<br>
Use the Binarize function to binarize X

In [None]:
X= [[1.0, 2.0, 3.0, 4.0], [2.0, 3.0, 4.0, 5.0]]
transformer = Binarizer(threshold=3).fit(X)
transformer
transformer.transform(X)

**To access steps in the pipeline**

In [None]:
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
print("Pipe",pipe)
print("Pipe steps",pipe.steps[0], pipe.steps[1])
print("Pipe[0]=",pipe[0])
print("Naming by function:",pipe['reduce_dim'])

**Nested Parameters**<br>
Use estimator_parameters syntax to access the estimator parameters

In [None]:
#set the 'C' parameter for the SVM
pipe.set_params(clf__C=10)

**Setting parameters for Grid Searches**

In this example, the grid search will try 2,5,10 number of dimensions for the PCA and 0.1,10,100 for the 'C' parameter for the SVM

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = dict(reduce_dim__n_components=[2, 5, 10],
                  clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)

**Getting the features of the pipeline**

In [None]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest

In [None]:
from sklearn.linear_model import LogisticRegression
param_grid = dict(reduce_dim=['passthrough', PCA(5), PCA(10)],
                  clf=[SVC(), LogisticRegression()],
                  clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)
iris = load_iris()
pipe = Pipeline(steps=[
   ('select', SelectKBest(k=2)),
   ('clf', LogisticRegression())])
pipe.fit(iris.data, iris.target)

pipe[:-1].get_feature_names_out()

**Pipeline Example**

**Import the libraries**

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import pprint
import time 
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

**Start the timer**

In [None]:
startT = time.time()

**Get the data**

https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

In [None]:
winedf = pd.read_csv('winequality-red.csv',sep=';')
# print winedf.isnull().sum() # check for missing data
print (winedf.head())

**Check the distribution of the wine quality**

In [None]:
print (winedf.shape)
ylab = winedf[['quality']]
print (ylab.shape)
print (winedf['quality'].value_counts()) 

**Plot the correlation of the features**

In [None]:
winecorr = winedf.corr()
s=sns.heatmap(winecorr)
s.set_yticklabels(s.get_yticklabels(),rotation=30,fontsize=7)
s.set_xticklabels(s.get_xticklabels(),rotation=30,fontsize=7)

**Create a scatter plot of the two highly correlated features**

In [None]:
plt.show() # as expected high correlation between acidity and pH

# individual correlation plot
plt.subplot(1,2,1)
plt.scatter(winedf['fixed acidity'], winedf['pH'], s=winedf['quality']*5, color='magenta', alpha=0.3)
plt.xlabel('Fixed Acidity')
plt.ylabel('pH')
plt.subplot(1,2,2)
plt.scatter(winedf['fixed acidity'], winedf['residual sugar'], s=winedf['quality']*5, color='purple', alpha=0.3)
plt.xlabel('Fixed Acidity')
plt.ylabel('Residual Sugar')
plt.tight_layout()
plt.show()

**The data should have only the features, the labels should have only the quality**

In [None]:
X=winedf.drop(['quality'],axis=1)
Y=winedf['quality']

In [None]:
print (type(X), type(Y))
print (X.head(3))

**Create the pipeline**

**StandardScaler**: Standardize features by removing the mean and scaling to unit variance.



In [None]:
#Example of Standard Scaler
data = [[0, 0], [0, 0], [1, 1], [1, 1]]
scaler = StandardScaler()
print("fit:",scaler.fit(data))
print("mean:",scaler.mean_)
print("transform:\n",scaler.transform(data))
print("transform [2,2]:",scaler.transform([[2, 2]]))

In [None]:
steps = [('scaler', StandardScaler()), ('SVM', SVC())]
pipeline = Pipeline(steps)

In [None]:
parameters = {'SVM__C':[0.001,0.1,10,100,10e5], 'SVM__gamma':[0.1,0.01]}

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=30, stratify=Y)
print (X_test.shape)

In [None]:
grid = GridSearchCV(pipeline, param_grid=parameters, cv=5)
grid.fit(X_train, y_train)
print ("score = %3.2f" %(grid.score(X_test,y_test)))
print (grid.best_params_)
endT = time.time()
print ("total time elapsed = %3.3f"%(endT-startT))

**Assignment**<br>
1. Why, even after Grid Search, does the model have a score of 67%?
2. Look at the winequality-white.csv file. If we use the data as is, will we get a better score than the red wine data? Check to see if your answer is correct.<br>
The white wine data set is about 3 times the size of the red wine data set. How much time do you think it will take to grid search the white wine dataset?