## Pipelines in Sklearn
_In this notebook, I am trying to implement pipelines from sklearn_

Pipelines are an amazing way to apply the data processing techniques & maintiain clean code, in this notebook, we try to implement the Pipeline on **IRIS dataset**

In [1]:
#import the libs 
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

I have imported all the libraries required for loading the data & further preprocessing.

In [2]:
#load the data
iris_df = load_iris()

In [3]:
#train test split
X_train,X_test,y_train,y_test = train_test_split(iris_df.data, #X
                                                iris_df.target,#y
                                                test_size=0.30,#Size of test data
                                                train_size=0.70,#Size of train data
                                                random_state=0) # Mutliple runs should not skew the results
X_train.shape,X_test.shape,y_train.shape,y_test.shape


((105, 4), (45, 4), (105,), (45,))

Now, since we have the data, lets create the pipeline, the pipeline, would have the following stages/steps

- Data pre-processing using standard Scaler.
- Dimensionality reduction using PCA.
- Apply classification.

Lets Dig in!!


In [4]:
#data in pipelines would be defined in tuples such as 
#(object, class)
#pipeline for Logistic Regression
pipeline_lr = Pipeline([
    ('scalar1', StandardScaler()),
    ('pca1',PCA(n_components=2)),
    ('lr_classifier', LogisticRegression(random_state=0))
])

In [5]:
#pipeline for Decision Trees
pipeline_dt = Pipeline([
    ('scalar2', StandardScaler()),
    ('pca2',PCA(n_components=2)),
    ('dt_classifier', DecisionTreeClassifier(random_state=0))
])

In [6]:
#pipeline for Random Forest
pipeline_rf = Pipeline([
    ('scalar3', StandardScaler()),
    ('pca3',PCA(n_components=2)),
    ('rf_classifier', RandomForestClassifier(n_estimators=1000,random_state=1, max_depth=10))
])

Here, I plan to make a list of pipelines & iterate (Loop Through :)) through the list to find the best pipeline/Algorithm for this use case

In [7]:
#list of pipelines
pipelines = [pipeline_lr,pipeline_dt, pipeline_rf]

In [8]:
#Lets fit the pipelines
for pipe in pipelines:
    pipe.fit(X_train, y_train)

#lets define a pipe dictnary to print the data
pipe_dict = {
    0:'Logistic Regression',
    1:'Decision Trees',
    2:'Random Forest Classifier'
}

#lets print the score of the different piplines
for i,model in enumerate(pipelines):
    print('{} accuray on test data is {} '.format(pipe_dict[i], model.score(X_test,y_test)))

Logistic Regression accuray on test data is 0.8666666666666667 
Decision Trees accuray on test data is 0.9111111111111111 
Random Forest Classifier accuray on test data is 0.9111111111111111 


This is it! we have pre-processed the data and check various models in a single step.

# That's All floks!!!