# Automating Supervised Learning using Machine Learning pipelines in SkLearn

Machine Learning (ML) Pipelines are used to help automate machine learning workflows. They operate by enabling a sequence of data to be transformed and correlated together in a model that can be tested and evaluated to achieve an outcome. 

These pipelines consist of several steps to train a model. Pipelines are iterative and every step is repeated to continuously improve the accuracy of the model and achieve a successful algorithm. 

1. The main objective of a pipeline for any ML model is to exercise control. 
2. The learning algorithm finds patterns in the training data that map the input data attributes to the target and it outputs an ML model that captures these patterns. 
3. A pipeline consists of a sequence of components which are a compilation of computations. Data is sent through these components and is manipulated with the help of computation.

Note that the classifiers I am importing here are Supervised Machine Learning systems, in which the training set that I feed includes the desired solutions (labels). 

For example, a typical supervised learning task is **classification**. As one example, take the spam filter: It is trained with many example emails along with their class (spam or not), and it must learn how to classify new emails. 

Another typical supervised ML task is **Regression**, where we want to predict a target numeric value given a set a features called predictors. For example, let's say we want to predict the price of a car given a set of features, such as the mileage, age, brand, etc. To train the system, I need to give it many examples of cars, including both their predictors and their labels. 

Examples of the most important supervised learning algorithms: 
* k-Nearest Neighbors
* Linear Regression
* Logistic Regression
* Support Vector Machines
* Decision Trees & Random Forests
* Neural Networks (this can also be unsupervised)

#### Load the famous iris dataset

In [54]:
from sklearn.datasets import load_iris

#### Split arrays or matrices into random train and test subsets.

In [55]:
from sklearn.model_selection import train_test_split

#### Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

In [56]:
from sklearn.preprocessing import StandardScaler

#### Principal component analysis (PCA).

Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.

In [57]:
from sklearn.decomposition import PCA

#### Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. 

In [58]:
from sklearn.pipeline import Pipeline

#### Logistic Regression Classifier

Logistic regression is named for the logistic function or sigmoid funciton. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

1 / (1 + e^-value)

In [59]:
from sklearn.linear_model import LogisticRegression

#### Decision Tree Classifier

Use the dataset features to create yes/no questions and continually split the dataset until you isolate all data points belonging to each class.

With this process you’re organizing the data in a tree structure. Every time you ask a question you’re adding a node to the tree. The first node is called the root node. The result of asking a question splits the dataset based on the value of a feature, and creates new nodes. If you decide to stop the process after a split, the last nodes created are called leaf nodes.

In [60]:
from sklearn.tree import DecisionTreeClassifier

#### Random Forest Classifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. 

In [61]:
from sklearn.ensemble import RandomForestClassifier

#### Load Iris Dataset

In [75]:
iris_df = load_iris()
iris_df

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [63]:
X_train,X_test,y_train,y_test=train_test_split(iris_df.data,iris_df.target,test_size=0.3,random_state=0)

# Pipeline Creation
1. Data Preprocessing by using Standard Scaler
2. Reduce Dimension using PCA
3. Apply  Classifier

#### Logistic Regression:

In [64]:
pipeline_lr=Pipeline([('scalar1',StandardScaler()),
                     ('pca1',PCA(n_components=2)),
                     ('lr_classifier',LogisticRegression(random_state=0))])

#### Decision Tree:

In [65]:
pipeline_dt=Pipeline([('scalar2',StandardScaler()),
                     ('pca2',PCA(n_components=2)),
                     ('dt_classifier',DecisionTreeClassifier())])

#### Random Forest:

In [66]:
pipeline_randomforest=Pipeline([('scalar3',StandardScaler()),
                     ('pca3',PCA(n_components=2)),
                     ('rf_classifier',RandomForestClassifier())])

A list of pipelines:

In [67]:
pipelines = [pipeline_lr, pipeline_dt, pipeline_randomforest]

Initialize to find best pipeline: 

In [68]:
best_accuracy=0.0
best_classifier=0
best_pipeline=""

Dictionary of pipelines and classifier types for ease of reference

In [69]:
pipe_dict = {0: 'Logistic Regression', 1: 'Decision Tree', 2: 'RandomForest'}

Fit the pipelines:

In [70]:
for pipe in pipelines:
	pipe.fit(X_train, y_train)

In [71]:
for i,model in enumerate(pipelines):
    print("{} Test Accuracy: {}".format(pipe_dict[i],model.score(X_test,y_test)))

Logistic Regression Test Accuracy: 0.8666666666666667
Decision Tree Test Accuracy: 0.9111111111111111
RandomForest Test Accuracy: 0.9111111111111111


Awesome! Looks like both Decision Tree & Random Forest have similar test accuracy, but let's confirm:

In [72]:
for i,model in enumerate(pipelines):
    if model.score(X_test,y_test)>best_accuracy:
        best_accuracy=model.score(X_test,y_test)
        best_pipeline=model
        best_classifier=i
print('Classifier with best accuracy:{}'.format(pipe_dict[best_classifier]))

Classifier with best accuracy:Decision Tree


### The classifier with the best accuracy in this example is the Decision Tree