<h1 style="text-align: center;"> Introduction to Scikit Learn and Scikit Learn Pipelines </h1>


### Important Note: please use scikit learn with only numpy arrays, use pandas data frames with scikit learn at your own risk!

### So the basic workflow is to import the data and do data minging in pandas, convert all the data to numeric form and then convert it to a numpy array and finally scikit learn can be used

__The Scikit Learn library gives us to do the following data science operations and more__
* Feature extraction
* Classification
* Regression
* Clustering
* Dimension reduction
* Model selection
* Pipelines and Feature Unions

__Today we will be building scikit learn pipeline that would help us automate the datascience process__

## Two main methods to interface with scikit learn, estimators and transformers

* __The general api pattern for estimators is as follows__

![](img/sk_estimator_interface.jpg)

* __The general difference between the transformer and predictor apis are the predict() and transform() methods__

![](img/sk_transformer.jpg)

# Scikit Learn Pipelines

* __For now remember to use numpy arrays with scikit learn pipelines, packages such as sklearn-pandas are not production ready yet and issues with the unexpected behaviour of pandas when used with pipelines make them extremely unreliable__

![](img/sk_pipelines.jpeg)

![](img/engineering-pipelines.jpeg)

In [22]:
from sklearn.datasets import load_digits

X, Y = load_digits().data, load_digits().target

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size= 0.3, random_state=42)

In [24]:
from sklearn.decomposition import PCA

pca = PCA(n_components=20)

In [25]:
pca.fit(X_train)

PCA(copy=True, iterated_power='auto', n_components=20, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [26]:
X_new_train = pca.transform(X_train)

X_new_test = pca.transform(X_test)

In [27]:
from sklearn.linear_model import LogisticRegression

predictiveModel = LogisticRegression()

In [28]:
predictiveModel.fit(X_new_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [29]:
Y_preds = predictiveModel.predict(X_new_test)

In [30]:
from sklearn.metrics import confusion_matrix

In [31]:
confusion_matrix(Y_test, Y_preds)

array([[52,  0,  0,  0,  0,  0,  0,  1,  0,  0],
       [ 0, 46,  1,  0,  0,  0,  0,  0,  2,  1],
       [ 0,  0, 47,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0, 52,  0,  0,  0,  0,  2,  0],
       [ 0,  0,  0,  0, 60,  0,  0,  0,  0,  0],
       [ 0,  0,  1,  0,  0, 62,  1,  0,  0,  2],
       [ 0,  0,  0,  0,  1,  0, 52,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0, 54,  0,  1],
       [ 0,  0,  0,  0,  0,  1,  0,  0, 42,  0],
       [ 0,  0,  0,  2,  0,  1,  0,  0,  1, 55]], dtype=int64)

In [32]:
from sklearn.metrics import accuracy_score

In [33]:
accuracy_score(Y_test, Y_preds)

0.96666666666666667

In [34]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline(steps = [("scale", StandardScaler()), ("pca", pca), ("logreg", predictiveModel)])

In [35]:
pipe.fit(X_train, Y_train)

Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=20, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('logreg', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [36]:
Y_pipe_preds = pipe.predict(X_test)

In [37]:
accuracy_score(Y_test, Y_pipe_preds)

0.94074074074074077

<img src='img/scikit_cheat_sheet.jpg' />