This week's content index:
1. Pipelining / FeatureUnion
2. Ensembles (Bagging/Boosting/Random Forest)
3. Parameter Tuning
4. Saving models to disk and loading models from disk

## 1. Pipelining `Pipeline`

* Often used to mitigate **data leakage** (i.e. accidentally share information between the training and testing dataset)
* It works by ensuring that data preparation like standardisation is constrained to each fold of the cross validation procedure.

In [6]:
# Pipelining example
# step 1: standardise the data
# step 2: learn  a Linear Discriminant Analysis model

import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis


# load data
data = pd.read_csv("diabetes.csv")
X = data.values[:,0:8]
Y = data.values[:,8]

# create pipeline
estimators = []
estimators.append(('standardize',StandardScaler()))
estimators.append(('lda',LinearDiscriminantAnalysis()))
model = Pipeline(estimators)

# evaluate pipeline
kfold = KFold(n_splits = 10, random_state = 7)
results = cross_val_score(model, X, Y, cv = kfold)
print(results.mean())

0.773462064251538


## `FeatureUnion` : combining estimators to create even more powerful ones

Feature extraction is another procedure that is susceptible to data leakage. Like data preparation, feature extraction procedures must be restricted to the data in your training dataset. The pipeline provides a handy tool called the `𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑈𝑛𝑖𝑜𝑛` which allows the results of multiple feature selection and extraction procedures to be combined into a larger dataset on which a model can be trained. Importantly, all the feature extraction and the feature union occurs within each fold of the cross-validation procedure. The example below demonstrates the pipeline defined with four steps:

1. Feature Extraction with Principal Component Analysis (3 features).
2. Feature Extraction with Statistical Selection (6 features).
3. Feature Union.
4. Learn a Logistic Regression Model.

In [8]:
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

# load data
data = pd.read_csv("diabetes.csv")
X = data.values[:,0:8]
Y = data.values[:,8]

# create feature union
features = []
features.append(('pca',PCA(n_components=3))) # feature extraction with PCA (3 features)
features.append(('select_best',SelectKBest(k=6)))
feature_union = FeatureUnion(features)

estimators = []
estimators.append(('feature_union',feature_union))
estimators.append(('logistic',LogisticRegression()))
model = Pipeline(estimators)

# evaluate pipeline
kfold = KFold(n_splits = 10, random_state = 7)
results = cross_val_score(model, X, Y, cv = kfold)
print(results.mean())

0.7760423786739576


## 2. Ensembles

* **Bagging** (Bootstrap Aggregation). Building multiple models (typically of the same type) from different subsamples of the training dataset.
* **Boosting**. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the sequence of models.
* **Voting**. Building multiple models (typically of differing types) and simple statistics (like calculating the mean) are used to combine predictions.

### Bagging `BaggingClassifier`

Bootstrap Aggregation (or Bagging) involves taking multiple samples from your training dataset (with replacement) and training a model for each sample. The final output prediction is averaged across the predictions of all of the sub-models. The two bagging models covered in this section are as follows:
* Bagged Decision Trees
* Random Forest

Bagging performs best with algorithms that have **high variance**. A popular example are **decision trees**, often constructed without pruning.

In [9]:
# Bagging decision trees for classification
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

data = pd.read_csv("diabetes.csv")
X = data.values[:,0:8]
Y = data.values[:,8]

kfold = KFold(n_splits=10,random_state=7)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators = num_trees, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

  from numpy.core.umath_tests import inner1d


0.770745044429255


### Random Forest `𝑅𝑎𝑛𝑑𝑜𝑚𝐹𝑜𝑟𝑒𝑠𝑡𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟`

Random Forests is an extension of bagged decision trees. Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the correlation between individual classifiers. Specifically, rather than greedily choosing the best split point in the construction of each tree, only a random subset of features are considered for each split. 

In [11]:
# Random Forest Classification
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv("diabetes.csv")
X = data.values[:,0:8]
Y = data.values[:,8]

kfold = KFold(n_splits=10,random_state=7)
num_trees = 100
max_features = 3

model = RandomForestClassifier( n_estimators = num_trees, max_features = max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7694805194805194


### Boosting Algorithms

**AdaBoost** `𝐴𝑑𝑎𝐵𝑜𝑜𝑠𝑡𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟`

AdaBoost works by weighting instances in the dataset by how easy or difficult they are to classify, allowing the algorithm to pay less attention to them in the construction of subsequent models. You can construct an AdaBoost model for classification using the 𝐴𝑑𝑎𝐵𝑜𝑜𝑠𝑡𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 class. 

In [12]:
# Random Forest Classification
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import AdaBoostClassifier

data = pd.read_csv("diabetes.csv")
X = data.values[:,0:8]
Y = data.values[:,8]

kfold = KFold(n_splits=10,random_state=7)
num_trees = 30

model = AdaBoostClassifier( n_estimators = num_trees,random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.760457963089542


### Voting Ensemble

**Voting** `𝑉𝑜𝑡𝑖𝑛𝑔𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟`

Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms. It works by first creating two or more standalone models from your training dataset. A Voting Classifier can then be used to wrap your models and average the predictions of the sub- models when asked to make predictions for new data. The predictions of the sub-models can be weighted, but specifying the weights for classifiers manually or even heuristically is difficult. More advanced methods can learn how to best weight the predictions from sub-models, but this is called stacking (stacked aggregation) and is currently not provided in scikit-learn.


In [13]:
# An example of combining the predictions of logistic regression, classification 
# and regression trees and support vector machines (SVC) together for a classification problem
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier


data = pd.read_csv("diabetes.csv")
X = data.values[:,0:8]
Y = data.values[:,8]

kfold = KFold(n_splits=10,random_state=7)

# create the sub models
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart',model2))
model3 = SVC()
estimators.append(('svm',model3))

# create the ensemble model
ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())



  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


0.7343301435406698


  if diff:
  if diff:
  if diff:
  if diff:


## 3. Tuning (Hyperparameter optimisation)

Algorithm tuning is a final step in the process of applied machine learning before finalising your model. It is sometimes called hyperparameter optimisation where the algorithm parameters are referred to as hyperparameters, whereas the coefficients found by the machine learning algorithm itself are referred to as parameters. Optimisation suggests the search-nature of the problem. Phrased as a search problem, you can use different search strategies to find a good and robust parameter or set of parameters for an algorithm on a given problem. Python scikit-learn provides two simple methods for algorithm parameter tuning:
1. Grid Search Parameter Tuning.
2. Random Search Parameter Tuning.

**Grid Search** `GridSearchCV`

Grid search is an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid. You can perform a grid search using the 𝐺𝑟𝑖𝑑𝑆𝑒𝑎𝑟𝑐h𝐶𝑉 class. 

In [14]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge


data = pd.read_csv("diabetes.csv")
X = data.values[:,0:8]
Y = data.values[:,8]

alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
param_grid = dict(alpha=alphas)
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid.fit(X,Y)
print(grid.best_score_)
print(grid.best_estimator_.alpha)

0.2796175593129722
1.0


## Random Search `RandomizedSearchCV`

Random search is an approach to parameter tuning that will sample algorithm parameters from a random distribution (i.e. uniform) for a fixed number of iterations. A model is constructed and evaluated for each combination of parameters chosen. You can perform a random search for algorithm parameters using the 𝑅𝑎𝑛𝑑𝑜𝑚𝑖𝑧𝑒𝑑𝑆𝑒𝑎𝑟𝑐h𝐶𝑉 class.

In [17]:
import pandas as pd
from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import Ridge

data = pd.read_csv("diabetes.csv")
X = data.values[:,0:8]
Y = data.values[:,8]

param_grid = {"alpha":uniform()}
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions = param_grid, n_iter = 100, random_state = 7)
rsearch.fit(X,Y)
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

0.27961712703051084
0.9779895119966027


## 4. Saving and Loading Models

### Serialise object in Python using `pickle`


Pickle is the standard way of serialising objects in Python. You can use the 𝑝𝑖𝑐𝑘𝑙𝑒 operation to serialise your machine learning algorithms and save the serialised format to a file. 

The example below demonstrates how you can train a logistic regression model on the Pima Indians onset of diabetes dataset, save the model to file and load it to make predictions on the unseen test set.

In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from pickle import dump, load

data = pd.read_csv("diabetes.csv")
X = data.values[:,0:8]
Y = data.values[:,8]

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.33, random_state=7)

# fit the model on the 33% training data
model = LogisticRegression()
model.fit(X_train, Y_train)

# save the model to disk
filename = 'finalized_model.sav'
dump(model,open(filename,'wb'))

# load the model from disk
loaded_model = load(open(filename,'rb'))
result = loaded_model.score(X_test,Y_test)
print(result)

0.7559055118110236


### `Joblib` Library

The 𝑱𝒐𝒃𝒍𝒊𝒃 library is part of the SciPy ecosystem and provides utilities for pipelining Python jobs.

It provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently. This can be useful for some machine learning algorithms that require a lot of parameters or store the entire dataset (e.g. k-Nearest Neighbors). The example below demonstrates how you can train a logistic regression model on the Pima Indians onset of diabetes dataset, save the model to file using Joblib and load it to make predictions on the unseen test set.


In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.externals.joblib import dump, load

data = pd.read_csv("diabetes.csv")
X = data.values[:,0:8]
Y = data.values[:,8]

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.33, random_state=7)

# fit the model on the 33% training data
model = LogisticRegression()
model.fit(X_train, Y_train)

# save the model to disk
filename = 'finalised_model.sav'
dump(model, filename)

# load the model from disk
loaded_model = load(filename)
result = loaded_model.score(X_test, Y_test)
print(result)

0.7559055118110236


### Notes on saving and loading models

* **Python Version** Take note of the Python version. You almost certainly require the same major (and maybe minor) version of Python used to serialise the model when you later load it and deserialise it.

* **Library Versions** The version of all major libraries used in your machine learning project almost certainly need to be the same when deserialising a saved model. This is not limited to the version of NumPy and the version of scikit-learn.

* **Manual Serialisation** You might like to manually output the parameters of your learned model so that you can use them directly in scikit-learn or another platform in the future. Often the techniques used internally by machine learning algorithms to make predictions are a lot simpler than those used to learn the parameters and can be easy to implement in custom code that you have control over.

* Take note of the version so that you can re-create the environment if for some reason you cannot reload your model on another machine or another platform at a later time.

### From the lecture

* **sigmoid function** are not commonly used nowadays for deep neural networks or RNN because long term information has to sequentially travel through all cells before getting to the current processing cell.
* This means that it can be easily corrupted by being multiplied many times by small numbers <0. This is the cause of vanishing gradients
* Alternatives to sigmoid function:
  + tanH (Hyperbolic Tangent):  between -1 and 1, central near 0. It is easier to train and often better performance than sigmoid function
  + ReLU (Rectified Linear Unit)