# Pipelines
Useful to avoid leaking data from the training dataset into the test dataset, especially during the data preparation stage (e.g. when applying normalisation or standardisation to the data).

Pipelines help to prevent data leakage by ensuring that data preparation is constrained to each fold of your cross validation procedure.

In [1]:
from pandas import read_csv
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest



## Load data
This example uses the Pima Indians diabetes dataset.

In [2]:
url = "pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values

In [None]:
# separate array into features (X) and label (y) parts
X = array[:,0:8]
y = array[:,8]

## Create pipeline
For this example:
1. standardize the data
2. train a Linear Discriminant Analysis (LDA) model

In [None]:
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('lda', LinearDiscriminantAnalysis()))
model = Pipeline(estimators)

## Evaluate pipeline
This example evaluates the pipeline using 10-fold cross validation.

In [None]:
num_folds = 10
num_instances = len(X)
seed = 8

kfold = KFold(n=num_instances, n_folds=num_folds, random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())

## Feature extraction as a pipeline
For this example:
1. apply feature extraction using Principal Component Analysis (PCA) (3 features)
2. apply feature extraction using statistical selection (6 features)
3. do a feature union
4. build a Logistic Regression model

In [None]:
# create feature union
features = []
features.append(('pca', PCA(n_components=3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)

# create pipeline
estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', LogisticRegression()))
model = Pipeline(estimators)

## Evaluate pipeline
Repeat the code above for evaluating a pipeline.

In [None]:
# evaluate pipeline
num_folds = 10
num_instances = len(X)
seed = 8

kfold = KFold(n=num_instances, n_folds=num_folds, random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())