# Feature Extraction and Modeling Pipeline
Following demonstrates the pipeline defined with four steps:

1. Feature Extraction with Principal Component Analysis
2. Feature Extration with Statistical Selection
3. Feature Union
4. Learn a Logistic Regression Model

The pipeline is then evaluated using 10-fold cross validation.

Reference: http://machinelearningmastery.com/

In [1]:
# Create a pipeline that extracts features from the data then creates a model
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

In [2]:
#load data - Pima Indians Diabetes
url = 'https://goo.gl/vhm1eU'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pd.read_csv(url, names=names)
dataframe.shape

(768, 9)

In [3]:
dataframe.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
array = dataframe.values
X = array[:, :8]
y = array[:, 8]

In [6]:
# create feature union
features = []
features.append(('pcs', PCA(n_components=3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)

In [7]:
# create pipeline
estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', LogisticRegression()))
model = Pipeline(estimators)

In [9]:
# evaluate pipeline
n_splits = 10
seed = 7
kfold = KFold(n_splits=n_splits, random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())

0.776042378674
