# Agenda
- StandardScaler
- Pipeline

## How does StandardScaler solve the problem?
[StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) is used for the "standardization" of features, also known as "center and scale" or "z-score normalizaiton"


In [1]:
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
col_names = ['label', 'color', 'proline']
wine = pd.read_csv(url, header=None, names=col_names, usecols=[0, 10, 13])

In [2]:
wine.head()

Unnamed: 0,label,color,proline
0,1,5.64,1065
1,1,4.38,1050
2,1,5.68,1185
3,1,7.8,1480
4,1,4.32,735


In [3]:
wine.describe()

Unnamed: 0,label,color,proline
count,178.0,178.0,178.0
mean,1.938202,5.05809,746.893258
std,0.775035,2.318286,314.907474
min,1.0,1.28,278.0
25%,1.0,3.22,500.5
50%,2.0,4.69,673.5
75%,3.0,6.2,985.0
max,3.0,13.0,1680.0


In [4]:
feature_cols=['color','proline']
X = wine[feature_cols]
y = wine.label

In [5]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1)

**fit**:Compute the mean and std to be used for later scaling.

**fit_transform**: Fit to data, then transform it.

In [6]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

In [7]:
print(X_train_scaled[:,0].mean())
print(X_train_scaled[:,0].std())
print(X_train_scaled[:,1].mean())
print(X_train_scaled[:,1].std())

-3.90664944003e-16
1.0
1.6027279754e-16
1.0


In [8]:
X_test_scaled = scaler.transform(X_test)
print(X_test_scaled[:,0].mean())
print(X_test_scaled[:,0].std())
print(X_test_scaled[:,1].mean())
print(X_test_scaled[:,1].std())

0.0305898576303
0.866822198488
0.0546533341088
1.14955947533


In [9]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred_class = knn.predict(X_test)
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.64444444444444449

In [10]:
knn.fit(X_train_scaled, y_train)
y_pred_class=knn.predict(X_test_scaled)
metrics.accuracy_score(y_test, y_pred_class)

0.8666666666666667

In [11]:
from sklearn.cross_validation import cross_val_score
cross_val_score(knn,X,y,cv=5,scoring="accuracy").mean()

0.71983168041991563

In [12]:
scaler=StandardScaler()
X_scaled = scaler.fit_transform(X)
cross_val_score(knn, X_scaled, y, cv=5, scoring="accuracy").mean()

0.90104247104247115

# Pipeline
### How does Pipleline solve the problem?



In [13]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), KNeighborsClassifier())
cross_val_score(knn, X, y, cv=5, scoring='accuracy').mean()

0.71983168041991563

Pipeline can also be used with [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) for parameter searching:


In [14]:
scaler=StandardScaler()
X_scaled = scaler.fit_transform(X)
from sklearn.grid_search import GridSearchCV
param_grid=[{'n_neighbors':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]}]
grid = GridSearchCV(knn,param_grid, cv=5, scoring='accuracy')
X_scaled=scaler.transform(X)
grid.fit(X_scaled, y)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]}],
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=0)

In [15]:
grid.best_score_

0.9101123595505618

In [16]:
grid.best_params_

{'n_neighbors': 1}