# Centering and scaling
## Why scale your data?
- Many models use some form of distance
- Features on large sclaes can unduly influence the model
- **Example:** k-NN uses distance explicitly when making predictions
- Thus, we want features to be on a **similar scale**
- **Normalizing** (scaling and centering)

## Ways to normalize data
- **Standardization: Subtract the mean and divide by variance**
 - All features are centered around zero and have variance one
- **Subtract the minimum and divide by the range**
 - Minimum zero and maximum one
- **Can also normalize so the data ranges from -1 to 1**

## CV and scaling in a pipeline

#### Hyperparameters with pipeline
```python
steps = [('scaler', StandardScaler()),
         ('knn', KNeighborsClassifier())]

pipeline = Pipeline(steps)

parameters = { knn__n_neighbors = ... }
```
Keys to the hyperparameter dictionary
- knn: key to the step name
- _ _
- parameters


In [63]:
import numpy as np
import pandas as pd

df = pd.read_csv('datasets/white-wine.csv')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [73]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0
mean,6.854788,0.278241,0.334192,6.391415,0.045772,35.308085,138.360657,0.994027,3.188267,0.489847,10.514267,5.877909
std,0.843868,0.100795,0.12102,5.072058,0.021848,17.007137,42.498065,0.002991,0.151001,0.114126,1.230621,0.885639
min,3.8,0.08,0.0,0.6,0.009,2.0,9.0,0.98711,2.72,0.22,8.0,3.0
25%,6.3,0.21,0.27,1.7,0.036,23.0,108.0,0.991723,3.09,0.41,9.5,5.0
50%,6.8,0.26,0.32,5.2,0.043,34.0,134.0,0.99374,3.18,0.47,10.4,6.0
75%,7.3,0.32,0.39,9.9,0.05,46.0,167.0,0.9961,3.28,0.55,11.4,6.0
max,14.2,1.1,1.66,65.8,0.346,289.0,440.0,1.03898,3.82,1.08,14.2,9.0


- scaling doesn't always improve accuracy. For example, in the Congressional voting records dataset, all of the features are binary. In such a situation, scaling will have minimal impact.

- density 0.98 - 1.04
- total sulfur dioxide 9 to 440
- quality: if quality <= 5, the target variable is 1, otherwise, it is 0

In [75]:
X = df.drop('quality', axis=1).values
y = df['quality'].apply(lambda x: 1 if x <= 5 else 0).values

In [76]:
from sklearn.preprocessing import scale

X_scaled = scale(X)

# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X))) 
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))

# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled))) 
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))

Mean of Unscaled Features: 18.432687072460002
Standard Deviation of Unscaled Features: 41.54494764094571
Mean of Scaled Features: 2.7314972981668206e-15
Standard Deviation of Scaled Features: 0.9999999999999999


## Centering and scaling in a pipeline
In this exercise, we'll see whether scaling the features has any impact on its performance. Using k-NN classifier as part of a pipeline that includes scaling. Then comparing performance with the unscaled features.

In [77]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [87]:
steps = [('scaler', StandardScaler()),
         ('knn', KNeighborsClassifier())]

pipeline = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the pipeline to the training set
knn_scaled = pipeline.fit(X_train, y_train)

print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))

Accuracy with Scaling: 0.7700680272108843


In [88]:
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))

Accuracy without Scaling: 0.6979591836734694


## Scaling and CV in a pipeline

In [105]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

In [None]:
steps = [('scaler', StandardScaler()),
         ('knn', KNeighborsClassifier())]

pipeline = Pipeline(steps)

parameters = dict(knn__n_neighbors = np.arange(1, 50))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

cv = GridSearchCV(pipeline, param_grid=parameters, cv=5)

cv.fit(X_train, y_train)

In [None]:
# predict() will use the best found parameters
y_pred = cv.predict(X_test)

print(cv.best_params_)

In [103]:
print(cv.score(X_test, y_test))

0.7761904761904762


In [106]:
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.85      0.82      0.83       997
          1       0.64      0.69      0.66       473

avg / total       0.78      0.78      0.78      1470

