## Advanced Machine Learning with Andreas Mueller 

# Transformers
* Allow you to change representations of your data
* Several transformers are often used for preprocessing 
* Split data into TEST and TRAIN sets

### Unsupervised Transformations for Preprocessing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [3]:
from sklearn.datasets import load_boston
boston = load_boston()
boston.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target)

### Look at Training set
* Each feature is a floating value, on varying scales: some close to zero, others in hundreds
* Data is not in good format for many linear models, which assume data values on same scale
* Scale the data very simply using mean and std dev

In [5]:
np.set_printoptions(suppress=True)
print(X_train)

[[   0.03961    0.         5.19    ...,   20.2      396.9        8.01   ]
 [   0.14476    0.        10.01    ...,   17.8      391.5       13.61   ]
 [   2.36862    0.        19.58    ...,   14.7      391.71      29.53   ]
 ..., 
 [   0.5405    20.         3.97    ...,   13.       390.3        3.16   ]
 [   4.03841    0.        18.1     ...,   20.2      395.33      12.87   ]
 [   0.0795    60.         1.69    ...,   18.3      370.78       5.49   ]]


In [6]:
print("mean : %s " % X_train.mean(axis=0))
print("standard deviation : %s " % X_train.std(axis=0))

mean : [   3.64755971   11.05145119   10.96757256    0.06596306    0.55184881
    6.30326385   67.80448549    3.79751398    9.40105541  404.9525066
   18.45382586  359.93559367   12.30408971] 
standard deviation : [   9.32755385   22.75394638    6.71348231    0.24821752    0.11150906
    0.70102993   27.92897713    2.04025314    8.61841919  164.80258555
    2.16664688   86.53846767    6.83761438] 


In [9]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [10]:
scaler.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

### Transform data into scaled version using scaler.tranform
* In contrast to supervised models where we predicted new outcomes
* Here, we use transform to get a new view of the data
* After transforming, data has mean of zero and standard deviation 1

In [11]:
X_scaled = scaler.transform(X_train)

In [12]:
print(X_scaled.shape)

(379, 13)


In [13]:
print("mean : %s " % X_scaled.mean(axis=0))
print("standard deviation : %s " % X_scaled.std(axis=0))

mean : [-0.  0.  0. -0. -0. -0.  0.  0. -0. -0.  0.  0. -0.] 
standard deviation : [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 
