**Applying data transformations**


The different kind of transformations do, let’s apply them using
scikit-learn.
We will use the cancer dataset, 
Preprocessing methods like
the scalers are usually applied before applying a supervised machine learning algo‐
rithm. As an example, say we want to apply the kernel SVM (SVC) to the cancer data‐
set, and use MinMaxScaler for preprocessing the data. We start by loading and
splitting our dataset into a training set and a test set. We need a separate training and
test set to evaluate the supervised model we will build after the preprocessing:

In [6]:
# Importing the dataset
from sklearn.datasets import load_breast_cancer
import numpy as np

#Spliting the dataset into training and testing
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
 random_state=1)
print(X_train.shape)
print(X_test.shape)

(426, 30)
(143, 30)


As a reminder, the data contains 150 data points, each represented by four measure‐
ments. We split the dataset into 112 samples for the training set and 38 samples for
the test set.

As with the supervised models we built earlier, we first import the class implementing
the preprocessing, and then instantiate it:

In [7]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()


Unsupervised Learning and Preprocessing
We then fit the scaler using the fit method, applied to the training data. For the Min
MaxScaler, the fit method computes the minimum and maximum value of each fea‐
ture on the training set. In contrast to the classifiers and regressors of chapter 2, the
scaler is only provided with the data X_train when fit is called, and y_train is not
used:

In [8]:
scaler.fit(X_train)
MinMaxScaler(copy=True, feature_range=(0, 1))

MinMaxScaler()

To apply the transformation that we just learned, that is, to actually scale the training
data, we use the transform method of the scaler. The transform method is used in
scikit-learn whenever a model returns a new representation of the data:


In [9]:
np.set_printoptions(suppress=True, precision=2)
# transform data
X_train_scaled = scaler.transform(X_train)
# print data set properties before and after scaling
print("transformed shape: %s" % (X_train_scaled.shape,))
print("per-feature minimum before scaling:\n %s" % X_train.min(axis=0))
print("per-feature maximum before scaling:\n %s" % X_train.max(axis=0))
print("per-feature minimum after scaling:\n %s" % X_train_scaled.min(axis=0))
print("per-feature maximum after scaling:\n %s" % X_train_scaled.max(axis=0))

transformed shape: (426, 30)
per-feature minimum before scaling:
 [  6.98   9.71  43.79 143.5    0.05   0.02   0.     0.     0.11   0.05
   0.12   0.36   0.76   6.8    0.     0.     0.     0.     0.01   0.
   7.93  12.02  50.41 185.2    0.07   0.03   0.     0.     0.16   0.06]
per-feature maximum before scaling:
 [  28.11   39.28  188.5  2501.      0.16    0.29    0.43    0.2     0.3
    0.1     2.87    4.88   21.98  542.2     0.03    0.14    0.4     0.05
    0.06    0.03   36.04   49.54  251.2  4254.      0.22    0.94    1.17
    0.29    0.58    0.15]
per-feature minimum after scaling:
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]
per-feature maximum after scaling:
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]


The transformed data has the same shape as the original data - the features are simply
shifted and scaled.
You can see that all of the feature are now between zero and one, as desired.
To apply the SVM to the scaled data, we also need to transform the test set. This is
done by again calling the transform method, this time on X_test:

In [10]:
# transform test data
X_test_scaled = scaler.transform(X_test)
# print test data properties after scaling
print("per-feature minimum after scaling: %s" % X_test_scaled.min(axis=0))
print("per-feature maximum after scaling: %s" % X_test_scaled.max(axis=0))

per-feature minimum after scaling: [ 0.03  0.02  0.03  0.01  0.14  0.04  0.    0.    0.15 -0.01 -0.    0.01
  0.    0.    0.04  0.01  0.    0.   -0.03  0.01  0.03  0.06  0.02  0.01
  0.11  0.03  0.    0.   -0.   -0.  ]
per-feature maximum after scaling: [0.96 0.82 0.96 0.89 0.81 1.22 0.88 0.93 0.93 1.04 0.43 0.5  0.44 0.28
 0.49 0.74 0.77 0.63 1.34 0.39 0.9  0.79 0.85 0.74 0.92 1.13 1.07 0.92
 1.21 1.63]


Maybe somewhat surprisingly, you can see that for the test set, after scaling, the mini‐
mum and maximum are not zero and one. Some of the features are even outside the
0-1 range!
The explanation is that the MinMaxScaler (and all the other scalers) always applies
exactly the same transformation to the training and the test set. So the transform
method always subtracts the training set minimum, and divides by the training set
range, which might be different than the minimum and range for the test set