In [1]:
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler

# Scale

When your data has different values, and even different measurement units, it can be difficult to compare them. What is kilograms compared to meters? Or altitude compared to time?

The answer to this problem is scaling. We can scale data into new values that are easier to compare.

Take a look at the table below, it is the same data set that we used in the [Multiple Regression](#multiple-regression) chapter, but this time the volume column contains values in liters instead of cm3 (1.0 instead of 1000).

In [2]:
df = pandas.read_csv('../../data-nstd.csv')

df

Unnamed: 0,Car,Model,Volume,Weight,CO2
0,Toyoty,Aygo,1.0,790,99
1,Mitsubishi,Space Star,1.2,1160,95
2,Skoda,Citigo,1.0,929,95
3,Fiat,500,0.9,865,90
4,Mini,Cooper,1.5,1140,105
5,VW,Up!,1.0,929,105
6,Skoda,Fabia,1.4,1109,90
7,Mercedes,A-Class,1.5,1365,92
8,Ford,Fiesta,1.5,1112,98
9,Audi,A1,1.6,1150,99


It can be difficult to compare the volume 1.0 with the weight 790, but if we scale them both into comparable values, we can easily see how much one value is compared to the other.

There are different methods for scaling data, in this tutorial we will use a method called standardization.

The standardization method uses this formula:
<br>
$z = (x - u) / s$
<br>
Where $z$ is the new value, $x$ is the original value, $u$ is the mean and $s$ is the standard deviation.

If you take the `Weight` column from the data set above, the first value is `790`, and the scaled value will be:
<br>
$(790 - 1292.23) / 238.74 = -2.1$
<br>
If you take the `Volume` column from the data set above, the first value is `1.0`, and the scaled value will be:
<br>
$(1.0 - 1.61) / 0.38 = -1.59$
<br>
Now you can compare $-2.1$ with $-1.59$ instead of comparing `790` with `1.0`.

You do not have to do this manually, the Python **sklearn** module has a method called `StandardScaler()` which returns a Scaler object with methods for transforming data sets.

In [3]:
X = df[[ 'Weight', 'Volume' ]]
y = df['CO2']

y

0      99
1      95
2      95
3      90
4     105
5     105
6      90
7      92
8      98
9      99
10     99
11    101
12     99
13     94
14     97
15     97
16     99
17    104
18    104
19    105
20     94
21     99
22     99
23     99
24     99
25    102
26    104
27    114
28    109
29    114
30    115
31    117
32    104
33    108
34    109
35    120
Name: CO2, dtype: int64

In [4]:
scale = StandardScaler()
scaledX = scale.fit_transform(X.values)

scaledX

array([[-2.10389253, -1.59336644],
       [-0.55407235, -1.07190106],
       [-1.52166278, -1.59336644],
       [-1.78973979, -1.85409913],
       [-0.63784641, -0.28970299],
       [-1.52166278, -1.59336644],
       [-0.76769621, -0.55043568],
       [ 0.3046118 , -0.28970299],
       [-0.7551301 , -0.28970299],
       [-0.59595938, -0.0289703 ],
       [-1.30803892, -1.33263375],
       [-1.26615189, -0.81116837],
       [-0.7551301 , -1.59336644],
       [-0.16871166, -0.0289703 ],
       [ 0.14125238, -0.0289703 ],
       [ 0.15800719, -0.0289703 ],
       [ 0.3046118 , -0.0289703 ],
       [-0.05142797,  1.53542584],
       [-0.72580918, -0.0289703 ],
       [ 0.14962979,  1.01396046],
       [ 1.2219378 , -0.0289703 ],
       [ 0.5685001 ,  1.01396046],
       [ 0.3046118 ,  1.27469315],
       [ 0.51404696, -0.0289703 ],
       [ 0.51404696,  1.01396046],
       [ 0.72348212, -0.28970299],
       [ 0.8281997 ,  1.01396046],
       [ 1.81254495,  1.01396046],
       [ 0.96642691,

> **Note:** that the first two values are -2.1 and -1.59, which corresponds to our calculations.

### Predict $CO_2$ Values

The task in the **Multiple Regression** chapter was to predict the $CO_2$ emission from a car when you only knew its weight and volume.

> **When the data set is scaled, you will have to use the scale when you predict values:**

In [5]:
regr = linear_model.LinearRegression()
regr.fit(scaledX, y.values)

In [6]:
scaled = scale.transform([[2300, 1.3]])

In [7]:
predictCO2 = regr.predict(scaled)
predictCO2[0]

107.20873279892223