# Scaling

When your data has different values, and even different measurement units, it can be difficult to compare them. What is kilograms compared to meters? Or altitude compared to time?

The answer to this problem is scaling. We can scale data into new values that are easier to compare.

For example, by doing scaling, now you can compare -2.1 with -1.59 instead of comparing 790 with 1.0. (W3Schools)

Case:
- From the file [unscaled_cars.csv](unscaled_cars.csv) which is the same data set that we used in the multiple regression chapter, but this time the volume column contains values in liters instead of cm3 (1.0 instead of 1000).

In [1]:
import pandas

df = pandas.read_csv("unscaled_cars.csv")

X = df[["Weight", "Volume"]]
Y = df["CO2"]

df.head(10)

Unnamed: 0,Car,Model,Volume,Weight,CO2
0,Toyota,Aygo,1.0,790,99
1,Mitsubishi,Space Star,1.2,1160,95
2,Skoda,Citigo,1.0,929,95
3,Fiat,500,0.9,865,90
4,Mini,Cooper,1.5,1140,105
5,VW,Up!,1.0,929,105
6,Skoda,Fabia,1.4,1109,90
7,Mercedes,A-Class,1.5,1365,92
8,Ford,Fiesta,1.5,1112,98
9,Audi,A1,1.6,1150,99


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# scale the X attributes
X_scaled = scaler.fit_transform(X)

print(X_scaled)

[[-2.10389253 -1.59336644]
 [-0.55407235 -1.07190106]
 [-1.52166278 -1.59336644]
 [-1.78973979 -1.85409913]
 [-0.63784641 -0.28970299]
 [-1.52166278 -1.59336644]
 [-0.76769621 -0.55043568]
 [ 0.3046118  -0.28970299]
 [-0.7551301  -0.28970299]
 [-0.59595938 -0.0289703 ]
 [-1.30803892 -1.33263375]
 [-1.26615189 -0.81116837]
 [-0.7551301  -1.59336644]
 [-0.16871166 -0.0289703 ]
 [ 0.14125238 -0.0289703 ]
 [ 0.15800719 -0.0289703 ]
 [ 0.3046118  -0.0289703 ]
 [-0.05142797  1.53542584]
 [-0.72580918 -0.0289703 ]
 [ 0.14962979  1.01396046]
 [ 1.2219378  -0.0289703 ]
 [ 0.5685001   1.01396046]
 [ 0.3046118   1.27469315]
 [ 0.51404696 -0.0289703 ]
 [ 0.51404696  1.01396046]
 [ 0.72348212 -0.28970299]
 [ 0.8281997   1.01396046]
 [ 1.81254495  1.01396046]
 [ 0.96642691 -0.0289703 ]
 [ 1.72877089  1.01396046]
 [ 1.30990057  1.27469315]
 [ 1.90050772  1.01396046]
 [-0.23991961 -0.0289703 ]
 [ 0.40932938 -0.0289703 ]
 [ 0.47215993 -0.0289703 ]
 [ 0.4302729   2.31762392]]


In [4]:
from sklearn import linear_model

regress = linear_model.LinearRegression()
regress.fit(X_scaled, Y)

Case: Predict the CO2 emission from a 1.3 liter car that weighs 2300 kilograms:

In [9]:
x_predict_unscaled = [[2300, 1.3]]
x_predict = scaler.transform(x_predict_unscaled)

y_predict = regress.predict(x_predict)

y_predict



array([107.2087328])