# Feature scaling

For some of our distance-based algorithms, the choice of scale of our features can be important.

Consider California housing:

In [1]:
%matplotlib inline
import pandas as pd

from sklearn import datasets

data = datasets.fetch_california_housing()
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [2]:
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31


Note the wide variance between different columns. If we start looking at feature importance, this can over- or under-weight certain features:

In [3]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(df, data.target)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [4]:
imp = pd.Series(model.coef_, index=df.columns)
imp.abs().sort_values(ascending=False)

AveBedrms     0.645066
MedInc        0.436693
Longitude     0.434514
Latitude      0.421314
AveRooms      0.107322
HouseAge      0.009436
AveOccup      0.003787
Population    0.000004
dtype: float64

We can scale the input features to make sure they are considered equally:

In [5]:
from sklearn.preprocessing import StandardScaler

X = StandardScaler().fit_transform(df)
pd.DataFrame(X).describe()

Unnamed: 0,0,1,2,3,4,5,6,7
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,6.6097e-17,5.508083e-18,6.6097e-17,-1.060306e-16,-1.101617e-17,3.442552e-18,-1.079584e-15,-8.526513e-15
std,1.000024,1.000024,1.000024,1.000024,1.000024,1.000024,1.000024,1.000024
min,-1.774299,-2.19618,-1.852319,-1.610768,-1.256123,-0.229,-1.447568,-2.385992
25%,-0.6881186,-0.8453931,-0.3994496,-0.1911716,-0.5638089,-0.06171062,-0.7967887,-1.113209
50%,-0.1767951,0.02864572,-0.08078489,-0.101065,-0.2291318,-0.02431585,-0.6422871,0.5389137
75%,0.4593063,0.6643103,0.2519615,0.006015869,0.2644949,0.02037453,0.9729566,0.7784964
max,5.858286,1.856182,55.16324,69.57171,30.25033,119.4191,2.958068,2.62528


In [6]:
model = LinearRegression()
model.fit(X, data.target)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [7]:
imp = pd.Series(model.coef_, index=df.columns)
imp.abs().sort_values(ascending=False)

Latitude      0.899886
Longitude     0.870541
MedInc        0.829619
AveBedrms     0.305696
AveRooms      0.265527
HouseAge      0.118752
AveOccup      0.039326
Population    0.004503
dtype: float64

In [8]:
imp

MedInc        0.829619
HouseAge      0.118752
AveRooms     -0.265527
AveBedrms     0.305696
Population   -0.004503
AveOccup     -0.039326
Latitude     -0.899886
Longitude    -0.870541
dtype: float64

# Lab 

Open the [Feature Scaling Lab](feature-scaling-lab.ipynb)