# **Comparing feature magnitude**

Many machine learning algorithms are sensitive to the scale of the features.

In this work, we will learn to visualize the feature magnitudes and most common statistical metrics.

In [1]:
import pandas as pd

# the dataset for the demo
from sklearn.datasets import fetch_california_housing

In [2]:
# load the the fetch california housing price data

# this is how we load the boston dataset from sklearn
fetch_california = fetch_california_housing()

# create a dataframe with the independent variables
data = pd.DataFrame(fetch_california.data,
                      columns=fetch_california.feature_names)

data.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [3]:
# let's have a look at the values of those variables
# to get an idea of the feature magnitudes

data.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31


In the table we observe the main statistics of the variables, e.g., the 25th, 50th and 75th quantiles, the mean, standard deviation and minimum and maximum value. Comparing these parameters we can quickly understand whether our features are in a similar scale. In this case, they are clearly not.

MedInc takes values 0.5-15 whereas HouseAge takes values 1 to 52, and Population takes values 3 to 36000.

In [4]:
# let's now calculate the range of the variables

data.max() - data.min()

MedInc           14.500200
HouseAge         51.000000
AveRooms        141.062937
AveBedrms        33.733333
Population    35679.000000
AveOccup       1242.641026
Latitude          9.410000
Longitude        10.040000
dtype: float64

The ranges of the variables, as expected are quite different.