# Preprocessing California Housing Dataset with Feature Scaling

We are going to use the California housing dataset to illustrate how data scaling works. The dataset was derived from the 1990 U.S. census. One row of the dataset represents the census of one block group.

## Import and Load the Dataset

In [1]:
from sklearn.datasets import fetch_california_housing

california_housing = fetch_california_housing(as_frame=True)
df = california_housing.frame

## Explore the Dataset

Take a peak at the first few rows of data. Help me use appropriate function to display first few rows of data

In [2]:
import pandas as pd

# TODO1: Use appropriate function to display first few rows of data
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## Preprocessing the Dataset

We need to predict another median house value. To do so, we will assign ``MedHouseVal`` to ``y`` and all other columns to ``X`` just by dropping ``MedHouseVal``.

In [3]:
y = df['MedHouseVal']
X = df.drop(['MedHouseVal'], axis = 1)

## Splitting Data into Train and Test Sets

Help me finish this part. You should sample 75% of the data for training and 25% of the data for testing. To ensure a reproducible evaluation, set the random_state using the provided ``SEED``.

In [4]:
from sklearn.model_selection import train_test_split

SEED = 42
# TODO2: use train_test_split to split the train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=SEED)

# Inspect those numbers quickly by printing the lengths of the full dataset and of split data
len(X)
len(X_train)
len(X_test)

5160

## Feature Scaling both Train and Test Sets

By importing StandardScaler, instantiating it, fitting it according to our train data (preventing leakage), and transforming both train and test datasets, we can perform feature scaling. Help me finish this part.

In [5]:
from sklearn.preprocessing import StandardScaler

# TODO3: apply feature scaling.
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Let's organize the data into a ``DataFrame`` again with column names and use describe() to observe the changes in mean and std.

In [6]:
col_names=['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
scaled_df = pd.DataFrame(X_train, columns=col_names)
scaled_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MedInc,15480.0,2.074711e-16,1.000032,-1.774632,-0.688854,-0.175663,0.46445,5.842113
HouseAge,15480.0,-1.232434e-16,1.000032,-2.188261,-0.840224,0.032036,0.666407,1.855852
AveRooms,15480.0,-1.620294e-16,1.000032,-1.877586,-0.407008,-0.08394,0.257082,56.357392
AveBedrms,15480.0,7.435912000000001e-17,1.000032,-1.740123,-0.205765,-0.108332,0.007435,55.925392
Population,15480.0,-8.996536000000001e-17,1.000032,-1.246395,-0.558886,-0.227928,0.262056,29.971725
AveOccup,15480.0,1.055716e-17,1.000032,-0.201946,-0.056581,-0.024172,0.014501,103.737365
Latitude,15480.0,7.890329e-16,1.000032,-1.451215,-0.79982,-0.645172,0.971601,2.953905
Longitude,15480.0,2.206676e-15,1.000032,-2.380303,-1.106817,0.536231,0.785934,2.633738
