# California Housing Dataset Analysis and Preprocessing

This notebook performs an initial exploratory data analysis (EDA) of the California Housing dataset, including loading data, inspecting its structure and statistics, handling missing data, and adding feature engineering columns. The goal is to prepare the dataset for downstream machine learning tasks.


In [None]:
# pandas to load and manipulate tabular data
import pandas as pd

# Load California housing training dataset provided by the environment
df = pd.read_csv('/content/sample_data/california_housing_train.csv')

In [None]:
# Show first few rows of the dataset
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


The dataset includes various features related to housing and demographics such as longitude, latitude, median income, housing age, as well as the target variable `median_house_value`.


In [None]:
# Check data types and non-null counts for understanding the dataset completeness
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  17000 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB


In [None]:
# Display statistical summary for numerical features
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
mean,-114.496,33.898,17.0,3387.4,804.2,723.2,308.0,2.01624,74320.0
std,0.112161,0.372384,2.54951,3063.093502,754.374377,337.366566,155.114474,0.677342,8611.155555
min,-114.57,33.57,14.0,720.0,174.0,333.0,117.0,1.4936,65500.0
25%,-114.57,33.64,15.0,1454.0,326.0,515.0,226.0,1.6509,66900.0
50%,-114.56,33.69,17.0,1501.0,337.0,624.0,262.0,1.82,73400.0
75%,-114.47,34.19,19.0,5612.0,1283.0,1015.0,463.0,1.925,80100.0
max,-114.31,34.4,20.0,7650.0,1901.0,1129.0,472.0,3.1917,85700.0


We check the data types and count of non-null values to identify missing data. The summary statistics give insights on feature ranges, means, and distributions.

In [None]:
# Check if there are any missing values in the dataset
df.isnull().sum()

Unnamed: 0,0
longitude,0
latitude,0
housing_median_age,0
total_rooms,0
total_bedrooms,0
population,0
households,0
median_income,0
median_house_value,0


In [None]:
# Create additional features that may be helpful for prediction
df['rooms_per_household']=df['total_rooms']/df['households']
df['bedrooms_per_room']=df['total_bedrooms']/df['total_rooms']
df['population_per_household']=df['population']/df['households']

In [None]:
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,rooms_per_household,bedrooms_per_room,population_per_household
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,11.889831,0.228617,2.150424
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,16.522678,0.248497,2.438445
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,6.153846,0.241667,2.846154
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,6.641593,0.224517,2.278761
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,5.549618,0.224209,2.381679
...,...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,6.008130,0.177718,2.457995
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,5.051613,0.224777,2.567742
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,5.870614,0.198356,2.728070
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,5.589958,0.206587,2.715481


These new features normalize room and bedroom counts by the number of households as well as population density, which might better capture housing characteristics relevant to price.


In [None]:
df.head()


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,rooms_per_household,bedrooms_per_room,population_per_household
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,11.889831,0.228617,2.150424
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0,16.522678,0.248497,2.438445
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,6.153846,0.241667,2.846154
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,6.641593,0.224517,2.278761
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0,5.549618,0.224209,2.381679


## Summary

This notebook successfully loaded the California housing dataset, performed initial exploration, and created new features that could improve model performance. The next steps include data preprocessing, model building, and evaluation.


## Feature Scaling with StandardScaler

Feature scaling is an important preprocessing step before applying machine learning algorithms. StandardScaler standardizes features by removing the mean and scaling to unit variance. This is especially useful for algorithms sensitive to the scale of input features.


In [None]:
# Import StandardScaler from sklearn
from sklearn.preprocessing import StandardScaler

# Select only numeric columns (float64) for scaling
num_cols = df.select_dtypes(include=['float64']).columns

# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler on the numeric data and transform it
df[num_cols] = scaler.fit_transform(df[num_cols])


In [None]:
# Display the first few rows of scaled dataset
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,rooms_per_household,bedrooms_per_room,population_per_household
0,2.619365,-0.67152,-1.079671,1.361695,1.764204,-0.361184,-0.075998,-1.252543,-1.210558,2.540559,0.274246,-0.204549
1,2.539569,-0.573264,-0.761872,2.296608,3.230441,-0.261865,-0.099404,-1.081483,-1.096745,4.365146,0.618055,-0.133217
2,2.494683,-0.905463,-0.920772,-0.882462,-0.866956,-0.955354,-0.999252,-1.170105,-1.048461,0.281516,0.499931,-0.032242
3,2.489696,-0.928857,-1.159121,-0.524186,-0.48023,-0.796793,-0.715774,-0.3626,-1.154514,0.473608,0.203333,-0.172765
4,2.489696,-0.961609,-0.682422,-0.545747,-0.506328,-0.70183,-0.622148,-1.026454,-1.222629,0.043548,0.198008,-0.147276


All numeric columns are now scaled such that each feature has a mean of 0 and a standard deviation of 1. This ensures a fair comparison between features and improves the performance of many machine learning models.
