### Housing Dataset

Suppose we want to buy a house from a neighbourhood, and we have data that contains the general characteristic of the neighborhood, houses, and the population itself. To temper our expectations, we want to predict the median house value.

In [2]:
import pandas as pd 
import numpy as np 

file_path = '..\\datasets\\'
housing = pd.read_csv(file_path + 'housing.csv')
housing.head()

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [3]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [4]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


We want to predict the `median_house_value` column. What we want to do is to separate the column we want to predict, or the target column, from the possible determinants that we will use for the prediction, or the feature columns. Then, we split the data into the training set and the test set.

In [5]:
from sklearn.model_selection import train_test_split

target_cols = ['median_house_value']
feature_cols = [col for col in housing.columns if col not in target_cols]

x_full = housing[feature_cols]
y = housing[target_cols]

x_train, x_test, y_train, y_test = train_test_split(x_full, y, train_size = 0.8, random_state = 0)

It is important to check if there are blank cells and the feature column where it is included so we can deal with it in the future. 

In [6]:
null_cols = [col for col in x_full.columns if x_full[col].isnull().any()]
null_cols

['total_bedrooms']

Check the amount of rows where there are no entries.

In [7]:
nan_count = x_full[null_cols].isnull().sum().sum()
print('There are {} rows with NaN values'.format(nan_count))

There are 207 rows with NaN values


We will list the numerical and categorical columns.

In [8]:
num_cols = [col for col in feature_cols if x_full[col].dtype in ['int64', 'float64']]
categorical_cols = [col for col in feature_cols if x_full[col].dtype in ['object']]

print('The numerical columns are: {}'.format(num_cols))
print('The categorical columns are: {}'.format(categorical_cols))

The numerical columns are: ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
The categorical columns are: ['ocean_proximity']


As we can see, the column with a null cell is numerical. We can preprocess the data by filling the null cell with the mean value. It is better than simply putting in 0 total bedrooms for a community.

In [9]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

num_transformer = SimpleImputer(strategy = 'median')
cat_transformer = OneHotEncoder(handle_unknown = 'ignore')
preprocess = ColumnTransformer(transformers = [('num', num_transformer, num_cols), ('cat', cat_transformer, categorical_cols)])

Next, we will use a Random Forest Regressor with a max depth of 30.

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import random

depth = {}
for max_depth_val in random.choices([*range(1, 50, 1)], k = 10):
    model = RandomForestRegressor(max_depth = max_depth_val, random_state = 0)
    pipeline = Pipeline(steps = [('preprocessor', preprocess), ('model', model)])
    pipeline.fit(x_train, y_train.values.ravel())
    predicted_val = pipeline.predict(x_test)
    error = mean_squared_error(y_test, predicted_val)
    depth[max_depth_val] = error
optim_depth = min(depth, key = depth.get)
error = depth[optim_depth]
print('The root-mean-square error for a maximum depth of {} is {}.'.format(max_depth_val, np.sqrt(error)))

For comparison, we can check the actual and predicted values side-by-side.

In [None]:
model = RandomForestRegressor(max_depth = optim_depth, random_state = 0)
pipeline = Pipeline(steps = [('preprocessor', preprocess), ('model', model)])
pipeline.fit(x_train, y_train.values.ravel())
predicted_val = pipeline.predict(x_test)
predicted_cols = pd.DataFrame(predicted_val, columns = ['predicted'], index = y_test.index)
comparison_table = y_test.join(predicted_cols)
comparison_table.head()

We can actually search for parameters that will make this error lower. For example, we can make loops that calculate the error for a corresponding parameter and find the parameter value that minimizes this error. If we will do this for all parameters, it may be computationally expensive. I want to try hyperparameter tuning as described in [this article](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) but as we can see, the calculation for a maximum depth of 30 takes $>10$ secs. This may consume a lot of time so we will stop here for now.