### Housing Dataset

Suppose we want to buy a house from a neighbourhood, and we have data that contains the general characteristic of the neighborhood, houses, and the population itself. To temper our expectations, we want to predict the median house value.

In [None]:
import pandas as pd 
impot numpy as np 

file_path = 'Datasets\\'
housing = pd.read_csv(file_path + 'housing.csv')
housing.head()

In [None]:
housing.info()

In [None]:
housing.describe()

We want to predict the `median_house_value` column. What we want to do is to separate the column we want to predict, or the target column, from the possible determinants that we will use for the prediction, or the feature columns. Then, we split the data into the training set and the test set.

In [None]:
from sklearn.model_selection import train_test_split

target_cols = ['median_house_value']
feature_cols = [col for col in housing.columns if col not in target_cols]

x_full = housing[feature_cols]
y = housing[target_cols]

x_train, x_test, y_train, y_test = train_test_split(x_full, y, train_size = 0.8, random_state = 0)

It is important to check if there are blank cells and the feature column where it is included so we can deal with it in the future. 

In [None]:
null_cols = [col for col in x_full.columns if x_full[col].isnull().any()]
null_cols

Check the amount of rows where there are no entries.

In [None]:
nan_count = x_full[null_cols].isnull().sum().sum()
print('There are {} rows with NaN values'.format(nan_count))

We will list the numerical and categorical columns.

In [None]:
num_cols = [col for col in feature_cols if x_full[col].dtype in ['int64', 'float64']]
categorical_cols = [col for col in feature_cols if x_full[col].dtype in ['object']]

print('The numerical columns are: {}'.format(num_cols))
print('The categorical columns are: {}'.format(categorical_cols))

As we can see, the column with a null cell is numerical. We can preprocess the data by filling the null cell with the mean value. It is better than simply putting in 0 total bedrooms for a community.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

num_transformer = SimpleImputer(strategy = 'median')
cat_transformer = OneHotEncoder(handle_unknown = 'ignore')
preprocess = ColumnTransformer(transformers = [('num', num_transformer, num_cols), ('cat', cat_transformer, categorical_cols)])

Next, we will use a Random Forest Regressor with a max depth of 30.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import random

depth = {}
for max_depth_val in random.choices([*range(1, 50, 1)], k = 10):
    model = RandomForestRegressor(max_depth = max_depth_val, random_state = 0)
    pipeline = Pipeline(steps = [('preprocessor', preprocess), ('model', model)])
    pipeline.fit(x_train, y_train.values.ravel())
    predicted_val = pipeline.predict(x_test)
    error = mean_squared_error(y_test, predicted_val)
    depth[max_depth_val] = error
optim_depth = min(depth, key = depth.get)
error = depth[optim_depth]
print('The root-mean-square error for a maximum depth of {} is {}.'.format(max_depth_val, np.sqrt(error)))

For comparison, we can check the actual and predicted values side-by-side.

In [None]:
model = RandomForestRegressor(max_depth = optim_depth, random_state = 0)
pipeline = Pipeline(steps = [('preprocessor', preprocess), ('model', model)])
pipeline.fit(x_train, y_train.values.ravel())
predicted_val = pipeline.predict(x_test)
predicted_cols = pd.DataFrame(predicted_val, columns = ['predicted'], index = y_test.index)
comparison_table = y_test.join(predicted_cols)
comparison_table.head()

We can actually search for parameters that will make this error lower. For example, we can make loops that calculate the error for a corresponding parameter and find the parameter value that minimizes this error. If we will do this for all parameters, it may be computationally expensive. I want to try hyperparameter tuning as described in [this article](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) but as we can see, the calculation for a maximum depth of 30 takes $>10$ secs. This may consume a lot of time so we will stop here for now.