In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Amsterdam House Prices
I will be predicting the House Prices in the Amsterdam House Prices Predictions dataset with the regression models that I am familiar with in this notebook.

Table of Contents:
- Data Preprocessing
- Data Cleaning
- Feature Engineering
- Optional: Suggestions
- More Feature Engineering 
- Modeling (Train Test Split + Model Cross Validation)
- Parameter Tuning & Final Model


# Data Preprocessing
Okay - first things first. Let's take a small look at the data.

In [None]:
df = pd.read_csv('../input/amsterdam-house-price-prediction/HousingPrices-Amsterdam-August-2021.csv')
df.head()

We can tell that there aren't a lot of features that are created here - but we can definitely create some features out of this - namely from the Address and Zip features. Let's look at the features at a deeper level! 

In [None]:
for i in df.columns:
    fig_dims = (12,8)
    fig, ax = plt.subplots(figsize = fig_dims)
    sns.histplot(x=i, data = df)

We can tell that there are outliers in some of the features - we might want to remove those outliers. Let's decide later on and look at the correlation of the features first.

# Correlation of features
Let's do a heatmap to see the correlation between these features. 

In [None]:
sns.heatmap(df.corr())

We can tell that there is a strong correlation between the price and area of the house itself, and a slightly weaker (but still strong) correlation between the price and rooms features and the price and area features. 
Aight, let's start the data cleaning and feature engineering stages. 

# Data Cleaning
Let's start off with looking at if there is any missing data. 

In [None]:
df.info()

There are missing data points - mainly in the price column. I wouldn't necessarily fill in the prices as it is the metric we are predicting, therefore we will be removing the rows with no price instead.

In [None]:
df = df.dropna(axis = 0, inplace = False)

In [None]:
df.info()

Great! We now have a complete dataset. However, if you remember, we were looking at the outliers and considering removing it. Let's take a look at the boxplot first before we decide on anything! 

In [None]:
sns.boxplot(x='Price', data = df)

Oh wow - there are a lot of outliers in this dataset. I would remove most of these data points to have a more accurate regression, but just for fun, let's see how much of the dataset we're removing.

In [None]:
df.describe()

In [None]:
# Finding the maximum tolerance for the boxplot
q1 = df.describe()['Price']['25%']
q3 = df.describe()['Price']['75%']
iqr = q3 - q1
max_price = q3 + 1.5 * iqr 

In [None]:
# We create an outliers dataset so that we can find the count of outliers
outliers = df[df['Price'] >= max_price]

#Outlier and dataset count followed by percentage of dataset removed
outliers_count = outliers['Price'].count()
df_count = df['Price'].count()
print('Percentage removed: ' + str(round(outliers_count/df_count * 100, 2)) + '%')

To be honest, that's a small chunk of data being removed. However, since linear regressions are rather sensitive to outliers, it is best that we remove those. However, if we had more info about housing with the prices close to the outliers, we would definitely be able to train the model better.

In [None]:
# Repalcing the old dataframe with the new one  
df = df[df['Price'] <= max_price]

There we go!


# Feature Engineering
From the data analysis above, you can tell that there are only 5 features, so we definitely need to do some feature engineering to create more features, which is important for training the model. 

In [None]:
df['Zip No'] = df['Zip'].apply(lambda x:x.split()[0])
df['Letters'] = df['Zip'].apply(lambda x:x.split()[-1])

If we look at the Zip column, we realise that there are 4 digits in the front, and 2 letters at the back. We can split it such that we get 2 new features from the Zip column. 

In [None]:
df['Address']

We know that the back part of the address isn't important as it is just stating that the address is in Amsterdam. We replace the address column with a less redundant version instead :) 

In [None]:
df['Address'] = df['Address'].apply(lambda x:x.split(',')[0]) 

This is definitely not enough as the addresses are too varied. I decided that I will take the street of the address itself as a feature instead. 
However, the separation is more complicated than it seems, so I have created a function that allows me to extract the street name from the address itself. 

In [None]:
def word_separator(string):
    list = string.split()
    word = []
    number = [] 
    for element in list:
        if element.isalpha() == True: 
            word.append(element)
        else:
            break
    word = ' '.join(word)
    return word

In [None]:
df['Street'] = df['Address'].apply(lambda x:word_separator(x))

In [None]:
df.head()

Looks great to me, don't you think?


# Optional: Suggestions
The above features are things that I thought by myself, but there are definitely some other features that you can consider in the model that I believe would make the model more accurate!
- I personally did not think of using [Price per meter square](https://www.kaggle.com/lennarthaupts/prediction-based-on-the-10-closest-neighbors) but Lennart thought of it, which I thought was pretty impressive! 
- I also did not think of[ putting the districts into bins](https://www.kaggle.com/laetitiafrost/amsterdam-house-price-linreg-randomforest-knn) but Letitia thought of using that, which I thought was a really innovative idea as well. 

# Further Feature Engineering and Data Processing
We split the features into numerical and categorical features so that we are able to convert the categorical features into numerical ones, before training the model with it.

In [None]:
numerical = ['Price', 'Area', 'Room', 'Lon', 'Lat']
categorical = ['Address', 'Zip No', 'Letters', 'Street']

There are a few encoders that I considered using:
- Label Encoding
- One Hot Encoding
- Ordinal Encoding

One Hot Encoding would not be effective if there were too many features and Ordinal Encoding would be useful if you had to preserve some order ofcategorical data but useless otherwise. Therefore, Label Encoding would be the best choice here. 

In [None]:
from sklearn.preprocessing import LabelEncoder
for c in categorical:
    lbl = LabelEncoder() 
    lbl.fit(list(df[c].values)) 
    df[c] = lbl.transform(list(df[c].values))

We drop the more obvious 'features' that we do not need as they're either an index to the dataset or features have been extracted from the dataset. 

In [None]:
df.drop(['Zip', 'Unnamed: 0', 'Address'], axis =1, inplace = True)

Let's do a correlation between our new features!

In [None]:
sns.heatmap(df.corr())

There is now a strong and negative correlation beteen the Zip numbers and the Latitude features!

# Train Test Split
We will split the dataset into two datasets, the train dataset and the test dataset. We will then use cross-validation with negative mean squared error as the scoring feature. We will then make the value positive and square root it to derive the Root Mean Squared Error, which is smaller. 

In [None]:
from sklearn.model_selection import train_test_split
X = df.drop('Price', axis =1)
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.4)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Linear Regression Cross Validation

In [None]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
predictions = linreg.predict(X_test)

In [None]:
cv = cross_val_score(linreg, X_train, y_train, cv=20, scoring = 'neg_mean_squared_error')
print(cv)
print(abs(cv.mean())**0.5)

# Lasso Regression Cross Validation
Lasso regression is a type of regression that only uses L1 regularisation.


In [None]:
from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(X_train, y_train)
predictions = lasso.predict(X_test)

In [None]:
cv = cross_val_score(lasso, X_train, y_train, cv=20, scoring = 'neg_mean_squared_error')
print(cv)
print(abs(cv.mean())**0.5)

# Elastic Net Regression Cross Validation
Elastic Net Regression is a type of regression that uses a ratio of L1 and L2 regularisation. 

In [None]:
from sklearn.linear_model import ElasticNet
elasticnet = ElasticNet()
elasticnet.fit(X_train, y_train)
predictions = elasticnet.predict(X_test)

In [None]:
cv = cross_val_score(elasticnet, X_train, y_train, cv=20, scoring = 'neg_mean_squared_error')
print(cv)
print(abs(cv.mean())**0.5)

# Ridge Regression Cross Validation
Lasso regression is a type of regression that only uses L2 regularisation.


In [None]:
from sklearn.linear_model import Ridge
ridge = Ridge()
ridge.fit(X_train, y_train)
predictions = ridge.predict(X_test)

In [None]:
cv = cross_val_score(ridge, X_train, y_train, cv=20, scoring = 'neg_mean_squared_error')
print(cv)
print(abs(cv.mean())**0.5)

# Random Forest Cross Validation
For this case, we'll be using Random Forest regression as it is a regression task.

In [None]:
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor()
random_forest.fit(X_train, y_train)
predictions = random_forest.predict(X_test)

In [None]:
cv = cross_val_score(random_forest, X_train, y_train, cv=20, scoring = 'neg_mean_squared_error')
print(cv)
print(abs(cv.mean())**0.5)

# XGBoost Cross Validation


In [None]:
from xgboost import XGBRegressor
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
predictions = xgb.predict(X_test)

In [None]:
cv = cross_val_score(xgb, X_train, y_train, cv=20, scoring = 'neg_mean_squared_error')
print(cv)
print(abs(cv.mean())**0.5)

From the different cross validation data, we can deduce that the Random Forest Regression model is the best model for this dataset. Of course, XGBoost is something else that we can also consider, since the RMSE differs by approximately 2000. We will continue with hyperparameter tuning using a random search before using GridSearchCV for parameter tuning. 

# Parameter Tuning
- We first use a RandomizedSearchCV so that we are able to get a rough estimate for a range of parameters. 
- After getting the best parameters, we will use GridSearchCV to further tune the parameters itself before finally taking the best parameters for the model.

# RandomizedSearchCV Tuning 
We will start with a RandomizedSearchCV so that we are able to get a general direction for the parameters.

In [None]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

random_grid = {'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

random_cv = RandomizedSearchCV(estimator = random_forest, param_distributions = random_grid, n_iter = 100, cv = 10, verbose = 2, n_jobs = -1)
random_cv.fit(X_train, y_train)

In [None]:
random_cv.best_params_ 

Great! We've got a slight idea of what parameters would optimize the model itself.

# GridSearchCV 
Now, we will use GridSearchCV so that we are able to get a more specific set of parameters for the model itself.

In [None]:
param_grid = {'bootstrap': [True, False],
'max_depth': [60,65,70,75,80],
'min_samples_leaf':[1,2,3],
'min_samples_split': [1,2,3],
'n_estimators': [1750,1760,1770,1780,1790,1800,1810,1820,1830,1840,1850]}
grid_search = GridSearchCV(estimator = random_forest, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)

grid_search.fit(X_train,y_train)

In [None]:
grid_search.best_params_

And there we have it! The best parameters for the model itself. 

# Final Model
We've come to the final step of the notebook itself. We will be implementing the best parameters into the model, and training the model with the training data. We will use cross validation again to get a RMSE value and see if there is any improvement to the model itself.

In [None]:
tuned_random_forest = RandomForestRegressor(n_estimators = 1750, max_depth = 80, min_samples_leaf = 1, min_samples_split = 2)
random_forest.fit(X_train, y_train)
predictions = random_forest.predict(X_test)

In [None]:
cv = cross_val_score(tuned_random_forest, X_train, y_train, cv=20, scoring = 'neg_mean_squared_error')
print("The Random Forest Regressor with tuned parameters has a RMSE of: " + str(abs(cv.mean())**0.5))

Seems like we've got a slight improvement with the model itself! We can optimise it even further, but with the amount of time it took for GridSearchCV, it might not necessarily be a good idea. 

# The End
Thank you so much for reading my notebook! I appreciate it :) 
