<h1 align="center">Machine Learning regressions to predict Buenos Aires City housing prices</h1>  

<font color="green">Upvotes and suggestions are highly appreciated :)</font>  
***Note: sections 4 and 5 are still under development, I really do appreciate advice in regards to them.***

***Table of contents***  
1. [Exploratory data analysis](#s1)  
    1. [A first look to the columns, column selection by relevance](#s1p1) 
    
        * [Temporary columns](#tcols)
        * [Geospatial columns](#gcols)
        * [Other columns](#ocols)
        
    2. [Feature plots and distributions](#s1p2)  
    
        * [Publication density map](#pdm)  
        * [Median prices diagram](#mpd)  
        * [Price distribution](#pd)  
        * [<font color="red">(to-do)Boxplots (useful for imputation criteria)</font>](#b)
    
2. [Feature engineering](#s2)   

    * [A second column selection](#ascs)  
    * [Core training data](#ctd)   
    * [<font color="green">First model training and results</font>](#fmt)
    * [Feature creation](#df)
    * [Feature importance measurement](#fem)
    * [<font color="red">Detailed Spatial Clustering</font>](#dsc)
    * [Imputation](#idt)
    
    
3. [Model selection](#s3)
    * [XGBoost](#xgb)
    * [Random Forest Regressor](#rfr)
    * [LGBMRegressor](#LGBM)
    * [CatBoostingRegressor](#LGBM)  
    * [Model saving](#ms)

4. [Model optimization](#mo)

5. [Applications and conclusions](#s4)
    * [Predictions on real-world properties](#prwp)
    * [Conclusions](#conclusions)


In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

###### This notebook is part of a wider project that aims to create property price predictors. The previous notebook tackled this whithin <a href="https://www.kaggle.com/msorondo/property-price-predictions-great-buenos-aires-n">northern Great Buenos Aires</a>. This notebook continues with the previous project by taking some insights for filtering and selecting the features and models... Still, it will introduce lots of modifications in order to increase the performance of the models to train. This notebook will be also used to refine the previous one.


## 1. Exploratory Data Analysis (EDA)<a id="s1"></a>

In [None]:
pd.plotting.register_matplotlib_converters()
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

We´ll first take a look to the dataset and to have an idea of the data in it...

In [None]:
df_crude = pd.read_csv("../input/argentina-venta-de-propiedades/ar_properties_crude.csv", index_col="id")

In [None]:
df_crude.head(5)

Take only Buenos Aires properties, valuated in dollars (ARS have been hugely devaluated).

In [None]:
#CABA = Ciudad Autónoma de Buenos Aires = Buenos Aires City
df_CABA_dolar = df_crude[(df_crude["l1"]=="Argentina") & (df_crude["l2"]=="Capital Federal") & (df_crude["currency"]=="USD")]
df_CABA_dolar

### 1.1. A first look at the columns, column selection by relevance<a id="s1p1">

In [None]:
df_CABA_dolar.columns

In [None]:
df_CABA_dolar["ad_type"].unique()

We´ll drop this one, it carries no relevant information at all.

In [None]:
df_CABA_dolar = df_CABA_dolar.drop(columns=["ad_type"])

<a id="tcols"></a>
### Temporary columns
Start date, end date and creation date give some relevant information, but they are not good predictors of the price. We want to predict the price and these dates depend on the owner´s choice (which obviously is not significantly correlated with the price). Same as the price period.

In [None]:
df_CABA_dolar = df_CABA_dolar.drop(columns=["start_date", "end_date", "created_on", "price_period"])

### Geospatial columns <a id="gcols"></a>

We´ll also drop "l1", "l2" and "currency" (they were used to filter by country and by city previously)

In [None]:
df_CABA_dolar = df_CABA_dolar.drop(columns=["l1", "l2" ,"currency"])
missing_percentage = df_CABA_dolar.isnull().sum()*100/len(df_CABA_dolar.index)
missing_percentage

"l3" gives information in regards to te neighboorhood. "l4", "l5" and "l6" give even more accurate details, but they have a huge amount of missing values (almost all missing) and we´ve still got "lat" and "lon" give high-precision information in regards to the location of the house, this could be useful to increase the precition of the prediction. We´ll keep "lat", "lon" and "l3". 

In [None]:
df_CABA_dolar = df_CABA_dolar.drop(columns=["l4","l5","l6"])

<a id="ocols"></a>
### Other columns
Even though we could use NLP techniques to analyze "title" and "description" columns, this would tremendously increase the length and complexity of this notebook with probably not much more benefits. Rent operations will be excluded as well.

In [None]:
df_CABA_dolar["operation_type"].value_counts()

In [None]:
df_CABA_dolar = df_CABA_dolar[df_CABA_dolar["operation_type"]=="Venta"]
df_CABA_dolar = df_CABA_dolar.drop(columns=["title", "description","operation_type"])
df_CABA_dolar

<a id="s1p2"></a>
## 1.2. Plots and distributions

In [None]:
import folium
from folium import Marker
from folium.plugins import HeatMap

<a id="pdm"></a>
### Publication density map (excepting publications with missing geolocation)

In [None]:
map_2 = folium.Map(width = 700, height = 500, location=[-34.586662, -58.436620], titles="cartodbposition", zoom_start=12)
df_CABA_dolar_noLatNorLonMissing = df_CABA_dolar[df_CABA_dolar["lat"].notnull() & df_CABA_dolar["lon"].notnull()]
HeatMap(data=df_CABA_dolar_noLatNorLonMissing[["lat","lon"]], radius=12).add_to(map_2)

In [None]:
map_2

The number of properties published look quite well distributed. Let´s look at the unique value counts from "l3"...

In [None]:
df_CABA_dolar["l3"].value_counts().head(8)

<a id="mpd"></a>
### Median prices diagram 

In [None]:
df_CABA_dolar.groupby(by=["l3"], axis=0)["price"].median().sort_values(ascending=False).head(7)

In [None]:
df_CABA_dolar.groupby(by=["l3"], axis=0)["price"].median().sort_values(ascending=False).tail(7)

<a id="pd"></a>
### Price distribution

In [None]:

plt.figure(figsize=(10,7))
plt.ticklabel_format(style='plain', axis='x')
sns.distplot(df_CABA_dolar["price"])

plt.ylim(0,10**-7)
plt.xlim(0,4000000)

In [None]:
df_CABA_dolar.describe()


Property type histogram

In [None]:
plt.figure(figsize=(15,10))
plt.hist(x=df_CABA_dolar["property_type"])

As expected, most of the properties are apartments.

<a id="s2"></a>
# 2. Feature engineering

Let´s start by renaming the dataframe to simplify it.

In [None]:
properties = df_CABA_dolar
current_missing_percentages = (properties.isnull().sum()/properties.shape[0]).sort_values(ascending=False)
current_missing_percentages

In [None]:
print(properties[properties["surface_covered"].isnull() & properties["surface_total"].isnull()].shape[0])
properties[properties["surface_covered"].isnull() & properties["surface_total"].isnull()].shape[0]/properties.shape[0]

<a id="ascs"></a>
## A second column selection

There are some features that add important information to perform prediction, but still have huge amounts of missing values %, we´ll examine correlation to see how to deal with them.

In [None]:
def columns_correlation_with_target(df,target):
    for feature in df.select_dtypes(exclude=["object"]).columns:
        if feature!=target:
            print("Correlation between ", feature, " and ", target, ": ", df[target].corr(df[feature]))

In [None]:
columns_correlation_with_target(properties,"price")

Lat and lon were still not clustered so there´s no problem with them not correlating with price.
Rooms, bedrooms and bathrooms are significantly correlated with price, with bedrooms being the least ones.  
Bedrooms columns will have to be dropped, they do correlate with price but they have a huge amount of missing values and are not worth of imputation (mainly because in Argentina "bedrooms" isn´t usually used as a reference, and rooms and bathrooms already give substantial information).

In [None]:
properties = properties.drop(columns=["bedrooms"])

In [None]:
properties

<a id="ctd"></a>
## Core training data
In order to proceed to deal with missing values and then create features we´ll start by tasting how a basic, not so much pocessed dataframe performs in a model. It will include all of the latter features except for lat and lon, and we´ll remove rows with missing values.
Then we´ll compare this approach with another one that imputes the values.

In [None]:
core_properties = properties.drop(columns=["lat","lon"])
core_properties = core_properties.dropna(axis=0)
core_properties.shape

Note that it is still a huge dataset.  
<a id="fmt"></a>
## First model training + cross validation

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
X_core = core_properties.drop(columns=["price"])
y_core = core_properties["price"]


In [None]:
X_core_train, X_core_valid, y_core_train, y_core_valid = train_test_split(X_core,y_core)

OHE4 = OneHotEncoder(handle_unknown="ignore",sparse=False)

obj_cols = [col for col in X_core_valid.columns if X_core_valid[col].dtype=="object"]

OHE4_cat_train = pd.DataFrame(OHE4.fit_transform(X_core_train[obj_cols]))
OHE4_cat_valid = pd.DataFrame(OHE4.transform(X_core_valid[obj_cols]))

OHE4_cat_train.index = X_core_train.index
OHE4_cat_valid.index = X_core_valid.index

num_cols = [col for col in X_core_valid.columns if X_core_valid[col].dtype=="float64"]
num_cols
encoded_X_core_train = pd.concat([OHE4_cat_train,X_core_train[num_cols]],axis=1)
encoded_X_core_valid = pd.concat([OHE4_cat_valid, X_core_valid[num_cols]], axis=1)



In [None]:
RFR2 = RandomForestRegressor(random_state=4).fit(encoded_X_core_train,y_core_train)
RFR2_preds = RFR2.predict(encoded_X_core_valid)
RFR2_MAE = mean_absolute_error(RFR2_preds,y_core_valid)
RFR2_MAE

Not bad for a first attempt.

<a id="fc"></a>
## Feature Creation

We´ll create and test new features to increase the performance...

In [None]:
bath_rooms_ratio = core_properties["bathrooms"]/core_properties["rooms"]
surf_covered_by_total= core_properties["surface_covered"]/core_properties["surface_total"]
l3_type = core_properties["l3"]+core_properties["property_type"]

newFeatureTester function

In [None]:
def newFeatureTester(df, new_column):
    X_core["new_feature"] = new_column
    
    X_train, X_test, y_train, y_test = train_test_split(X_core,y_core)
    
    objec_cols = [col for col in X_test.columns if X_test[col].dtype=="object"]
    
    OHE = OneHotEncoder(handle_unknown="ignore", sparse=False)
    OHEncoded_cats_train = pd.DataFrame(OHE.fit_transform(X_train[objec_cols]))
    OHEncoded_cats_test = pd.DataFrame(OHE.transform(X_test[objec_cols]))
    
    OHEncoded_cats_train.index = X_train.index
    OHEncoded_cats_test.index = X_test.index
    
    numericals_train = X_train.select_dtypes(exclude=["object"])
    numericals_test = X_test.select_dtypes(exclude=["object"])
    
    OHEncoded_train = pd.concat([OHEncoded_cats_train,numericals_train], axis=1)
    OHEncoded_test = pd.concat([OHEncoded_cats_test, numericals_test], axis=1) 
    
    model = RandomForestRegressor(random_state=3).fit(OHEncoded_train,y_train)
    preds = model.predict(OHEncoded_test)
    
    mae = mean_absolute_error(preds,y_test)
    
    print("MAE with ", new_column.name,": " , mae)
    
    mae_avg_price = mae/(y_core.mean())
    
    print("MAE/AVG PRICE: ", new_column.name,": ",  mae_avg_price)
    
    return [mae,mae_avg_price]
    

With this function, we´ll try to separately measure the impact of each new feature on the model´s prediction.

In [None]:
res = pd.DataFrame({"bath_rooms_ratio" : newFeatureTester(core_properties,bath_rooms_ratio),
"surf_covered_by_total" : newFeatureTester(core_properties,surf_covered_by_total),
"l3_type":newFeatureTester(core_properties,l3_type)},index=["MAE","MAE/AVG Price"])

In [None]:
res

There doesn´t seem to be substantial improvements (if any). Let´s use them all into one training set and measure the feature importance to choose with more confidence.
<a id="fem"></a>
## Feature importance

In [None]:
dict_new_features = {"bath_rooms_ratio":bath_rooms_ratio, "surf_covered_by_total":surf_covered_by_total, "l3_type":l3_type}
df_new_features = pd.DataFrame(dict_new_features)
df_new_features

X_core_plus_new = pd.concat([X_core,df_new_features], axis=1)
X_core_plus_new.isnull().sum()

OneHotEncode categoricals...

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_core_plus_new,y_core)
object_cols = [col for col in X_train.columns if X_train[col].dtype=="object"]

OHE2 = OneHotEncoder(handle_unknown="ignore", sparse=False)

labeled_obj_cols_train = pd.DataFrame(OHE2.fit_transform(X_train[object_cols]))
labeled_obj_cols_test = pd.DataFrame(OHE2.transform(X_test[object_cols]))


In [None]:
#OneHotEncoder removed indexes, put them back...
labeled_obj_cols_train.index = X_train.index
labeled_obj_cols_test.index= X_test.index

numeric_X_train = X_train.select_dtypes(exclude=["object"])
numeric_X_test = X_test.select_dtypes(exclude=["object"])

In [None]:
labeled_X_train  =  pd.concat([labeled_obj_cols_train,numeric_X_train], axis=1)
labeled_X_test = pd.concat([labeled_obj_cols_test,numeric_X_test], axis=1)

In [None]:
RFReg = RandomForestRegressor(random_state=7).fit(labeled_X_train,y_train)

permutator = PermutationImportance(RFReg,random_state=2).fit(labeled_X_test,y_test)

In [None]:
permutator.estimator

In [None]:
colnames_labeled = labeled_X_train.columns.tolist()
colnames_labeled_all_as_strings = [str(name) for name in colnames_labeled]

In [None]:
eli5.show_weights(permutator, feature_names=colnames_labeled_all_as_strings, top=len(colnames_labeled_all_as_strings))

In [None]:
preds = permutator.predict(labeled_X_test)
mae = mean_absolute_error(preds,y_test)
print("Mean absolute error: ",mae, " . Error in relation to mean price: ", mae/y_test.mean(), "% .")

The error seems to be considerably lower (~ USD 5K less) when combining all of the new features. 
There are some columns resulting from the One Hot Encoding that deteriorate the prediction but the totality of them increases it quite a bit. The only feature that seems not worth using is "surf_covered_by_total".
We´ll update the DataFrame to eliminate this feature.

In [None]:
X_core_plus_new = X_core_plus_new.drop(columns=["surf_covered_by_total"])

<a id="dsc"></a>
## (for later stages of development)Detailed Spatial Clustering
We´ll use Density-Based Spatial Clustering of Applications with Noise algorithm to unsupervisedly find highly detailed clusters that relate numeric geospatial columns with price. I chose this one over the others because of its capability to detect multiple clusters in high-density maps without having to pre-establish the number of clusters (like K-Means does).
Another advantage of this algorithm is that it performs very well with high-dimentional spaces, this lets us go further and add the price component.

<a id="idt"></a>
## Imputation

### On numerical columns only

We´ll compare the following model with the one trained on the core DataFrame...

In [None]:
from sklearn.impute import SimpleImputer
properties_with_missing = properties.drop(columns=["lat","lon"])
properties_with_missing_numericals = properties_with_missing.dropna(subset=["l3","property_type"])
properties_with_missing_numericals

This way we almost double the amount of properties to perform prediction.

In [None]:
X = properties_with_missing_numericals.drop(columns=["price"])
y=properties_with_missing_numericals["price"]
X_train, X_test, y_train, t_test = train_test_split(X,y)

In [None]:
obj_cols = [col for col in X_test.columns if X_test[col].dtype=="object"]
OHE3 = OneHotEncoder(handle_unknown="ignore",sparse=False)

OHE3_cat_X_train = pd.DataFrame(OHE3.fit_transform(X_train[obj_cols]))
OHE3_cat_X_test = pd.DataFrame(OHE3.transform(X_test[obj_cols]))
#One Hot Encoder lost indexes, put them back...
OHE3_cat_X_train.index = X_train.index
OHE3_cat_X_test.index = X_test.index

numerical_X_train = X_train.select_dtypes(exclude=["object"])
numerical_X_test = X_test.select_dtypes(exclude=["object"])

OHE3_X_train = pd.concat([OHE3_cat_X_train,numerical_X_train], axis=1)
OHE3_X_test = pd.concat([OHE3_cat_X_test,numerical_X_test],axis=1)

Given the vast amount of outliers across all columns, we´ll impute for the median.

In [None]:
imputer = SimpleImputer(strategy="median")

imputed_X_train = pd.DataFrame(imputer.fit_transform(OHE3_X_train))
imputed_X_test = pd.DataFrame(imputer.transform(OHE3_X_test))
#imputer removed column names, put them back    
imputed_X_train.columns = OHE3_X_train.columns
imputed_X_test.columns = OHE3_X_test.columns

In [None]:
RFR_new = RandomForestRegressor(random_state = 1).fit(imputed_X_train,y_train)

In [None]:
predictions_RFR_new = RFR_new.predict(imputed_X_test)
mae_RFR_new = mean_absolute_error(predictions_RFR_new,t_test)
mae_RFR_new

In [None]:
from joblib import dump

In [None]:
dump(RFR_new,"imputed_RFR.joblib")

The imputation icreased the error. We´ll avoid it.

<a id="s3"></a>
# 3. Model selection

I´ll first train a Gradient Boosting Regressor, then compare with the Random Forest with no parameter tuning and then select the one that best performed for further optimization.

<a id="xgb"></a>
## XGBoosting Regressor

In [None]:
"""X_core_train, X_core_valid, y_core_train, y_core_valid = train_test_split(labeled_X_train,y_train)
OHE4 = OneHotEncoder(handle_unknown="ignore",sparse=False)

OHE4_cat_train = pd.DataFrame(OHE4.fit_transform(X_core_train[obj_cols]))
OHE4_cat_valid = pd.DataFrame(OHE4.transform(X_core_valid[obj_cols]))

OHE4_cat_train.index = X_core_train.index
OHE4_cat_valid.index = X_core_valid.index

num_cols = [col for col in X_core_valid.columns if X_core_valid[col].dtype=="float64"]
num_cols
encoded_X_core_train = pd.concat([OHE4_cat_train,X_core_train[num_cols]],axis=1)
encoded_X_core_valid = pd.concat([OHE4_cat_valid, X_core_valid[num_cols]], axis=1)"""

In [None]:
from xgboost import XGBRegressor
XGBR2 = XGBRegressor(random_state=6,n_estimators=900,early_stopping_rounds=10, 
                     eval_set=[encoded_X_core_valid,y_core_valid],verbose=False).fit(encoded_X_core_train,y_core_train)

In [None]:
XGBR2_preds = XGBR2.predict(encoded_X_core_valid)
XGBR2_MAE = mean_absolute_error(y_core_valid,XGBR2_preds)
XGBR2_MAE

In [None]:
dump(XGBR2,"XGBR2.joblib")

<a id="rfr"></a>
## Random Forest Regressor

In [None]:
encoded_X_core_train

In [None]:
labeled_X_train

Training WITHOUT labeled "l3_type column"

In [None]:
RFR2 = RandomForestRegressor(random_state=4).fit(encoded_X_core_train,y_core_train)
RFR2_preds = RFR2.predict(encoded_X_core_valid)
RFR2_MAE = mean_absolute_error(RFR2_preds,y_core_valid)
RFR2_MAE

Training WITH labeled "l3_type column"

In [None]:
RFR3 = RandomForestRegressor(random_state=4).fit(labeled_X_train,y_core_train)
RFR3_preds = RFR3.predict(labeled_X_test)
RFR3_MAE = mean_absolute_error(RFR3_preds,y_core_valid)
RFR3_MAE

The difference isn´t worth the extra training time.

In [None]:
from joblib import dump, load
dump(RFR2, 'baseline_random_forest.joblib')#37.6k

<a id="lgbm"></a>
## LGBMRegressor

In [None]:
for col in X_train.columns:
    if X_train[col].dtype=="object":
        X_train[col] = X_train[col].astype('category')
        X_test[col] = X_test[col].astype('category')

In [None]:
LGBM = LGBMRegressor(random_state=12).fit(X_train,y_train)
preds_LGBM = LGBM.predict(X_test)
mean_absolute_error(preds_LGBM,y_test)

Let´s try with one hot encoded data...

In [None]:
LGBM2 = LGBMRegressor(random_state=12).fit(labeled_X_train,y_train)
preds_LGBM2 = LGBM2.predict(labeled_X_test)
mean_absolute_error(preds_LGBM2,y_test)

<a id="cbr"></a>
## CatBoostRegressor

In [None]:
from catboost import CatBoostRegressor

CBR = CatBoostRegressor(random_state=9,cat_features=["l3",'l3_type', 'property_type']).fit(X_train,y_train)
preds = CBR.predict(X_test)
MAE = mean_absolute_error(preds,y_test)

In [None]:
MAE

One hot encoded version...

In [None]:
CBR2 = CatBoostRegressor(random_state=9).fit(labeled_X_train,y_train)
preds = CBR2.predict(labeled_X_test)
MAE = mean_absolute_error(preds,y_test)

In [None]:
MAE

All of the gradient boosting techniques were outperformed by the random forest regressor, yet the extreme gradient booster got quite near and has a bigger tuning margin. We´ll optimize both.

<a id="ms"></a>
## Model saving

In [None]:
dump(RFR2, 'baseline_random_forest.joblib')#37.6k
dump(RFR3, 'l3_types_random_forest.joblib')#37.6k

In [None]:
dump(XGBR2,"XGBR2.joblib")#40k

In [None]:
dump(RFR_new,"imputed_RFR.joblib")#70k

<a id="mo"></a>
# 4. Model Optimization

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
parameters_forsearch = {
    "n_estimators" : [100,250,500,750]
}
search = GridSearchCV(RandomForestRegressor(),parameters_forsearch,cv=2)
search.fit(encoded_X_core_train,y_core_train)

In [None]:
print(search.best_params_)

We´ll keep RFR2 model by now, it has the best n_estimators.

<a id="s4"></a>
# 5. Applications and conclusions.

<a id="prwp"></a>
## (Applications)Predictions on real world properties:


In [None]:
current_model = load("baseline_random_forest.joblib")

<a id="conclusions"></a>
# Conclusions