#### Author: Victor Diallen Andrade do Amaral

# Table of Contents :
* [1. Introduction](#section1)
* [2. Importing Libraries](#section2)
* [3. Loading Datasets](#section3)
* [4. Data Analysis](#section4)
* [5. Data Visualization](#section5)
* [6. Machine Learning](#section6)
* [7. Conclusion](#section7)

<a id="section1"></a>
# Introduction

## Kaggle Dataset Link

https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland?select=apartments_pl_2023_10.csv

## About Dataset

The dataset contains apartment offers from the 15 largest cities in Poland (Warsaw, Lodz, Krakow, Wroclaw, Poznan, Gdansk, Szczecin, Bydgoszcz, Lublin, Katowice, Bialystok, Czestochowa). The data comes from local websites with apartments for sale. To fully capture the neighborhood of each apartment better, each offer was extended by data from the Open Street Map with distances to points of interest (POI). The data is collected monthly and covers timespan between September 2023 and October 2023

## Variables Description

- **city** - the name of the city where the property is located
- **type** - type of the building
- **squareMeters** - the size of the apartment in square meters
- **rooms** - number of rooms in the apartment
- **floor** / floorCount - the floor where the apartment is located and the total number of floors in the building
- **buildYear** - the year when the building was built
- **latitude, longitude** - geo coordinate of the property
- **centreDistance** - distance from the city centre in km
- **poiCount** - number of points of interest in 500m range from the apartment (schools, clinics, post offices, kindergartens, -  - - restaurants, colleges, pharmacies)
- **[poiName]Distance** - distance to the nearest point of interest (schools, clinics, post offices, kindergartens, restaurants, colleges, pharmacies)
- **ownership** - the type of property ownership
- **condition** - the condition of the apartment
- **has[features]** - whether the property has key features such as assigned parking space, balcony, elevator, security, storage room
- **price** - offer price in Polish Zloty

<a id="section2"></a>
# Importing Libraries

In [None]:
# Python version used
from platform import python_version
print('Python Version Used in this Jupyter Notebook:', python_version())

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.pipeline import make_pipeline
from scipy.stats import skew
from sklearn.linear_model import LinearRegression, Ridge, LassoCV
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import log_loss, accuracy_score
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import RepeatedKFold
from numpy import absolute
import graphviz
import xgboost as xgb

<a id="section3"></a>
# Loading Datasets

In [None]:
df_august = pd.read_csv('/kaggle/input/apartment-prices-in-poland/apartments_pl_2023_08.csv')
df_september = pd.read_csv('/kaggle/input/apartment-prices-in-poland/apartments_pl_2023_09.csv')
df_october = pd.read_csv('/kaggle/input/apartment-prices-in-poland/apartments_pl_2023_10.csv')

In [None]:
df_august.head()

In [None]:
df_september.head()

In [None]:
df_october.head()

In [None]:
print(df_august.shape)
print(df_september.shape)
print(df_october.shape)

## Concatenating Datasets

In [None]:
df_august['Month'] = 0
df_september['Month'] = 1
df_october['Month'] = 2

In [None]:
frames = [df_august, df_september, df_october]
df = pd.concat(frames)

<a id="section4"></a>
# Data Analysis

In [None]:
# Shape of dataframe
df.shape

In [None]:
# Checking for missing values
df.isna().sum()

In [None]:
# Dropping id and columns which have a very high number of missing values, being impossible to apply techniques such as imputation
df.drop(['id','type', 'floor', 'buildYear', 'floorCount', 'condition', 'buildingMaterial'], axis=1, inplace=True)

In [None]:
df['Month'] = df['Month'].astype(str)

In [None]:
# Dropping remaining missing values
df_clean = df.dropna()

In [None]:
# Dropping duplicates if any
df_clean = df_clean.drop_duplicates().reset_index(drop=True)

In [None]:
df_clean.head(5)

In [None]:
df_clean.shape

In [None]:
# Cell in case you want to save the new and clean dataframe to csv.
# df_clean = pd.to_csv('df_all_clean.csv', index=None)

In [None]:
df_clean.info()

In [None]:
df_clean.describe()

## Selecting Numerical and Categorical Columns

In [None]:
num_cols = df_clean.select_dtypes([np.number]).columns
df_nums = df_clean[num_cols].reset_index(drop=True)

In [None]:
cat_cols = df_clean.select_dtypes(['object']).columns
df_cats = df_clean[cat_cols].reset_index(drop=True)

<a id="section5"></a>
# Data Visualization

## Univariate Analysis - Box Plots

In [None]:
features = num_cols.to_list()
plt.figure(figsize=(15,5))
for i in range(0, len(features)):
    plt.subplot(2, 7, i + 1)
    sns.boxplot(y = df_clean[features[i]], color = 'magenta', orient = 'v')
    plt.tight_layout()

- It's important to notice that although boxplot indicates outliers, some of them seem to be natural values and should leave them as it is.

## Univariate Analysis - Dist Plots

In [None]:
features = num_cols.to_list()
plt.figure(figsize = (20, 10))
for i in range(0, len(features)):
    plt.subplot(5, 3, i+1)
    sns.histplot(x = df_clean[features[i]], kde = True, color = 'green')
    plt.tight_layout()

- Most columns seem to be skewed, which needs to be corrected later.

## Univariate Analysis - Violin Plots

In [None]:
plt.figure(figsize=(15,10))
features = num_cols.to_list()
for i in range(0, len(features)):
    plt.subplot(3, 5, i+1)
    sns.violinplot(y = df_clean[features[i]], color = 'yellow', orient = 'v')
    plt.tight_layout()

## Univariate Analysis - Count Plot (Categorical)

In [None]:
plt.figure(figsize=(15,10))
for i in range(0, len(df_cats.columns)):
    plt.subplot(4, 3, i+1)
    ax = sns.countplot(y = df_clean[df_cats.columns[i]], palette = 'BuPu', orient = 'v')
    ax.set_xlim(0,df_clean[df_cats.columns[i]].value_counts().max()+df_clean[df_cats.columns[i]].value_counts().max()*0.2)
    ax.bar_label(ax.containers[0]);
    plt.tight_layout()

- Some of the categorical columns are way too imbalanced.

## Bivariate Analysis - Correlation Map

In [None]:
df_nums.corr()

In [None]:
corr_df = df_nums.corr()

In [None]:
plt.figure(figsize = (15, 8))
sns.heatmap(corr_df, cmap = 'Blues', annot = True, fmt = '.2f')
plt.xticks(rotation=45);

In [None]:
df_clean[df_cats.columns] = df_clean[df_cats.columns].apply(LabelEncoder().fit_transform)

In [None]:
df_clean.head()

In [None]:
plt.figure(figsize = (15, 8))
sns.heatmap(df_clean.corr(), cmap = 'Blues', annot = True, fmt = '.2f')
plt.xticks(rotation=45);

<a id="section6"></a>
# Machine Learning

## Feature Selection

In [None]:
# Dropping low correlation columns (interestingly, some of them have high multicollinearity)
df_clean.drop(['schoolDistance','latitude', 'postOfficeDistance',
               'kindergartenDistance', 'collegeDistance', 'pharmacyDistance', 'hasBalcony','Month'], axis=1, inplace=True)

In [None]:
# Dropping rooms column because it has a high multicolinearity with squareMeters columns
df_clean.drop(['rooms'], axis=1, inplace=True)

### Using Random Forest Classifier to Check the Most Important Features

In [None]:
X = df_clean.loc[:, df_clean.columns != 'price']
y = df_clean['price'].values

In [None]:
clf = RandomForestClassifier(n_estimators=10, random_state=0, max_depth=9, n_jobs=-1)

clf.fit(X, y)

In [None]:
feature_scores = pd.Series(clf.feature_importances_, index=X.columns).sort_values(ascending=False)

feature_scores

In [None]:
plt.barh(X.columns, clf.feature_importances_)

## Linear Regression

### Using X1

In [None]:
X1 = df_clean.loc[:, df_clean.columns != 'price']
y1 = df_clean['price'].values

In [None]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size = 0.25, random_state = 42)

In [None]:
scaler = StandardScaler()

In [None]:
X1_train_scaled = scaler.fit_transform(X1_train)
X1_test_scaled = scaler.fit_transform(X1_test)

In [None]:
model = linear_model.LinearRegression(fit_intercept = True)

In [None]:
model_v1_lm = model.fit(X1_train_scaled, y1_train)

In [None]:
# Calcula a métrica R2 do nosso modelo
r2_score(y1_test, model_v1_lm.fit(X1_train_scaled, y1_train).predict(X1_test_scaled))

### Using X2

In [None]:
X2 = df_clean[['squareMeters', 'longitude', 'centreDistance']]
y2 = df_clean['price']

In [None]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size = 0.25, random_state = 42)

In [None]:
X2_train_scaled = scaler.fit_transform(X2_train)
X2_test_scaled = scaler.fit_transform(X2_test)

In [None]:
model_v2_lm = model.fit(X2_train_scaled, y2_train)

In [None]:
# Calculating the R2 metric
r2_score(y2_test, model_v2_lm.fit(X2_train_scaled, y2_train).predict(X2_test_scaled))

- X2 had lower R2 score, but it's a way simpler model (containing less predictor variables), which makes it more generalizable 

## Random Forest

In [None]:
# Creating Random Forest Model
rf1 = RandomForestRegressor(n_estimators = 100, min_samples_leaf = 10, random_state = 101, n_jobs=-1, max_depth=9, oob_score=True)

### Using X1

In [None]:
model_rf_v1 = rf1.fit(X1_train_scaled, y1_train)

In [None]:
prediction = model_rf_v1.predict(X1_test_scaled)

In [None]:
mean_squared_error(y1_test, prediction)

In [None]:
mean_absolute_error(y1_test, prediction)

In [None]:
r2_score(y1_test, prediction)

### Using X2

In [None]:
model_rf_v2 = rf1.fit(X2_train_scaled, y2_train)

In [None]:
prediction = model_rf_v2.predict(X2_test_scaled)

In [None]:
mean_squared_error(y2_test, prediction)

In [None]:
mean_absolute_error(y2_test, prediction)

In [None]:
r2_score(y1_test, prediction)

- X1 and X2 had similar performances. X2 being a simpler model, it would be chosen.

### Finding the best n_estimator

In [None]:
N_estimators = [5,50,100,200,500,1000]
R2_score = []
for n_estimator in N_estimators:
    model = RandomForestRegressor(n_estimators = n_estimator,max_depth = 9)
    model.fit(X2_train_scaled, y2_train)
    prediction = model.predict(X2_test_scaled)
    r2_calc = r2_score(y2_test, prediction)
    R2_score.append(r2_calc)
    print(f'For {n_estimator} n_estimator and the R2 score is: ', r2_calc)


In [None]:
fig, ax = plt.subplots()
ax.plot(N_estimators, R2_score,c='g')
for i, txt in enumerate(np.round(R2_score,5)):
    ax.annotate((N_estimators[i],str(txt)), (N_estimators[i],R2_score[i]))
plt.grid()
plt.title("R2 Score for each Estimator")
plt.xlabel("Estimator i's")
plt.ylabel("Score measure")
plt.show()

### Using GridSearchCV for Hyperparameter Tuning

In [None]:
#param_grid = { 
#    'n_estimators': [25, 50, 100, 150, 200], 
#    'max_features': ['sqrt', 'log2', None], 
#    'max_depth': [3, 6, 9], 
#    'max_leaf_nodes': [3, 6, 9], 
#}


#grid_search = GridSearchCV(RandomForestRegressor(), 
#                           param_grid=param_grid) 
#grid_search.fit(X1_train_scaled, y1_train) 
#print(grid_search.best_estimator_) 

In [None]:
#model_grid = RandomForestRegressor(max_depth=9,
#                                   max_features=1.0,
#                                   max_leaf_nodes=None,
#                                   n_estimators=200,
#                                   random_state=10) 

In [None]:
#model_grid.fit(X2_train_scaled, y2_train) 
#y_pred_grid = model_grid.predict(X2_test_scaled) 

In [None]:
#r2_score(y2_test, y_pred_grid)

## XGBRegressor

In [None]:
model = XGBRegressor(n_estimators=200, max_depth=6, eval_metric=["auc", "error", "error@0.6"])

In [None]:
model.fit(X2_train_scaled, y2_train)

In [None]:
prediction = model.predict(X2_test_scaled)

In [None]:
print(r2_score(y2_test, prediction))

In [None]:
print(mean_squared_error(y2_test, prediction))

print(mean_absolute_error(y2_test, prediction))

<a id="section7"></a>
# Conclusion

- X2 selected features not always returned better evaluation metrics, but being a way simpler model, it should be chosen since its more generalizable.
- Among all the algorithms XGBoost Regressor had better results.
- Tuning hyperparameters might return better results.