# Project Report

Our main objective for this project was to get a somewhat accurate estimation of a rent price in Doha, given an input of some features, such as:

- `Type`: segment of the establishment - Apartment, Villa, Penthouse, etc. 

- `Area(sqm)`: area of establishment in square meters - preprocessed format in the web scraping script

- `Bedrooms`/`Bathrooms`: # of bedrooms/bathrooms

- `Location`: location of establishment - the zone/neighborhood

- `Amenities`: amenities corresponding to the establishment - Balcony/Security/View at water, etc

- `Furnishing`: level of furnishing in establishment - Furnished/Partly Furnished/Unfurnished

The main question is, in fact, given features x1, x2 ..., can we derive some equation y= a0 + a1 * x1 + a2 * x2 ... ?


We have conducted countless trial-and-error tests, as previously mentioned in the presentation and the notebooks, in order to tweak the data cleaning and wrangling process and the model, to obtain the best results. We will go through a few tests when it came to determining what estimator we should use.

In [1]:
import pandas as pd 
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import itertools
import random
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, SGDRegressor
import sklearn.metrics as sm
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score,explained_variance_score
from sklearn.preprocessing import PolynomialFeatures


In [4]:
dataset = pd.read_csv("data_ready.csv")
data = dataset
x = data.drop(columns = ['Price'])
y = data['Price']

### Linear Regression

In [10]:

x_train, x_test, y_train, y_test = train_test_split(x.values, y.values, test_size=0.2, shuffle=True)

est = LinearRegression()
est.fit(x_train, y_train) 
y_pred = est.predict(x_test)

print("R2 on train: ", est.score(x_train, y_train))
print("R2 on test:", est.score(x_test, y_test))
print("RMSE: ", int(np.sqrt(sm.mean_squared_error(y_test, y_pred)))) 


R2 on train:  0.7109591543880578
R2 on test: 0.686721532212042
RMSE:  2894


Not bad for such a restrained dataset, given that most of the observations we collected are actually clustered in different hotspots - Apartments in The Pearl for example, etc...

We interpret these results in the manner that roughly 70% of the variation in the values of features can be predicted correctly by the model. Nevertheless, trying more estimators might give better results.

### SGD Linear Regression

In [13]:
x_train, x_test, y_train, y_test = train_test_split(x.values, y.values, test_size=0.2, shuffle=True)

est = SGDRegressor()
est.fit(x_train, y_train) 
y_pred = est.predict(x_test)

print("R2 on train: ", est.score(x_train, y_train))
print("R2 on test:", est.score(x_test, y_test))
print("RMSE: ", int(np.sqrt(sm.mean_squared_error(y_test, y_pred)))) 


R2 on train:  -2.160787973618269e+23
R2 on test: -2.1433485516220804e+23
RMSE:  2270972043837340


Results are disasterous, but it makes sense why. First and foremost, the Stochastic Gradient Regressor is scale-sensitive, so let's redo this action:


In [14]:
x_train, x_test, y_train, y_test = train_test_split(x.values, y.values, test_size=0.2, shuffle=True)

est = SGDRegressor()
scaler = StandardScaler()

x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
est.fit(x_train, y_train) 
y_pred = est.predict(x_test)

print("R2 on train: ", est.score(x_train, y_train))
print("R2 on test:", est.score(x_test, y_test))
print("RMSE: ", int(np.sqrt(sm.mean_squared_error(y_test, y_pred)))) 

R2 on train:  0.6988120679406404
R2 on test: 0.7207864390054183
RMSE:  2583


That's more like it! We get very similar results with the classic Linear Regression, a little bit better even when predicting on `test`, yet when tested on the actual program and inputting some features manually, as a user, our extensive tests showed that it tends to overestimate the prices a bit, while the linear regressor keeps it closer to the real ones. Moving on...

### Linear Regression with Polynomial Features

What if the relation between the features is not actually linear, but polynomial? We can test this out by using a Polynomial Regression to see if we get better results if we model the relation between dependent and independent variables as an nth degree polynomial function.

In [17]:
est = LinearRegression()
xp = PolynomialFeatures(degree = 2).fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(xp, y.values, test_size=0.2, shuffle=True)

PolynomialFeatures(degree = 2).fit(x_train, y_train)

est.fit(x_train, y_train) 
y_pred = est.predict(x_test)

print("R2 on train: ", est.score(x_train, y_train))
print("R2 on test:", est.score(x_test, y_test))
print("RMSE: ", int(np.sqrt(sm.mean_squared_error(y_test, y_pred)))) 

R2 on train:  0.8326745416081861
R2 on test: 0.6014560514145142
RMSE:  3200


Easy to compare, much better results on train can be observed, but when it comes to predicting on test, results are much worse. Let's see potential results if we change the degree of the polynomial to 3. Will it even run?

In [18]:
est = LinearRegression()
xp = PolynomialFeatures(degree = 3).fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(xp, y.values, test_size=0.2, shuffle=True)

PolynomialFeatures(degree = 3).fit(x_train, y_train)

est.fit(x_train, y_train) 
y_pred = est.predict(x_test)

print("R2 on train: ", est.score(x_train, y_train))
print("R2 on test:", est.score(x_test, y_test))
print("RMSE: ", int(np.sqrt(sm.mean_squared_error(y_test, y_pred)))) 

KeyboardInterrupt: 

And it didn't! Too much time has passed and it's still going, we are using way too many features to be able to test this!

## Conclusion & Future Prospects

It is safe to say that some features are of greater influence than others. Even logically speaking, the number of bedrooms or the level of furnishment within the establishment will clearly have a greater impact on the price than having a "Walk-in closet" or a "Concierge" as some of the amenities, and as a matter of fact, we successfully achieve to predict these variations with our model. To be more precise, in terms of numbers, our Linear Regression estimator scores a very stable 70% "accuracy" in predicting those variations, after we tweaked and engineered all the features in so many variations and combinations, which is the reason why we chose to approach this numerical estimation problem using a classic linear regressor. 

However, although 70% "accuracy" does not sound too bad, it is critical to mention here the impact of the human factor in this entire scenario. While more or less predictable, these listings come from different agencies, with different commisions, with different amenities which we were not able to collect in a straightforward manner, such as utilities included, or "2 months free" (and prices are usually higher in this case, although in theory they should not).

Some very important points that would help improve the model in the future:

- More close and in-detail review of the importance and magnitude applied by certain features to achieve better results
- Description scripting using NLP to collect some new features like "utilities included", "2 months free", something that would help deal with the human factor a little bit more in-depth
- Testing more options for feature encoding combinations, although whatever we have achieved now has come to great lengths in terms of performance, there can always be some aspects to improve
- Last but not least, more observations. We would need way more data than whatever we collected in order to obtain more accurate estimations. This can be done by obtaining some databases from different rent advertisement providers, or obtaining more historical data