## <B>Bucharest: House prices prediction 

![](https://resources.stuff.co.nz/content/dam/images/1/k/f/w/7/n/image.related.StuffLandscapeSixteenByNine.710x400.1oq9kk.png/1520294653850.jpg?format=pjpg&optimize=medium)

# Dataset description

The file contains data related to the sale price of real estates in Bucharest, Romania in March 2019.

The data set is composed of 7 variables: 
- number of rooms
- surface 
- floor 
- total number of floors in the building
- area location of the dwelling
- score of the location
- price 

The main source of the database is represented by www.imobiliare.ro, which is the most popular real estate website in Romania.

# Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LassoLars, BayesianRidge, SGDRegressor
from sklearn.preprocessing import StandardScaler
warnings.filterwarnings("ignore")

# Data reading and checking

In [None]:
df = pd.read_csv('../input/bucharest-house-price-dataset/Bucharest_HousePriceDataset.csv')
print(df.shape)
df.head()

In [None]:
# Translate columns in english
df.rename(columns={'Nr Camere':'Rooms', 'Suprafata':'Surface', 'Etaj':'Floor',
                   'Total Etaje': 'NumberOfFloors', 'Sector':'AreaLocation',
                   'Scor':'Rank', 'Pret':'Price'}, inplace = True)

In [None]:
df.info()

As can be observed, the data does not have missing values.

Let's change the 'AreaLocation' into a 'category' datatype.

In [None]:
df['AreaLocation'] = df['AreaLocation'].astype('category')

In [None]:
df.describe()

As can be observed, most of the houses are actually flats, considering 'Floor' and NumberOfFloors' features.

Furthermore, the negative value of -1 for 'min' of 'Floor', suggest that we deal with basement flats also.

Let's see if we can find values with 'NumberOfFloors' = 0. Those will be treated as actually houses instead of flats.

In [None]:
houses = df[df['NumberOfFloors']==0]
houses

There are no houses, all values have at least 1 floor. Considering that, all listing are treated as flats.

Considering the columns 'Floor' and 'NumberOfFloors', let's make sure that there are no error entries in the data, like: 'Floor' values higher than 'NumberOfFloors'.

In [None]:
df[df['Floor'] > df['NumberOfFloors']]

# Exploration

In [None]:
sns.pairplot(df)
plt.show()

'NumberOfFloors' and 'Surface' columns seem to have few extreme values (outliers).

Let's have a closer look.

In [None]:
inspect = ['NumberOfFloors', 'Surface']
for c in inspect:
    plt.scatter(df[c], df['Price'], color='salmon')
    plt.xlabel(c)
    plt.ylabel('Price')
    plt.grid()
    plt.show()

As can be observed, the discrepancy is not that high. We will keep these values.

In [None]:
plt.figure(figsize=(15,12))
sns.heatmap(df.corr(), annot=True, cmap='YlGnBu')
plt.show()

Lets have a look at the distribution of the columns.

We will use boxplot method for numerical features, and barplot for the categorical one.

In [None]:

for c in df.columns[[0,1,2,3,5,6]]: # except 'category' dtype
    df[c].plot(kind='box', color = 'salmon')
    plt.ylabel('count')
    plt.grid()
    plt.show()

In [None]:
plt.figure(figsize=(8,5))
df['AreaLocation'].value_counts().sort_index().plot(kind='bar', color='salmon')
plt.xlabel('AreaLocation')
plt.ylabel('count')
plt.title('# of listings by AreaLocation')
plt.grid()
plt.show()

Let's create a function to easily explore the relation between the categorical 'AreaLocation' and numerical features.

In [None]:
def area_info(feature, aggregation):

    plt.figure(figsize=(8,5))
    if aggregation == 'sum':
        x = df.groupby('AreaLocation')[feature].sum()
        x.plot(kind='bar', color='salmon')
        plt.title('Sum of {} per AreaLocation'.format(feature))
        plt.ylabel('sum')
        plt.grid()
        plt.show()

    elif aggregation == 'mean':
        x = df.groupby('AreaLocation')[feature].mean()
        x.plot(kind='bar', color='salmon')
        plt.title('Mean {} per AreaLocation'.format(feature))
        plt.ylabel('mean')
        plt.grid()
        plt.show()

    elif aggregation == 'min':
        x = df.groupby('AreaLocation')[feature].min()
        x.plot(kind='bar', color='salmon')
        plt.title('Min {} per AreaLocation'.format(feature))
        plt.ylabel('min')
        plt.grid()
        plt.show()

    elif aggregation == 'max':
        x = df.groupby('AreaLocation')[feature].max()
        x.plot(kind='bar', color='salmon')
        plt.title('Sum of {} per AreaLocation'.format(feature))
        plt.ylabel('max')
        plt.grid()
        plt.show()

    else:
        print('You chose an incorrect feature or your aggregation is not correct')
        print('Please chose from the following features: Rooms, Surface, Floor, NumberOfFloors, Rank, Price.')
        print('Please choose from the following list of aggregations: sum, mean, min, max')


In [None]:
area_info('Rooms', 'mean')

In [None]:
area_info('Surface', 'mean')

In [None]:
area_info('Floor', 'mean')

In [None]:
area_info('NumberOfFloors', 'mean')

In [None]:
area_info('Rank', 'mean')

In [None]:
area_info('Price', 'mean')

# Data preparation

Encode the categorical variable 'AreaLocation'

In [None]:
encoded = pd.get_dummies(df['AreaLocation'], prefix='Area')
new_data = pd.concat([df, encoded], axis=1)
new_data.drop('AreaLocation', axis = 1, inplace = True)
print(new_data.shape)
new_data.head()

In [None]:
change = ['Area_1','Area_2','Area_3','Area_4','Area_5','Area_6']

for c in change:
    new_data[c] = new_data[c].astype('int64')
new_data.info()

In [None]:
target = new_data['Price'].values.reshape(-1,1)
features = new_data.drop('Price', axis = 1).values

print(features.shape, target.shape)

Standardize our features and target

In [None]:
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

# Modelling & Evaluation

In [None]:
class Model:
  
  def __init__(self, model, test_size, features, target):
    self.model = model
    self.test_size = test_size
    self.features = features
    self.target = target

  

  def Fit(self):
    X_train, X_test, y_train, y_test = train_test_split(self.features, self.target, test_size = self.test_size, random_state = 123, shuffle = True)
    self.model.fit(X_train, y_train)
    preds = self.model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    acc = self.model.score(X_train, y_train)
    print("Root Mean Squared Error: {}".format(rmse))
    print('Model Accuracy: {}'.format(acc))  

In [None]:
regressors = [LinearRegression(), Ridge(), Lasso(), ElasticNet(), LassoLars(), BayesianRidge(), SGDRegressor()]

for r in regressors:
    print(r)
    selected = Model(r, 0.2, scaled_features, target)
    selected.Fit()
    print('\n')

As can be observed, the model with the best performance is <b>SGDRegressor</b> with an RMSE = 30691.773, and an accuracy score around 76%.