## Yo!
So I've done some simple work with this dataset and acheved average MAE for sale prices around 12000 and for rent prices around 65.  
  
Here I want to show you what you can find with EDA, which variables might be worth to engineer and how GepPy library can help you with finding coordinates and addresses in order to obtain more information and improve your model. 

### Some preparation
Importing libraries:

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV, LinearRegression, ElasticNet,  HuberRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_log_error
from xgboost import XGBRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.svm import SVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.utils import resample
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import KFold


import re

from geopandas.tools import geocode
import warnings
warnings.filterwarnings("ignore")

seed = 42

Introducting functions for EDA:

In [None]:
def missing(df):
    df_missing = pd.DataFrame(df.isna().sum().sort_values(ascending = False), columns = ['missing_count'])
    df_missing['missing_share'] = df_missing.missing_count / len(df)
    return df_missing

In [None]:
def simple_chart(df, x, title = None, hue = None):
    plt.figure(figsize = (10, 6))
    plt.title(title, fontsize=14)
    ax = sns.countplot(x = x, hue = hue, data = df)

In [None]:
def factor_chart(df, x, y, hue = None):
    ax = sns.factorplot(x = x, y = y, data = df, hue = hue, kind = 'box', size=6, aspect = 2)

In [None]:
def scatter(df, x, y, hue = None):
    plt.figure(figsize = (20, 10))
    ax = sns.scatterplot(x = x, y = y, data = df, hue = hue)
    plt.show()

A bit of styling

In [None]:
sns.set(style="darkgrid")

And there we go!

### Raw data exploration 

In [None]:
df = pd.read_csv("../input/riga-real-estate-dataset/riga_re.csv")

In [None]:
df.head()

In [None]:
missing(df)

In [None]:
df[df.price.isna()]

We see that we have **10%** of dataset with missing price, that is our target variable.  
  
Moreover, if we take a closer look at the observations with missing price, we will se that they miss most information about other variables as well and contribute to the number of missings in other columns.  
  
As we do not have a test dataset, they have no use so I **remove them**

In [None]:
df_all = df[~df.price.isna()].reset_index(drop = True).copy()

In [None]:
missing(df_all) 

**Wow**, this action has actually helped us to get rid of most other missing variables as well!  
  
Now, none of the variables have share of missing higher than 5% so all of them are worth to explore.

In [None]:
df_all.dtypes

Unexpectedly, rooms feature is categorical, we will take a closer look at it later on.

In [None]:
df_all.describe()

Variance of some numeric variables is very high.  
  
Floor and max floor feature values are reasonable, but area, price, lat and lon features have some wierd values and definite mistakes so cleaning will be necessary.

In [None]:
print('Number of observations:', len(df_all), '\n')
print('Unique values:')
print(df_all.nunique().sort_values(ascending = False))

We can see that actually not all street addresses are unique, so there are some groups of apartments that are in the same building, so it can help us while imputing floors and areas.

### Data cleaning and missing imputation

I want to use street addresses to impute missing lat, lon and addresses.  
  
So let's fix missing streets first with their coordinates.

In [None]:
df_all[df_all.street.isna()]

Unfortunately we see, that lat and lon values for these 5 observation are just incorrect, because if you put them on the map, you'll find yourself at Italian mountains (I wish it could happen for real).  
  
Therefore, finding addresses for them is impossible. Considering the fact that these observations have a lot of other missing features as well, we'd better drop them.

In [None]:
df_all = df_all.drop(df_all[df_all.street.isna()].index).reset_index(drop = True)

In [None]:
missing(df_all) 

**Twofer**! Now, thanks to this drop, we should care about imputation for much less features.

Now I want to extract actual street names from 'street' feature in order to use it as a categorical variable and to impute missing districts.  

In [None]:
# Function for removing digits from a string

def no_digits(text):
    return ''.join([i for i in text if not i.isdigit()])

In [None]:
df_all['street_name_0'] = df_all['street'].apply(lambda x: no_digits(re.sub('\W+',' ', str(x))).strip())

In [None]:
df_all.head(3)

In [None]:
# set(df_all.street_name_0.values)

If you uncomment and launch the line above, you'll see that some useless liters have left, so we will fix em as well.

In [None]:
df_all['st_n'] = None
for i in range(len(df_all)):
    if ((df_all.loc[i, 'street_name_0'][:3] != 'St ') & (df_all.loc[i, 'street_name_0'][:2] != 'J ') & 
        (df_all.loc[i, 'street_name_0'][:2] != 'M ')):
        df_all.loc[i, 'st_n'] = df_all.loc[i, 'street_name_0'].split(' ')[0]
    elif (df_all.loc[i, 'street_name_0'][:3] != 'St '):
         df_all.loc[i, 'st_n'] = df_all.loc[i, 'street_name_0'].split(' ')[0] + ' ' + df_all.loc[i, 'street_name_0'].split(' ')[1]
    else:
        df_all.loc[i, 'st_n'] = 'St ' + df_all.loc[i, 'street_name_0'].split(' ')[1]

In [None]:
#set(df_all.st_n.values)

Now it's better!

In [None]:
df_all.drop(['street_name_0'], axis = 1, inplace = True)

Now let's impute **districts**.

In [None]:
df_all[df_all.district.isna()]

In [None]:
df_all[df_all.st_n == 'Ogļu'].groupby('district').count()

See that Ogļu street belongs to Ķīpsala district strictly, so we impute it first.

In [None]:
df_all.loc[1107, 'district'] = 'Ogļu'

In [None]:
df_all[df_all.st_n == 'Pupuku iela'].groupby('district').count()

Oh, but looks like Pupuku iela is unique, so we use google maps then and find that this street belongs to Bišumuiža district.

In [None]:
df_all.loc[3172, 'district'] = 'Bišumuiža'

Now let's impute lat and lan.  
  
For this purpose I will use **GeoPy** library.

In [None]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here")

In [None]:
def lat(add):
    try:
        return geolocator.geocode(add).latitude
    except:
        return None

def lon(add):
    try:
        return geolocator.geocode(add).longitude
    except:
        return None

But as we've already noticed, some lat and lan values are **wrong**, so we have to make them empty and impute them by using their street addresses and GeoPy

In [None]:
scatter(df_all, x = 'lon', y = 'lat')

In [None]:
scatter(df_all[(df_all.lat>56.88)&(df_all.lat<57.1)&(df_all.lon>20)], x = 'lon', y = 'lat')

Now it's better

In [None]:
df_all.loc[~((df_all.lat>56.88)&(df_all.lat<57.1)&(df_all.lon>20)), ['lat', 'lon']] = None

And now we come to the most painful part of this project for me.  
 
The problem with finding coordinates with geopy and all other libraries is that the address should be specifically correct in order to return desired coordinates. 

For example, if you request 'Jūrmalas g. 15' coordinates, you wont get any result, because geopy cannot understand 'g.', so we have to replace all the instances of 'Jūrmalas g.' on 'Jūrmalas gatve'.  
 
Or if you request 'Skolas 38' it will send you to a Norway hinterland.  

To fix all of this has brought me a lot of pain, but I did it so below is what I could do to achieve the result.  
  
Also important to note, that geopy api sometimes throws you **connection errors**.  
I made my coordinate finding functions robust to it, but it still requires to execute the line with the function **several times**

In [None]:
df_all['district'] = df_all["district"].replace('Krasta r-ns', 'Krasta masīvs')

In [None]:
df_all['Street_New'] = df_all['street']

In [None]:
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' g.', ' gatve'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' k-1', '').strip())
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' k-2', '').strip())
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' k1', '').strip())
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' k 1', '').strip())
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' k-1', '').strip())
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' k-3', '').strip())
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('-k-3', '').strip())
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' k-4', '').strip())
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' k. 1', '').strip())
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('k5', '').strip())
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('krastm.', 'krastmala'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' pr.', ' prospekts'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('Pulkv.', 'Pulkveža'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('bulv.', 'bulvāris'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('šķ. l.', 'šķērslīnija'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('šķ l.', 'šķērslīnija'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' l. ', ' līnija '))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' d. ', ' dambis '))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('J. Daliņa', 'Jāņa Daliņa'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('J. Vācieša', 'Jukuma Vācieša'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' g. ', ' gatve '))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' lauk.', ' laukums'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('k1', '').strip())
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('k2', '').strip())
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('-13d', '-13'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('-36d', '-36'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('-45d', '-45'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('-94b', '-94'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' 19/1', ' 19'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('M. Balasta', 'Mazais Balasta'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('M. Kuldīgas', 'Mazā Kuldīgas'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('M. Nometņu', 'Mazā Nometņu'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace('Asteres', 'Aisteres'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' 17 a', ' 17'))
df_all['Street_New'] = df_all["Street_New"].apply(lambda x: str(x).replace(' š. ', ' šoseja '))

In [None]:
df_all['Street_Full'] = df_all.apply(lambda x: str(x['Street_New']).split(' ')[0] + ' iela ' + str(x['Street_New']).split(' ')[1] +
                                    ', ' + str(x['district']) + ', ' + 'Rīga' if 
                                    len(x['Street_New'].split(' ')) == 2 else str(x['Street_New']) + ', ' + 
                                    'Rīga', axis = 1)

**ATTENTION!**   
  
The following code is using GeoPy coordinates fuctions, but they **do not work well** if you execute them on kaggle environment!  
 
Therefore, if you want to perform the same coordinate imputation on your data, please, copy the following commented code lines and execute them on your environment (Jupiter, for ex.)
 
Here instead I will use the preloaded dataset that is the result of the commented code lines below.

In [None]:
# This lines request full address that is stored in Street_Full feature. 
# Has to be launched 3-4 times, until the number of missing values stops decreasing (reaching 24 for both lat and lon specifically in this case)

#df_all['lat'] = df_all.apply(lambda x: lat(str(x['Street_Full'])) if np.isnan(x['lat']) == True else x['lat'], axis=1)
#df_all['lon'] = df_all.apply(lambda x: lon(str(x['Street_Full'])) if np.isnan(x['lon']) == True else x['lon'], axis=1)

In [None]:
# However, some full addresses do not work with district name, so for the left missings we use only street name and 'Riga'
# Also 2-3 times to execute (until 1 missing left for both lat and lon). 

#df_all['lat'] = df_all.apply(lambda x: lat(str(x['Street_Full'].split(',')[0]) + str(x['Street_Full'].split(',')[-1])) if np.isnan(x['lat']) == True else x['lat'], axis=1)
#df_all['lon'] = df_all.apply(lambda x: lon(str(x['Street_Full'].split(',')[0]) + str(x['Street_Full'].split(',')[-1])) if np.isnan(x['lon']) == True else x['lon'], axis=1)

In [None]:
# Remaining missing did not work with full address, but only street name was enough here.
# 1 execution is enough here

#df_all['lat'] = df_all.apply(lambda x: lat(str(x['Street_Full'].split(',')[0])) if np.isnan(x['lat']) == True else x['lat'], axis=1)
#df_all['lon'] = df_all.apply(lambda x: lon(str(x['Street_Full'].split(',')[0])) if np.isnan(x['lon']) == True else x['lon'], axis=1)

Here is a preloaded dataset that I use for the following analysis here.  
  
However, **DO NOT FORGET** to remove 2 following lines if you want to use geopy on your env.

In [None]:
riga_fixed_coordinates = pd.read_csv('../input/riga-fixed-coordinates/riga_fixed_coordinates.csv')
missing(riga_fixed_coordinates)

In [None]:
df_all = riga_fixed_coordinates.copy()

In [None]:
df_all[~(df_all.lat>56.88)&(df_all.lat<57.1)&(df_all.lon>20)]

Phew, that was tiresome!  
  
But as we see, we successfully imputed all the coordinates and none of them are out of Riga boundaries now.  

Let's check it on 'map' again.

In [None]:
scatter(df_all, x = 'lon', y = 'lat')

**Splendid!**  

Now it's time to imput remaining missings.  
 
Go with **area**.

In [None]:
df_all[df_all.area.isna()]

Let's look if there is a flat with the same addresses.  
  
I assume that flats in the same buildings will have more or less the same Area/Rooms, so we can use this as a proxy.

In [None]:
df_all[df_all.street == 'Slokas 130']

In [None]:
# Therefore
df_all.loc[3981, 'area'] = 80.0

Now go with imputing rooms and use the same approach.

In [None]:
df_all[df_all.rooms.isna()]

In [None]:
df_all[df_all.street == 'Dārzaugļu 1']

Double area, but I doubt to put 8 rooms here.  
  
Let's look at rooms number distribution by rooms:

In [None]:
df_all.groupby(['rooms']).area.median()

Now we can see why rooms feature is categorical.  
  
For some reason there is a **'Citi'** value in rooms that means 'other'.  
 
Let's try to figure out what it can give us.

In [None]:
df_all[df_all.rooms == 'Citi']

As we see, there are only 12 observations with this value of floor and they vary a lot, but most do have comparatively large area that makes them outliers.  
  
I would assume that these observations do have some unique certain specifics that can impact their price, but they are very few, so I just decided to drop them.

In [None]:
df_all = df_all.drop(df_all[df_all.rooms == 'Citi'].index, axis = 0).reset_index(drop = True)

In [None]:
# Look at the missing again as index was reseted
df_all[df_all.rooms.isna()]

So from what's left, we can impute '6' for our missing room. 

In [None]:
df_all.loc[1610, 'rooms'] = '6'

And now introduce numeric feature of rooms:

In [None]:
df_all['rooms_num']= df_all['rooms'].astype('int64')

Imputing total floors now.  
  
And again, same addresses should help.

In [None]:
df_all[df_all.total_floors.isna()]

In [None]:
df_all[df_all.street == 'Zentenes 18']

Value for imputation is clearly visible.

In [None]:
df_all.loc[1902, 'total_floors'] = 9.0

And now we're done with imputation!  
  
No missings left!

In [None]:
missing(df_all)

### EDA

In [None]:
plt.figure(figsize = (10, 6))
ax = sns.distplot(df_all.price, bins = 20) 

Distribution range of price is wild. Why is it so?  
  
Let's try to find some clues in our features.

**Op_type**

In [None]:
simple_chart(df_all, x = 'op_type')
factor_chart(df_all, x = 'op_type', y = 'price', hue = None)

And here we found it!  
  
Obviously, montly payment price for rent is much smaller than sale price of a whole apartment so these are clearly 2 different targets.  
  
Therefore, as we have quite enough observations for both major categories, it worths to separate the dataset on 2 parts for the further analysis and modelling later on.  
  
But before what about other op_type values?

In [None]:
df_all[~df_all.op_type.isin(['For rent', 'For sale'])]

We can directly notice by the price value, which of the observations with other op_type actually represent either for rent or for sale types.  
  
So we fix them and leave only 2 major op_types in order to split the dataset later.

In [None]:
df_all.loc[~df_all.op_type.isin(['For rent', 'For sale']) & (df_all.price < 1000), 'op_type'] = 'For rent'

In [None]:
df_all.loc[~df_all.op_type.isin(['For rent', 'For sale']) & (df_all.price > 1000), 'op_type'] = 'For sale'

In [None]:
simple_chart(df_all, x = 'op_type')

**Area**  
  
  
Expectedly, one of the most important features.

In [None]:
scatter(df_all[df_all.op_type == 'For sale'], x = 'area', y = 'price', hue = None)
scatter(df_all[df_all.op_type == 'For rent'], x = 'area', y = 'price', hue = None)

No surprise, strong positive correlation.  
  
However, we can notice, that dots are highly dispersed and there is definitely heteroskedasticity problem here.  
 
One of the ways to handle it is to use log1p of the target variable instead of the target itself. This makes a model robust to outliers and usually obtains better accuracy.
 
Nevertheless, in my case it did not help as much as outliers removal, so here I will use only it.

**Condition**

In [None]:
simple_chart(df_all, x = 'condition')
factor_chart(df_all[df_all.op_type == 'For sale'], x = 'condition', y = 'price', hue = None)
factor_chart(df_all[df_all.op_type == 'For rent'], x = 'condition', y = 'price', hue = None)

Apartments with all amenities do have higher price than with partial or without.  
  
At the same time, the number of observations without amenities is insignificantly slow, so I just introduce a dummy for All amenities to represent this condition difference in the model.

In [None]:
df_all['All_Amen'] = 0
df_all.loc[df_all.condition == 'All amenities', 'All_Amen'] = 1

**Rooms**

In [None]:
simple_chart(df_all, x = 'rooms') 
factor_chart(df_all[df_all.op_type == 'For sale'], x = 'rooms', y = 'price', hue = None)
factor_chart(df_all[df_all.op_type == 'For rent'], x = 'rooms', y = 'price', hue = None)

Alright, no surprise that higher numbers of rooms correlated with higher prices.  
  
Although, it's surprising, that 6 rooms flats for rent are cheaper than 5 rooms - other features impact might be involved.

**Floor**

In [None]:
simple_chart(df_all, x = 'floor')
factor_chart(df_all[df_all.op_type == 'For sale'], x = 'floor', y = 'price', hue = None)
factor_chart(df_all[df_all.op_type == 'For rent'], x = 'floor', y = 'price', hue = None)

Looks like floor is correlated with price and will be included in the model.  
 
Some observations seem to be outlying, but it can probably be some other features impact, so we will have to track it when removing outliers.

**Total floors**

In [None]:
simple_chart(df_all, x = 'total_floors')
factor_chart(df_all[df_all.op_type == 'For sale'], x = 'total_floors', y = 'price', hue = None)
factor_chart(df_all[df_all.op_type == 'For rent'], x = 'total_floors', y = 'price', hue = None)

Same conclusion here as for floors. Seems important, should be included, might have outliers.

**House seria**

In [None]:
simple_chart(df_all, x = 'house_seria')
factor_chart(df_all[df_all.op_type == 'For sale'], x = 'house_seria', y = 'price', hue = None)
factor_chart(df_all[df_all.op_type == 'For rent'], x = 'house_seria', y = 'price', hue = None)

Some house serias do have varying price levels, so I will dummy them and track their impact in models

Same conclusion goes for **house type**

In [None]:
simple_chart(df_all, x = 'house_type')
factor_chart(df_all[df_all.op_type == 'For sale'], x = 'house_type', y = 'price', hue = None)
factor_chart(df_all[df_all.op_type == 'For rent'], x = 'house_type', y = 'price', hue = None)

Districts and streets might have their own specific levels of prices.   
 
Some are prestigious and some are poor, and it can impact the price strongly, while area or distance from center will not track it.  
  
Therefore, this categorical information might be very important, so I will dummy them as well. 

In [None]:
simple_chart(df_all, x = 'district')
factor_chart(df_all[df_all.op_type == 'For sale'], x = 'district', y = 'price', hue = None)
factor_chart(df_all[df_all.op_type == 'For rent'], x = 'district', y = 'price', hue = None)
scatter(df_all[df_all.op_type == 'For sale'], x = 'lon', y = 'lat', hue = 'district')
scatter(df_all[df_all.op_type == 'For rent'], x = 'lon', y = 'lat', hue = 'district')

**lat and lon**

In [None]:
scatter(df_all[df_all.op_type == 'For sale'], x = 'lat', y = 'price', hue = None)
scatter(df_all[df_all.op_type == 'For rent'], x = 'lat', y = 'price', hue = None)

In [None]:
scatter(df_all[df_all.op_type == 'For sale'], x = 'lon', y = 'price', hue = None)
scatter(df_all[df_all.op_type == 'For rent'], x = 'lon', y = 'price', hue = None)

Notice, how price is the higher, the closer a dot is to the center.  
  
Therefore, knowing the coordinates of Riga center, we can calculate each observation's distance from center and use it as a feature.

In [None]:
Riga_Center_Lat = 56.949600
Riga_Center_Lon = 24.105200

In [None]:
import geopy.distance

In [None]:
def center_dist(lat_i, lon_i):
    return geopy.distance.vincenty((Riga_Center_Lat, Riga_Center_Lon), (lat_i, lon_i)).km

In [None]:
df_all['center_dist'] = df_all.apply(lambda x: center_dist(x['lat'], x['lon']), axis = 1)

In [None]:
scatter(df_all[df_all.op_type == 'For sale'], x = 'center_dist', y = 'price', hue = None)
scatter(df_all[df_all.op_type == 'For rent'], x = 'center_dist', y = 'price', hue = None)

Expectedly, correlation of center_dist with price is negative and visibly significant, so it will be included in the models

Now, let's introduce some other features, that can improve our predictions additionally

**Area_Room_Ratio** will reflect how big the rooms are in the apartment. The assumption here is that people might prefer  larger rooms

In [None]:
df_all['Area_Room_Ratio'] = df_all.area / df_all.rooms_num

In [None]:
scatter(df_all[df_all.op_type == 'For sale'], x = 'Area_Room_Ratio', y = 'price', hue = None)
scatter(df_all[df_all.op_type == 'For rent'], x = 'Area_Room_Ratio', y = 'price', hue = None)

Seems like my assumption was sort of correct and this feature will be worth to use. Though, some outliers and strong heteroskedasticity remain a problem

**Floor_Ratio**  
The second assumption is that people prefer comparatively higher flats in a house, so their prices are higher for all buildings

In [None]:
df_all['Floor_Ratio'] = df_all.floor / df_all.total_floors 

In [None]:
scatter(df_all[df_all.op_type == 'For sale'], x = 'Floor_Ratio', y = 'price', hue = None)
scatter(df_all[df_all.op_type == 'For rent'], x = 'Floor_Ratio', y = 'price', hue = None)

And here we have an unexpected data mistake!  
  
Obviously, Floor_Ratio cannot be higher than 1, but we have such values

In [None]:
df_all[df_all['Floor_Ratio'] > 1]

It seems like floor and total_floors values are switched for these observations.  
  
Let's switch 'em back then.

In [None]:
df_all.loc[(df_all['Floor_Ratio'] > 1), 'floor'] = df_all['floor'] / df_all['Floor_Ratio']
df_all.loc[(df_all['Floor_Ratio'] > 1), 'total_floors'] = df_all['Floor_Ratio'] * df_all['total_floors']

In [None]:
df_all.loc[(df_all['Floor_Ratio'] > 1), 'Floor_Ratio'] = df_all['floor'] / df_all['total_floors']

In [None]:
df_all[df_all['Floor_Ratio'] > 1]

Now we're good!

In [None]:
scatter(df_all[df_all.op_type == 'For sale'], x = 'Floor_Ratio', y = 'price', hue = None)
scatter(df_all[df_all.op_type == 'For rent'], x = 'Floor_Ratio', y = 'price', hue = None)

Some positive correlation is visible, so it worths to try this feature in the models

OK, we're almost done with cleaning! What's left is only...

### Dropping outliers

First, it is time to separate our dataset on 2 independent sets and work with them separately now

In [None]:
df_sale = df_all[df_all.op_type == 'For sale'].reset_index(drop = True).copy()
df_rent = df_all[df_all.op_type == 'For rent'].reset_index(drop = True).copy()

Now we go for outliers detection  
  
I've made it basing on observing scatterplots and removing dots that highly deviate.  
 
I decided to not to use quantile outlier detection here because since the values are highly disperced and our datasets are not really big, a big share of information would've been lost and models would've not been describing our data really.  
 
But if this dataset will be expanded later on, I would definitely use boxcox transformation to remove outliers and to build more robust model 

In [None]:
scatter(df_sale, x = 'area', y = 'price', hue = None)
scatter(df_rent, x = 'area', y = 'price', hue = None)

In [None]:
scatter(df_rent[df_rent.price < 300], x = 'area', y = 'price', hue = 'center_dist')

Notice, that in rent dataset there is a relatively big chunk of dots below price = 100, that have their own mood on price.
  
However, playing with 'hue' parameter for the chart I could not find the explaination for this group's deviant behavior. 
 
Therefore, since available features do not explain this, I specify these dots as outliers so they do not harm our model.  
 
If you find the explaination for this, please, let me know.

After some boring iterations of charts I've ended up with these datasets.  
 
Unusually expensive or cheap flats' prices definitely have some other explaination of their deviant prices (such as interior for ex.) so they were removed together with other highly deviated dots

In [None]:
df_sale_clean = df_sale[(df_sale.price < 300000) & (df_sale.area <160)  
                  & (~((df_sale.price < 50000) &(df_sale.area > 80))) 
                 & (~((df_sale.price < 100000)&(df_sale.area > 130)))
                  & (df_sale.Area_Room_Ratio<80)
                 ].copy()

In [None]:
df_rent_clean = df_rent[(df_rent.price < 1390) & (df_rent.area <125) & (df_rent.price > 60) 
                  & (~((df_rent.price < 110) &(df_rent.area > 40))) 
                 & (~((df_rent.price < 400)&(df_rent.area > 100)))
                  & (~((df_rent.price > 1000)&(df_rent.area < 70)))
                  &(df_rent.Area_Room_Ratio < 65)
                 ].copy()

Now charts look more neat and outliers impact is higly reduced.

In [None]:
scatter(df_sale_clean, x = 'area', y = 'price', hue = None)
scatter(df_rent_clean, x = 'area', y = 'price', hue = None)

Drop useless columns

In [None]:
df_sale_clean.columns

In [None]:
df_sale_clean = df_sale_clean.drop(['op_type', 'street', 'rooms', 'condition', 'Street_New', 'Street_Full'], axis = 1)
df_rent_clean = df_rent_clean.drop(['op_type', 'street', 'rooms', 'condition', 'Street_New', 'Street_Full'], axis = 1)

Splitting on test and train

In [None]:
def get_splits(df):
    X_train, X_test, y_train, y_test = train_test_split(df.drop(['price'], axis = 1), 
                                                          df['price'], train_size=0.8, test_size=0.2, 
                                                          random_state = seed)
    return X_train, X_test, y_train, y_test

Get dummies for categorical features

In [None]:
OH_sale_clean = pd.get_dummies(df_sale_clean, drop_first = True)
OH_rent_clean = pd.get_dummies(df_rent_clean, drop_first = True)

In [None]:
OH_sale_train, OH_sale_test, OH_y_sale_train, OH_y_sale_test = get_splits(OH_sale_clean)
OH_rent_train, OH_rent_test, OH_y_rent_train, OH_y_rent_test = get_splits(OH_rent_clean)

Drop columns, that are relevant only for train sets

In [None]:
cols_to_drop_sale = OH_sale_train.columns[(OH_sale_train == 0).all()]
OH_sale_train = OH_sale_train.drop(cols_to_drop_sale, axis = 1)
OH_sale_test = OH_sale_test.drop(cols_to_drop_sale, axis = 1)

In [None]:
cols_to_drop_rent = OH_rent_train.columns[(OH_rent_train == 0).all()]
OH_rent_train = OH_rent_train.drop(cols_to_drop_rent, axis = 1)
OH_rent_test = OH_rent_test.drop(cols_to_drop_rent, axis = 1)

And finally...

### Modelling

First, let's choose the most appropriate models for tuning

In [None]:
models = [RandomForestRegressor(random_state = seed), 
          Ridge(random_state = seed), 
          RidgeCV(), 
          Lasso(random_state = seed), 
          LassoCV(random_state = seed), 
          ElasticNet(random_state = seed),
          HuberRegressor(), 
          KernelRidge(), 
          GradientBoostingRegressor(random_state = seed), 
          ExtraTreesRegressor(random_state = seed), 
          XGBRegressor(random_state = seed)]

In [None]:
models_names = [str(i).split('(')[0] for i in models]

In [None]:
def models_summary(train, test, y_train, y_test):
    models_MAE = []
    models_RMSE = []
    models_RMSLE = []
    for model in models:
        model.fit(train, y_train)
        preds = model.predict(test)
        models_MAE.append(mean_absolute_error(y_test, preds))
        models_RMSE.append(np.sqrt(mean_squared_error(y_test, preds)))
        models_RMSLE.append(np.sqrt(mean_squared_log_error(y_test, abs(preds))))
    return pd.DataFrame(list(zip(models_names, models_MAE, models_RMSE, models_RMSLE)),
              columns=['models','MAE', 'RMSE', 'RMSLE']).sort_values(by = 'MAE').set_index('models')

For sale:

In [None]:
models_sale_res = models_summary(OH_sale_train, OH_sale_test, OH_y_sale_train, OH_y_sale_test)
models_sale_res

For rent:

In [None]:
models_rent_res = models_summary(OH_rent_train, OH_rent_test, OH_y_rent_train, OH_y_rent_test)
models_rent_res

Seems like **ExtraTrees** has the lowest MAE, RMSE and RMSLE for both sets and it is the best fit for our disperced data.  
 
Of course, this results do not take tuning into account and famous XGB or GBS potentially can outbeat ExtraTrees.  
 
However, I've tuned them by myself and ExtraTrees still was the best for my datasets.  
 
Therefore, I will show results only for this model, but if you manage to overcome my scores with XGB - please, let me know!

### Tuning ExtraTrees
For this purpose I will use combined dataset and use GridSearch on 5 kfolds. This will allow to getan average error score for any split wilth the same train/set proportion as we set before (80/20).  
  
Below I already selected my combination of parameters that I consider the best by efficiency and time ratio,but youcan play with your parameters here. 
 
Note, that you can achieve higher scores by setting higher n_estimators, but you'd have to wait longer.

In [None]:
kf = KFold(n_splits=5, random_state=seed)

In [None]:
ET_model = ExtraTreesRegressor(random_state = seed,
                                n_estimators=400, 
                                min_samples_split=2,
                                min_samples_leaf=1, 
                                max_features=200,
                              )

params_grid = {#'n_estimators': range(50,50,201),
               #'max_features': range(50,401,50),
               #'min_samples_split': range(2,5),
               #'min_samples_leaf': range(1,4)
              }  

ET_grid = GridSearchCV(estimator = ET_model, param_grid = params_grid, n_jobs = -1,
                               cv = kf, scoring = 'neg_mean_absolute_error')

In [None]:
ET_grid_sale = ET_grid.fit(OH_sale_clean.drop(['price'], axis = 1), OH_sale_clean.price)
print(ET_grid_sale.best_params_)
print(ET_grid_sale.best_score_)

In [None]:
ET_grid_rent = ET_grid.fit(OH_rent_clean.drop(['price'], axis = 1), OH_rent_clean.price)
print(ET_grid_rent.best_params_)
print(ET_grid_rent.best_score_)

So for the sale dataset we can expect **MAE** around **12200** and for rent **MAE** around **66.32** in average case scenario.  
 
By increasing n_estimatiors, MAE score can be reduced by ~150 and ~4 for sale and rent datasets respectively

Let's calculate our results for our selected train and test sets:

In [None]:
ET_sale = ExtraTreesRegressor(random_state = seed,
                                n_estimators=400, 
                                min_samples_split=2,
                                min_samples_leaf=1, 
                                max_features=200,
                              )
ET_rent = ExtraTreesRegressor(random_state = seed,
                                n_estimators=400, 
                                min_samples_split=2,
                                min_samples_leaf=1, 
                                max_features=200,
                              )

In [None]:
def score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)

In [None]:
score(ET_sale, OH_sale_train, OH_sale_test, OH_y_sale_train, OH_y_sale_test)

In [None]:
score(ET_rent, OH_rent_train, OH_rent_test, OH_y_rent_train, OH_y_rent_test)

Here we ahcieve **MAE 11473** for sale houses and **64.62** for rent and we can see that tuning has helped to slightly improve our score compare the baseline model, where we had MAE 11501 for sale and 65.25 for rent.
 
I think it's not so bad, but can't yet say so (waiting for your kernel here!)

Now I'd like to take a closer look on performance of our models

In [None]:
def results(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    res_tab = pd.DataFrame({'y_test': y_test, 'preds': preds, 'error': (preds - y_test),
             'error_share': abs(y_test - preds)/y_test}).sort_values(by = 'error_share', ascending = False)
    return res_tab

In [None]:
sale_results = results(ET_sale, OH_sale_train, OH_sale_test, OH_y_sale_train, OH_y_sale_test)
rent_results = results(ET_rent, OH_rent_train, OH_rent_test, OH_y_rent_train, OH_y_rent_test)

In [None]:
sale_results[:10]

In [None]:
rent_results[:10]

To make everything visible, let's plot error and error_share from lowest to highest 

In [None]:
def error_lines(df, y):
    plt.figure(figsize = (10, 6))
    ax = sns.lineplot(x = range(len(df)), y = y, data = df.sort_values(by = [y], ascending = False))

For sale:

In [None]:
error_lines(sale_results, 'error')

In [None]:
error_lines(sale_results, 'error_share')

Right and left sides of the error graphs are symmetric and the graph is centered around 0.  
 
This allows us to say that our model is not over or underestimating price and most errors explained by outliers.  
 
Other graph shows, that we mistake less by 10% for the half of the set and less than 25% for nearly 80% of the set.  
  
So yeah, not so bad! 
  
I guess...

For rent:

In [None]:
error_lines(rent_results, 'error')

In [None]:
error_lines(rent_results, 'error_share')

Here we see that model for rent type apartments used to underestimate y value, so more careful work with low-price outliers can help to reduce error rate greatly.  
 
Still, model performs quite good, having less than 20% error share for 80% of the observations. 

### CONCLUSION 

Ok, so far I've managed to build a model that is seemingly not bad.  
 
To get better scores I've tried to tune all of the mentioned models, used log1p in order to handle heteroscedasticity problem that our values clearly have, tried different combination of mentioned features etc.  
 
Here I present you the best of mine, but I believe you can do it better!  
 
To give you some ideas:  
 
1) More outliers can be removed with box-cox tool so the model will be more robust to them (but maybe will lose explanatory power)  
 
2) Try other 'fashionable' models like LightGBM and CatBoost. Due to some problems with my python packages I could not install them properly, so maybe they will outperform ExtraTrees here  
 
3) My model clearly has too much features, so I suspect it is Multidimensionally cursed. 
Definitely, most of it comes from huge number of streets dummies. Not using them would drop my model performance, but maybe there is a way to cleverly choose or group some of them.  

4) To tackle heteroskedasticity you can apply log1p to the features like area, Area/Room ratio, Floor/Max floor ration or to the target variable itself. This can mitigate the negative effect of outliers, which are definitely worsening my score a lot.

That's all from me, guys!  
 
Let me know if you found any better solutions or my mistakes and if you have any advice for my model, stats and code overall (I know it's not the state of the art at all, so your criticism is highly welcome) 
 
Peace!