# Introduction

We are going to explore 'Madrid real estate market' dataset. The goal is to select and clean the appropriate features for a machine learning project, namely predicting the price of a house. 

We will learn techniques for dealing with missing values and preparing data for the algorithm.

This is a long notebook starting with over 50 attributes.

In [None]:
#Download libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

pd.set_option("display.max_columns", None)    

Let's check the structure of the file

In [None]:
!head -2 '../input/madrid-real-estate-market/houses_Madrid.csv'

The first line contains the columns' names. The first column, with no name, is the index. Columns are separated by commas. We include all this information when reading the file and we take a look at the data.

In [None]:
data = pd.read_csv('../input/madrid-real-estate-market/houses_Madrid.csv', sep=',', header=0, index_col=0)
data.head()

Let's find out the size of the table and learn a bit about the columns and their elements.

In [None]:
print("The number of rows is {} and the number of columns is {}".format(data.shape[0], data.shape[1]))
print("---------------------------------------------------------------")
print(data.info(verbose=True))


We have 21742 rows (houses) and 57 columns (attributes). Only 13 columns don't have missing values and 10 columns *only* have missing values. We will get rid of these in the first place. We'll have to analyse the rest.

As data types, we have object, boolean and numbers. Reading the names of the columns, it looks like sometimes the type should be different. Like all the columns with the word 'has_*something*' are object, and they should be boolean. We'll look into that too.

Let's check if there are duplicates.

In [None]:
data.duplicated().any()

'is_exact_address_hidden' is not necessary as well as 'street_name' and 'street_number'. We already have that info in 'raw_address'.

'rent_price' is not usefull to this analysis because we'll be focusing on buying.

'is_rent_price_known' and 'is_buy_price_known' don't give any interesting information because they only have one value.

We'll eliminate them, as well as columns filled with missing values.


In [None]:
data.drop(columns=['is_exact_address_hidden', 'street_name', 'street_number', 'is_rent_price_known', 'is_buy_price_known', 'rent_price'], inplace=True)
#Drop columns with missing values in every row
data.dropna(axis=1, how='all', inplace=True)
print("The new number of columns is {}".format(data.shape[1]))

Let's take a look at the correlation between numerical features, knowing that our target is 'buy_price'.

In [None]:
corr = data.iloc[:,1:].corr()
f, ax = plt.subplots(figsize=(11, 9))
sns.heatmap(corr ,annot=True, fmt='.1g',center=0) 
plt.show()

## First impressions:
There are no significant negative correlations, they are all positive or 0.

There's a perfect correlation between 'sq_mt_built' and 'sq_mt_useful', which makes sense because they are two similar ways of measuring a house.

'buy_price' is positive correlated with these two attributes, which shows that the bigger the house, the more expensive.

However 'buy_price' has no correlation (-0.03) with 'sq_mt_allotment' which is mainly used for houses, including garden, pool, ..., the whole lot. Although there are very few non null values.

'buy_price' is also correlated with the number of rooms and bathrooms, which are correlated to the flat's size.

In [None]:
y = data['buy_price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=stats.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=stats.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=stats.lognorm)
plt.show()


## Now, we'll analyse the columns.

'Title' includes the full address and the type (piso, casa,...) and 'subtitle' describes the area in Madrid, both with no missing values. We'll leave them like that for the moment while we study the rest of the data.

'sq_mt_built' and 'sq_mt_useful' are two ways of measuring the size of a house. They are equivalent and 'sq_mt_built' has fewer missing values (126 vs over 13000). The first option would be to drop the second column, but let's check the three atributes related to size.

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(10,2.5), dpi=100, sharex=True, sharey=True)

ax[0].hist(data.sq_mt_built, bins=100, color='b')
ax[0].set_title('Built')
ax[1].hist(data.sq_mt_useful, bins=100, color='g')
ax[1].set_title('Useful')
ax[2].hist(data.sq_mt_allotment, bins=100, color='r')
ax[2].set_title('Allotment')
plt.show()

m² built and m² useful have similar right skewed distributions with many small houses. It's more difficult to see the shape for m² allotment. 

Let's try another type of plot.

In [None]:
%matplotlib inline
f, ax = plt.subplots(figsize=(8, 6.5))

data.boxplot(column=['sq_mt_built', 'sq_mt_useful', 'sq_mt_allotment'])
plt.show()

Boxplots show more information although it may be more difficult to see. Let's take it one step at a time.

* m² built and m² useful boxes are shorter, which means that their values are grouped together (their standard deviations are smaller), with the median closer to the first quartile. We saw this in the previous plot with their high and narrow peaks.
* However, m² allotment's box is bigger, showing that its values are more scattered (the standard deviation is bigger) and the median is closer to the third quartile.
* This also shows that the majority of the values for m² built and m² useful are between 100 and almost 200 m² (more or less). While the majority of the values for m² allotment are in a wider range, between 0 and almost 400 m².
* The longer upper whisker for m² allotment shows a very long right tail.
* All of them have outliers (the circles). These are values beyond 1.5 times the distance between the third and the first quartile. 
* All of this shows (for the first two) that while the majority of the houses are small (75% are less than 200 m²), there are still many items (25%) much much bigger.

Let's take a closer look at the limits of these three atributes.

In [None]:
data.agg({'sq_mt_useful': ['min', 'max'], 'sq_mt_built': ['min', 'max'], 'sq_mt_allotment': ['min', 'max']})

Wow! There is something weird here. Houses with 1 m² useful and 1 m² allotment?

In [None]:
data[['sq_mt_useful', 'sq_mt_built', 'sq_mt_allotment']].describe()

It looks like the numbers were cut after 1.000. Let's see the smallest values of m² built

In [None]:
data.query('sq_mt_built<23.0')[['title','sq_mt_built', 'sq_mt_useful', 'n_rooms']]

The first one is weird. A 16 m² flat with 3 rooms, but the rest make sense. They are small flats with no rooms (Estudio) or just one room.

Let's see the smallest values for square meter useful:

In [None]:
data.query('sq_mt_useful<15.0')[['title','sq_mt_built', 'sq_mt_useful', 'n_rooms', 'n_bathrooms']]

There are certainly weird relationships here between these two columns with very small numbers for m² useful. But the values for houses ('Casa o chalet') seeing the number of rooms and bathrooms, may be in the range of thousands. 

We can use two other atributes to find the right size: 'buy_price' and 'buy_price_by_area'. Both with no missing values.

In [None]:
f, ax = plt.subplots(figsize=(8, 6.5))

sns.scatterplot(data=data, x='buy_price', y='buy_price_by_area', hue='sq_mt_built', palette='crest', ax=ax)
plt.show()

In most cases, prices are under 300000€.

Let's confirm that we don't have any missing values for 'built' and we will get rid of 'sq_mt_useful':

In [None]:
rel_built = data['sq_mt_built'].isnull()
data.loc[(rel_built), 'sq_mt_built'] = (data['buy_price'] / data['buy_price_by_area'])

print('The final number of missing values for sq_mt_built is: {}'.format(data.sq_mt_built.isnull().sum()))

data.drop(columns=["sq_mt_useful"], inplace=True)

Now every single house has it's size. Next one:

## sq_mt_allotment

Let's see the relationship between 'sq_mt_allotment' and the house's type:

In [None]:
data.house_type_id = data.house_type_id.astype('str')

In [None]:
allotment_notnull = ((data.sq_mt_allotment.notnull())& (data.house_type_id.str.contains('Casa'))).sum()
print("The number of houses with no empty values in 'sq_mt_allotment' is {} from a total of 1432 no null values".format(allotment_notnull))

Which means that this colum, m² allotment, is almost exclusively for detached houses. Let's see the other three.

In [None]:
data[(data.sq_mt_allotment.notnull())& (~data.house_type_id.str.contains('Casa'))][['title', 'sq_mt_built', 'sq_mt_allotment', 'n_floors', 'house_type_id']]

It's a country house and two fields.

These three rows could be dropped if there are no more elements of the same type. We'll check 'house_type_id' later.

Square meter allotment is a challenge because it's difficult to tell the right size for many houses. 

The histogram shows almost 500 houses with very small plots. Checking in more detail, we find that the majority are between 1 and 5 square meters, which makes no sense.

In [None]:
data['sq_mt_allotment'].plot.hist(bins=40)
plt.show()

In [None]:
print("Number of houses with less than 6 m²: {}".format(data.query('sq_mt_allotment<6').shape[0]))
print("Number of houses between 4 y 10 m²: {}".format(data.query('4<sq_mt_allotment<10').shape[0]))
print("Number of houses between 9 y 15 m²: {}".format(data.query('9<sq_mt_allotment<15').shape[0]))
print("Number of houses between 15 y 20 m²: {}".format(data.query('15<sq_mt_allotment<20.0').shape[0]))
print("Number of houses between 19 y 25 m²: {}".format(data.query('19<sq_mt_allotment<25').shape[0]))
print("Number of houses between 29 y 35 m²: {}".format(data.query('29<sq_mt_allotment<35').shape[0]))

In [None]:
data.query('sq_mt_allotment<6')[['title', 'subtitle', 'sq_mt_built', 'n_rooms', 'n_bathrooms', 'n_floors', 'sq_mt_allotment', 'buy_price']].head()

It seems weird for a house of over 500 m² and over a million euros to have 3 m² of garden/land. So we may hypothesize that plots over a thousand meters lost everything after the point, like it happened with square meters build. 

However the solution here is not easy, because we may have big, expensive houses with over 20.000 m² plots and other ones with a small inner courtyard.

So, in this case, we'll apply two strategies:

1. Houses with less than 10 m² allotment: multiply the size by 1000.
2. Houses between 10 and 30 m² allotment: we need two parameters, the plot's size and the prize of the house.

We'll put the limit in 30 m² because there aren't houses over 30 m² allotment expensive enough to have 30.000, 40.000 or more meters.

In [None]:
rel_allotment1 = (data['sq_mt_allotment'] <10)
data.loc[(rel_allotment1), 'sq_mt_allotment'] = (data['sq_mt_allotment'] * 1000)

In [None]:
rel_allotment2 = (data['sq_mt_allotment'] <30) & (data['buy_price'] > 1000000)
data.loc[(rel_allotment2), 'sq_mt_allotment'] = (data['sq_mt_allotment'] * 1000)

In [None]:
print(data.sq_mt_allotment.describe())
data.query('sq_mt_allotment<30')[['title', 'subtitle', 'sq_mt_built', 'n_rooms', 'n_bathrooms', 'n_floors', 'sq_mt_allotment', 'buy_price']].head()

Now we can see that the smallest value is 10 and the biggest is 21.000 m². (Checking real state web pages, I found houses with these characteristics)

Size and price of the houses still under 30, justify not changing 'sq_mt_allotment'

For the rest of the houses, we'll replace the missing value NaN with 0. Because they don't have gardens of any kind.

In [None]:
data['sq_mt_allotment'] = data['sq_mt_allotment'].fillna(0)
print("Number of missing values in 'sq_mt_allotment': {}".format(data.sq_mt_allotment.isnull().sum()))

## house_type_id

Let's deal with 'house_type_id' to add a new class we'll need.

In [None]:
data.house_type_id.value_counts() 

In [None]:
print(data[(data.title.str.contains('Estudio')) & (data.house_type_id.str.contains('nan'))].shape[0])
rel_housetype = ((data.title.str.contains('Estudio')) & (data.house_type_id.str.contains('nan')))

So 388 out of these 391 missing values contain the word 'Estudio' in 'title'. Let's change the value adding a new type: "Housetype 3: Estudio".

The other 3 are the ones we found earlier and we can eliminate.

In [None]:
data.loc[(rel_housetype), 'house_type_id'] = "HouseType 3: Estudio"

data.drop(index=[7578, 8400, 8423], inplace=True)

Let's check if all these 'Estudios' have 0 rooms:

In [None]:
data[(data.house_type_id.str.contains('Estudio') & (data.n_rooms == 0))].shape[0]

No, there are 3 houses with at least one room. Let's see them.

In [None]:
rel_estudios = (data.house_type_id.str.contains('Estudio') & (data.n_rooms > 0))
data[rel_estudios]

Let's change this three into flats 'Pisos'.

In [None]:
data.loc[(rel_estudios), 'house_type_id'] = "HouseType 1: Pisos"

In [None]:
print("Data on 'Estudios'", data.loc[data.house_type_id.str.contains('Estudio')][['sq_mt_built', 'buy_price']].describe(), sep="\n") 
print("Data on 'Pisos' flats", data.loc[data.house_type_id.str.contains('Piso')][['sq_mt_built', 'buy_price']].describe(), sep="\n")

Surprise, surprise. A 300 m² studio! This is weird. Let's take a look.

In [None]:
data[(data.house_type_id.str.contains('Estudio')) & (data.floor == 'Bajo') & (data.sq_mt_built > 100)][['title', 'subtitle', 'sq_mt_built', 
                                                                                                        'n_bathrooms', 'buy_price']].sort_values(by='sq_mt_built')

Aha! There are two types of flats in this category: the small ones and the big ones, which may come from a previous store transformed into an appartment. All of them with the same number of bathrooms.

Finally, this is the distributions of houses' types versus prices.

In [None]:
f, ax = plt.subplots(figsize=(9, 6))

sns.stripplot(y='house_type_id', x='buy_price', data=data, ax=ax)
plt.show()

## n_rooms

Let's continue with the number of rooms and bathrooms.

'n_rooms' doesn't have any missing values and it's of type integer.

In [None]:
f, ax = plt.subplots(figsize=(8, 6.5))

sns.scatterplot(data=data, x='n_rooms', y='buy_price', hue='house_type_id', style='house_type_id', palette='crest', ax=ax)
plt.show()

The correlation between rooms and price was 0.6. We can see this value explained in the plot. Increasing the number of rooms, increase the price but only up to a point. After 5 there isn't a significant increase.

House types 2 and 5 with 4 to 10 rooms are the most expensive.

In [None]:
data.n_rooms.value_counts()

Three houses with 24 rooms. Wow!

There are 439 houses with no rooms which should be 'Estudios' but we only had 385 'Estudios'.

Let's check the rest of these houses with no rooms.

In [None]:
rel_noestudios = (data.n_rooms == 0) & (~data.house_type_id.str.contains('Estudio'))
data.loc[(rel_noestudios)]['house_type_id'].value_counts()

There's something wrong here. It's very weird for detached houses or duplex not to have rooms.

In [None]:
rel_casa_norooms = data.loc[(rel_noestudios) & (data.house_type_id == "HouseType 2: Casa o chalet")]
rel_casa_norooms

Let's check similar houses in Fuencarral to impute the missing values.

In [None]:
data[(data.subtitle.str.contains('Fuencarral')) & (data.title.str.contains('Casa', 'independiente')) & (250 < data.sq_mt_built) & 
     (data.sq_mt_built < 350)]

We'll use the median for number of rooms, because the mode, 5, may be a bit too much. For bathrooms, we'll use the mode.

In [None]:
data.loc[9347,'n_rooms'] = 4
data.loc[9347,'n_bathrooms'] = 3

There are no others 'chalets' in Chamartin as big as this one, so we'll go for size instead of location.

In [None]:
data[(data.title.str.contains('pareado')) & (700 < data.sq_mt_built) & (data.sq_mt_built < 800)]

In [None]:
#We'll use the median for all three.
data.loc[14045, 'n_rooms'] = 6
data.loc[14045, 'n_bathrooms'] = 7
data.loc[14045, 'n_floors'] = 4

In [None]:
rel_atico_norooms = data.loc[(rel_noestudios) & (data.house_type_id == "HouseType 5: Áticos")]
rel_atico_norooms

All of them, except one, are under 65. They are small flats with one bathroom.

Let's check all of this kind.

In [None]:
data[(data.title.str.contains('Ático')) & (data.sq_mt_built < 60)]['n_rooms'].value_counts()

Considering their size, these could be 'Estudios' with no rooms on the top floor of a building. We'll leave them like that.

In [None]:
rel_pisos_norooms = data.loc[(rel_noestudios) & (data.house_type_id == "HouseType 1: Pisos")]
rel_pisos_norooms

Some look like 'Estudios' on the ground floor, others are small... 

It's difficult to tell if these values are wrong or not, so for the moment, we'll leave them like that.

In [None]:
rel_duplex_norooms = data.loc[(rel_noestudios) & (data.house_type_id == "HouseType 4: Dúplex")]
rel_duplex_norooms

Looking at real state web pages it's possible to have big flats with no rooms. They are normally lofts, transformed recently or in need of repairs. So we'll leave them like that, with no rooms.

## n_bathrooms

Let's check the bathrooms. 'n_bathrooms' is missing a few values and it's of type float which makes no sense.


In [None]:
f, ax = plt.subplots(figsize=(8, 6.5))

sns.scatterplot(data=data, x='n_bathrooms', y='buy_price', hue='house_type_id', style='house_type_id', palette='crest', ax=ax)
plt.show()

In [None]:
data.n_bathrooms.value_counts()

In [None]:
data.loc[data.n_bathrooms.isnull(),['title', 'subtitle','sq_mt_built', 'n_rooms','n_floors', 'floor', 'buy_price', 'house_type_id']]

The majority of missing values belong to flats besides 1 duplex and 2 studies.

Let's see what is the most common number for each type.

In [None]:
data[data.house_type_id.str.contains('Piso') | data.house_type_id.str.contains('Estudio') | 
     data.house_type_id.str.contains('Dúplex')].groupby(['house_type_id', 'n_bathrooms']).agg({'n_bathrooms': ['count']}).unstack()

We'll asign 1 to flats, 2 to duplex and 1 to studies.

We'll define a function to fill missing values of every house's type.

In [None]:
def fill_missing(column_to_change, column_ref, **kwargs):
    '''Fill missing values in a column by grouping them to categories in another column.
       Parameters: column to change; column used as reference; dictionary with pairs category:new_value
       Returns: Nothing, the changes are done in place. Outputs progress.'''
    
    for type_house, new_number in kwargs.items():
        #Select the null rows to change of a especific category
        rel_no = (data[column_ref].str.contains(type_house)) & (data[column_to_change].isnull())
        #Apply the new value
        data.loc[(rel_no), column_to_change] = new_number
        print('Done ' + type_house)

In [None]:
bathrooms_data = {'Dúplex': 2, 'Piso': 1, 'Estudio': 1}
fill_missing('n_bathrooms', 'house_type_id', **bathrooms_data)

In [None]:
#Finally, we change the type
data = data.astype({'n_bathrooms' : 'int64'})

## n_floors

Next one: 'n_floors' is useful when it's a house not a flat.

In [None]:
print(data.n_floors.notnull().sum())
data[(data.n_floors.notnull())]['house_type_id'].value_counts()

The only ones with values are houses. We'll have to add 1 floor for 'Estudios', top floors and flats, and 2 for duplex.

In [None]:
floors_data = {'Dúplex': 2, 'Piso': 1, 'Estudio': 1, 'Ático': 1}
fill_missing('n_floors', 'house_type_id', **floors_data)

In [None]:
data[(data.n_floors.isnull())]['house_type_id'].value_counts()

We are left with 502 houses without number of floors. Let's take a look at them.

In [None]:
print("Data from houses without number of floors: ")
print(data.loc[data.n_floors.isnull(),['sq_mt_built', 'n_rooms', 'n_bathrooms', 'sq_mt_allotment', 'buy_price']].describe())
print("Data from the rest of the houses: ")
print(data.loc[data.n_floors.notnull() & data.house_type_id.str.contains('Casa'),['sq_mt_built', 'n_rooms', 'n_bathrooms', 'sq_mt_allotment', 'n_floors',
                                                                                  'buy_price']].describe())
print("Number of floors: ")
floor_number = data[(data.n_floors.notnull()) & (data.house_type_id.str.contains('Casa'))]['n_floors'].value_counts()
print(floor_number)

Comparing both groups we can see that they have similar values. So we can assume that the number of floors would be similar.

Checking the number of floors for the second group we can see that mean and median are 3. Investigating a bit more, we can see that the most frequent number is 4.

However, we can't give 3 floors to a 30 m² built house.

We'll need more than their size because values overlapped, so we'll add rooms and check them in more detail.

In [None]:
print("Houses without number of floors")
data[data.n_floors.isnull()].groupby('n_rooms').agg({'sq_mt_built': ['min','max', 'mean'], 'id':'count'})

In [None]:
print("Houses with number of floors")
data[(data.n_floors.notnull()) & (data.house_type_id.str.contains('Casa'))].groupby(['n_floors', 'n_rooms']).agg({'sq_mt_built': ['min','max', 'mean'], 
                                                                                                                  'id':'count'})

After studying both statistics these are the actions we'll take:
* Those with just one room will get 2 floors. There are more houses with 1 room and 2 floors.
* Those with 2 rooms will get also 2 floors. They fit better.
* Those with 3 rooms will get 3 floors. It's the mode.
* Those with 4 rooms will get 3 floors too. Again, it's the mode and the means are similar.
* Those with 5 rooms will get 4 floors. Folowing the mode.
* The rest will get 4 floors. It's the most common number.

We'll define a new function to fill them.

In [None]:
def fill_nfloors(column_to_change, column_ref, **kwargs):
    '''Fill number of floors' column grouping them by number of rooms.
       Parameters: column to change; column used as reference; dictionary with pairs number_of_rooms: number_of_floors
       Return: Nothing, changes are done on site. Output progress.'''
    
    for nrooms, new_number in kwargs.items():
        #Select null rows with correct number of rooms
        rel_no = (data[column_ref] == int(nrooms)) & (data[column_to_change].isnull())
        #Apply new number of floors
        data.loc[(rel_no), column_to_change] = new_number
        print('Done ' + nrooms)

In [None]:
n_floors_data = {'1':2, '2':2, '3':3, '4':3, '5':4, '6':4, '7':4, '8':4, '9':4, '10':4, '11':4, '12':4, '13':4, '18':4}
fill_nfloors('n_floors', 'n_rooms', **n_floors_data)

In [None]:
data = data.astype({'n_floors' : 'int64'})

## floor - height
Let's check the 'floor' column.

In [None]:
print("Number of null entries: {}".format(data.floor.isnull().sum()))
print("---------Types of floors-----------")
print(data.floor.value_counts())

There are three types of names for ground, basement and subbasement levels. Normally, we may expect some price's difference from an exterior anything to an interior one. 

Let's see if this is true.

In [None]:
print(data.loc[(data.floor.notnull()) & (data.floor.str.contains('Sótano$')), ['sq_mt_built', 'n_rooms', 'n_bathrooms', 'buy_price']].describe())
print(data.loc[(data.floor.notnull()) & (data.floor.str.contains('Sótano interior')), ['sq_mt_built', 'n_rooms', 'n_bathrooms', 'buy_price']].describe())
print(data.loc[(data.floor.notnull()) & (data.floor.str.contains('Sótano exterior')), ['sq_mt_built', 'n_rooms', 'n_bathrooms', 'buy_price']].describe())

There is no significant difference between basement flats and the same happens for the other two. 

We are going to group together the same levels and turn them all into numbers.

In [None]:
data.floor.replace({'Bajo': -1,'Entreplanta exterior': 0, 'Entreplanta interior': 0, 'Entreplanta' : 0, 'Semi-sótano exterior' : -2, 'Semi-sótano interior': -2,
                   'Semi-sótano': -2, 'Sótano interior' : -3, 'Sótano' : -3, 'Sótano exterior': -3}, inplace=True)

In [None]:
#Let's check the missing values.
print("Not null values: ")
print(data.loc[(data.floor.notnull()),'house_type_id'].value_counts())
print("Null values: ")
print(data.loc[(data.floor.isnull()),'house_type_id'].value_counts())

We'll give 'Aticos' the highest number: 10. We don't know the exact height but to point out that it's the highest.

A random value for detached houses because height doesn't matter there: -5

Let's check the others:

In [None]:
def check_values(column_ref, house_type):
    '''Find out how many unique values each house's category has in a column.
       Parameters: column to check; type of house
       Returns: a Series with the unique values for a house's category'''
    
    return data.loc[(data[column_ref].notnull()) & (data.house_type_id.str.contains(house_type)), column_ref].value_counts()

In [None]:
check_values('floor', 'Estudio')

In [None]:
floor_heigth = {'Áticos': 10, 'Casa': -5, 'Estudio': -1}

In [None]:
check_values('floor', 'Dúplex')

In [None]:
#We'll give the most common value
floor_heigth['Dúplex'] = -1
fill_missing('floor', 'house_type_id', **floor_heigth)

In [None]:
pisos_floor = check_values('floor', 'Piso')
pisos_floor_total = pisos_floor.sum()
pisos_floor

We are going to try something different here as the proportions aren't very different.

Instead of chosing the most common value, we are going to select randomly among the 4 most common.

In [None]:
floor1 = pisos_floor.loc['1']/pisos_floor_total*100
floor2 = pisos_floor.loc['2']/pisos_floor_total*100
floor3 = pisos_floor.loc['3']/pisos_floor_total*100
floor4 = pisos_floor.loc['4']/pisos_floor_total*100
floor_total = floor1 + floor2 + floor3 + floor4

#Floors 1, 2, 3 and 4 contain 72% of all flats.

In [None]:
rng = np.random.default_rng()

def random_number(options):
    '''Chose an integer among the ones given based on their probalities
       Parameters: list of integers.
       Return: An integer'''
    
    return rng.choice(options, p=[floor1/floor_total,floor2/floor_total,floor3/floor_total,floor4/floor_total])

#Call the function, receive an integer and apply it to a null row in floor's column
data.floor.mask(data.floor.isnull(), random_number([1,2,3,4]), inplace=True)

data = data.astype({'floor' : 'int64'})

'is_floor_under' is true when it's a ground floor or basement. It's not necessary. We already have that information in 'floor'.


In [None]:
data.drop(columns=['is_floor_under'], inplace=True)

## Neighborhood_id

In [None]:
data.neighborhood_id.iloc[0]

'neighborhood_id' consists of a number, a name and the mean price by neighborhood. Also, the district's number and its name.

We can use this to help locate the houses but we'll only keep the numbers and separate them in two new columns.

In [None]:
data['neighborhood'] = data.neighborhood_id.str.extract('(\d+):', expand=True)
data['district'] = data.neighborhood_id.str.extract('District (\d+)', expand=True)

data.drop(columns=['neighborhood_id'], inplace=True)

data = data.astype({'neighborhood' : 'int64'})
data = data.astype({'district' : 'int64'})

The next columns, 'operation' and 'buy_price_by_area' are not needed. We'll only keep buy_price.

rent_price could be useful for another project!

In [None]:
data.drop(columns=['operation', 'buy_price_by_area'], inplace=True)

We have worked with 'house_type_id' many times, but let's see the elements again.

In [None]:
data.house_type_id.unique()

In [None]:
#We only need the number so:
data['house_type'] = data.house_type_id.str.extract('(\d)', expand=True)

data.drop(columns=['house_type_id'], inplace=True)

data = data.astype({'house_type' : 'int64'})

'is_new_development' and 'built_year are a bit confusing because we don't know since when a house is considered  new. And 'built_year' has some values like 2022, so we'll drop them.

In [None]:
data.drop(columns=['is_new_development', 'built_year'], inplace=True)

The next 13 columns are boolean. They describe interesting features but many of them have more than 50% of missing values. This is impossible to fill. We'll eliminate them except 'has_lift' and 'is_exterior' which have less missing values. 

The type of heating (central or individual) is a bit irrelevant so we'll eliminate them too. 


In [None]:
data.drop(columns=['has_central_heating', 'has_individual_heating'], inplace=True)

## has_lift

In [None]:
data[data.has_lift.notnull()].groupby('house_type').house_type.count()

All type 2 houses 'Casa o chalet' have missing values. This is fine because this houses don't have lifts (normally), so we'll give them a False value.

In [None]:
rel_casa_nolift = (data.house_type == 2)
data.loc[(rel_casa_nolift), 'has_lift'] = False

In [None]:
data[data.has_lift.isnull()].groupby('house_type').house_type.count()

We have a few houses left with no values, so we'll give them False, too.

In [None]:
data.has_lift.fillna(False, inplace=True)

In [None]:
f, ax = plt.subplots(figsize=(8, 6.5))

sns.violinplot(x='house_type', y='buy_price', hue='has_lift', data=data, split=True, scale='count')
ax.set_ylim(0,3000000)
plt.title("House type vs Price with lift", size=20)
plt.show()

## is_exterior

Let's do the same with 'is_exterior'

In [None]:
data[data.is_exterior.isnull()].groupby('house_type').house_type.count()

In [None]:
rel_casa_exterior = (data.house_type == 2)
data.loc[(rel_casa_exterior), 'is_exterior'] = True

There are no hints about this missing values. So we'll assing the values according to their proportions.

We'll define a function to fill the null 'exterior' cells with a random option.

We also define a slight different version of the previous function 'randon_number'.

In [None]:
rng = np.random.default_rng()

def random_number(options, prob):
    '''Select a value based on its given probabilities
       Parameters: a list of values; a list of their probabilities
       Returns: a value'''
    
    #select an option based on their probabilities
    return rng.choice(options, p=[prob[1],prob[0]])

In [None]:
def ext_prob(house_type):
    '''Fill the null values in the "exterior" column according to the probabilites of each house's type
       Parameters: the house's type
       Returns: Nothing, changes are done in place'''
    
    #Calculate the number of trues and falses for houses of a specific type
    ext0, ext1 = data[(data.is_exterior.notnull()) & (data.house_type == house_type)].groupby('is_exterior').is_exterior.count()
    #Obtain the total number of houses of this category without null values
    ex_total = ext0 + ext1
    #Find out houses with null values in 'exterior'
    mask_ex = (data.is_exterior.isnull()) & (data.house_type == house_type)
    #Select an option, True or False, according to the probabilities of this house's type and apply it to the null ones
    data.is_exterior.mask(mask_ex, random_number([True, False], [ext1/ex_total, ext0/ex_total]), inplace=True) 

In [None]:
exterior_null = [1,3,4,5]
map(ext_prob, exterior_null)

In [None]:
data = data.astype({'is_exterior' : 'bool'})

'is_renewal_needed' is a basic feature that will impact the final price.

We'll keep 'has_parking' because it always afects the price and remove 'is_parking_included_in_price' and 'parking_price' because they have too many missing values and don't add anything.

In [None]:
columns_todrop = ['has_ac', 'has_fitted_wardrobes', 'has_garden', 'has_pool','has_terrace', 'has_balcony', 'has_storage_room', 'is_accessible', 'has_green_zones',
                 'is_parking_included_in_price', 'parking_price','is_orientation_north', 'is_orientation_west', 'is_orientation_south', 'is_orientation_east']

data.drop(columns=columns_todrop, inplace=True)

## energy_certificate

'energy_certificate' is compulsory in Spain for buying a flat since several years ago.

In [None]:
data.energy_certificate.value_counts()

We'll replace letters with numbers.

In [None]:
data.energy_certificate.replace({'en trámite': 0,'no indicado': 0, 'inmueble exento': 0, 'G' : 1, 'F' : 2, 'E': 3, 'D': 4, 'C':5, 'B':6, 'A':7}, inplace=True)

In [None]:
data.drop(columns=['title', 'subtitle', 'raw_address'], inplace=True)

And, finally, this is the result. 16 columns with no missing values.

In [None]:
print(data.info(verbose=True))

Let's see what has happened to the correlation matrix

In [None]:
corr = data.iloc[:,1:].corr()
f, ax = plt.subplots(figsize=(11, 9))
sns.heatmap(corr ,annot=True, fmt='.1g',center=0) 
plt.show()

## Last impressions

There's more colour in this plot, with more positive and negative correlations than before.

'buy_price' keeps the same values for m² built, number of rooms and bathrooms. Number of floors and m² allotment increase their correlations, although they aren't significant enough.

Flat's height has a -0.2 value. Maybe explained by giving a -5 to 'Casas' category. 

The rest of the columns have very small values, showing a small direct effect on our target variable and among them. 

However, houses' prices depend on more than linear correlations (or so I hypothesize), but testing that will have to wait for another notebook.


In [None]:
corr = data.iloc[:,1:].corr("spearman")
f, ax = plt.subplots(figsize=(11, 9))
sns.heatmap(corr ,annot=True, fmt='.1g',center=0) 
plt.show()

Calculating the spearman correlation shows small differences. While the first one searches for linear relationships, this one looks for monotonic relationships.

In [None]:
data.to_csv('madrid_houses_clean.csv')

Saving the dataframe to a file was the last step. Now we can use it in other notebooks.

My next step will be to explore a bit more about this data placing each house on a map. This will allow us to see better the relationships between variables and how they are scattered among Madrid.

Stay tuned!