## 1. Data collection

In [3]:
import pandas as pd

Let's look at the dataset.

In [7]:
data = pd.read_csv('rent_apartments.csv')
data

Unnamed: 0,address,area,constraction_year,rooms,bedrooms,bathrooms,balcony,storage,parking,furnished,garage,garden,energy,facilities,zip,neighborhood,rent
0,1071 HN Amsterdam (Cornelis Schuytbuurt),167.0,1870,3,2,2,yes,no,no,yes,no,Not present,D,Roof terrace,1071 HN,Cornelis Schuytbuurt,4500
1,1071 HK Amsterdam (Concertgebouwbuurt),150.0,1890,3,2,2,yes,no,yes,yes,no,Not present,A,"Cable TV, Internet connection, Fireplace, Bath...",1071 HK,Concertgebouwbuurt,3450
2,1071 HK Amsterdam (Concertgebouwbuurt),150.0,1890,3,2,2,yes,no,yes,yes,no,Not present,A,"Cable TV, Internet connection, Fireplace, Bath...",1071 HK,Concertgebouwbuurt,3450
3,1071 WV Amsterdam (Hondecoeterbuurt),90.0,1923,3,2,1,yes,no,no,yes,no,Not present,,"Shower, Toilet",1071 WV,Hondecoeterbuurt,2000
4,1071 WV Amsterdam (Hondecoeterbuurt),104.0,1923,3,2,1,no,no,no,no,no,Present (47 m²),D,"Shower, Bath, Toilet",1071 WV,Hondecoeterbuurt,3250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1723,1033 DL Amsterdam (Terrasdorp),75.0,1990,3,2,1,no,no,no,yes,no,Not present,C,,1033 DL,Terrasdorp,1450
1724,1033 DZ Amsterdam (Terrasdorp),75.0,1990,3,2,1,yes,no,no,yes,no,Not present,C,Shower,1033 DZ,Terrasdorp,1500
1725,1021 NX Amsterdam (IJplein e.o.),74.0,1986,2,1,1,no,no,no,yes,no,Not present,,,1021 NX,IJplein e.o.,1400
1726,1021 EC Amsterdam (Vogelbuurt Zuid),118.0,1920,5,4,1,yes,yes,yes,yes,no,Not present,G,"Storage space, Shower, Toilet",1021 EC,Vogelbuurt Zuid,2650


This table contains apartment properties, like address, area, construction year, number of rooms, number of bedrooms, etc.

It also contains the last column called rent, which represents the rental price of those apartments.

Our objective is to build a model that would predict the rental price of an apartment based on these parameters.

## 2. Data preparation

Before we do so, let's start with pre-processing that dataset.

Data preparation is all about transferring your dataset into readable for machine learning algorithm form.

The first thing we need to do is to check the data types of our features.

In [6]:
data.dtypes

address               object
area                 float64
constraction_year      int64
rooms                  int64
bedrooms               int64
bathrooms              int64
balcony               object
storage               object
parking               object
furnished             object
garage                object
garden                object
energy                object
facilities            object
zip                   object
neighborhood          object
rent                   int64
dtype: object

We can see that we have a bunch of features here.
Some of them are numerical, and some of them are categorical.

Let's start with transforming categorical features into numerical form.

I'd like to start with these five features:
* balcony
* storage
* parking
* furnished
* garage.

I'd like to encode them using getDummies () method in Python.

In [8]:
data_encoded = pd.get_dummies(data, columns=['balcony', 'parking', 'furnished', 'garage', 'storage'], drop_first=True)

In [9]:
data_encoded

Unnamed: 0,address,area,constraction_year,rooms,bedrooms,bathrooms,garden,energy,facilities,zip,neighborhood,rent,balcony_yes,parking_yes,furnished_yes,garage_yes,storage_yes
0,1071 HN Amsterdam (Cornelis Schuytbuurt),167.0,1870,3,2,2,Not present,D,Roof terrace,1071 HN,Cornelis Schuytbuurt,4500,True,False,True,False,False
1,1071 HK Amsterdam (Concertgebouwbuurt),150.0,1890,3,2,2,Not present,A,"Cable TV, Internet connection, Fireplace, Bath...",1071 HK,Concertgebouwbuurt,3450,True,True,True,False,False
2,1071 HK Amsterdam (Concertgebouwbuurt),150.0,1890,3,2,2,Not present,A,"Cable TV, Internet connection, Fireplace, Bath...",1071 HK,Concertgebouwbuurt,3450,True,True,True,False,False
3,1071 WV Amsterdam (Hondecoeterbuurt),90.0,1923,3,2,1,Not present,,"Shower, Toilet",1071 WV,Hondecoeterbuurt,2000,True,False,True,False,False
4,1071 WV Amsterdam (Hondecoeterbuurt),104.0,1923,3,2,1,Present (47 m²),D,"Shower, Bath, Toilet",1071 WV,Hondecoeterbuurt,3250,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1723,1033 DL Amsterdam (Terrasdorp),75.0,1990,3,2,1,Not present,C,,1033 DL,Terrasdorp,1450,False,False,True,False,False
1724,1033 DZ Amsterdam (Terrasdorp),75.0,1990,3,2,1,Not present,C,Shower,1033 DZ,Terrasdorp,1500,True,False,True,False,False
1725,1021 NX Amsterdam (IJplein e.o.),74.0,1986,2,1,1,Not present,,,1021 NX,IJplein e.o.,1400,False,False,True,False,False
1726,1021 EC Amsterdam (Vogelbuurt Zuid),118.0,1920,5,4,1,Not present,G,"Storage space, Shower, Toilet",1021 EC,Vogelbuurt Zuid,2650,True,True,True,False,True


Next, let's look at the garden variable.

As you can see, we have a bunch of different strings here.

Our goal is to extract numerical values from those strings.

We can easily use regular expressions library in Python.

In [10]:
data_encoded.garden.unique()

array(['Not present', 'Present (47 m²)', 'Present (29 m²)',
       'Present (75 m²)', 'Present (40 m², located on the north)',
       'Present (50 m²)', 'Present (20 m², located on the south)',
       'Present (1 m²)', 'Present (15 m²)', 'Present (25 m²)',
       'Present (12 m²)', 'Present (45 m², located on the south)',
       'Present (26 m², located on the south-east)',
       'Present (20 m², located on the north-east)',
       'Present (42 m², located on the west)', 'Present (46 m²)',
       'Present (45 m², located on the south-west)',
       'Present (60 m², located on the south-west)',
       'Present (50 m², located on the south)',
       'Present (40 m², located on the north-east)', 'Present (16 m²)',
       'Present (60 m²)', 'Present (65 m², located on the south)',
       'Present (90 m²)', 'Present (85 m²)',
       'Present (85 m², located on the south-west)',
       'Present (500 m², located on the west)',
       'Present (45 m², located on the west)',
       'Present (1

In [11]:
import re

Let's just look at the fourth value of garden column.

Our goal is to extract 47.

This is done by using findAll () method in regular expressions library, where we explicitly tell it to find any digit inside that string.

In [12]:
data_encoded.garden[4]

'Present (47 m²)'

Cool, it found the 47, but it's still not in a numerical form.

To make it numerical, we have to take this out of the list and convert to integer.

Now we can write a simple for loop in Python to extract all the numerical values from those strings across the entire dataset.

The only thing to keep in mind is that we have notPresentValue, which doesn't contain any numerical data.

For instance, if you, let's say, have notPresentValue inside the garden column, and we want to extract numerical value from that string, Python would throw an error because it cannot find any digit.

To extract all the numerical values from those strings, we're going to use some basic logic.

For the entire dataset, if you see the value of garden variable is notPresent, then we're going to assign zero.

Else, we're going to use regular expressions.

In [20]:
int(re.findall(r'\d+', data_encoded.garden[4])[0])

47

In [24]:
for index, row in data_encoded.iterrows():
    garden = row['garden']
    if garden == 'Not present':
        garden = 0
    else:
        garden = int(re.findall(r'\d+', garden)[0])

    # Assign the modified value back to the DataFrame
    data_encoded.at[index, 'garden'] = garden

In [25]:
data_encoded

Unnamed: 0,address,area,constraction_year,rooms,bedrooms,bathrooms,garden,energy,facilities,zip,neighborhood,rent,balcony_yes,parking_yes,furnished_yes,garage_yes,storage_yes
0,1071 HN Amsterdam (Cornelis Schuytbuurt),167.0,1870,3,2,2,0,D,Roof terrace,1071 HN,Cornelis Schuytbuurt,4500,True,False,True,False,False
1,1071 HK Amsterdam (Concertgebouwbuurt),150.0,1890,3,2,2,0,A,"Cable TV, Internet connection, Fireplace, Bath...",1071 HK,Concertgebouwbuurt,3450,True,True,True,False,False
2,1071 HK Amsterdam (Concertgebouwbuurt),150.0,1890,3,2,2,0,A,"Cable TV, Internet connection, Fireplace, Bath...",1071 HK,Concertgebouwbuurt,3450,True,True,True,False,False
3,1071 WV Amsterdam (Hondecoeterbuurt),90.0,1923,3,2,1,0,,"Shower, Toilet",1071 WV,Hondecoeterbuurt,2000,True,False,True,False,False
4,1071 WV Amsterdam (Hondecoeterbuurt),104.0,1923,3,2,1,47,D,"Shower, Bath, Toilet",1071 WV,Hondecoeterbuurt,3250,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1723,1033 DL Amsterdam (Terrasdorp),75.0,1990,3,2,1,0,C,,1033 DL,Terrasdorp,1450,False,False,True,False,False
1724,1033 DZ Amsterdam (Terrasdorp),75.0,1990,3,2,1,0,C,Shower,1033 DZ,Terrasdorp,1500,True,False,True,False,False
1725,1021 NX Amsterdam (IJplein e.o.),74.0,1986,2,1,1,0,,,1021 NX,IJplein e.o.,1400,False,False,True,False,False
1726,1021 EC Amsterdam (Vogelbuurt Zuid),118.0,1920,5,4,1,0,G,"Storage space, Shower, Toilet",1021 EC,Vogelbuurt Zuid,2650,True,True,True,False,True


Yes, we have perfectly converted strings to numeric values.

We can also check it by calling unique method.

In [26]:
data_encoded['garden'].unique()

array([0, 47, 29, 75, 40, 50, 20, 1, 15, 25, 12, 45, 26, 42, 46, 60, 16,
       65, 90, 85, 500, 30, 49, 51, 80, 27, 56, 9, 200, 32, 100, 34],
      dtype=object)

## 3. Model Building

### 3.1. Defining X and y

To build a model, we first need to identify independent variables and dependent variables.

I'll start with independent variables.

The variables I'd like to use to train our model are:
* area
* construction year
* bedrooms
* garden
* balcony
* parking
* furnished
* garage
* storage

Dependent variable is obviously going to be rental price, something that we've already discussed.

We can easily look at the dependent variable as well as independent variables.

In [31]:
X = data_encoded[['area', 'constraction_year', 'bedrooms', 'garden', 'balcony_yes', 'parking_yes', 'furnished_yes', 'garage_yes', 'storage_yes']]
X

Unnamed: 0,area,constraction_year,bedrooms,garden,balcony_yes,parking_yes,furnished_yes,garage_yes,storage_yes
0,167.0,1870,2,0,True,False,True,False,False
1,150.0,1890,2,0,True,True,True,False,False
2,150.0,1890,2,0,True,True,True,False,False
3,90.0,1923,2,0,True,False,True,False,False
4,104.0,1923,2,47,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
1723,75.0,1990,2,0,False,False,True,False,False
1724,75.0,1990,2,0,True,False,True,False,False
1725,74.0,1986,1,0,False,False,True,False,False
1726,118.0,1920,4,0,True,True,True,False,True


In [30]:
y = data_encoded['rent']
y

0       4500
1       3450
2       3450
3       2000
4       3250
        ... 
1723    1450
1724    1500
1725    1400
1726    2650
1727    2600
Name: rent, Length: 1728, dtype: int64

### 3.2. Split the dataset

The next step is to split the dataset.

Here we're going to use train -test -split from scikit -learn library.

Let's import it.

In [34]:
from sklearn.model_selection import train_test_split

We're going to split our dataset into train and test sets.

We're also going to specify our test size to be equal to 20%.

Now we're ready to build a model.

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### 3.3. Model Building

To build a model, we're going to use random forest regressor.

Let's first import that library.

First, we're going to initialize our model.

In [36]:
from sklearn.ensemble import RandomForestRegressor

In [37]:
rf = RandomForestRegressor()

Second, we're going to train it with our dataset.

Last thing is we can extract the score of the model that we just trained.

In [38]:
rf.fit(X_train, y_train)

We have a model that we can use to predict rental price of any apartment.

In [39]:
rf.score(X_test, y_test)

0.8137825604385918

### 3.4. Predicting

Let's make some predictions.


Our independent variables were area, construction year, bedrooms, garden, and so on.

I would like to predict a price of an apartment that has, let's say, 85m2, then the construction year is 2015, has 2 bedrooms, has 20m2 garden, 1 balcony, 1 parking, not furnished, no garage, and 1 storage.

In [40]:
X.head()

Unnamed: 0,area,constraction_year,bedrooms,garden,balcony_yes,parking_yes,furnished_yes,garage_yes,storage_yes
0,167.0,1870,2,0,True,False,True,False,False
1,150.0,1890,2,0,True,True,True,False,False
2,150.0,1890,2,0,True,True,True,False,False
3,90.0,1923,2,0,True,False,True,False,False
4,104.0,1923,2,47,False,False,False,False,False


So the price will be $2,034.

In [42]:
rf.predict([[85, 2015, 2, 20, 1, 1, 0, 0, 1]])



array([2033.9825])

### 3.5. Tuning Hyperparameters

We can also tune hyperparameters of the model to see if it gets better.

To do so, we're going to use grid search method.

So let's first import grid search.

In [43]:
from sklearn.model_selection import GridSearchCV

Then we have to define the grid space.

Let's say we would like to take the number of estimators.

We would also like to take maximum depth.

In [44]:
grid_space = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 6, 9, 12]}

The next thing is to initialize the grid search.

In [47]:
grid = GridSearchCV(RandomForestRegressor(), param_grid=grid_space, cv=5, scoring='r2')

In [48]:
model_grid = grid.fit(X_train, y_train)

Now let's print the best parameters that the grid search found.

In [49]:
print(f"Best hyperparameters are {model_grid.best_params_}, score = {model_grid.best_score_}")

Best hyperparameters are {'max_depth': 12, 'n_estimators': 300}, score = 0.7072638894875193


## 4. Models Management

Last thing in any process of developing machine learning algorithm is to save your trained model.

There is a Python library called 'pickle' that enables you to save your trained models with configuration parameters.

In [50]:
import pickle as pk

To use pickle library, we can use dump method and then specify the model that we would like to save, which is random forest.

And where we would like to save it.

Let's say I would like to save it to models folder and name it as 'rf_v1'.

In [52]:
pk.dump(rf, open('models/rf_v1', 'wb'))

In order to use this trained model, all you have to do is to use load method in pickle.

And I'm also giving the name 'rf_v1'.

In [56]:
rf_v1 = pk.load(open('models/rf_v1', 'rb'))

Now I can use random forest version 1 to make predictions.

Just like here.

In [57]:
rf_v1.predict([[85, 2015, 2, 20, 1, 1, 0, 0, 1]])



array([2033.9825])

the prediction is the same, so I don't need to retrain it again.