## Predicting Rent in Toronto

In this project I will look at rent in Toronto. The data was acquired from
https://www.kaggle.com/rajacsp/toronto-apartment-price.

Firstly, I will import the appropriate packages and data.

In [17]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import re

Below I imported the data and removed the dollar signs from the prices.

In [18]:
all_data_df = pd.read_csv('Toronto_apartment_rentals_2018.csv')
all_data_df.head()
# Remove the dollar signs
all_data_df['Price'] = all_data_df['Price'].replace('[\$,]', '', regex=True).astype(float)
all_data_df.head()

Unnamed: 0,Bedroom,Bathroom,Den,Address,Lat,Long,Price
0,2,2.0,0,"3985 Grand Park Drive, 3985 Grand Park Dr, Mis...",43.581639,-79.648193,2450.0
1,1,1.0,1,"361 Front St W, Toronto, ON M5V 3R5, Canada",43.643051,-79.391643,2150.0
2,1,1.0,0,"89 McGill Street, Toronto, ON, M5B 0B1",43.660605,-79.378635,1950.0
3,2,2.0,0,"10 York Street, Toronto, ON, M5J 0E1",43.641087,-79.381405,2900.0
4,1,1.0,0,"80 St Patrick St, Toronto, ON M5T 2X6, Canada",43.652487,-79.389622,1800.0


Next I'll check for missing values. There are none.

In [19]:
all_data_df.isna().sum()

Bedroom     0
Bathroom    0
Den         0
Address     0
Lat         0
Long        0
Price       0
dtype: int64

For feature engineering, I will focus on the location. Instead of using
latitude and longitude, I will focus on the first three letters of the postal code as the first three letters can tell us
what neighbourhood the home is located in. The latitude and longitude are far too specific.



In [20]:
reg = re.compile('[LMlm]+\d+\D')
post_firstthree=[]
city = []
for add in all_data_df['Address']:
    ft_list = reg.findall(add)
    if ft_list == []:
        post_firstthree.append('NA')
    else:
        post_firstthree.append(ft_list[0])
all_data_df['postal'] = post_firstthree
all_data_df['postal'] = all_data_df['postal'].str.upper()
all_data_df.drop(['Address', 'Lat', 'Long'], axis=1)

# city indicator- in our out of Toronto?



Unnamed: 0,Bedroom,Bathroom,Den,Price,postal
0,2,2.0,0,2450.0,L5B
1,1,1.0,1,2150.0,M5V
2,1,1.0,0,1950.0,M5B
3,2,2.0,0,2900.0,M5J
4,1,1.0,0,1800.0,M5T
...,...,...,...,...,...
1119,3,1.0,0,3000.0,L7S
1120,1,1.0,0,1200.0,L6M
1121,1,1.0,0,1800.0,M4C
1122,2,1.0,0,2200.0,M5B


Next I'll encode the postal codes as dummy variables.

In [21]:
all_data_df = pd.get_dummies(all_data_df)

Now I'll split the data into a testing and training set.

In [22]:
trainX, testX, trainY, testY = train_test_split(all_data_df.drop(['Price'], axis=1), all_data_df['Price'])
trainX.shape
testX.shape


(281, 768)

## The model
Now it's time to create the model. I'll use the RandomForestRegressor. I'll also check the model's score.

In [23]:
model = RandomForestRegressor()
model.fit(trainX, trainY)
predictions = model.predict(testX)
# Check model accuracy
model.score(testX, testY)

0.9876227122880562

The model's score is 0.9876.