# Group Project Report


## Introduction

As first-year students seeking an affordable place to stay during the school year, it is important to get some predicted insights on the housing price to have a better overview of the living expenses not just in Vancouver but also in other Canadian cities, and the rental price can be inferred accordingly. Indeed, the higher costs associated with homeownership impact  rental markets as  landlords tend to increase their rents to make payments for mortgages on their rental properties (Hirota et al., 2020). Not only does the prediction of the house price help students to choose which province they should live in and study, but it also helps people to estimate their finances based on the future price range, and the trend of housing prices at a certain location. Considering this necessity, this study uses two different models, k-Nearest Neighbours (kNN) and linear regression, and aims to discover which model will provide a better prediction of housing prices using the housing listing dataset from the top 45 most populous cities in Canada. The kNN regression model is hypothesized to be a better tool for predicting housing prices than the linear regression model. 

KNN regression is a local estimator using its neighbourhood, a non-parametric model, and it produces a flexible line based on the distribution of data. Linear regression, on the other hand, is a global estimator that uses the linear relationship between variables and produces a straight line that illustrates the linear relationship between predictors and responses. The model is considered a good model if its Root Mean Squared Error (RMSE) is lower than the other model, making it the more suitable model to predict housing prices.

## Literature Review

### 1. The Relationship between Purchase Price and Rent

As stated in economics theory, the asset price is "the sum of the discounted value of expected future cash flow"(Hirota & Suzuki-Löffelholz & Udagawa, 2019). Therefore, when we discuss the value of a house, we need to examine its rents since they are the "discounted value" we referred to before. This theory, reveals that the purchase price is not a determinant of the future rent. However, in Hirota, Suzuki-Löffelholz, and Udagawa's study, they applied behavioral economics theory to unfold the relationship between the purchase price and the rent. They hypothesized that the sunk cost of the property owner, which is the purchase price, does not affect the rent they will offer in the future, but it later turned out that there is an underlying correlation between those: even if the rent and the purchase price is not directly related with the rent offered, the property owners is likely to charge a higher rent if the purchase price is high (2019).

### 2. Quality of Life

Dimitrios and Sfakianaki argue that "quality of life" significantly impacts rental prices, which in turn influence housing costs (2014). They explain that a high quality of life, which includes factors like clean water and safe neighborhoods, is indeed a type of goods(2019). It is true that every individual has to pay for traditional economics goods, such as food, shelter, and clothing. However, "tangible goods" also appear in the market, including clean water and safe neighborhoods (2014). The consumption of tangible goods implies that the consumer is pursuing a higher quality of living.

## Methods & Results

In [1]:
library(ggplot2)
library(tidymodels)
library(tidyverse)
library(repr)
library(janitor)
library(GGally)
library(readr)
library(dplyr)
library(ISLR)
library(gridExtra)
library(kknn)
set.seed(1234)

“package ‘ggplot2’ was built under R version 4.3.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39m 1.2.0
[32m✔[39m [34mdials       [39m 1.2.0     [32m✔[39m [34mtibble      [39m 3.2.1
[32m✔[39m [34mdplyr       [39m 1.1.3     [32m✔[39m [34mtidyr       [39m 1.3.0
[32m✔[39m [34minfer       [39m 1.0.5     [32m✔[39m [34mtune        [39m 1.1.2
[32m✔[39m [34mmodeldata   [39m 1.2.0     [32m✔[39m [34mworkflows   [39m 1.1.3
[32m✔[39m [34mparsnip     [39m 1.1.1     [32m✔[39m [34mworkflowsets[39m 1.0.1
[32m✔[39m [34mpurrr       [39m 1.0.2     [32m✔[39m [34myardstick   [39m 1.2.0
[32m✔[39m [34mrecipes     [39m 1.0.8     

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mpurrr[39m::[32mdiscard()[39m masks [34mscales[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mf

We first read the data and wrangle it by choosing the number of beds and baths greater than 0, as well as erasing some mislabelled houses. We figured that number_baths was the best predictor compared to other for the linear regression model, so we decided to use it as the predictor against the housing price. This makes sense since a house with a high ratio between the number of beds and baths would have its value lower than those with a relatively equal number of beds and baths. 

For the kNN regression model, we accounted for the median family income, number of beds and baths as our predictors since they are both important factors in predicting the household value.

Table 1. Clean dataset with number of beds and baths greater than 0

In [None]:
url <- "https://raw.githubusercontent.com/slappyslop/dsci-100-002-033/main/data/HouseListings-Top45Cities-10292023-kaggle.csv"
download.file(url, "data/HouseListings-Top45Cities-10292023-kaggle.csv")
housing_raw <- read_csv("data/HouseListings-Top45Cities-10292023-kaggle.csv") |> clean_names()
housing_filter <- housing_raw |> filter(number_beds > 0 & number_baths > 0)
housing_clean <- housing_filter |> filter(!(city %in% c("Saskatoon", "Winnipeg", "Nanaimo", "Regina") & province == "Ontario"))
housing_clean |> head()
housing_clean |> tail()

Then we split the cleaned dataset into training set and testing set.

In [None]:
set.seed(1234)
housing_split <- initial_split(housing_clean, prop = 0.75, strata = price)
training <- training(housing_split)
testing <- testing(housing_split)

Using the training data, we constructed three graphs illustrating the relationship between price and the number of beds, baths, and median family income, respectively. This is to get a better sense of what will be the factor that has the strongest influence on the housing prices.

In [None]:
options(repr.plot.height = 4, repr.plot.width = 4)
bed_plot <- training|>
            ggplot(aes(x = number_beds, y = price))+
            geom_point(alpha=0.4)+
            labs(x = "Number of Beds", y = "Price(CAD)", title = "Graph 1. The Relationship between Number of Beds and Price")+
            xlim(c(0, 30))
bath_plot <- training|>
            ggplot(aes(x = number_baths, y = price))+
            geom_point(alpha=0.4)+
            labs(x = "Number of Baths", y = "Price(CAD)", title = "Graph 2. The Relationship between Number of Bath and Price")+
            xlim(c(0, 20))
income_plot <- training|>
            ggplot(aes(x = median_family_income, y = price))+
            geom_point(alpha=0.4)+
            labs(x = "Median Family Income", y = "Price(CAD)", title = "Graph 3. The Relationship between Median Family Income and Price")


In [None]:
options(repr.plot.height = 8, repr.plot.width = 6)
final_plot <- grid.arrange (bed_plot, bath_plot, income_plot, nrow = 3)
final_plot

### 1. Linear Regression

We first attempted to make a linear regression of price against one of the variables, we also made our predictors which included the ratio of beds to bathrooms and the sum of beds and bathrooms. In order to see the feasibility of this, we used a pair plot through the following code. Unfortunately, this code makes the kernel crash, and so this was done using RStudio on Shravan's local machine

In [None]:
training <- training |> select(-longitude, -latitude, -address, -city, -province)
testing <- testing |> select(-longitude, -latitude, -address, -city, -province)

First we removed the factors that we thought would not have much effect on the prediction such as lattitude and longitude, as well as non-numeric columns like address, city, and province. we felt that number_beds, number_baths, and median_family_income would be the best predictors for our problem. 



As observed from the dataset, there are houses with the number of beds significantly greater than the number of baths. We thus decided to create another column called ratio which determines whether the number of beds and baths is reasonable, and adingd up the beds and baths to see the houses' total number of rooms. Both new columns are helpful since they directly affect the household value, and this makes sense since a high ratio would mean there are more beds than baths, which is not ideal for a family of many people. Furthermore, the more rooms a house has, the greater the housing price will be because the house will be considered a large house


Table 2. Training dataset with new predictors

In [None]:
training_full <- training |> mutate(sum = number_beds + number_baths, ratio = number_beds/number_baths,)
training_full |> head(5)
training_full |> tail(5)

Table 3. Testing dataset with new predictors

We then used the following code to run a pairplot to see which factors had the best linear corellation. Unfortunately, this code makes the kernel crash, and so this was done using RStudio on Shravan's local machine

### Makes the kernel crash !
```
price_pairplot <- training_full|> ggpairs( lower = list(continuous = wrap('points', alpha = 0.4)), diag = list(continuous = "barDiag") ) +                                   theme(text = element_text(size = 20))
price_pairplot
```


We discovered that the correlation coefficient was highest with `sum` and `number_baths`, these being the best linear predictors of price (`0.423` and `0.471`). However, `sum`, `ratio`, and `number_beds` had some co-linearity to `number_baths` so we didn't use them. The only predictors non-colinear to `number_baths` were `median_family_income` and `population`but these had a correlation coefficient of `0.053` and `0.075` so we didn't use them either. We decided that for the linear regression, the only predictor we would use was `number_baths`.

We consideredlogging the price as it would produce more interpretable visualizations and a lower RMSE, but chose to just predict price instead. This is because if we chose to calculate the logarithm, a small difference in the RMSE value for `log(price)` would be a huge difference in the prediction error for `price`. Also an equivalent prediction of `log(price)` above the truth value would produce a much larger error for `price` than an equivalent prediction of `log(price)` under the truth value. This would make it hard to interpret and evaluate the model.

In [None]:
lm_spec <- linear_reg() |> set_engine("lm") |> set_mode("regression")
lm_recipe <- recipe(price ~ number_baths, data = training_full)
lm_fit <-  workflow() |> add_recipe(lm_recipe) |> 
  add_model(lm_spec) |> 
  fit(data = training_full)
lm_fit


In [None]:
lm_test_results <- lm_fit |>
  predict(testing_full) |>
  bind_cols(testing_full) |>
  metrics(truth = price, estimate = .pred) |> filter(.metric == "rmse")
lm_test_results

This approach gives us an RMSE value of `809492.1`.

In [None]:
options(plot.repr.height = 4, plot.repr.width = 4)
lm_plot <- ggplot(testing_full, aes(x = number_baths , y = price)) + 
  geom_point()+ 
  geom_abline(intercept = -17159, slope =  374799 ,linetype = "dashed", color = "blue", size = 1) +
  labs(y = "Home price (CAD)", x = "Number of baths in home") +
  ggtitle("Graph 4. Regression Visualization")+
  theme(text = element_text(size = 12))
lm_plot

From the scatterplot above, we can see that the points cluster in the left corner, and there is an outlier on the right at x = 60. We decided to take a closer look at the larger cluster.

In [None]:
lm_plot_2 <- ggplot(testing_full, aes(x = number_baths , y = price)) + 
  geom_point()+ 
  geom_abline(intercept = -17159, slope =  374799 , linetype = "dashed", color = "blue", size = 1) +
  labs(y = "Home price (CAD)", x = "Number of baths in home") + xlim(0, 15)+
  ggtitle("Graph 5. Regression Visualization")

lm_plot_2

### 2. KNN  Regression

We then wanted to test how a kNN regression model would fare in the same task. Here we decided to use all the columns as predictors because this way, the algorithm can select the closest set of houses to the one that needs to be predicted. It's important to realize here that both `population` and `median_family_income` are basically just numeric values for `city` which we removed from the training set eariler. However, after a lot of testing, we found the best results when both were present in the dataset. We think this is because it helps the kNN model select houses from the same city to compare against better. 

In [None]:
head(training_full)
tail(training_full)

We set the seed and created the kNN model, and cross validated to find the best k value.

In [None]:
set.seed(1234)
housing_recipe <- recipe(price ~ ., data = training_full) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
housing_recipe
housing_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("regression")

housing_vfold <- vfold_cv(training_full, v = 5, strata = price)

housing_workflow <- workflow() |>
  add_recipe(housing_recipe) |>
  add_model(housing_spec)

tuned_housing <- housing_workflow |>
  tune_grid(resamples = housing_vfold, grid = tibble(neighbors = seq(from = 1, to = 41, by = 5))) |>
  collect_metrics()|>
  filter(.metric == "rmse")

tuned_housing

Table 4. Number of neighbours and their corresponding RMSE values

In [None]:
ggplot(tuned_housing, aes(x = neighbors, y = mean)) + geom_point() + geom_line() +
labs(x = "Number of neighbours", y = "RMSE") +
ggtitle("Graph 6. Relationship between RMSE and number of neighbours")
options(repr.plot.height = 6, repr.plot.width = 7.5)