## Group Project Report


In [3]:
library(ggplot2)
library(tidymodels)
library(tidyverse)
library(repr)
library(janitor)
library(GGally)
library(readr)
library(dplyr)
library(ISLR)
library(gridExtra)
library(kknn)
set.seed(1234)

“package ‘ggplot2’ was built under R version 4.3.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39m 1.2.0
[32m✔[39m [34mdials       [39m 1.2.0     [32m✔[39m [34mtibble      [39m 3.2.1
[32m✔[39m [34mdplyr       [39m 1.1.3     [32m✔[39m [34mtidyr       [39m 1.3.0
[32m✔[39m [34minfer       [39m 1.0.5     [32m✔[39m [34mtune        [39m 1.1.2
[32m✔[39m [34mmodeldata   [39m 1.2.0     [32m✔[39m [34mworkflows   [39m 1.1.3
[32m✔[39m [34mparsnip     [39m 1.1.1     [32m✔[39m [34mworkflowsets[39m 1.0.1
[32m✔[39m [34mpurrr       [39m 1.0.2     [32m✔[39m [34myardstick   [39m 1.2.0
[32m✔[39m [34mrecipes     [39m 1.0.8     

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mpurrr[39m::[32mdiscard()[39m masks [34mscales[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mf

## Introduction

As first-year students seeking an affordable place to stay during the school year, it is important to get some predicted insights on the housing price to have a better overview of the living expenses not just in Vancouver but also in other Canadian cities, and the rental price can be inferred accordingly. Considering this necessity, this study uses two different models, k-Nearest Neighbours (kNN) and linear regression, and aims to discover which model will provide a better prediction of housing prices using the housing listing dataset from the top 45 most populous cities in Canada. 

KNN regression is a local estimator using its neighbourhood,a non-parametric model, and it produces a flexible line based on the distribution of data. Linear regression, on the other hand, is a global estimator that uses the linear relationship between variables and produces a straight line that illustrates the linear relationship between predictors and responses.

The kNN regression model is expected to be a better tool in predicting housing price than the linear regression model. 

## Methods & Results

    We first read the data and wrangle it by choosing the number of beds and baths greater than 0. We figured that the number_baths was the best predictor compared to other predictors for the linear regression model, so we decided to use it as the predictor against the housing price. This makes sense since a house with a high ratio between the number of beds and baths would have its value lower than those with a relatively equal number of beds and baths. Although logging the price would produce more interpretable visualizations and a lower RMSE, we still kept the housing price at its initial value because if we were to calculate in the exponent form, a small difference in the RMSE value would provide a huge difference in the prediction error. 

    For the kNN regression model, we accounted for the median family income, number of beds and baths as our predictors since they are both important factors in predicting the household value.

Table 1. Clean dataset with number of beds and baths greater than 0

In [4]:
url <- "https://raw.githubusercontent.com/slappyslop/dsci-100-002-033/main/data/HouseListings-Top45Cities-10292023-kaggle.csv"
download.file(url, "data/HouseListings-Top45Cities-10292023-kaggle.csv")
housing_raw <- read_csv("data/HouseListings-Top45Cities-10292023-kaggle.csv") |> clean_names()
housing_filter <- housing_raw |> filter(number_beds > 0 & number_baths > 0)
housing_clean <- housing_filter |> filter(!(city %in% c("Saskatoon", "Winnipeg", "Nanaimo", "Regina") & province == "Ontario"))
housing_clean

[1mRows: [22m[34m35768[39m [1mColumns: [22m[34m10[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): City, Address, Province
[32mdbl[39m (7): Price, Number_Beds, Number_Baths, Population, Latitude, Longitude, ...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


city,price,address,number_beds,number_baths,province,population,latitude,longitude,median_family_income
<chr>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Toronto,779900,#318 -20 SOUTHPORT ST,3,2,Ontario,5647656,43.7417,-79.3733,97000
Toronto,799999,#818 -60 SOUTHPORT ST,3,1,Ontario,5647656,43.7417,-79.3733,97000
Toronto,799900,#714 -859 THE QUEENSWAY,2,2,Ontario,5647656,43.7417,-79.3733,97000
Toronto,1200000,275 MORTIMER AVE,4,2,Ontario,5647656,43.7417,-79.3733,97000
Toronto,668800,#420 -388 RICHMOND ST,1,1,Ontario,5647656,43.7417,-79.3733,97000
Toronto,669900,#817 -151 DAN LECKIE WAY,2,1,Ontario,5647656,43.7417,-79.3733,97000
Toronto,699000,#1107 -438 KING ST W,2,2,Ontario,5647656,43.7417,-79.3733,97000
Toronto,978000,#2708 -20 EDWARD ST,3,2,Ontario,5647656,43.7417,-79.3733,97000
Toronto,958000,#4616 -386 YONGE ST,2,2,Ontario,5647656,43.7417,-79.3733,97000
Toronto,1899000,#2713 -155 YORKVILLE AVE,2,3,Ontario,5647656,43.7417,-79.3733,97000


Then we split the cleaned dataset into training set and testing set.

In [5]:
set.seed(1234)
housing_split <- initial_split(housing_clean, prop = 0.75, strata = price)
training <- training(housing_split)
testing <- testing(housing_split)

In [4]:
options(repr.plot.height = 4, repr.plot.width = 4)
bed_plot <- training|>
            ggplot(aes(x = number_beds, y = price))+
            geom_point(alpha=0.4)+
            labs(x = "Number of Beds", y = "Price(CAD)", title = "The Relationship between Number of Beds and Price")+
            xlim(c(0, 30))
bath_plot <- training|>
            ggplot(aes(x = number_baths, y = price))+
            geom_point(alpha=0.4)+
            labs(x = "Number of Baths", y = "Price(CAD)", title = "The Relationship between Number of Bath and Price")+
            xlim(c(0, 20))
income_plot <- training|>
            ggplot(aes(x = median_family_income, y = price))+
            geom_point(alpha=0.4)+
            labs(x = "Median Family Income", y = "Price(CAD)", title = "The Relationship between Median Family Income and Price")

In [None]:
options(repr.plot.height = 6, repr.plot.width = 4)
final_plot <- grid.arrange (bed_plot, bath_plot, income_plot, nrow = 3)

“[1m[22mRemoved 5 rows containing missing values or values outside the scale range
(`geom_point()`).”
“[1m[22mRemoved 4 rows containing missing values or values outside the scale range
(`geom_point()`).”


### 1. Linear Regression

We first attempted to make a linear regression of price against one of the variables, we also made our own predictors which included the ratio of beds to bathrooms and sum of beds and bathrooms. In order to see the feasibility of this, we used a pairplot through the following code. Unfortunately, this code makes the kernel crash, and so this was done using RStudio on my (Shravan) local machine

In [6]:
training <- training |> select(-longitude, -latitude, -address, -city, -province)
testing <- testing |> select(-longitude, -latitude, -address, -city, -province)



In [7]:
training_full <- training |> mutate(sum = number_beds + number_baths, ratio = number_beds/number_baths,)
testing_full <- testing |> mutate(sum = number_beds + number_baths, ratio = number_beds/number_baths, )




## MAKES THE KERNEL CRASH
price_pairplot <- training_full|> 
  ggpairs(
    lower = list(continuous = wrap('points', alpha = 0.4)),
    diag = list(continuous = "barDiag")
  ) +
  theme(text = element_text(size = 20))
"
price_pairplot

We discovered that the correlation coefficient was highest with `sum` and `number_baths` being the best linear predictors of price (0.423 and 0.471). The only other predictor that was non-colinear to `number_baths` was `median_family_income`, howeveer this had a coefficient of 0.053, and so we decided that for the linear regression the only predictor we would use was `number_baths`.

In [None]:
lm_spec <- linear_reg() |> set_engine("lm") |> set_mode("regression")
lm_recipe <- recipe(price ~ number_baths, data = training_full)
lm_fit <-  workflow() |> add_recipe(lm_recipe) |> 
  add_model(lm_spec) |> 
  fit(data = training_full)
lm_fit


In [None]:
lm_test_results <- lm_fit |>
  predict(testing_full) |>
  bind_cols(testing_full) |>
  metrics(truth = price, estimate = .pred)
lm_test_results

This approach gives us an RMSE value of `809292.1`.

In [None]:
lm_plot <- ggplot(testing_full, aes(x = number_baths , y = price)) + 
  geom_point()+ 
  geom_abline(intercept = -17159, slope =  374799 ,linetype = "dashed", color = "blue", size = 1) +
  labs(y = "Home price (dollars)", x = "Number of baths in home") +
  ggtitle("Graph 1. Regression Visualization")
lm_plot

A closer look at where the majority of the data is

In [None]:
lm_plot_2 <- ggplot(testing_full, aes(x = number_baths , y = price)) + 
  geom_point()+ 
  geom_abline(intercept = -17159, slope =  374799 ,linetype = "dashed", color = "blue", size = 1) +
  labs(y = "Home price (dollars)", x = "Number of baths in home") + xlim(0, 15)+
  ggtitle("Graph 1. Regression Visualization")

lm_plot_2

### 2. KNN  Regression

We then performed kNN regression model to visualize the correlation between the housing price and the number of beds and baths in each household. We chose the neighbourhood from 1 to 100 with an interval of 10 to better see the difference in the RMSE mean values. 
We first set the recipe to have the price predicted from the training data by the median family income, number of beds, and number of baths. We then set the model as regression and created a workflow with the chosen number of neighbours.

In [8]:
head(training_full)
tail(training_full)

price,number_beds,number_baths,population,median_family_income,sum,ratio
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
439000,2,1,5647656,97000,3,2.0
438000,2,1,5647656,97000,3,2.0
448800,3,2,5647656,97000,5,1.5
468500,1,1,5647656,97000,2,1.0
45000,3,5,5647656,97000,8,0.6
399000,1,1,5647656,97000,2,1.0


price,number_beds,number_baths,population,median_family_income,sum,ratio
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1750000,3,4,733156,76500,7,0.75
1250000,4,3,431479,86753,7,1.333333
1500000,3,4,431479,86753,7,0.75
1249900,9,3,431479,86753,12,3.0
6995000,4,5,431479,86753,9,0.8
1799900,5,3,431479,86753,8,1.666667


In [None]:
set.seed(1234)
housing_recipe <- recipe(price ~ ., data = training_full) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
housing_recipe
housing_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("regression")

housing_vfold <- vfold_cv(training_full, v = 5, strata = price)

housing_workflow <- workflow() |>
  add_recipe(housing_recipe) |>
  add_model(housing_spec)

tuned_housing <- housing_workflow |>
  tune_grid(resamples = housing_vfold, grid = tibble(neighbors = seq(from = 20, to = 41, by = 2))) |>
  collect_metrics()|>
  filter(.metric == "rmse")


tuned_housing_2 <- housing_workflow |>
  tune_grid(resamples = housing_vfold, grid = tibble(neighbors = seq(from = 20, to = 41, by = 2))) |>
  collect_metrics()|>
  filter(.metric == "rmse")

tuned_housing_2

In [None]:
ggplot(tuned_housing, aes(x = neighbors, y = mean)) + geom_point() + geom_line() +
ggtitle("Graph 2. Relationship between mean price and number of neighbours")

We want to take a closer look around 30

In [None]:
tuned_housing_2 <- housing_workflow |>
  tune_grid(resamples = housing_vfold, grid = tibble(neighbors = seq(from = 20, to = 41, by = 2))) |>
  collect_metrics()|>
  filter(.metric == "rmse")

In [None]:
ggplot(tuned_housing_2, aes(x = neighbors, y = mean)) + geom_point() + geom_line() +
ggtitle("Graph 3. Relationship between mean price and number of neighbours")

In [None]:
housing_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 30) |>
  set_engine("kknn") |>
  set_mode("regression")

housing_fit <- workflow() |>
  add_recipe(housing_recipe) |>
  add_model(housing_best_spec) |>
  fit(data = training_full)

housing_summary <- housing_fit |>
  predict(testing_full) |>
  bind_cols(testing_full) |>
  metrics(truth = price, estimate = .pred) |>
  filter(.metric == 'rmse')
housing_summary

We can see a very clear issue here. The way the kNN algorithm works is that it simply finds the city (`median_family_income`), `number_beds` and `number_baths` and takes the average of a number of similar listings based on our value of k. Of course, a 2 bedroom in Vancouver in a downtown skyscraper would have a very different price to a 2 bedroom in East Van. Our dataset does not account for this and this places a large factor in the large errors.

## Discussion

The results showed that the kNN model would be the best model used to predict household value since it accounts for all predictors that have a direct effect on the price, such as using the family income to find the best house with a reasonable number of beds and baths.
