## 1. Introduction
Our dataset is a list of housing listings from the top 45 most populous cities in canada. We felt that it would be interesting and relevant (since we are first years looking for housing) to study this data and identify patterns in it. Our goal is to create a regression model that can predict price using number of beds, number of baths, and city as predictors.

## 2. Preliminary Data Analysis

In [13]:
library(tidyverse)
library(repr)
library(tidymodels)
library(janitor)
options(repr.matrix.max.rows = 6)

We load our data into a tibble 

In [14]:
housing_raw <- read_csv("data/HouseListings-Top45Cities-10292023-kaggle.csv") |> clean_names()
housing_raw

[1mRows: [22m[34m35768[39m [1mColumns: [22m[34m10[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): City, Address, Province
[32mdbl[39m (7): Price, Number_Beds, Number_Baths, Population, Latitude, Longitude, ...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


city,price,address,number_beds,number_baths,province,population,latitude,longitude,median_family_income
<chr>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Toronto,779900,#318 -20 SOUTHPORT ST,3,2,Ontario,5647656,43.7417,-79.3733,97000
Toronto,799999,#818 -60 SOUTHPORT ST,3,1,Ontario,5647656,43.7417,-79.3733,97000
Toronto,799900,#714 -859 THE QUEENSWAY,2,2,Ontario,5647656,43.7417,-79.3733,97000
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Halifax,419900,212 60 Walter Havill Drive,2,2,Nova Scotia,431479,44.8857,63.1005,86753
Halifax,949900,10 Idlewylde Road,3,1,Nova Scotia,431479,44.8857,63.1005,86753
Halifax,592900,208 2842-2856 Gottingen,2,1,Nova Scotia,431479,44.8857,63.1005,86753


This tibble is already tidy, so we can get down to summarizing and visualizing data

Address, latitude, and longitude won't help us make a prediction, so let's remove them. Additionally it's important to note that median_family_income and population are only unique to the city, and so exist only to assign a numeric value to each city

In [20]:

filter(housing_raw, number_beds == 0 | number_baths == 0)

city,price,address,number_beds,number_baths,province,population,latitude,longitude,median_family_income
<chr>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Toronto,548000,#2503 -99 HARBOUR SQ,0,1,Ontario,5647656,43.7417,-79.3733,97000
Toronto,459900,#2311 -170 SUMACH ST N,0,1,Ontario,5647656,43.7417,-79.3733,97000
Toronto,499900,#202 -1030 KING ST W,0,1,Ontario,5647656,43.7417,-79.3733,97000
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Saskatoon,199000,408-404 C AVENUE S,0,1,Saskatchewan,266141,52.1333,-106.6833,89000
Saskatoon,350000,428 F AVENUE S,0,1,Saskatchewan,266141,52.1333,-106.6833,89000
Saskatoon,84900,19-400 4th AVENUE N,0,1,Saskatchewan,266141,52.1333,-106.6833,89000


These are all wrong (there are no 0 bed houses), so let's remove them from our data so it doesn't skew the model

In [25]:
housing_filter <- housing_raw |> filter(number_beds > 0 & number_baths > 0)
housing_filter

city,price,address,number_beds,number_baths,province,population,latitude,longitude,median_family_income
<chr>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Toronto,779900,#318 -20 SOUTHPORT ST,3,2,Ontario,5647656,43.7417,-79.3733,97000
Toronto,799999,#818 -60 SOUTHPORT ST,3,1,Ontario,5647656,43.7417,-79.3733,97000
Toronto,799900,#714 -859 THE QUEENSWAY,2,2,Ontario,5647656,43.7417,-79.3733,97000
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Halifax,419900,212 60 Walter Havill Drive,2,2,Nova Scotia,431479,44.8857,63.1005,86753
Halifax,949900,10 Idlewylde Road,3,1,Nova Scotia,431479,44.8857,63.1005,86753
Halifax,592900,208 2842-2856 Gottingen,2,1,Nova Scotia,431479,44.8857,63.1005,86753


## 3. Methods

Address, latitude, and longitude won't help us make a prediction, so let's remove them. Additionally it's important to note that median_family_income and population are only unique to the city, and so exist only to assign a numeric value to each city

In [26]:
housing <- housing_filter |> select(-address, -latitude, -longitude)
housing

city,price,number_beds,number_baths,province,population,median_family_income
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>
Toronto,779900,3,2,Ontario,5647656,97000
Toronto,799999,3,1,Ontario,5647656,97000
Toronto,799900,2,2,Ontario,5647656,97000
⋮,⋮,⋮,⋮,⋮,⋮,⋮
Halifax,419900,2,2,Nova Scotia,431479,86753
Halifax,949900,3,1,Nova Scotia,431479,86753
Halifax,592900,2,1,Nova Scotia,431479,86753


## 4. Expected findings and significance