In [None]:
library(tidyverse)
library(data.table)
library(plotly) # for interactive ploting
library(DT) # for interactive tabulation
library(broom) # for tidy statistical summaries
library(caret) # for regression performance measures

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

# Simple Linear Regression

Let's first import the realty dataset:

In [None]:
realty_data <- readRDS(sprintf("%s/rds/02_01_realty_data.rds", datapath))

In [None]:
realty_data

See which variables are of factor type and what the levels of each are:

In [None]:
realty_data %>% keep(is.factor) %>% lapply(levels)

And the frequencies of those levels:

In [None]:
realty_data %>% keep(is.factor) %>% summary

Let's see the numeric variables:

In [None]:
realty_data %>% keep(is.numeric) %>% names

And statistical summaries of numeric columns:

In [None]:
realty_data %>% keep(is.numeric) %>% summary()

And statistical summaries of numeric columns in a better format:

In [None]:
realty_data %>% keep(is.numeric) %>% broom::tidy() %>% mutate_if(is.numeric, round, 2) %>%
select(column, n, mean, sd, median, min, max)

Please follow the steps:

- Filter the data for properties with a single bathroom (banyo_sayisi) and a single living room (salon)
- Select gross size (brut_metrekare) and room count (oda) features, exclude rows with NA values
- Trim the top and bottom 5% brut_metrekare values
- Plot the relationship between gross size (brut_metrekare) and room count (oda) with a best fit line 
- Set an arbitrary seed for reproducibility with set.seed(xxx) (so that your typed interpretations and the printed results are conformable) and partition the data into 0.7 train and 0.3 test sets randomly
- Create a linear model where gross size is the dependent and the room count is the independent variable
- Interpret the model summary. What does the intercept and coefficient tell? How significant are they? How much does the model explain the dependent variable?
- Calculate the predicted values for the train and test sets
- Plot predicted vs actual values for train and test sets with diagonal lines
- Calculate and compare RMSE and R2 values using predicted and actual values for train and test sets. Interpret the results

## Answer