In [None]:
library(tidyverse)
library(data.table)
library(plotly) # for interactive ploting
library(DT) # for interactive tabulation
library(broom) # for tidy statistical summaries
library(caret) # for regression performance measures

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

# Simple Linear Regression

Let's first import the realty dataset:

In [None]:
realty_data <- readRDS(sprintf("%s/rds/02_01_realty_data.rds", datapath))

In [None]:
realty_data

See which variables are of factor type and what the levels of each are:

In [None]:
realty_data %>% keep(is.factor) %>% lapply(levels)

And the frequencies of those levels:

In [None]:
realty_data %>% keep(is.factor) %>% summary

Let's see the numeric variables:

In [None]:
realty_data %>% keep(is.numeric) %>% names

And statistical summaries of numeric columns:

In [None]:
realty_data %>% keep(is.numeric) %>% summary()

And statistical summaries of numeric columns in a better format:

In [None]:
realty_data %>% keep(is.numeric) %>% broom::tidy() %>% mutate_if(is.numeric, round, 2) %>%
select(column, n, mean, sd, median, min, max)

Please follow the steps:

- Filter the data for properties with a single bathroom (banyo_sayisi) and a single living room (salon)
- Select gross size (brut_metrekare) and room count (oda) features, exclude rows with NA values
- Trim the top and bottom 5% brut_metrekare values
- Plot the relationship between gross size (brut_metrekare) and room count (oda) with a best fit line 
- Set an arbitrary seed for reproducibility with set.seed(xxx) (so that your typed interpretations and the printed results are conformable) and partition the data into 0.7 train and 0.3 test sets randomly
- Create a linear model where gross size is the dependent and the room count is the independent variable
- Interpret the model summary. What does the intercept and coefficient tell? How significant are they? How much does the model explain the dependent variable?
- Calculate the predicted values for the train and test sets
- Plot predicted vs actual values for train and test sets with diagonal lines
- Calculate and compare RMSE and R2 values using predicted and actual values for train and test sets. Interpret the results

## Answer

In [None]:
features <- c("oda", "brut_metrekare")

In [None]:
realty_data2 <- realty_data %>%
filter(banyo_sayisi == 1 & salon == 1) %>%
select(all_of(features)) %>%
na.omit %>%
filter(between(brut_metrekare, quantile(brut_metrekare, 0.05), quantile(brut_metrekare, 0.95)))

In [None]:
realty_data2 %>% str

In [None]:
realty_data2 %>% ggplot(aes(x = oda,
                           y = brut_metrekare)) +
                        geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = F)

In [None]:
train_ratio <- 0.7

Randomly create row indices for train partition

In [None]:
set.seed(100000)
train_indices <- realty_data2[,sample(.N * train_ratio)]

Split the data into two partitions

In [None]:
train_data <- realty_data2[train_indices]
test_data <- realty_data2[-train_indices]

Check whether partitions are mutually exclusive:

In [None]:
realty_data2[,.N]
train_data[,.N]
test_data[,.N]

### Train the model

Let's create the model:

In [None]:
model1 <- lm(brut_metrekare ~ oda, data = train_data)

See the summary:

In [None]:
summary(model1)

tidy() from broom package extract and present useful information from the model in a tabulated manner:

In [None]:
tidy(model1)

What we see is:

- Intercept and oda's coefficient are significant
- Average size excluding rooms (living room, bathroom, kitchen and etc.) is 37.6 m2
- Average size of an additional room is 25 m2
- Nearly half of the variance in gross size is explained (R2)

### Predict the model

We have actual and predicted unit_price values for the train and test sets:

In [None]:
actual_train <- train_data$brut_metrekare
predicted_train <- predict(model1, train_data)

In [None]:
actual_test <- test_data$brut_metrekare
predicted_test <- predict(model1, test_data)

In [None]:
data.table(actual = actual_train, predictions = predicted_train) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Train Actual vs. Predictions")

In [None]:
data.table(actual = actual_test, predictions = predicted_test) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Test Actual vs. Predictions")

We calculate the R2, RMSE and MAE values using caret package for train and test predictions vs actual values:

In [None]:
model_dt <- data.table(partition = c("train", "test"),
                       R2 = c(R2(predicted_train, actual_train),
                                R2(predicted_test, actual_test)),
                        RMSE = c(RMSE(predicted_train, actual_train),
                                 RMSE(predicted_test, actual_test)),
                        MAE = c(MAE(predicted_train, actual_train),
                                MAE(predicted_test, actual_test))
                        )

In [None]:
model_dt

Performance on test set is similar to that on train test