In [None]:
library(tidyverse)
library(data.table)
library(plotly) # for interactive ploting
library(DT) # for interactive tabulation
library(broom) # for tidy statistical summaries
library(caret) # for regression performance measures

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "../data"

# Simple Linear Regression

In this session we will conduct a simple linear regression using one dependent variable and one independent variable.

We will split the dataset into train and test partitions

Let's first import the realty dataset:

In [None]:
realty_data <- readRDS(sprintf("%s/rds/02_01_realty_data.rds", datapath))

In [None]:
realty_data

Let's see the structure:

In [None]:
realty_data %>% str

You can navigate through and filter the data:

In [None]:
realty_data %>% datatable(
  filter = "top",
  options = list(pageLength = 20)
)

See which variables are of factor type and what the levels of each are:

In [None]:
realty_data %>% keep(is.factor) %>% lapply(levels)

And the frequencies of those levels:

In [None]:
realty_data %>% keep(is.factor) %>% summary

Let's see the numeric variables:

In [None]:
realty_data %>% keep(is.numeric) %>% names

And statistical summaries of numeric columns:

In [None]:
realty_data %>% keep(is.numeric) %>% summary()

And statistical summaries of numeric columns in a better format:

In [None]:
realty_data %>% keep(is.numeric) %>% broom::tidy() %>% mutate_if(is.numeric, round, 2) %>%
select(column, n, mean, sd, median, min, max)

Now let's select some of the features

In [None]:
features <- c("price", "brut_metrekare",
             "krediye_uygunluk",
             "kira_getirisi")

Let's create the unit price column and unit rent column, filter for "eligible for loan" (krediye uygun) rows and trim 5% top and bottom unit price and unit_rent values:

In [None]:
realty_data2 <- realty_data %>%
select(all_of(features)) %>%
mutate(unit_price = price / brut_metrekare) %>%
mutate(unit_rent = kira_getirisi / brut_metrekare) %>%
filter(krediye_uygunluk == "uygun") %>%
na.omit %>%
filter(between(unit_price, quantile(unit_price, 0.05), quantile(unit_price, 0.95))) %>%
filter(between(unit_rent, quantile(unit_rent, 0.05), quantile(unit_rent, 0.95)))

In [None]:
realty_data2 %>% str

Now we will try to understand whether unit_price is related to unit_rent.

## Visual examination:

In [None]:
realty_data2 %>% ggplot(aes(x = unit_rent,
                           y = unit_price)) +
                        geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = F)

We see a positive and slightly strong relationship

## Partition

Let's determine a ratio for train partition:

In [None]:
train_ratio <- 0.7

Randomly create row indices for train partition

In [None]:
train_indices <- realty_data2[,sample(.N * train_ratio)]

Split the data into two partitions

In [None]:
train_data <- realty_data2[train_indices]
test_data <- realty_data2[!train_indices]

Check whether partitions are mutually exclusive:

In [None]:
realty_data2[,.N]
train_data[,.N]
test_data[,.N]

## Train the model

Let's create the model:

In [None]:
model1 <- lm(unit_price ~ unit_rent, data = train_data)

See the summary:

In [None]:
summary(model1)

tidy() from broom package extract and present useful information from the model in a tabulated manner:

In [None]:
tidy(model1)

What we see is:

- The coefficient on unit_rent is significantly differant than 0, (statistically significant)
- 44% of the overall variance in unit_price is explained by the model
- When unit_rent is zero, unit_price is estimated to be negative. Maintenance costs and due fees (aidat) may be reason for that

How can you interpret the coefficient of unit_rent?

## Predict the model

We have actual and predicted unit_price values for the train and test sets:

In [None]:
actual_train <- train_data$unit_price
predicted_train <- predict(model1, train_data)

In [None]:
actual_test <- test_data$unit_price
predicted_test <- predict(model1, test_data)

The test data was not utilized when we created the model, so it is unseen data. If the model performs well on train but not on test data, we may conclude that the model "memorized" and not learned the data

Some information on regression performance metrics can be found folloing the links:

[Regression Model Accuracy (MAE, MSE, RMSE, R-squared) Check in R](https://www.datatechnotes.com/2019/02/regression-model-accuracy-mae-mse-rmse.html)

[Measuring Performance](https://topepo.github.io/caret/measuring-performance.html)

We calculate the R2, RMSE and MAE values using caret package for train and test predictions vs actual values:

In [None]:
model_dt <- data.table(partition = c("train", "test"),
                       R2 = c(R2(predicted_train, actual_train),
                                R2(predicted_test, actual_test)),
                        RMSE = c(RMSE(predicted_train, actual_train),
                                 RMSE(predicted_test, actual_test)),
                        MAE = c(MAE(predicted_train, actual_train),
                                MAE(predicted_test, actual_test))
                        )

In [None]:
model_dt

We see that R2 is lower (but not too much) for the test set as compared to its level for the train set. The higher R2 the better the fit is

RMSE and MAE measures are higher for the test set as compared to their levels for the train set. The lower the RMSE and MAE measures the better the fit is.

So test performance of the model is slightly lower than its performance on the train set but the difference is not very high.

Note that, the scales of RMSE and MAE measures are not standardized, they are dependent on the scale of the target feature. However R2 is a standardized measure: It is always between 0 and 1