In [None]:
library(tidyverse)
library(data.table)
library(plotly) # for interactive ploting
library(DT) # for interactive tabulation
library(broom) # for tidy statistical summaries
library(caret) # for regression performance measures
library(psych) # for pairwise comparisons
library(GGally) # for pairwise comparisons
library(magrittr) # for two-way pipes
library(lindia) # for qqplots

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

# Multiple Linear Regression

In this session we will include more than one independent variables for multiple linear regression.

We will split the dataset into train and test partitions

Let's first import the realty dataset:

In [None]:
realty_data <- readRDS(sprintf("%s/rds/02_01_realty_data.rds", datapath))

In [None]:
realty_data

Let's see the structure:

In [None]:
realty_data %>% str

You can navigate through and filter the data:

In [None]:
realty_data %>% datatable(
  filter = "top",
  options = list(pageLength = 20)
)

See which variables are of factor type and what the levels of each are:

In [None]:
realty_data %>% keep(is.factor) %>% lapply(levels)

And the frequencies of those levels:

In [None]:
realty_data %>% keep(is.factor) %>% summary

Let's see the numeric variables:

In [None]:
realty_data %>% keep(is.numeric) %>% names

And statistical summaries of numeric columns:

In [None]:
realty_data %>% keep(is.numeric) %>% summary()

And statistical summaries of numeric columns in a better format:

In [None]:
realty_data %>% keep(is.numeric) %>% broom::tidy() %>% mutate_if(is.numeric, round, 2) %>%
select(column, n, mean, sd, median, min, max)

And the boolean (logical) variables showing which features exist or not:

In [None]:
realty_data %>% keep(is.logical) %>% names

Frequencies of boolean and NA values:

In [None]:
realty_data %>% keep(is.logical) %>% summary

Frequencies of boolean and NA values in a better format:

In [None]:
do.call(rbind, realty_data %>% keep(is.logical) %>% 
lapply(table, useNA = "always")) %>%
datatable(
  filter = "top",
  options = list(pageLength = 20)
)

## Data cleaning

Now let's clean some of the categoric and boolean variables:

Some categories or variables actually point at similar things, so they better be integrated:

In [None]:
realty_data[isinma_tipi == "merkezi (pay olcer)", isinma_tipi := "merkezi"]

In [None]:
realty_data[, su_deposu := su_deposu | hidrofor]
realty_data[, hidrofor := NULL]

In [None]:
realty_data[, isi_yalitim := isi_yalitim | isicam]
realty_data[, isicam := NULL]

In [None]:
realty_data[, saten_boya := saten_boya | saten_alci]
realty_data[, saten_alci := NULL]

## Feature extraction

Now let's select some of the boolean variables.

First take out direction variables since they have missing values:

In [None]:
realty_data %>% keep(is.logical) %>% select(-c("kuzey", "bati", "guney", "dogu")) %>% length

See the distribution of rows in terms of how many boolean columns the ad has TRUE values: 

In [None]:
realty_data %>% keep(is.logical) %>% select(-c("kuzey", "bati", "guney", "dogu")) %>% rowSums %>% table

128 of the rows has no boolean values while only 5 rows has a single boolean value.

It is highly probable that, for the ads with no boolean values, the ad owners did not take their times to select any choices, not that the property does not have those features. They can be filtered out

In [None]:
bools <- realty_data %>% keep(is.logical) %>% select(-c("kuzey", "bati", "guney", "dogu"))

A feature that appears in too many or too few ads is not very useful. So we better select those ones that are more balanced in TRUE and FALSE values - number of TRUE cases closer to half the total number of rows:

In [None]:
normx <- bools[,.N] / 2
normx

In [None]:
bools_select <- bools %>% 
colSums %>% # get total count of TRUE values
"-"(normx) %>% # subtract from half of row count
abs %>% # take absolute value
sort %>% #sort
.[. <= normx / 2] %>% # filter for those variables where the absolute difference if less than or equal to the quarter of total row count 
names %>% # get the names
str_subset("yakin", negate = T) %>% # there are many columns for proximity to central places. Every property around mecidiyekoy is close to the center. take them out 
setdiff("merkezde") # same goes for this column

See whether these columns are highly correlated:

In [None]:
# create a correlation matrix
cor0 <- realty_data %>% select(all_of(bools_select)) %>% cor

Caret has a method to detect high correlations but not so useful:

In [None]:
cor0 %>% caret::findCorrelation(cutoff = 0.5, verbose = T, exact = F)

Manuallt we can do it better:

In [None]:
# take only the upper triangle and off diagonal values to eliminate duplicates
cor0[row(cor0) >= col(cor0)] <- NA
cor0

In [None]:
cor0 %>% as.data.table(keep.rownames = T) %>%
gather("key", "value", -"rn", na.rm = T) %>% as.data.table %>% # convert to long format data.table
arrange(-value) %>%
filter(value > 0.5) # filter for higher correlations

No more semantically close variables with very high correlations

Now let's select some of the features

In [None]:
features <- c("price",
              "neighborhood",
              "esyali",
             "krediye_uygunluk",
              "isinma_tipi",
             "kullanim_durumu",
              "brut_metrekare",
              "oda",
              "salon",
              "bina_yasi",
              "banyo_sayisi",
              "kat_sayisi",
              "kat", bools_select)

We calculate the row sum for all booelan variables to filter out properties with no features set:

In [None]:
nboolean <- realty_data %>% keep(is.logical) %>% select(-c("kuzey", "bati", "guney", "dogu")) %>% rowSums

In [None]:
realty_data2 <- realty_data %>%
select(all_of(features)) %>%
mutate(nboolean = nboolean) %>%
mutate(unit_price = price / brut_metrekare) %>%
mutate(unit_size = brut_metrekare / kat) %>% # this feature controls for the land share of the property
na.omit %>%
filter(unit_price %between% quantile(unit_price, c(0.05, 0.95))) %>%
filter(nboolean != 0) %>%
filter(isinma_tipi %in% c("merkezi", "kombi", "kat kaloriferi")) %>% # other categories are very rare, better to take out
select(-c("price", "nboolean"))

In [None]:
realty_data2

In [None]:
realty_data2 %>% str

Now let's see the correlations after the transformations:

What model.matrix does is to create a numeric representation of the data according to the regression model.

The good thing is that, it automatically converts categoric variables to sets of dummies excluding one category from each factor variable to handle linear dependency:

The -1 term excludes the intercept from the model:

In [None]:
dat1 <- model.matrix(unit_price ~ . - 1, realty_data2)

In [None]:
colnames(dat1)

Same method for detecting high correlations:

In [None]:
cor1 <- dat1 %>% cor

In [None]:
cor1[row(cor1) >= col(cor1)] <- NA
cor1

In [None]:
cor1 %>% as.data.table(keep.rownames = T) %>%
gather("key", "value", -"rn", na.rm = T) %>% as.data.table %>%
arrange(-value) %>%
filter(value > 0.5)

It seems we better keep only brut_metrekare and exclude the other two:

The two way pipe makes a transformation and assigns back:

In [None]:
realty_data2 %<>% select(-c("oda", "banyo_sayisi"))

## Partition

Let's determine a ratio for train partition:

In [None]:
set.seed(1000)
train_ratio <- 0.7

Randomly create row indices for train partition

In [None]:
train_indices <- realty_data2[,sample(.N * train_ratio)]

Split the data into two partitions

In [None]:
train_data <- realty_data2[train_indices]
test_data <- realty_data2[-train_indices]

Check whether partitions are mutually exclusive:

In [None]:
realty_data2[,.N]
train_data[,.N]
test_data[,.N]

## Initial model

First include all variables without the intercept term. "." is a shorthand for all variables except the RHS variable:

In [None]:
model1 <- lm(unit_price ~ . - 1, train_data)

In [None]:
model1 %>% summary

model1 %>% tidy %>% filter(p.value < 0.1)

The best source to learn the easy domain specific language of formulae in R is the built-in help. Please check the details:

In [None]:
?formula

qqplot shows whether the residual terms are distributed normally. Ideally they should be plotted across the red line:

In [None]:
gg_qqplot(model1, scale.factor = 1)

They deviate from the line so normality assumption is breached

Compare predictions and actual values for train and test sets:

In [None]:
options(repr.plot.width = 5, repr.plot.height = 5)

actual_train <- train_data$unit_price
predicted_train <- predict(model1, train_data)

actual_test <- test_data$unit_price
predicted_test <- predict(model1, test_data)

model_dt <- data.table(partition = c("train", "test"),
                       R2 = c(R2(predicted_train, actual_train),
                                R2(predicted_test, actual_test)),
                        RMSE = c(RMSE(predicted_train, actual_train),
                                 RMSE(predicted_test, actual_test)),
                        MAE = c(MAE(predicted_train, actual_train),
                                MAE(predicted_test, actual_test))
                        )

model_dt

data.table(actual = actual_train, predictions = predicted_train) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Train Actual vs. Predictions")

data.table(actual = actual_test, predictions = predicted_test) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Test Actual vs. Predictions")

In [None]:
data.table(residuals = actual_train - predicted_train, predictions = predicted_train) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
ggtitle("Train Predictions vs. Residuals")

data.table(residuals = actual_test - predicted_test, predictions = predicted_test) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
ggtitle("Test Predictions vs. Residuals")

The multiple R2 of the model is very high while the squared correlations between actual and predicted values are much lower. Why?

In fact the neighborhood defines the level of the unit price to a great extent and the prices vary much across the neighborhoods. What the model shows is the wide range of differences in the prices across the neighborhoods. So the variance is not uniform across the set but clusters around different neighborhoods

We better calculate the deviation of the unit price from the median unit price of the neighborhood or the premium as a percentage:

## Premium by neighborhoods

In [None]:
neigh_av <- realty_data2[, .(price_neigh = median(unit_price)), by = neighborhood]

In [None]:
neigh_av

In [None]:
realty_data3 <- neigh_av[realty_data2, on = "neighborhood"] %>% # merge the data.table way
mutate(premium_neigh = unit_price / price_neigh -1) %>% # calculate the premium
select(-c("unit_price", "price_neigh", "neighborhood")) %>%
filter(premium_neigh %between% quantile(premium_neigh, c(0.1, 0.9)))

We partition again

In [None]:
set.seed(1000)
train_ratio <- 0.7

Randomly create row indices for train partition

In [None]:
train_indices2 <- realty_data3[,sample(.N * train_ratio)]

Split the data into two partitions

In [None]:
train_data2 <- realty_data3[train_indices2]
test_data2 <- realty_data3[-train_indices2]

Check whether partitions are mutually exclusive:

In [None]:
realty_data3[,.N]
train_data2[,.N]
test_data2[,.N]

## Model 2

Now let's try to explain the premium using all variables at once:

In [None]:
model2 <- lm(premium_neigh ~ ., train_data2)

In [None]:
model2 %>% summary
model2 %>% tidy %>% filter(p.value < 0.1)

In [None]:
options(repr.plot.width = 5, repr.plot.height = 5)

actual_train <- train_data2$premium_neigh
predicted_train <- predict(model2, train_data2)

actual_test <- test_data2$premium_neigh
predicted_test <- predict(model2, test_data2)

model_dt <- data.table(partition = c("train", "test"),
                       R2 = c(R2(predicted_train, actual_train),
                                R2(predicted_test, actual_test)),
                        RMSE = c(RMSE(predicted_train, actual_train),
                                 RMSE(predicted_test, actual_test)),
                        MAE = c(MAE(predicted_train, actual_train),
                                MAE(predicted_test, actual_test))
                        )

model_dt

data.table(actual = actual_train, predictions = predicted_train) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Train Actual vs. Predictions")

data.table(actual = actual_test, predictions = predicted_test) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Test Actual vs. Predictions")

The predictive performance is not still very good

In [None]:
gg_qqplot(model2, scale.factor = 1)

Residuals closer to normality

In [None]:
data.table(residuals = actual_train - predicted_train, predictions = predicted_train) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
ggtitle("Train Predictions vs. Residuals")

data.table(residuals = actual_test - predicted_test, predictions = predicted_test) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
ggtitle("Test Predictions vs. Residuals")

## Pairwise comparisons

Now let's try to detect non-linear relationships using some alternative and similar methods:

In [None]:
options(repr.plot.width = 15, repr.plot.height = 15)

In [None]:
pairs(train_data2 %>% keep(is.numeric))

In [None]:
pairs.panels(train_data2 %>% keep(is.numeric))

In [None]:
ggpairs(train_data2 %>% keep(is.numeric))

## Model with quadratic terms

There might be a quadratic linearship between bina_yasi and premium:

However if we add this quadratic term manually, there might be multicollinarity with the linear term:

In [None]:
cor(realty_data3 %>% select(bina_yasi) %>% mutate(binayasi2 = bina_yasi^2))

But the poly() function does that creating orthagonal terms:

In [None]:
realty_data3[, poly(bina_yasi, 2)] %>% cor

Since "." includes all terms, bina_yasi is subtracted to prevent double accounting:

In [None]:
model3 <- lm(premium_neigh ~ . - bina_yasi + poly(bina_yasi, 2), train_data2)

In [None]:
model3 %>% summary
model3 %>% tidy %>% filter(p.value < 0.1)

In [None]:
options(repr.plot.width = 5, repr.plot.height = 5)
gg_qqplot(model3, scale.factor = 1)

In [None]:
actual_train <- train_data2$premium_neigh
predicted_train <- predict(model3, train_data2)

actual_test <- test_data2$premium_neigh
predicted_test <- predict(model3, test_data2)

model_dt <- data.table(partition = c("train", "test"),
                       R2 = c(R2(predicted_train, actual_train),
                                R2(predicted_test, actual_test)),
                        RMSE = c(RMSE(predicted_train, actual_train),
                                 RMSE(predicted_test, actual_test)),
                        MAE = c(MAE(predicted_train, actual_train),
                                MAE(predicted_test, actual_test))
                        )

model_dt

data.table(actual = actual_train, predictions = predicted_train) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Train Actual vs. Predictions")

data.table(actual = actual_test, predictions = predicted_test) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Test Actual vs. Predictions")

In [None]:
data.table(residuals = actual_train - predicted_train, predictions = predicted_train) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0) +
ggtitle("Train Predictions vs. Residuals")

data.table(residuals = actual_test - predicted_test, predictions = predicted_test) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0) +
ggtitle("Test Predictions vs. Residuals")

Performance still not good enough

## Model with fewer variables

Let's select only those variables from the previous model that are significant at 10% level:

In [None]:
model4 <- lm(premium_neigh ~ esyali + krediye_uygunluk +
             salon + poly(bina_yasi, 2) +
             manzara_sehir +
             pvc_dograma +
             goruntulu_diafon +
             cadde_uzerinde +
             su_deposu,
             train_data2)

In [None]:
model4 %>% summary
model4 %>% tidy %>% filter(p.value < 0.1)

In [None]:
gg_qqplot(model4, scale.factor = 1)

In [None]:
options(repr.plot.width = 5, repr.plot.height = 5)

actual_train <- train_data2$premium_neigh
predicted_train <- predict(model4, train_data2)

actual_test <- test_data2$premium_neigh
predicted_test <- predict(model4, test_data2)

model_dt <- data.table(partition = c("train", "test"),
                       R2 = c(R2(predicted_train, actual_train),
                                R2(predicted_test, actual_test)),
                        RMSE = c(RMSE(predicted_train, actual_train),
                                 RMSE(predicted_test, actual_test)),
                        MAE = c(MAE(predicted_train, actual_train),
                                MAE(predicted_test, actual_test))
                        )

model_dt

data.table(actual = actual_train, predictions = predicted_train) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Train Actual vs. Predictions")

data.table(actual = actual_test, predictions = predicted_test) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Test Actual vs. Predictions")

In [None]:
data.table(residuals = actual_train - predicted_train, predictions = predicted_train) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0) +
ggtitle("Train Predictions vs. Residuals")

data.table(residuals = actual_test - predicted_test, predictions = predicted_test) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0) +
ggtitle("Test Predictions vs. Residuals")

## Model with interaction

Now let's add interactions terms among variables:

In [None]:
model5 <- lm(premium_neigh ~ (esyali + krediye_uygunluk +
             salon + poly(bina_yasi, 2) +
             manzara_sehir +
             goruntulu_diafon)*su_deposu,
             train_data2)

In [None]:
model5 %>% summary
model5 %>% tidy %>% filter(p.value < 0.1)

In [None]:
gg_qqplot(model5, scale.factor = 1)

In [None]:
options(repr.plot.width = 5, repr.plot.height = 5)

actual_train <- train_data2$premium_neigh
predicted_train <- predict(model5, train_data2)

actual_test <- test_data2$premium_neigh
predicted_test <- predict(model5, test_data2)

model_dt <- data.table(partition = c("train", "test"),
                       R2 = c(R2(predicted_train, actual_train),
                                R2(predicted_test, actual_test)),
                        RMSE = c(RMSE(predicted_train, actual_train),
                                 RMSE(predicted_test, actual_test)),
                        MAE = c(MAE(predicted_train, actual_train),
                                MAE(predicted_test, actual_test))
                        )

model_dt

data.table(actual = actual_train, predictions = predicted_train) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Train Actual vs. Predictions")

data.table(actual = actual_test, predictions = predicted_test) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Test Actual vs. Predictions")

In [None]:
data.table(residuals = actual_train - predicted_train, predictions = predicted_train) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0) +
ggtitle("Train Predictions vs. Residuals")

data.table(residuals = actual_test - predicted_test, predictions = predicted_test) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0) +
ggtitle("Test Predictions vs. Residuals")

The model performs better on train set while the performance is worse on test set. The model memorized the data, instead of learning it.

## Multi collinearity check for the last time

Caret way:

Manual way:

In [None]:
cor2 <- model.matrix(model4) %>% cor

In [None]:
# take only the upper triangle and off diagonal values to eliminate duplicates
cor2[row(cor2) >= col(cor2)] <- NA
cor2

In [None]:
cor2 %>% as.data.table(keep.rownames = T) %>%
gather("key", "value", -"rn", na.rm = T) %>% as.data.table %>% # convert to long format data.table
arrange(-value) %>%
filter(value > 0.5) # filter for higher correlations

No high correlations