<font size="6"><b>MULTIPLE LINEAR REGRESSION</b></font>

In [None]:
library(tidyverse)
library(data.table)
library(plotly) # for interactive ploting
library(DT) # for interactive tabulation
library(broom) # for tidy statistical summaries
library(caret) # for regression performance measures
library(psych) # for pairwise comparisons
library(lindia) # for qqplots
library(car) # for multicollinearity
library(moments) # for higher moments 
library(PearsonDS) # for Pearson distribution
library(rethinking) # for LKJ distribution

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/databa"

![xkcd](../imagesba/change_in_slope.png)

(https://xkcd.com/2701/)

In this session we will do an example of multiple linear regression.

We will split the dataset into train and test partitions

# Exploring the Dataset

Let's first import the realty dataset:

In [None]:
realty_data <- readRDS(sprintf("%s/rds/realty_data3.rds", datapath))

This dataset includes features for around 1000 realty ads in Şişli and Mecidiyeköy area for residences. Here in this version we have only 402 filtered ones.

The response variable is the `premium_neigh`, premium of unit price of the realty relative to the median unit price of each neighbourhood.

Let's see the structure:

In [None]:
realty_data %>% str

You can navigate through and filter the data:

In [None]:
realty_data %>% datatable(
  filter = "top",
  options = list(pageLength = 20)
)

See which variables are of factor type and what the levels of each are:

In [None]:
realty_data %>% keep(is.factor) %>% lapply(levels)

And the frequencies of those levels:

In [None]:
realty_data %>% keep(is.factor) %>% summary

Let's see the numeric variables:

In [None]:
realty_data %>% keep(is.numeric) %>% names

And statistical summaries of numeric columns:

In [None]:
realty_data %>% keep(is.numeric) %>% broom::tidy() %>% mutate_if(is.numeric, round, 2) %>%
select(column, n, mean, sd, median, min, max)

And the boolean (logical) variables showing which features exist or not:

In [None]:
realty_data %>% keep(is.logical) %>% names

Frequencies of boolean and NA values:

In [None]:
do.call(rbind, realty_data %>% keep(is.logical) %>% 
lapply(table, useNA = "always")) %>%
datatable(
  filter = "top",
  options = list(pageLength = 20)
)

Let's examine the bivariate plots of numeric variables:

In [None]:
options(repr.plot.width = 15, repr.plot.height = 15)
pairs.panels(realty_data %>% keep(is.numeric))

The examination suggest that the relationship between the premium_neigh and bina_yasi variables may be quadratic.

# Partition

Now we will split the dataset into a train and test partitions. We will run the model on train partition and also get the predictive power using test partition.

Let's determine a ratio for train partition:

In [None]:
train_ratio <- 0.7

Randomly create row indices for train partition

In [None]:
set.seed(1000)
train_indices <- realty_data[,sample(.N * train_ratio)]

Split the data into two partitions

In [None]:
train_data <- realty_data[train_indices]
test_data <- realty_data[-train_indices]

Check whether partitions are mutually exclusive:

In [None]:
realty_data[,.N]
train_data[,.N]
test_data[,.N]

# Model

Let's include some variables from the data set to predict response variable premium_neigh. Note that:

- bina_yasi is included with polynomial terms
- Interaction term between krediye_uygunluk and cadde_uzerinde is also added

In [None]:
model1 <- lm(premium_neigh ~ esyali + krediye_uygunluk*cadde_uzerinde +
             salon + poly(bina_yasi, 2) +
             manzara_sehir +
             pvc_dograma +
             goruntulu_diafon +
             su_deposu,
             train_data)

Let's see the model summary:

In [None]:
model1 %>% summary
model1 %>% tidy %>% filter(p.value < 0.05) %>% mutate_if(is.numeric, round, 3)

F-statistic is significant, R-squared value is 27% and 7 variables are significant at 5% level.

- There is a significant positive relationship between premium_neigh and esyalihayir, quadratic term of bina_yasi, goruntulu_diafonTRUE and su_deposuTRUE variables
- There is a significant negative relationship between premium_neigh and krediye_uygunlukuygundeil, salon and manzara_sehirTRUE variables

Now let's check VIF values:

In [None]:
car::vif(model1)

No values are above the critical level of 5, so there is no significant multicollinearity.

Let's check the diagnostic plots:

In [None]:
options(repr.plot.width = 7, repr.plot.height = 7)
plot(model1)

And Cook's Distance plot:

In [None]:
plot(model1, which = 4)

And conduct Shapiro-Wilk test for the normality of residuals:

In [None]:
shapiro.test(model1$residuals)

What we see here is that:

- There is significant heteroscedasticity in residuals with a funnel shape
- Residuals are non-normal according to normal Q-Q plot and Shapiro-Wilk test
- There are influential observations
- There is no multicollinearity issue

Now let's get the predictions on test data:

In [None]:
actual_test <- test_data$premium_neigh
predicted_test <- predict(model1, test_data)
residuals_test <- predicted_test - actual_test

Let's compare train and test R-squared values:

In [None]:
r2_train <- summary(model1)$r.squared
r2_test <- R2(predicted_test, actual_test)

In [None]:
r2_train
r2_test

And compare the RMSE values:

In [None]:
summary(model1)$sigma
sqrt(sum(residuals_test^2) / (test_data[, .N] - 1))

In the test partition, the root mean squared error (RMSE) is higher and R-squared value is lower. So the model cannot generalize well into unseen data.

# Treat data problems

## Influential observations

Let's first calculate Cook's D values:

In [None]:
cd1 <- cooks.distance(model1)

In [None]:
sort(cd1, decreasing = T)[1:10]

We see that we can exclude six observations with the largest Cook's D values.

In [None]:
infobv <- which(rank(-cd1) %in% 1:6)
infobv

Exclude those values:

In [None]:
train_datab <- copy(train_data)[-infobv]

In [None]:
dim(train_data)

In [None]:
dim(train_datab)

## Non-normality of response variable

Response variable is not normally distributed:

In [None]:
hist(train_datab$premium_neigh)

Since the predictors include categoric or boolean variables, it is hard to make a Box-Cox transformation.

We will use the moments of the distribution of the variable and P-values from Pearson distribution to transform into a normally distributed variable:

In [None]:
m1 <- mean(train_datab$premium_neigh)
v1 <- var(train_datab$premium_neigh)
sk1 <- skewness(train_datab$premium_neigh)
ku1 <- kurtosis(train_datab$premium_neigh)
m1
v1
sk1
ku1

Get the P-values from Pearson distribution using the moments for the train and test data into a new variable `premium_neigh2`:

In [None]:
train_datab[, premium_neigh2 := ppearson(premium_neigh, moments = c(m1, v1, sk1, ku1))]
test_data[, premium_neigh2 := ppearson(premium_neigh, moments = c(m1, v1, sk1, ku1))]

And then we make the `premium_neigh2` variable standard normally distributed. Note that some infinite values are replaced with NA:

In [None]:
train_datab[, premium_neigh2 := qnorm(premium_neigh2)]
test_data[, premium_neigh2 := qnorm(premium_neigh2)]

In [None]:
train_datab[is.infinite(premium_neigh2), premium_neigh2 := NA]
test_data[is.infinite(premium_neigh2), premium_neigh2 := NA]

Let's confirm that train data responses are normalized:

In [None]:
m1b <- mean(train_datab$premium_neigh2, na.rm = T)
v1b <- var(train_datab$premium_neigh2, na.rm = T)
sk1b <- skewness(train_datab$premium_neigh2, na.rm = T)
ku1b <- kurtosis(train_datab$premium_neigh2, na.rm = T)
m1b
v1b
sk1b
ku1b

In [None]:
hist(train_datab$premium_neigh)
hist(train_datab$premium_neigh2)

While test data response variable distribution is closer to normal, we have a positive skewness and leptokurtosis:

In [None]:
m1t <- mean(test_data$premium_neigh2, na.rm = T)
v1t <- var(test_data$premium_neigh2, na.rm = T)
sk1t <- skewness(test_data$premium_neigh2, na.rm = T)
ku1t <- kurtosis(test_data$premium_neigh2, na.rm = T)
m1t
v1t
sk1t
ku1t

In [None]:
hist(test_data$premium_neigh)
hist(test_data$premium_neigh2)

## Re-run the model

Now we will rerun the same model on the corrected datasets:

In [None]:
model1b <- lm(premium_neigh2 ~ esyali + krediye_uygunluk*cadde_uzerinde +
             salon + poly(bina_yasi, 2) +
             manzara_sehir +
             pvc_dograma +
             goruntulu_diafon +
             su_deposu,
             train_datab)

Let's see the model summary:

In [None]:
model1b %>% summary
model1b %>% tidy %>% filter(p.value < 0.05) %>% mutate_if(is.numeric, round, 3)

F-statistic is significant, R-squared value is 31% and 6 variables are significant at 5% level.

Now let's check VIF values:

In [None]:
car::vif(model1b)

No values are above the critical level of 5, so there is no significant multicollinearity.

Let's check the diagnostic plots:

In [None]:
options(repr.plot.width = 7, repr.plot.height = 7)
plot(model1b)

In [None]:
plot(model1b, which = 4)

And conduct Shapiro-Wilk test for the normality of residuals:

In [None]:
shapiro.test(model1b$residuals)

What we see here is that:

- The heteroscedasticity in residuals is mostly cured
- Residuals are now normal according to normal Q-Q plot and Shapiro-Wilk test
- Most influential observations are excluded
- There is no multicollinearity issue

Now let's get the predictions on test data:

In [None]:
actual_testb <- test_data$premium_neigh2
predicted_testb <- predict(model1b, test_data)
testna <- which(is.na(actual_testb))
actual_testb <- actual_testb[-testna]
predicted_testb <- predicted_testb[-testna]

In [None]:
residuals_testb <- predicted_testb - actual_testb

Let's compare train and test R-squared values:

In [None]:
r2_trainb <- summary(model1b)$r.squared
r2_testb <- R2(predicted_testb, actual_testb)

In [None]:
r2_trainb
r2_testb

And compare the RMSE values:

In [None]:
summary(model1b)$sigma
sqrt(sum(residuals_testb^2) / (length(residuals_testb) - 1))

In the test partition, the root mean squared error (RMSE) is higher and R-squared value is lower. So the model still cannot generalize well into unseen data.

Note that since the scale of the response variable is changed, the scale of RMSE values is different from that of the previous model.

# Object Generating Code

In [None]:
student_id <- 2025000000
library(tidyverse)
library(data.table)
library(broom) # for tidy statistical summaries
library(caret) # for regression performance measures
library(psych) # for pairwise comparisons
library(car) # for multicollinearity
library(moments) # for higher moments 
library(PearsonDS) # for Pearson distribution
library(rethinking) # for LKJ distribution
set.seed(student_id)
nvar <- 6
sampsize <- 1e3
etax <- 1e-3
train_ratio <- 0.7
matx <- rlkjcorr(1, nvar, etax)
sampx <- rmvnorm(1e3, sigma = matx)
sampx <- pnorm(sampx)
means <- rnorm(nvar)
vars <- rexp(nvar, 1)
kurts <- rexp(nvar, 1) + 3
skews <- (rbeta(nvar, 3, 3) - 0.5)*2
colnamesx <- paste(sample(words, nvar + 1), "1", sep = "")
sampx_dt <- as.data.table(sampx)
sampx_dt <- as.data.table(mapply(function(x, a, b, c, d) qpearson(x, moments = c(a, b, c, d)), sampx_dt,
                                 means, vars, skews, kurts))
paramst <- as.matrix(runif(nvar, -5, 5))
errx <- as.matrix(rnorm(sampsize, 0, sqrt(rexp(1, 0.1))))
responsex <- as.matrix(sampx_dt) %*% paramst + errx
sampx_dt <- cbind(responsex, sampx_dt)
setnames(sampx_dt, colnamesx)
train_indices <- sampx_dt[,sample(.N * train_ratio)]
train_data <- sampx_dt[train_indices]
test_data <- sampx_dt[-train_indices]
normlz <- function(x)
{
    meanr <- mean(x, na.rm = T)
    varr <- var(x, na.rm = T)
    skewr <- skewness(x, na.rm = T)
    kurtr <- kurtosis(x, na.rm = T)
    normlx <- qnorm(ppearson(x, moments = c(meanr, varr, skewr, kurtr)))
    ifelse(is.infinite(normlx), NA, normlx)
}

## Tutorial

Above code generates three useful objects:

- train_data to train your model
- test_data to test your model and make predictions
- normlz() function to normalize a non-normal variable

The first column of the dataset is to be used as the response variable in your model. Note that, the variable names are different for each of you.

### Exploration

This code generates a bivariate scatter plot along with pairwise correlations and histograms and density plots of the variables. Note that the plot size options are increased for a better view:

In [None]:
options(repr.plot.width = 15, repr.plot.height = 15)
pairs.panels(train_data)

### Modeling and diagnosis

This code creates a regression model. Note that variable names are different for each of you. The "1"s at the end are there to ensure that random variable names do not coincide with built-in function name which may create a problem:

In [None]:
model1 <- lm(for1 ~ shut1 + like1 + long1 + air1 + ball1 + game1, train_data)

You can get a summary of the regression model:

In [None]:
summary(model1)

You can extract the coefficients in a better format this way:

In [None]:
tidy(model1)

You can check the R-squared value and significance of F-statistic, the direction, magnitude and significance of variables and hence interpret the model. In this case,

- There is a significant positive relationship between for1 and like1 and ball1 variables
- There is a significant negative relationship between for1 and shut1 variable

You can print diagnostic plots with the code below. Note that the plot sizes are changed for this plot.

- From the Residuals vs. Fitted plot you can check whether there is systematic pattern that signals a model misspecification (like a non-linear relationship) or heteroscedasticity (different residual variance across fitted values)
- From the Q-Q residuals, you can check whether normality assumption is violated

In [None]:
options(repr.plot.width = 7, repr.plot.height = 7)
plot(model1)

You can check the influential observations by Cook's Distance:

In [None]:
plot(model1, which = 4)

You can get Cook's Distance values for observations for determining which observations should be omitted, if any:

In [None]:
cooksd <- cooks.distance(model1)
summary(cooksd)

We see that most influential observations are the first six ones: 387, 101, 125, 454, 633 and 381 no observations. Note that after the sixth one there is certain cutoff down to 0.011 from 0.013 and subsequent values are much lower. This interpretation is just for this example. You have to check for your own model output:

In [None]:
round(head(sort(cooksd, decreasing = T), 10), 4)

After visual examination of normal Q-Q plot of residuals, you can test the normality assumption of residuals with Shapiro-Wilk test:

In [None]:
shapiro.test(model1$residuals)

Here a P-value above 0.05 means, the null hypothesis of normality is not rejected, the variable is not non-normal.

You can check the VIF values of variables for detecting multi-collinearity. You can exclude the variable with highest VIF value if it is above 5 and rerun the model. You don't have to delete the variable from the dataset but just exclude the term from the model formula if you do so:

In [None]:
vif(model1)

Here we can start by excluding long1 variable with the largest value, reruning the model and checking the VIF values again:

In [None]:
model2 <- lm(for1 ~ shut1 + like1 + air1 + ball1 + game1, train_data)
vif(model2)

After long1 is taken out, there is no more variables with VIF > 5.

### Predictive performance

You can get prediction on the test data and combine with actual y values to get residuals:

In [None]:
actual_test <- test_data$for1
predicted_test <- predict(model1, test_data)
residuals_test <- predicted_test - actual_test

You can compare R-squared values of train and test sets:

In [None]:
r2_train <- summary(model1)$r.squared
r2_test <- R2(predicted_test, actual_test)
r2_train
r2_test

And calculate and compare the root mean squared error (RMSE) values:

In [None]:
rmse_train <- summary(model1)$sigma
rmse_test <- sqrt(sum(residuals_test^2) / (length(residuals_test) - 1))
rmse_train
rmse_test

### Normalization

If a variable (especially the response variable) is highly non-normal you can normalize the variable with normlz() function created in the object generating code above.

For example in the above dataset, shut1 variable is highly left skewed:

In [None]:
hist(train_data$shut1)

While its kurtosis not extreme:

In [None]:
kurtosis(train_data$shut1)

We can test its normality:

In [None]:
shapiro.test(train_data$shut1)

And reject the normality assumption.

Now let's normalize the variable:

In [None]:
hist(normlz(train_data$shut1))

Much more symmetric.

In [None]:
skewness(normlz(train_data$shut1))
kurtosis(normlz(train_data$shut1))

Skewness is centered while kurtosis increased!

Test normality again:

In [None]:
shapiro.test(normlz(train_data$shut1))

Still non-normal. What to do?

Let's normalize a second time:

In [None]:
hist(normlz(normlz(train_data$shut1)))

In [None]:
skewness(normlz(normlz(train_data$shut1)))
kurtosis(normlz(normlz(train_data$shut1)))

Much more centered and closer to mesokurtosis now.

And test the normality again:

In [None]:
shapiro.test(normlz(normlz(train_data$shut1)))

Now our variable is normal.

You can assign the normalized version of the variable into a newly created column or on the original column. Note that if you do normalization, you have to do it also on the test set.

In order to ensure normalization is made with the same parameters, it is better to combine the train and test data, make the normalization, assign back the normalized values and split into train and test partitions again.