<font size="6"><b>GENERALIZED LINEAR MODELS AND LOGISTIC REGRESSION: APPLICATION</b></font>

<font size="5"><b>Serhat Ã‡evikel</b></font>

In [None]:
library(tidyverse)
library(data.table)
library(plotly) # for interactive ploting
library(broom) # for tidy statistical summaries
library(pROC) # for ROC curve
library(caret) # for lift chart and confusion matrix

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/databa"

![xkcd](../imagesba/logistic.png)

(https://xkcd.com/2048/)

We continue with the realty dataset.

Remember that, we calculated the premium_neigh variable which is premium of the unit price of the property over the median unit price of the neighborhood.

Now we will try to classify the properties into premium and discount

Let's first import the realty dataset:

In [None]:
realty_data3 <- readRDS(sprintf("%s/rds/realty_data3.rds", datapath))

In [None]:
head(realty_data3)

Let's add the binary variable premium, which takes 1 when the premium is above 0, and 0 otherwise

In [None]:
realty_data3[, premium := as.integer(premium_neigh > 0)]

Let's see the structure:

In [None]:
realty_data3 %>% str

Now, select some of the variables:

In [None]:
vars <- c("premium", "esyali", "krediye_uygunluk", "bina_yasi", "kat_sayisi", "kat", realty_data3 %>% keep(is.logical) %>% names)
vars

And assign the subset:

In [None]:
realty_data4 <- realty_data3 %>% select(all_of(vars)) %>% na.omit

Our tasks are to:

- Partition the data set into 70% train and 30% test sets
- Create and run a logistic regression model to explain premium with all other variables **without intercept**. Note that, the median values are taken as basis for premium, so the classes are nearly equal
- Print the summary of the model. Compare and interpret null and residual deviance values and create a table of the coefficients of the variables that are significant at 5% level
- Calculate the fitted positive case ("1") probabilities from the model and also the fitted classes for the train set with a cut value of 0.5
- Create a confusion matrix. You may use the below code template:

```R
table(actual = actual_classes, fitted = fitted_classes) %>% caret::confusionMatrix(positive = "1")
```
- What are the TP, TN, FP, FN counts? Interpret accuracy, sensitivity and specificity.
- Interpret Kappa (what is the level of class agreement)
- Create a ROC curve and calculate AUC. How far is the model better than pure random guessing in the train set?
- Calculate the predicted positive case ("1") probabilities from the model and also the predicted classes for the test set with a cut values of 0.5
- Create a confusion matrix for the test set similar to the one above.
- What are the TP, TN, FP, FN counts? Interpret accuracy, sensitivity and specificity.
- Interpret Kappa (what is the level of class agreement)
- Create a ROC curve and calculate AUC. How far is the model better than pure random guessing in the test set?
- Compare the results from the confusion matrices and AUC values of the train and test sets. 

# Partition the dataset

We will take 60% as train and 40% as test set: 

In [None]:
#partition data
set.seed(2)
train <- realty_data4[,sample(.I, .N * 0.7)]

In [None]:
realty_train <- realty_data4[train]
realty_test <- realty_data4[-train]

# Build and train a model

Build a logistic regression model: 

In [None]:
logit_reg <- glm(`premium` ~ . -1, data = realty_train, family = "binomial")

In [None]:
summary(logit_reg)

Let's check whether this model does significantly better than the null model:

In [None]:
anova(update(logit_reg, . ~ 1), logit_reg, test = "LRT")

Since P-value is below 0.05, we can say that model is significantly better at 5% significance level.

The coefficients significantly different from 0 at 5% significance level are:

In [None]:
logit_reg %>% broom::tidy() %>% filter(p.value < 0.05)

So:

- While kapiciTRUE and parke_laminantTRUE variables have a positive effect on the probability of having a premium unit value,
- esyalievet, "krediye_uygunlukuygun degil" and on_cepheTRUE variables have a negative effecton the probability of having a premium unit value.

Note that while the coeffcients effect on probabilities can be interpreted in direction terms, the actual numeric effect can only be interpret only in log-odds or odds terms.

## Get fitted values and create confusion matrix

In [None]:
fit_train <- predict(logit_reg, realty_train, type = "response")

In [None]:
actual_train <- realty_train$premium

In [None]:
train_class <- ifelse(fit_train > 0.5, 1, 0)

In [None]:
confmat <- table(fitted = train_class, actual = actual_train) %>% caret::confusionMatrix(positive = "1")
confmat

- TP: 101
- TN: 96
- FP: 46
- FN: 38

- Accuracy: 0.7 of all cases are correctly classified
- Sensitivity: 0.68 of the fitted positive cases are actually positive
- Specificity: 0.71 of the fitted negative cases are actually negative
- Kappa: 0.445, moderate agreement

Let's calculate the lift value. First the baseline precision (ratio of positive classes in the dataset) is:

In [None]:
bprec <- sum(actual_train) / length(actual_train)
bprec

And the ratio of precision to baseline precision is:

In [None]:
confmat$byClass["Precision"] / bprec

So our model is 1.39 times better at identifying positive cases than random guesssing.

## Create ROC and calculate AUC

In [None]:
plot.roc(actual_train, fit_train, legacy.axes = T)

In [None]:
auc(actual_train, fit_train, legacy.axes = T)

Pure random guessing would yield 0.5, perfect classification would yield 1. So the model performance is in between the perfect and pure random performances. The auc value of 0.746 is considered as acceptable/fair classification performance.

# Get predictions on test set and create a confusion matrix

In [None]:
pred_test <- predict(logit_reg, realty_test, type = "response")

In [None]:
test_class <- ifelse(pred_test > 0.5, 1, 0)

In [None]:
actual_test <- realty_test$premium

In [None]:
confmat2 <- table(prediction = test_class, actual = actual_test) %>% caret::confusionMatrix(positive = "1")
confmat2

- TP: 40
- TN: 36
- FP: 25
- FN: 20

- Accuracy: 0.62 of all cases are correctly classified
- Sensitivity: 0.62 of the fitted positive cases are actually positive
- Specificity: 0.64 of the fitted negative cases are actually negative
- Kappa: 0.25, fair agreement

In [None]:
bprect <- sum(actual_test) / length(actual_test)
bprect

And the ratio of precision to baseline precision is:

In [None]:
confmat2$byClass["Precision"] / bprect

So our model is 1.24 times better at identifying positive cases than random guesssing, slightly below that of the train_data.

Although the classification performance on test set is below that on the train set, we still have a fair prediction performance.

In [None]:
plot.roc(actual_test, pred_test, legacy.axes = T)

In [None]:
auc(actual_test, pred_test)

The AUC value of class prediction performance on the test set can be considered as poor.

# Object Generating Code

In [None]:
student_id <- 2025000000
library(tidyverse)
library(data.table)
library(broom) # for tidy statistical summaries
library(moments) # for higher moments 
library(PearsonDS) # for Pearson distribution
library(rethinking) # for LKJ distribution
library(caret) # for confusion matrix
library(pROC) # for roc and auc
set.seed(floor((student_id %% 1e8) * 1.1))
nvar <- 6
sampsize <- 1e3
etax <- 1e-3
train_ratio <- 0.7
matx <- rlkjcorr(1, nvar, etax)
sampx <- rmvnorm(1e3, sigma = matx)
sampx <- pnorm(sampx)
means <- rnorm(nvar)
vars <- rexp(nvar, 1)
kurts <- rexp(nvar, 1) + 3
skews <- (rbeta(nvar, 3, 3) - 0.5)*2
colnamesx <- paste(sample(words, nvar + 1), "1", sep = "")
sampx_dt <- as.data.table(sampx)
sampx_dt <- as.data.table(mapply(function(x, a, b, c, d) qpearson(x, moments = c(a, b, c, d)), sampx_dt,
                                 means, vars, skews, kurts))
paramst <- as.matrix(runif(nvar, -5, 5))
errx <- as.matrix(rnorm(sampsize, 0, sqrt(rexp(1, 0.02))))
responsex <- as.matrix(sampx_dt) %*% paramst + errx
posrate <- runif(1, 0.2, 0.4)
cutp <- quantile(responsex, 1 - posrate)
responsex <- ifelse(responsex > cutp, 1, 0)
sampx_dt <- cbind(responsex, sampx_dt)
setnames(sampx_dt, colnamesx)
train_indices <- sampx_dt[,sample(.I, .N * train_ratio), by = c(colnamesx[1])]$V1
train_data <- sampx_dt[train_indices]
test_data <- sampx_dt[-train_indices]
normlz <- function(x)
{
    meanr <- mean(x, na.rm = T)
    varr <- var(x, na.rm = T)
    skewr <- skewness(x, na.rm = T)
    kurtr <- kurtosis(x, na.rm = T)
    normlx <- qnorm(ppearson(x, moments = c(meanr, varr, skewr, kurtr)))
    ifelse(is.infinite(normlx), NA, normlx)
}

We have two partitions:

- train_data
- test_data

Let's see the variables names:

In [None]:
names(train_data)

evidence1 is a binary variable that takes only 0 and 1 values

In [None]:
table(train_data$evidence1)

The ratio of positive class in the train_data is 25.3%:

In [None]:
train_data[, sum(evidence1) / .N]

Now let's see from boxplots, how well each variable can predict a separation between the classes of the response variable:

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)
train_data %>%
  pivot_longer(-"evidence1") %>%
  ggplot(aes(y = value, x = "", group = evidence1)) +
  geom_boxplot() +
  facet_wrap(~ name, scales = "free_y")

We see that, fight1, site1 and visit1 provide a better separation performance.

Now let's create a logistic regression model:

In [None]:
model1 <- glm(evidence1 ~ fight1 + site1 + visit1, family = "binomial", data = train_data)

View the summary of the model:

In [None]:
summary(model1)

Let's test whether the model fits the response classes significantly better than the null model:

In [None]:
anova(update(model1, . ~ 1), model1, test = "LRT")

P-value is significantly below 5% so yes the model does better than the null model.

Intercept is significant at 5% and is -3.53 which means when all predictors have a value of zero, the log-odds is -3.53, or the odds ratio of having a positive class is:

In [None]:
exp(-3.53)

All three variable coefficients are significant at 5%.

An increase in fight1 decreases the odds ratio of having a positive class. An increase in site1 or visit1 variables increases the odds ratio.

Let's get fitted probabilities and create a confusion matrix:

In [None]:
fit_train <- predict(model1, train_data, type = "response")

In [None]:
actual_train <- train_data$evidence1

In [None]:
train_class <- ifelse(fit_train > 0.5, 1, 0)

In [None]:
cm_train <- table(fitted = train_class, actual = actual_train) %>% caret::confusionMatrix(positive = "1")
cm_train

- Accuracy is 86%, so 86% of all cases are correctly predicted.
- Kappa value is 0.602, so the prediction performance is considered good
- Sensitivity is 0.62, so 62% of actual positive values are correctly predicted
- Specificity is 0.94, so 94% of actual negative values are correctly predicted
- Positive predictive value or precision is 0.78, so 78% of predicted positives are true positives
- Negative predictive value is 0.88, so 88% of predicted negatives are true negatives  

Let's calculate the lift value. First the baseline precision (ratio of positive classes in the dataset) is:

In [None]:
bprec_tr <- sum(actual_train) / length(actual_train)
bprec_tr

And the ratio of precision to baseline precision is:

In [None]:
cm_train$byClass["Precision"] / bprec_tr

So our model is 3 times better at identifying positive cases than random guesssing.

The ROC curve is below:

In [None]:
plot.roc(actual_train, fit_train, legacy.axes = T)

AUC value is 0.896, which can be considered as an excellent classification performance:

In [None]:
auc(actual_train, fit_train)

Let's do the same on test_set, predict probabilities and create a confusion matrix:

In [None]:
pred_test <- predict(model1, test_data, type = "response")

In [None]:
test_class <- ifelse(pred_test > 0.5, 1, 0)

In [None]:
actual_test <- test_data$evidence1

In [None]:
cm_test <- table(prediction = test_class, actual = actual_test) %>% caret::confusionMatrix(positive = "1")
cm_test

While accuracy of 0.81 is slightly below that of train_data, kappa value of 0.478 is much lower and is considered as moderate performance.

Sensitivity of 0.558 and specificity of 0.897 are also lower than the ones calculated for train_data.

Let's calculate the lift value. First the baseline precision (ratio of positive classes in the dataset) is:

In [None]:
bprec_te <- sum(actual_test) / length(actual_test)
bprec_te

And the ratio of precision to baseline precision is:

In [None]:
cm_test$byClass["Precision"] / bprec_te

So our model is 2.55 times better at identifying positive cases than random guesssing, below the lift calculated on train_data but still high.

ROC curve for the test_set is:

In [None]:
plot.roc(actual_test, pred_test, legacy.axes = T)

AUC can still be considered excellent while it is slightly below that of the train_data:

In [None]:
auc(actual_test, pred_test)