# LOGISTIC REGRESSION AND PCA WITH CRAB DATASET

Adapted from Lewis (2017), Chapter 7

We will import a dataset from MASS package, including information on crabs and we will try to classify them as male vs. female

## Libraries and dataset

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(plotly) # for interactive visualizations
library(corrplot) # for correlation plots
library(psych) # for visualizing relationship among pairs of variables and PCA
library(GPArotation) # for rotation of components in PCA
library(GGally) # for better visualizing relationship among pairs of variables
library(listviewer) # for visualizing nested data structures
library(MASS) # for crab dataset
library(pROC) # for ROC curve
library(ROCR) # another library for ROC curve
library(plotROC) # for pretty plot ROC curve
library(IRdisplay) # for displaying interactive ROC curves
library(gains) # for lift charts
library(caret) # for lift charts and confusion matrix
library(lift) # for decile life chart
options(warn = -1) # for suppressing messages

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

In [None]:
data("crabs", package = "MASS")

In [None]:
crabs_dt <- as.data.table(crabs)

## Explore data

In [None]:
?crabs

```
This data frame contains the following columns:

sp
species - "B" or "O" for blue or orange.

sex
as it says.

index
index 1:50 within each of the four groups.

FL
frontal lobe size (mm).

RW
rear width (mm).

CL
carapace length (mm).

CW
carapace width (mm).

BD
body depth (mm).
```



Get the structure:

In [None]:
str(crabs_dt)

Delete index column:

In [None]:
crabs_dt[,index := NULL]

View initial rows:

In [None]:
crabs_dt %>% head()

### Factor variables

Let's explore and visualize factor variables:

In [None]:
crabs_dt %>% purrr::keep(is.factor) %>% purrr::map(levels)

In [None]:
crabs_factors <- crabs_dt %>% purrr::keep(is.factor) %>% # select factor columns
    tidyr::gather() %>% # convert into long format for faceting
    ggplot(aes(x = value)) + # plot value
    facet_wrap(~ key, scales = "free") + # divide into separate plots by key
    geom_bar()

plotly::ggplotly(crabs_factors)

### Numeric variables

And let's explore and visualize numeric variables:

In [None]:
crabs_dt %>% purrr::keep(is.numeric) %>% sapply(quantile) %>% t()

In [None]:
crabs_dt %>% purrr::keep(is.numeric) %>% # select columns
    tidyr::gather() %>% # reshape into long format in columns "key" and "value"
    ggplot(aes(value)) + # plot value
        facet_wrap(~ key, scale = "free" ) + # divide into separate plots by key
        geom_density(fill = "green")  # get density plots

### Relationships among features

View the correlation plot across numeric variables:

In [None]:
crabs_dt %>% purrr::keep(is.numeric) %>% cor() %>%

corrplot::corrplot.mixed(upper = "ellipse",
                         lower = "number",
                         tl.pos = "lt",
                         number.cex = .5,
                         lower.col = "black",
                         tl.cex = 0.7)

We see that many variables are highly correlated

### Scatterplots

Let's combine, histograms, density plots, correlations and scatterplots:

In [None]:
crabs_dt %>% purrr::keep(is.numeric) %>% psych::pairs.panels()

In [None]:
crabs_dt %>% purrr::keep(is.numeric) %>% GGally::ggpairs() %>% ggplotly()

## Partition the dataset

y will be the target variable: The labels

In [None]:
y <- crabs$sex

Select 150 indices for train set:

In [None]:
set.seed(2018)
train <- crabs_dt[,sample(.I, 150)]
head(train)

And split the dataset:

In [None]:
y_train <- y[train]
y_test <- y[-train]

crabs_train <- crabs_dt[train]
crabs_test <- crabs_dt[-train]

## Build and train a model

Similar to linear regression, logistic regression assumes the features are independent.

We saw earlier that there is high correlation between the features.

Given this, we will fit our initial model using FL, RW, and the categorical variable sp.

The logistic regression can be fitted to the sample data using the glm function, with the family argument set to binomial:

First, create the formula:

In [None]:
formula1 <- reformulate(c("sp", "FL", "RW"), "y_train")
formula1

And run the logistic regression model:

In [None]:
fit1 <- glm(formula1,
            family=binomial(link='logit'),
            data=crabs_train)

## Evaluate model

View the summary of the model:

In [None]:
summary(fit1)

### Deviance residuals

Like the residual in a linear regression model, the deviance residuals are a measure of model fit.

Smaller absolute values indicate better fit. This part of the output shows minimum, quantiles, and the maximum of the deviance residuals for individual sample examples used to fit the model. The maximum deviance is 2.143, with a very small median value of -0.00255.

In [None]:
summary(fit1)$deviance.resid %>% summary()

### Estimated coefficients

In [None]:
summary(fit1)$coefficients

The estimated coefficients are shown in the next part of the output.

They indicate that FL influences crab sex positively, while sp and RW have a negative effect.

The estimated values tell us the change in the log odds of the target variable for a one unit increase in a feature variable.

As an example, for a one unit increase in FL, the log odds of being a male crab (versus female) increases by 3.725. 

However, for a one unit increase in RW, the log odds of being a male crab decreases by 5.11.

###  Statistical significance of coefficients

In [None]:
confint.default(fit1)

For logistic models, R reports the confidence intervals using
the profiled log-likelihood function.

To see these:

In [None]:
confint(fit1)

Since none of the confidence intervals straddle 0, we can have some empirically grounded certainty that the sign of the estimated coefficients captures the direction of the relationship of the features to the log odds of being a male crab.

### Variable importance

The absolute value of the z-score is often used to measure variable importance.

In this case, RW with an absolute value of 5.286, followed by FL with an absolute value of 5.160 are the most influential features.

This is useful to know, and makes sense because both features are related to body size.

In [None]:
summary(fit1)$coefficients

### Null and residual deviance

The residual deviance is analogous to the residual sum of squares of a linear regression model.

Lower values indicate better fit.

It takes a value of 58.703.

The null deviance reports how well the target variable is predicted by a model that includes only the intercept.

We would expect our model to do better than this.

In this case it does, as the null deviance = 207.917.

This implies our model has reduced the deviance by just over 149 points.

In [None]:
fit1$deviance
fit1$null.deviance

### AIC and Fisher scoring

The following two items are also reported via the summary function:

- The Akaike Information Criterion (AIC) is a measure of the relative quality of statistical models. It is only useful for comparing models.
- The “Number of Fisher Scoring” iterations simply tells you how many iterations were needed to fit the model by maximum likelihood.

In [None]:
fit1$aic
fit1$iter

### ANOVA

How well our model fits, depends on the difference between the model and the observed data.

One approach to evaluate this is to use the anova function:

In [None]:
anova(fit1 , test="Chisq")

The anova function adds the features in the order given in the model formula (left to right).

Hence, sp appears first followed by FL and RW.

Analyzing the table, we observe a small drop in deviance when adding each sp and FL. For example, adding sp, reduces the deviance from the Null model’s value of 207.917 to 207.485.

This is a tiny drop. We see a similar pattern for adding in FL.

In this case, the model deviance drops by 0.533 to 206.951.

The addition of these two variables moves the model deviation in the right directions (downward).

However, as indicated by the large p-value (Pr(>Chi)) on both sp and FL, the change in not statistically significant

This indicates the model without these variables explains approximately the same amount of variation

Fortunately, adding the feature, RW leads to a significant reduction in deviance of over 148 points

A highly significant p-value here, supports the importance of this feature

### Pseudo R<sup>2</sup> statistic

Pseudo R statistic for logistic regression.

The pseudo-R2 is a useful goodness-of-fit metric for logistic regression.

Similar to the traditional R2 statistic, it takes a value between 0 and 1. It is calculated as:

$${\text{pseudo } R^{2}} = 1- \frac{\text{model deviance}}{{\text{Null deviance}}}$$

In [None]:
1 - (fit1$deviance / fit1$null.deviance)

The closer to 1 is the metric, the more useful are the features in predicting the target variable

In statistical language, it is more a measure of effect size than overall fit

In any case, the value of 0.717 indicates that the model is useful for predicting crab sex

### Model discrimination, ROC, and AUC

The discrimination of a model – that is, how well the model separates male from female crabs - can also be assessed using the area under the receiver operating characteristic curve (AUC)

It uses two metrics, Specificity and Sensitivity.

Specificity is a measure of how often the model predicts “female”(y = 0) when the actual observation is “female crab”

$${\text{Specificity}} = \frac{\text{True Negatives}}{{\text{Total Negatives}}}$$

Sensitivity or true positive rate measures when it’s actually “male”, how often does the model predict “male”

$${\text{Sensitivity}} = \frac{\text{True Positives}}{{\text{Total Positives}}}$$

Specificity and Sensitivity are often combined via a Receiver Operating Characteristic Curve (ROC).

The ROC visually measures how well the predictive model separates the data into positives and negatives

![ROC](https://www.researchgate.net/publication/8636163/figure/fig2/AS:202684352208899@1425335123086/Four-ROC-curves-with-different-values-of-the-area-under-the-ROC-curve-A-perfect-test-A.png)

Four ROC curves with different values of the area under the ROC curve:
- A perfect test (A) has an area under the ROC curve of 1.
- The chance diagonal (D, the line segment from 0, 0 to 1, 1) has an area under the ROC curve of 0.5.
- ROC curves of tests with some ability to distinguish between those subjects with and those without a disease (B, C) lie between these two extremes.
- Test B with the higher area under the ROC curve has a better overall diagnostic performance than test C.

(https://www.researchgate.net/figure/Four-ROC-curves-with-different-values-of-the-area-under-the-ROC-curve-A-perfect-test-A_fig2_8636163)

#### ROC curve with pROC package

First, get the predicted probability values for the train set:

In [None]:
pred1 <- predict(fit1 , type="response")

as.data.table(pred1) %>%
ggplot(aes(x = seq_along(pred1), y = pred1)) +
xlab("Index") +
geom_point()

Since levels are given as 1 for F and 2 for M (probably alphabetically), the predicted values are the probabilities of being M

Now create a roc object for predicted probability values and the actual class labels

In [None]:
roc <- pROC::roc(y_train, pred1)

In [None]:
roc

The train set data gave a value of 0.9067, indicating that the model discriminates well

The confidence interval can be called using the ci function:

In [None]:
pROC::ci(roc)

Plot the ROC curve with base plot:

In [None]:
plot(roc)

A fancier option is from plotROC package:

In [None]:
p1 <- data.table(D = y_train, M = pred1) %>%
ggplot(aes(m = M, d = D)) +
    plotROC::geom_roc() +
    plotROC::style_roc(theme = theme_grey)

p1

Or an interactive version of the same plot:

In [None]:
plotROC::export_interactive_roc(p1) %>% IRdisplay::display_html()

The closer the curve is to the perfect classifier, the better it is at identifying positive 
values. This can be measured using a statistic known as the area under the ROC 
curve (abbreviated AUC). The AUC treats the ROC diagram as a two-dimensional 
square and measures the total area under the ROC curve. AUC ranges from 0.5 (for 
a classifier with no predictive value) to 1.0 (for a perfect classifier). A convention to 
interpret AUC scores uses a system similar to academic letter grades:

- A: Outstanding = 0.9 to 1.0
- B: Excellent/good = 0.8 to 0.9
- C: Acceptable/fair = 0.7 to 0.8
- D: Poor = 0.6 to 0.7
- E: No discrimination = 0.5 to 0.6

In [None]:
pROC::auc(roc)

AUC is outstanding

#### ROC curve with ROCR library

Adapted from Lantz (2015) Chapter 10

First we create a prediction object, taking into account the probabilities:

In [None]:
pred_ROCR <- ROCR::prediction(predictions = pred1,
                             labels = y_train,
                             label.ordering = c("F", "M"))

And then a performance object for true positives vs false positives:

In [None]:
perf <- ROCR::performance(pred_ROCR,
                            measure = "tpr",
                            x.measure = "fpr")

And let's plot it:

In [None]:
plot(perf, lwd = 3)

For AUC calculation:

In [None]:
perf_auc <- ROCR::performance(pred_ROCR, measure = "auc")
perf_auc@y.values %>% unlist()

### Lift chart

In some applications, the goal is to search, among a set of new records, for a subset of records that gives the highest cumulative predicted values.

In such cases, a graphical way to assess predictive performance is through a lift chart.

This compares the model’s predictive performance to a baseline model that has no predictors.

A lift chart for a continuous response is relevant only when we are searching for a set of records that gives the highest cumulative predicted values.

A lift chart is not relevant if we are interested in predicting the outcome value for each new record.

The lift chart is based on ordering the set of records of interest (typically validation data) by their predicted value, from high to low.

Then, we accumulate the actual values and plot their cumulative value on the y-axis as a function of the number of records accumulated (the x-axis value).

This curve is compared to assigning a naive prediction (y) to each record and accumulating these average values, which results in a diagonal line.

The further away the lift curve from the diagonal benchmark line, the better the model is doing in separating records
with high value outcomes from those with low value outcomes.

The same information can be presented in a decile lift chart, where the ordered records are grouped into ten deciles, and for each decile, the chart presents the ratio of model lift to naive benchmark lift.

(Shmueli (2017) Chapter 5)

#### Lift chart with gains package and ggplot

First we create a gains object:

In [None]:
gain <- gains::gains(as.numeric(y_train) - 1,
                        pred1,
                        groups = length(pred1))



In [None]:
str(gain)

In order to plot the lift chart, we should take the first 11 elements from gain and convert to a data frame object in order to plot with ggplot:

In [None]:
as.data.frame(gain[1:11]) %>%
ggplot(aes(x = cume.obs / max(cume.obs), y = cume.pct.of.total)) +
geom_line()

#### Lift chart with caret and ggplot

A better option is to use the lift() function from the caret package.

In [None]:
lift1 <- caret::lift(y_train ~ pred1)
lift1

Here we see that the function took the F class as the positive case while in our model M is the positive (predicted) case. We change the factor levels for that:

In [None]:
lift1 <- caret::lift(y_train %>% forcats::fct_relevel("M") ~ pred1)

First let's view the percent table of class labels:

In [None]:
table(y_train) %>% prop.table()

And plot the lift curve:

In [None]:
lift1 %>%
ggplot(plot = "gain") %>% plotly::ggplotly()

Let's interpret this plot:

The dashed line at the top is the ideal case where classification accuracy is 100%

The dark line is the plot for the model

The dashed line at the bottom is the pure random case and is the baseline

The observations are sorted by probabilities. The x axis shows the percent of cases covered, y axis shows the percent of positive class found in those cases.

Since the percent of positive class is 49.3, in the best case the line will have a smooth upward slope until x reaches 49.3 and y reaches 100%. After that no more positive cases are left so the line will be horizontal.

In the random case, 100% of positive cases will be found only when 100% of samples are tested

We see that the model line follows the best line until 36.7% of all cases are tested

#### Decile lift by lift package

Another way to look at the lift chart is to look at each subsequent decile of observations, and what portion of the positive cases are caught with the model:

In [None]:
lift::plotLift(pred1, as.numeric(y_train) - 1, n.buckets = 10, cumulative = F)

In [None]:
lift::TopDecileLift(pred1, as.numeric(y_train) - 1)

The first value of 2.027 means, when 10% of all observations are tested 20.27% of all positive cases are found 

### Classification accuracy

Let's convert log odd values to class labels:

In [None]:
pred1_train_f <- factor(ifelse(pred1 > 0.5, "M", "F"))

And create a confusion matrix:

In [None]:
cm1_train <- table(y_train, pred1_train_f) %>% caret::confusionMatrix()
cm1_train

Classification accuracy is 0.9067

## Predictive power

Lets predict the probability values for the test dataset:

In [None]:
pred1_test <- predict(fit1,
                      newdata = crabs_test)

And convert them to class labels:

In [None]:
predclass1_test <- ifelse(pred1_test > 0.5, "M", "F") %>% factor()

Create the confusion matrix:

In [None]:
cm1_test <- table(y_test, predclass1_test) %>% caret::confusionMatrix()
cm1_test

We have an accuracy of 0.94

## Improve model performance

### Principal Components Analysis (PCA)

Let's look at the relationships among numeric variables again:

In [None]:
crabs_dt %>% purrr::keep(is.numeric) %>% cor() %>%

corrplot::corrplot.mixed(upper = "ellipse",
                         lower = "number",
                         tl.pos = "lt",
                         number.cex = .5,
                         lower.col = "black",
                         tl.cex = 0.7)

In [None]:
crabs_dt %>% purrr::keep(is.numeric) %>% GGally::ggpairs() %>% ggplotly()

What stands out is the very high correlation between the attributes.

This is not surprising as they are all measurements related to body
size.

However, it violates the assumption that features are independent.

What to do?

One solution is to use the principal components.

#### PCA with base-r

The prcomp function calculates the principal components:

In [None]:
pca <- crabs_dt %>%
    purrr::keep(is.numeric) %>%
    prcomp(center = T,
           scale = T)

PCA components are linear combinations of the original features.

In [None]:
pca

Let’s look at the PCA coefficient weights:

In [None]:
pca$rotation %>% round(3)

See whether components are orthagonal (uncorrelated)

In [None]:
cor(pca$x) %>% round(2)

Yes, cross correlations are near zero

##### Proportion of variance explained

The proportion of variation explained by each principal component can be viewed using the summary function:

In [None]:
summary(pca)

The first component explains 95.8% of the variation in the feature data, and the second component 3%

Since, these two components account for over 98% of the variation in the data, we will use them as our independent variables in the logistic regression model.

To do this we create a new R object called pca_dt:

In [None]:
pca_dt <- pca$x[,c("PC1", "PC2")] %>% as.data.table()

#### PCA with psych package

Adapted from Lesmeister (2015), Chapter 9

To extract the components with the psych package, you will use the principal() function.

We will state that we do not want to rotate the components at this time.

In [None]:
?principal

In [None]:
pca2 <- crabs_dt %>%
    purrr::keep(is.numeric) %>%
    psych::principal(nfactors = 5, rotate = "none")

In [None]:
pca2

How many of the components should we take?

A good rule of thumb is to select the components that account for at least 70 percent of the total variance, which means that the variance explained by each of the selected components accounts for 70 percent of the variance explained by all the components. 

A visual technique is to do a scree plot.

A scree plot can aid you in assessing the components that explain the most variance in the data.

It shows the Component number on the x axis and their associated Eigenvalues on the y axis.

In [None]:
pca2$values %>% as.data.table() %>%
    ggplot(aes(x = seq_along(.), y = .)) +
        geom_line() +
        xlab("Component") +
        ylab("Eigenvalues")

What you are looking in a scree plot is with eigenvalues that are greater than one and the point where the additional 
variance explained by a component does not differ greatly from one component to the next.

In other words, it is the break point where the plot flattens out. In this, two components look pretty compelling.

Now let's rotate the two components:

```
rotate	
"none", "varimax", "quartimax", "promax", "oblimin", "simplimax", and "cluster" are possible rotations/transformations of the solution. See fa for all rotations avaiable.
```

In [None]:
?principal

In [None]:
pca3 <- crabs_dt %>%
    purrr::keep(is.numeric) %>%
    psych::principal(nfactors = 2, rotate = "simplimax")

In [None]:
pca3

And the component scores can be viewed with:

In [None]:
pca3$scores

### Fit model with PCs

We fit the model using the training data

In [None]:
pca_train <- pca_dt[train]

In [None]:
fit2 <- glm(y_train ~ .,
           data = pca_train,
           family = "binomial")

See the coefficients:

In [None]:
fit2$coefficients %>% round(3)

It appears that both PC1 and PC2 influence crab sex positively

For a one unit increase in PC1, the log odds of being a male crab (versus female) increases by 0.328

However, notice for pca2 the effect size at 21.217 is many multiple times larger than that of pca1

Such a large difference cannot be ignored

Let’s take a look at the statistics, in terms of the confidence interval of the estimates:

In [None]:
confint.default(fit2)

In [None]:
summary(fit2)$coefficients %>% round(2)

Intercept and PC1 are statistically not significant

### Model deviance

To evaluate the overall performance of the model is to look at the null deviance and residual deviance

Null deviance indicates how well the class is predicted by a model with nothing but the intercept

We would expect such a model to be a poor classifier

In [None]:
fit2$null.deviance %>% round(2)

In [None]:
fit2$deviance %>% round(2)

Despite PC1 not being statistically significant, the null deviance of fit2 at 31.14 is considerably lower than for the null model

Adding in our predictors decreased the deviance by just over 176 points

You may also have noticed that fit2 deviance is lower than for fit1

This indicates fit2 has a smaller prediction error.

In [None]:
fit1$deviance %>% round(2)
fit2$deviance %>% round(2)

### Classification accuracy

See, how well the model fits on the train data:

Get predictions for probabilities:

In [None]:
pred2 <- predict(fit2, type = "response")

And convert to class labels:

In [None]:
predclass2 <- ifelse(pred2 > 0.5, "M", "F") %>% factor()

Get the confusion matrix:

In [None]:
cm2_train <- table(y_train, predclass2) %>% caret::confusionMatrix()
cm2_train

In [None]:
cm1_train$overall["Accuracy"] %>% round(3)
cm2_train$overall["Accuracy"] %>% round(3)

Classification accuracy is better

View the roc curve:

In [None]:
p2 <- data.table(D = y_train, M = pred2) %>%
ggplot(aes(m = M, d = D)) +
    plotROC::geom_roc() +
    plotROC::style_roc(theme = theme_grey)

plotROC::export_interactive_roc(p2) %>% IRdisplay::display_html()

And AUC:

In [None]:
pROC::auc(y_train, pred2)

Compare with the previous model:

In [None]:
pROC::auc(y_train, pred1)

Slightly better

### Predictive power

Now its predictive power on test data:

In [None]:
pca_test <- pca_dt[-train]

In [None]:
pred2_test <- predict(fit2,
                     newdata = pca_test,
                     type = "response")

In [None]:
predclass2_test <- ifelse(pred2_test > 0.5, "M", "F") %>% factor()

In [None]:
cm2_test <- table(y_test, predclass2_test) %>% caret::confusionMatrix()
cm2_test

In [None]:
cm1_test$overall["Accuracy"] %>% round(3)
cm2_test$overall["Accuracy"] %>% round(3)

Accuracy of predictions are also better for the 2nd model

### Drop insignificant variables

PC1 was insignificant.

Now let's only take PC2 as a predictor:

In [None]:
fit3 <- glm(y_train ~ PC2,
           data = pca_train,
           family = "binomial")

In [None]:
fit3

Get the predicted probabilities:

In [None]:
pred3 <- predict(fit3, type = "response")

And factorize them:

In [None]:
predclass3 <- ifelse(pred3 > 0.5, "M", "F") %>% factor()

And the confusion matrix:

In [None]:
cm3_train <- table(y_train, predclass3) %>% caret::confusionMatrix()
cm3_train

Compare accuracies:

In [None]:
cm2_train$overall["Accuracy"] %>% round(3)
cm3_train$overall["Accuracy"] %>% round(3)

Predict on test set:

In [None]:
pred3_test <- predict(fit3,
                     newdata = pca_test,
                     type = "response")

Get class labels:

In [None]:
predclass3_test <- ifelse(pred3_test > 0.5, "M", "F") %>% factor()

And confusion matrix:

In [None]:
cm3_test <- table(y_test, predclass3_test) %>% caret::confusionMatrix()
cm3_test

Compare accuracies:

In [None]:
cm2_test$overall["Accuracy"] %>% round(3)
cm3_test$overall["Accuracy"] %>% round(3)

Accuracy is better over model 2 and there is only one misclassification