In [6]:
# Run this first so it's ready by the time you need it
install.packages("readr")
install.packages("dplyr")
install.packages("supernova")
install.packages("ggformula")
library(readr)
library(dplyr)
library(supernova)
library(ggformula)


The downloaded binary packages are in
	/var/folders/mg/1wy1xcls587_h0tqnj42l5740000gn/T//RtmpkRkf65/downloaded_packages

The downloaded binary packages are in
	/var/folders/mg/1wy1xcls587_h0tqnj42l5740000gn/T//RtmpkRkf65/downloaded_packages

The downloaded binary packages are in
	/var/folders/mg/1wy1xcls587_h0tqnj42l5740000gn/T//RtmpkRkf65/downloaded_packages

The downloaded binary packages are in
	/var/folders/mg/1wy1xcls587_h0tqnj42l5740000gn/T//RtmpkRkf65/downloaded_packages



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: ggplot2

Loading required package: ggstance


Attaching package: ‘ggstance’


The following objects are masked from ‘package:ggplot2’:

    geom_errorbarh, GeomErrorbarh


Loading required package: scales


Attaching package: ‘scales’


The following object is masked from ‘package:supernova’:

    number


The following object is masked from ‘package:readr’:

    col_factor


Loading required package: ggridges


New to ggformula?  Try the tutorials: 
	learnr::run_tutorial("introduction", package = "ggformula")
	learnr::run_tutorial("refining", package = "ggformula")

[1mRows: [22m[34m2348[39m [1mColumns: [22m[34m45[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31

# Chapter 15 - Models with Categorical Outcomes

Thus far in the course, we've explored a great many types of models that the general linear model framework can handle. This gives us a large amount of flexibility in the types of research questions we can answer with statistics. Hopefully you are beginning to appreciate how understanding the fundamentals of the general linear model can open a great many avenues of inquiry for you. The general linear model is not the only way to do statistics (there are more advanced statistics classes you can take that will teach you different approaches for specific problems), but it is an extremely powerful one to start with.

The last major class of general linear model that we will learn in this course will address one thing we've been leaving out so far. Across all the regressions, interactions, nonparametric tests with all sorts of predictors and relationships between predictors, one thing has been contast - we've always been predicting a *continuous* outcome variable. Of course, there are research questions concerning categorical outcome variables as well. What political party is someone likely to join, whether or not someone is admitted to college, etc. These sorts of models are called **logistic regression**, and we will cover them here. 

## 15.1 Non-continuous outcomes

Standard regression models are linear combinations of parameters estimated from the data. Multiplying these parameters by different values of the predictor variables (or by each other) gives estimates of the outcome.

However, because there’s no hard limit on the range of predictor variables (at least, no limit coded into the model itself), the predictions of a linear model in theory range between negative -∞ (infinity) and +∞. Although values approaching infinity might be very unlikely, there is no hard limit on either the parameters we fit (the regression coefficients) or the predictor values themselves.

When outcome data are continuous or somewhat like a continuous variable this isn’t usually a problem. Although our models might predict some improbable values (for example, that someone is 8 feet tall), they will not often be strictly impossible.

It might occur to you at this point that, if a model predicted a height of -8 feet or a temperature below absolute zero, then this *would* be impossible. And this is true, and a theoretical violation of the assumption of the linear model that the outcome can range betweem -∞ (infinity) and +∞ (we'll talk more about the assumptions the GLM makes in chapter 20, and what violates those assumptions). Despite this, researchers use linear regression to predict many outcomes which have this type of range restriction and, although models can make strange predictions in edge cases, they are useful and can make good predictions most of the time.

However, for other types of outcomes like categories, this often won’t be the case. Standard linear regression will fail to make sensible predictions even in cases that are not unusual.

For binary data we want to predict the probability of a positive response, and this can range between zero and 1. For these data, the lack of constraint on predictions from linear regression are a problem.

Let's see a demonstration of this. We have some medical data about patients with kidney disease from [this dataset](https://matthew-brett.github.io/cfd2020/data/chronic_kidney_disease.html). 

## 15.2 Link functions

One of the most common cases where this occurs is when you're trying to predict a binary outcome, yes or no. We already learned in simple regression how expressing this as a dummy-coded number, 1 or 0, let's us do math with it. That is also why we choose to express categorical outcome variables with dummy codes.

But hang on. When a dummy variable is an outcome, no other option outside the range of 0 to 1 is possible, and no other option *between* these values is possible either. This poses a problem for predictions based off a continuous regression line like in linear regression, where all possible inputs can be operated on to produce theoretically all possible output values. 

The answer to this in the GLM framework is to still make continuous predictions, but then translate those predictions into being either a 1 or 0 final answer. The mathematical tool for doing this kind of translation is called a **link function**. A link function, like the name suggests, is a separate mathematical function that *links* the output of a linear model to a corresponding value that makes sense in terms of the actual outcome variable. 

Simple linear regression actually has a link function too. It's just a trivial one, called the **identity function** - the linear model's prediction is mapped to an outcome value by its identity, its already-existing value. For other types of model different functions are used.

## 15.3 Log odds

The link function in logistic regression is called the **logit function** (hence the name logistic regression). The logit function expresses category membership in terms of the *log odds* of being in that category. If we're predicting whether or not someone is admitted to college, their *likelihood* of admittance is what the prediction means.

### Log odds conversion - step 1
The first step in converting a binary categorical variable into log odds is to consider the probability of someone being in one category versus the other. Recall our discussion of probabilities back in chapter 8. Under a probability model, we can't determine the value of any one datapoint, but we know over the whole population of datapoints how many end up in one category versus another. We know the *probability* of category membership. When 9% of students are admitted to Pomona and 91% are not, there is a 9% likelihood that any one person will end up in the category "admitted." 

Probabilities thus allow us to express a binary outcome - admitted or not - as a probability of category membership. Thus, we can now express this variable as a continuous variable between 0 and 1. True data will always have a 0 or 1 value - when someone has *already* been admitted to college or rejected, the probability of their admittance is 100% or 0%. But when making predictions about new data points, their predicted probability can be anything between 0 and 1. Predictors in a logistic model help us make guesses about these probabilities.

### Log odds conversion - step 2
We now have a binary variable expressed as a continuous variable, but it is still bounded between 0 and 1. We can transform a probability on the 0—1 scale to a 0 → ∞ scale by converting it to **odds**, which are expressed as a ratio:

$$odds = \frac{p}{1-p} $$

Probabilities and odds ratios are two equivalent ways of expressing the same idea. So a probability of P(X) = 0.5 equates to an odds ratio of 1 (i.e. 1 to 1 or 1/1, which is what 0.5/(1-0.5) reduces to). P(X) = 0.6 equates to odds of 1.5 (that is, 1.5 to 1, or 3 to 2). And P(X) = 0.95 equates to an odds ratio of 19 (19 to 1).

Odds convert or map probabilities from 0 to 1 onto the real numbers from 0 to ∞.

<img src="images/ch15-probtoodds.png" width="500">

We can reverse the transformation (which is important later) like so:

$$probability = \frac{odds}{1 + odds} $$

### Log odds conversion - step 3
When we convert a probability to odds, the odds can go up to infinity, but will never be less than 0. This is still a problem for our linear model. We’d like our regression coefficients to be able to vary between -∞ and ∞.

To avoid this restriction, we can take the *logarithm* of the odds — sometimes called the **logit** for conciseness. The figure below shows the transformation of probabilities between 0 and 1 to the log-odds scale. 

<img src="images/ch15-probtologit.png" width="500">

The logit has two nice properties:

- It converts odds of less than one to negative numbers, because the log of a number between 0 and 1 is always negative.
- It flattens the rather square curve for the odds in the figure above, giving you more interpretability among different logit values. 

By doing these mathematical conversion steps, we can now express a binary 0-or-1 variable as a continuous variable, possibly ranging from -∞ to ∞.


## 15.4 Interpretting a logistic model

### Interpreting predictions
As we’ve seen here, the logit or logistic link function transforms probabilities between 0 and 1 to the range from negative to positive infinity. This means logistic regression coefficients are in log-odds units, so we must interpret logistic regression coefficients differently from regular regression with continuous outcomes. Consider the equation form of a simple logistic regression model:

$$logit(Y_i) = b_0 + b_1X_i + e_i$$

The right side of the equation, the actual model side, is built the same between a linear model and a logistic model. We still combine the values of predictors with coefficients in order to make predictions. However, in linear regression, a coefficient like b<sub>1</sub> means the change in the outcome (expressed with the outcome's units) for a unit change in the predictor.

For logistic regression, the same coefficient means the change in the *log odds* of the outcome being 1, for a unit change in the predictor.

If we want to interpret logistic regression in terms of probabilities, we need to undo the transformation described in steps 1 and 2. To do this:

- 1) We take the exponent of the logit to ‘undo’ the log transformation and get the predicted odds. Taking the exponent means calculating e<sup>y</sup>, where e is the special mathematical value 2.71828... and y is the logit. 

- 2) We convert the odds back to probability: prob = odds / (1 + odds) 

Here's a hypothetical example to walk through how to do this. Imagine if we have a model to predict whether a person has any children. The outcome is binary, so equals 1 if the person has any children, and 0 otherwise.

The model has an intercept and one predictor, age in years:

$$logit(children_i) = b_0 + b_1age_i + e_i$$

We fit this model and get two parameter estimates: b<sub>0</sub> = 0.5 and b<sub>1</sub> = 0.02.  

The outcome of the linear model is the log-odds of having any children. So to compute this for any particular person, we simply input their age as the value of the predictor. For someone aged 30, the predicted log-odds are:

$$0.5 + 0.02*30 = 1.1$$

In order to understand what that prediction means in terms of probabilities, we first take the exponent to find odds:

$$odds = e^{1.1} = 3.004166$$
 
That suggests there are 3 to 1 odds that a person aged 30 will have children. Lastly, we can use the conversion equation above to find probability:

$$probability = \frac{odds}{1 + odds} = \frac{3.004166}{1 + 3.004166} = 0.7502601$$
 
Thus, given our logistic regression model, we would predict that by taking a random person off the street who is 30 years old, not knowing anything else about them, there is a 75% chance that they have children. 
 
For someone aged 40:

$$0.5 + 0.02*40 = 1.3 $$

$$odds = e^{1.3} = 3.669297$$

$$probability = \frac{3.669297}{1 + 3.669297} = 0.785835$$

and so on.

### Interpreting coefficients
That's how to understand the predictions of a logistic model, the log odds of category membership. But what does that b<sub>1</sub> = 0.02 value mean for how the prediction *changes*, for each one-unit change in the predictor? It would be the change in the log odds of someone having children, for a one-year increase in age. You could leave it at that, but that's hard to wrap one's head around. 

If we take the exponent of 0.02, would that tell us how much odds are changing by? 

$$e^{0.02} = 1.020201$$

Unfortunately, no. If we interpreted things this way, it would imply that a a 31-year-old has 1.02 higher odds of having kids compared to a 30-year-old. I.e., 4.02 to 1, compared with 3 to 1. However, solving for the odds of a 31-year-old tells us that's not the case: 

$$0.5 + 0.02*31 = 1.12 $$

$$odds = e^{1.12} = 3.064854$$

Instead, an important feature of logs to know about is that subtracting the logs of two numbers is the same thing as taking the log of those numbers' ratio:

$$log(3.064854) - log(3.004166) = 0.02$$

$$log(\frac{3.064854}{3.004166}) = 0.02$$

You can verify this in the code window below: 

In [1]:
log(3.064854) - log(3.004166)
log(3.064854/3.004166)

Thus, we can interpret the exponent of the coefficient b<sub>1</sub> = 0.02 as being the *ratio* of odds, for a one-unit change in predictor. In other words, e<sup>0.02</sup> = 1.02, so the odds of having kids is 1.02x great for each 1-year increase in someone's age. 

To take exponents in R, you can use the ```exp()``` function:

In [2]:
exp(0.02)

Note that this is not the same as saying the *probability* of having kids is 1.02x greater for each 1-year increase in age. Probability is odds/(1+odds), so the probability that a 31-year-old has kids is:

In [3]:
#probability of a 31-year-old having kids
3.064854 / (1 + 3.064854)

That's not the same thing as 1.02 times 0.7502601 (the probability of a 30-year-old having kids):

In [5]:
1.02*0.7502601

In logistic regression, you have to be very careful with how you talk about the interpretations. 0.02 is the change in log odds for every one-unit increase in the predictor. e<sup>0.02</sup> is the multiplier of odds for every one-unit increase in the predictor. 

That's the case for a simple logistic regression. In a multiple logistic regression, there are multiple predictors. For example:

$$logit(children_i) = b_0 + b_1age_i + b_2married_i + e_i$$

This would be a model predicting how many children someone has, both from how old they are and whether or not they're married. 

Let's say when fitting this model, we estimate b<sub>0</sub> = 0.1, b<sub>1</sub> = 0.01, and b<sub>2</sub> = 1.3. The coefficients still speak to the change in predicted log-odds for every one-unit increase in the predictor, but we have to interpret that in the context of the other variable again. In this case, b<sub>1</sub> means the predicted change in log-odds (or multiplier of odds) of having children for every additional year in age, when holding marriage status constant. b<sub>2</sub> means the predicted change in log-odds (or multiplier of odds) of having children for someone who is married compared to someone who is not, when age is held constant.

Extending to the case of a logistic regression with an interaction: 

$$logit(children_i) = b_0 + b_1age_i + b_2married_i + b_3age_i*married_i + e_i$$

Make sure to interpret b<sub>3</sub> appropriately as an interaction coefficient, but in terms of log-odds of the outcome variable. Here, we might say it's the change in difference of log odds between people who are married and people who are not, for every year increase in their age. 

## 15.5 Fitting logistic regression models

In R, the function ```lm()``` finds the best-fitting coefficients for the model in order to minimize error between predictions and true outcomes. However, the fact that real outcome data only have two unique values (0 and 1) makes it really hard to do this fitting process, when the predictions could be any real number value. If you convert a binary categorical variable into 0s and 1s and include that as the outcome in a linear model, the function will *run*, and give you an answer, but because the fitting process was flawed those coefficient estimates will be very inaccurate for making guesses about new data. 

To deal with this, we use a separate function: ```glm()```. This stands for ***generalized* linear model**. This function can fit any linear model, but will *generalize* to outcome variables of types other than continuous if you tell it to. In order to be sensitive to different data types, it takes an additional argument ```family =```. In the case of logistic regression, this argument should be set to ```family = binomial```, indicating that we're predicting binary data. 

In fact, we could use glm() to run the linear models we were building before. For that, we would set ```family = gaussian``` to tell it we're predicting gaussian-distributed data (another name for the continuous normal distribution). 

The upside of this function is that it can outcome data on any sort of distribution. The downside is that unlike ```lm()```, it won't automatically convert categorical variables to dummy variables for us. We have to do that manually. 

Let's use the General Social Survey again to make predictions about how likely people are to support or oppose marijuana legalization, depending on their religious affiliation. In these data, marijuana legalization attitude is in the variable ```should_marijuana_be_made_legal```. Let's investigate the data type and the possible values:

In [8]:
str(GSS$should_marijuana_be_made_legal)
table(GSS$should_marijuana_be_made_legal)

 chr [1:2348] NA "Not legal" "Legal" "Not legal" "Not legal" NA "Legal" ...



    Legal Not legal 
      938       509 

It is a character datatype with two possible values, "Legal" and "Not legal". It is binary, which means we can use logistic regression to predict someone's likelihood of supporting marijuana legalization or not. However we first need to recode it as a numeric dummy variable:

In [10]:
#resetting values
GSS$marijuana_dummy <- recode(GSS$should_marijuana_be_made_legal, 
                             "Legal" = "1",
                             "Not legal" = "0")
GSS$marijuana_dummy <- as.numeric(GSS$marijuana_dummy)

#confirming we did it right
str(GSS$marijuana_dummy)

 num [1:2348] NA 0 1 0 0 NA 1 1 1 1 ...


Our predictor of interest is in the variable ```rs_religious_preference```:

In [11]:
str(GSS$rs_religious_preference)
table(GSS$rs_religious_preference)

 chr [1:2348] "Christian" "Catholic" "None" "Protestant" "Catholic" ...



               Buddhism                Catholic               Christian 
                     19                     493                      30 
               Hinduism Inter-nondenominational                  Jewish 
                      8                       1                      39 
           Moslem/islam         Native american                    None 
                     16                       1                     542 
     Orthodox-christian                   Other           Other eastern 
                      6                      32                       1 
             Protestant 
                   1139 

This is a categorical variable with many levels, but some levels are very rare in the dataset. It will be hard to fit accurate coefficients for these levels, so let's only retain data for religions that have at least ten people representing them: 

In [13]:
included_religions <- c("Buddhism", "Catholic", "Christian", "Jewish", "Moslem/islam", "None", 
                       "Other", "Protestant")
GSS_subset <- filter(GSS, rs_religious_preference %in% included_religions)

#confirming we did it right
table(GSS_subset$rs_religious_preference)


    Buddhism     Catholic    Christian       Jewish Moslem/islam         None 
          19          493           30           39           16          542 
       Other   Protestant 
          32         1139 

Now we will fit our logistic regression model. Because ```rs_religious_preference``` is a many-leveled categorical variable there will be many coefficients fit in the model representing whether or not someone is in a specific religious category, but the formula for fitting it is just ```marijuana_dummy ~ rs_religious_preference```. 

In [15]:
#fitting the logistic regression with glm(), including family argument for binary data
logistic_model <- glm(marijuana_dummy ~ rs_religious_preference, data = GSS_subset, family = binomial)
logistic_model


Call:  glm(formula = marijuana_dummy ~ rs_religious_preference, family = binomial, 
    data = GSS_subset)

Coefficients:
                        (Intercept)      rs_religious_preferenceCatholic  
                              2.565                               -2.247  
   rs_religious_preferenceChristian        rs_religious_preferenceJewish  
                             -1.649                               -1.738  
rs_religious_preferenceMoslem/islam          rs_religious_preferenceNone  
                             -2.565                               -1.134  
       rs_religious_preferenceOther    rs_religious_preferenceProtestant  
                             -1.118                               -2.196  

Degrees of Freedom: 1429 Total (i.e. Null);  1422 Residual
  (880 observations deleted due to missingness)
Null Deviance:	    1854 
Residual Deviance: 1786 	AIC: 1802

And that's that! Each coefficient in this table represents the change in log-odds of supporting marijuana legalization, based on whether or not someone is in that particular religious group. The intercept is the log-odds of supporting marijuana legalization for someone in the reference group, which is whatever religion is missing from this list of coefficients (in this case, Buddhism). 

If you wanted to make predictions about the likelihood of people's marijuana legalization support given their religion, you could simply plug in those coefficient estimates to the regression equation to get predicted log-odds, and then convert to probabilities. For example, for someone who is Buddhist, 

$$\hat{logit(Y)} = 2.565$$

Because they are the reference group, every predictor X has a value of 0 and those components of the model drop out. Log-odds of 2.565 means:

$$e^{2.565} = 13.00066$$

$$\frac{13.00066}{1 + 13.00066} = 0.9285748$$

There is about a 93% chance someone who is Buddhist will support marijuana legalization. 

All the other coefficients being negative mean that all other religious groups have a lower chance of supporting marijuana legalization. For instance, for someone who is Catholic: 

$$\hat{logit(Y)} = 2.565 - 2.247 = 0.318$$

$$e^{0.318} = 1.374376$$

$$\frac{1.374376}{1 + 1.374376} = 0.5788367$$

An individual Catholic person is only 58% likely to support marijuana legalization. 

As with linear models, we can also make predictions for lots of datapoints at once. One twist though is that we need to use the special function ```predict.glm()``` to go along with the glm model object we made. In addition, we have to choose whether to make predictions in probability units of the response (i.e. probability of supporting marijuana legalization), or predictions of the transformed response (logit) that is actually the outcome in a logistic model.

By default ```predict.glm()``` will make predictions in terms of log-odds. But it's much easier to understand the consequences of your model in terms of probabilities. For that sort of prediction, you need to add ```type="response"``` to the ```predict.glm()``` function call. Here we predict the chance of marijuana legalization support for the first ten people in the dataset:

In [17]:
predict.glm(logistic_model, GSS_subset[1:10,], type="response")

## 15.6 Error in logistic models

In linear models, we talked extensively 