# Modelling the Non-Agentivity Alternation

In this notebook, the non-agentivity alternation is modelled as described in the doctoral thesis. Refer to the relevant chapter to understand modelling choices.

## Preparations

### Importing Relevant Libraries

In [None]:
library(lme4)
library(glue)
library(performance)
library(rstudioapi)
library(broom)
library(scales)
library(sjPlot)
library(ggplot2)
library(dplyr)
library(stargazer)
library(broom.mixed)

### Loading and Preprocessing Data

Note that, as mentioned in the doctoral thesis, only data from the VACC corpus is used for modelling this alternation set.

In [None]:
data <- read.csv("VACC/NON-AGENTIVITY_for_modelling_VACC.csv")

Treatment coding of response variable: "man" = 0, "werden" = 1.

In [None]:
data$CURRENT <- as.factor(data$CURRENT)
cat("R models this level:", tail(levels(data$CURRENT), 1))

Renaming speakers for consistency with the rest of the thesis.

In [None]:
data <- data %>% mutate(PREVIOUS_SPEAKER = recode(PREVIOUS_SPEAKER, "A" = "VA", "S" = "HS", "J" = "CF"))

Combining the variables PREVIOUS and PREVIOUS_SPEAKER, as modelling an interaction of these two independent variables leads to an issue of singularity.

In [None]:
data$PREVIOUS_SPEAKER_COMBINED <- interaction(data$PREVIOUS, data$PREVIOUS_SPEAKER) #R automatically creates a factor
cat("Levels are:", levels(data$PREVIOUS_SPEAKER_COMBINED))

`interaction()` also creates combinations which never appear. 

In [None]:
#discarding empty levels
data$PREVIOUS_SPEAKER_COMBINED <- droplevels(data$PREVIOUS_SPEAKER_COMBINED)

#before renaming (see above) "werden.A" was the reference level as it is first in alphabetical order, 
#to preserve this even after renaming, the factor is relevelled 
data$PREVIOUS_SPEAKER_COMBINED <- relevel(data$PREVIOUS_SPEAKER_COMBINED, ref = "werden.VA")
cat("The reference level of PREVIOUS_SPEAKER_COMBINED is:", head(levels(data$PREVIOUS_SPEAKER_COMBINED), 1)) 

Treatment coding of further predictor variables.

In [None]:
data$PREVIOUS <- as.factor(data$PREVIOUS)
data$PREVIOUS_BETA_MAN <- as.factor(data$PREVIOUS_BETA_MAN) 
data$PREVIOUS_BETA_WERDEN <- as.factor(data$PREVIOUS_BETA_WERDEN) 

## Simple Model

### Model Fitting

In [None]:
simple_model <- glm(CURRENT ~ PREVIOUS_SPEAKER_COMBINED,
                                data = data, family = 'binomial')

summary(simple_model)

### Model Evaluation

#### Events per Predictor

In [None]:
#Events per predictor
events <- sum(data$CURRENT == "man") #Number of positive cases
predictors <- length(coef(simple_model)) - 1 #Exclude intercept
EPP <- events / predictors
print(EPP)

#### R²

In [None]:
r2_nagelkerke(simple_model)

#### Predictive Efficiency

In [None]:
#Predictive efficiency
fixed_model_predictions <- predict(simple_model, newdata = data, type = "response")
#Convert probabilities to binary outcomes (i.e., if probability > 0.5, predict 1, else 0)
fixed_model_predicted_class <- ifelse(fixed_model_predictions > 0.5, "werden", "man")
#Compare predicted values to actual values
fixed_model_accuracy <- mean(fixed_model_predicted_class == data$CURRENT)
fixed_model_accuracy

In [None]:
#Calculate baseline accuracy, i.e., a dumb intercept-only model only ever predicting the most frequent outcome
counts <- table(data$CURRENT) 
dumb_model_accuracy <- max(counts) / sum(counts)
dumb_model_accuracy

In [None]:
#McNemar's Test for significance against baseline
baseline_predicted_class <- rep(names(which.max(counts)), length(data$CURRENT))  #Create baseline predictions (always predict the most frequent outcome)
mcnemar_table <- table(
  model_correct = (fixed_model_predicted_class == data$CURRENT),
  baseline_correct = (baseline_predicted_class == data$CURRENT)) #Create a contingency table: Compare model and baseline predictions against actual values
mcnemar_result <- mcnemar.test(mcnemar_table) #Perform McNemar's Test
mcnemar_result

### Visualisation

#### Coefficient Plot

In [None]:
model_summary <- tidy(simple_model, conf.int = TRUE) 

plot <- ggplot(model_summary, aes(x = estimate, y = term, color = p.value < 0.05)) +
          geom_point(size = 3) +
                    geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
          geom_vline(xintercept = 0, linetype = "dashed") +
          labs(
            x = "Estimated Coefficient",
            y = "Predictors",
          ) +
          theme_minimal() +
          scale_color_manual(values = c("TRUE" = "black", "FALSE" = "gray")) +  #Grey out non-significant predictors
          coord_fixed(ratio = 0.5)

plot

#### Prediction Plot

Plot visualises the probability of observing "werden" in CURRENT given different combinations of variants in PREVIOUS and SPEAKERS.

In [None]:
#for "werden" in CURRENT
plot <- plot_model(simple_model, type="pred", terms=c("PREVIOUS_SPEAKER_COMBINED"), dpi=300, ) + 
        theme(axis.text = element_text(size = 16), axis.title = element_text(size = 16)) + theme_minimal(base_size=12) +
        theme(plot.title = element_blank()) 

plot