
# Lab Week 05 (15.4.2025): Exploring Interaction Effects in Depression Treatment

In this notebook, you'll analyze data from a study that compared the effectiveness of three treatments (A, B, and C) for severe depression.

We aim to determine whether the effect of **age** on **effectiveness** depends on the **treatment** received.

The data includes:
- `age`: The age of the patient
- `TRT`: The treatment group (A, B, or C)
- `y`: The effectiveness of the therapy, a score measuring improvement

---

## 📋 Tasks

1. Explore the data
2. Fit a baseline regression model (additive only)
3. Add interaction terms to the model
4. Compare both models and interpret
5. Perform model diagnostics
6. Predict value for a new datapoint and estimate the uncertainty 


In [None]:
# Run this cell to import the packages used and to load the data
import pandas as pd
import numpy as np

import statsmodels.api as sm

import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.stats.outliers_influence \
     import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm

from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)



# Load dataset
url = 'https://online.stat.psu.edu/stat501/sites/stat501/files/data/depression.txt'
df = pd.read_csv(url, sep='\t')
df = df.drop(['x2','x3'], axis = 1)

# Preview data
df.head()


## 🔍 Task 1: Data Exploration

Inspect the structure of the dataset and visualize the relationship between `age` and `y` across treatment groups.


In [None]:
#your code here


## 🧮 Task 2: Baseline Model

Fit a multiple linear regression model predicting `y` from:
- `age`
- `TRT` (careful: this is a categorical variable)

Interpret the model output.

In [None]:
# your code here

*Your model interpretation here*

## 🔁 Task 3: Interaction Model

Now fit a model that additionally to the model from Task 2 includes interaction terms between `age` and `TRT`.

Interpret whether age affects effectiveness differently depending on TRT.


In [None]:
# your code here


## 📊 Task 4: Compare Models

Compare both models using R-squared and adjusted R-squared. Summarize which model is better and what the interaction terms tell you. Finally show how the model with interaction term leads to one regression line per treatment type. Write down the equations of the three regression lines.

In [13]:
# your code here

*Your comparison here*

In [14]:
# your regression line equations here

## 🧪 Task 5: Model Diagnostics

Now that we have fitted our final model (with interaction terms), let’s assess how well it meets the assumptions of linear regression.

We’ll perform the following diagnostics:

- **Residual Plot**: Check for randomness in residuals. (Hint: see the [lab to Chapter 2](https://islp.readthedocs.io/en/latest/labs/Ch03-linreg-lab.html) for instructions how to compute residuals and how to plot the residual plot)
- **Standardized Residuals**: Identify potential outliers. These are defined as observations with $|\text{standardized residual}| \leq 3$. Standardized residuals can be computed by applying the [`get_influence()`](https://www.statsmodels.org/0.9.0/generated/statsmodels.stats.outliers_influence.OLSInfluence.html#statsmodels.stats.outliers_influence.OLSInfluence) method to the model, and by then accessing the [`resid_studentized_internal`]() attribute of the output.
- **Leverage**: Identify observations that have unusual predictor values. To do so, compute the leverage statistic by taking the `hat_matrix_diag` attribute f the value returned by the `get_influence()` method which has to be applied to the trained model. Use $2 \cdot \frac{p+1}{n}$ as cutoff value for high-leverage observations.

### ➤ Why?
These checks help us validate our model's assumptions and identify points that may disproportionately affect model results.

In [15]:
# your code here

## 🔮 Task 6: Prediction with Confidence and Prediction Intervals

Use the final model to predict the effectiveness for a **new patient**:

- Age = 45
- Treatment = A

Calculate:
- A **confidence interval** for the mean effectiveness for this group
- A **prediction interval** for an individual with these characteristics

### ➤ Why?
Understanding prediction vs. confidence intervals is critical:
- **Confidence Interval**: Tells us where the **mean** outcome is likely to lie.
- **Prediction Interval**: Tells us where a **single new observation** is likely to fall, accounting for more uncertainty.

> Provide your interpretation of both intervals in context.

In [16]:
# your code here

*Your model interpretation here*