**SM339 &#x25aa; Applied Statistics &#x25aa; Spring 2024 &#x25aa; Uhan**

# Lesson 18. Comparing Two Regression Lines &mdash; Part 2

## Using one model to fit two lines with different intercepts

### Example 1

In a Municipal Court in Columbus, Ohio, jury duty lasts two weeks, so the court randomly summons potential jurors 26 times over the course of a year. However, not every person who receives a summons actually shows up in court, and in fact the court has noticed a negative trend in the percent of jurors reporting as the calendar year goes on. 

After 1998, the court implemented a variety of methods to try to increase participation rates, and they are interested in knowing if (on average) these methods made any real difference.

The data `Jurors` in the `Stat2Data` library contains the participation rates for each of the 26 periods (numbered sequentially) for the years 1998 and 2000. Use this data to help answer the court's
question.

First, let's take a look at the data:

In [None]:
library(Stat2Data)
data(Jurors)
head(Jurors)

#### a.
Make a color-coded plot of the data to help investigate what's going on. Comment on the general trends you see.

_Hints._ The `col` keyword argument in `plot()` lets you specify colors. You can combine this with the `ifelse()` function to conditionally color points in your plot, like this:

```r
col = ifelse(Jurors$Year == 1998, "black", "red") 
```

*Write your notes here. Double-click to edit.*

#### b.
The variable $\mathit{I2000}$ is an indicator variable:

$$ \mathit{I2000} = \begin{cases}
1 & \text{if } \mathit{Year} = 2000\\
0 & \text{otherwise}
\end{cases} $$

We can use the following _single_ population-level model to answer the court's question:

$$ \mathit{PctReport} = \beta_0 + \beta_1 \mathit{Period} + \beta_2 \mathit{I2000} + \varepsilon \quad \text{where} \quad \varepsilon \sim \text{iid } N(0, \sigma_{\varepsilon}^2) $$

#### c.
Fit the model and report the prediction equation.

*Write your notes here. Double-click to edit.*

#### d.
Interpret the coefficients.

*Write your notes here. Double-click to edit.*

#### e.
Does the model to appear to be a "good" model? Check the diagnostic plots, overall ANOVA F-test, and $R^2$.

*Write your notes here. Double-click to edit.*

#### f.
Use a formal statistical test to address the court's question: 
> Do you see evidence that, after accounting for the period of the year, the percentage of summoned jurors who reported to court was significantly different in 2000 than in 1998?

*Write your notes here. Double-click to edit.*

#### g.
Provide a confidence interval that estimates the size of the average change (after accounting for the period of the year) in juror turnout from 1998 to 2000.

## Using categorical variables in regression models with R 

* There are two other ways to include categorical variables into your model in R

* Instead of using the pre-defined indicator (which may not exist in other data sets!), you can create your own with the `I()` function, like this:

In [None]:
fit2 <- lm(PctReport ~ Period + I(Year == 2000), data = Jurors)
summary(fit2)

* As another option, you can tell R that the original variable (in our example, $\mathit{Year}$) is categorical with the `as.factor()` function, like this:

In [None]:
fit3 <- lm(PctReport ~ Period + as.factor(Year), data = Jurors)
summary(fit3)

## Using one model to fit two lines with different intercepts AND different slopes

### Example 2

In Example 1, we analyzed the $\mathit{Jurors}$ dataset to determine whether there was a difference in juror turnout ($\mathit{PctReport}$) between the years 1998 and 2000, after accounting for the time of year ($\mathit{Period}$).

The model we used, which __did not__ allow the two years to have different slopes for $\mathit{Period}$, was:

$$ \mathit{PctReport} = \beta_0 + \beta_1 \mathit{Period} + \beta_2 \mathit{I2000} + \varepsilon \quad \text{where} \quad \varepsilon \sim \text{iid } N(0, \sigma_{\varepsilon}^2) $$

Now we want to allow different intercepts __and__ different slopes for $\mathit{Period}$. We can do so with the following model:

$$ \mathit{PctReport} = \beta_0 + \beta_1 \mathit{Period} + \beta_2 \mathit{I2000} + \beta_3 (\fbox{$\mathit{Period}$} \times \mathit{I2000}) + \varepsilon \quad \text{where} \quad \varepsilon \sim \text{iid } N(0, \sigma_{\varepsilon}^2) $$ 

#### a.
Fit the model and provide the summary output.

To include an interaction term between variables `x` and `y` in R, use a colon, like this:

```r
x:y 
```

#### b.
Do we see statistical evidence that the _rate of change_ in juror turnout over the course of the year differs between 1998 and 2000? Provide the four steps for the hypothesis test of the relevant coefficient.

*Write your notes here. Double-click to edit.*

## Exercises

### Problem 1
Suppose we are interested in how the growth rates of boys and girls compare.

The dataset `Kids198` from `Stat2Data` contains the ages (in months) and weights (in pounds) for a random sample of 198 children. Girls are coded as $\mathit{Sex} = 1$ and boys are corded as $\mathit{Sex} = 0$.

In [None]:
library(Stat2Data)
data(Kids198)
head(Kids198)

#### a.
Make a color-coded plot of the data, and comment on the general trends you see.

*Write your notes here. Double-click to edit.*

#### b.
Specify the single model you will use to answer the research question.

*Write your notes here. Double-click to edit.*

#### c.
Fit the model and report the prediction equation.

*Solution.*

$$ \widehat{\mathit{Weight}} = -33.69254 + 0.90871 \mathit{Age} + 31.85057 \mathit{Sex} - 0.28122 (\mathit{Age} \times \mathit{Sex}) $$

#### d.
Does the model appear to be a "good" model? Justify your answer.

*Write your notes here. Double-click to edit.*

#### e.
Is the growth rate (i.e., the slope of $\mathit{Age}$) significantly different for boys versus girls? Justify your answer.

*Write your notes here. Double-click to edit.*

#### f.
Provide an interval estimating how much the growth rates differ.
Be clear which sex appears to grow faster.

*Write your notes here. Double-click to edit.*