**SM339 &#x25aa; Applied Statistics &#x25aa; Spring 2024 &#x25aa; Uhan**

# Lesson 10. Outliers and Unusual Points

_Setup._ Tweak the width and height values below to adjust the size of your plots in this notebook.

In [None]:
options(repr.plot.width=8, repr.plot.height=8)

## Overview

- If one or two "unusual" observations are strongly affecting the estimate of the fitted line, we need to know

- This is part of assessing the model

## Example 1

The dataset `PalmBeach` contains data from the US presidential election in 2000.
The race was very close, and the votes from Florida ended up determining the outcome.
`PalmBeach` contains the number of votes for two candidates in Florida counties.

- First, let's load the data:

In [None]:
library(Stat2Data)
data(PalmBeach)
head(PalmBeach)

- Let's start by plotting the number of votes for candidate George W. Bush (on the x-axis) versus the number of votes for candidate Pat Buchanan (on the y-axis)

- Let's also include a fitted line for a simple linear regression model

- The code below adds labels to the points in the scatterplot, so we can identify unusual pointsAdd the following code to your plotting code

```r
text(PalmBeach$Bush, PalmBeach$Buchanan, pos=1)  # pos = 1 means put labels to the bottom of each point
```

- Which observations appear to be "unusual"?

_Write your answer here. Double-click to edit._

- We can inspect the unusual observations like this:

## Three ways to quantify how unusual a point is

### Standardized residual

- The __standardized residual__ of observation $i$ is (roughly) $\dfrac{y_i - \hat{y}_i}{\hat{\sigma}_{\varepsilon}}$

- This expresses the residuals on a common, unitless scale
    
- Points with large standardized residuals are called __outliers__

### Leverage

- The __leverage__ of observation $i$ is denoted by $h_i$

- See STAT2 Chapter 4 for details on how $h_i$ is computed
    
- A point with high leverage _can_ have a strong effect on the fitted line, especially the slope

- In simple linear regression, this means that the point's $x$ value is far from $\bar{x}$

- Points with high leverage are called __leverage points__

### Cook's distance

- The __Cook's distance__ of observation $i$ is denoted by $D_i$

- See STAT2 Chapter 4 for details on how $D_i$ is computed
    
- __We are most concerned with a point that has high leverage and is an outlier__

- $D_i$ combines the standardized residual and leverage of a point to give an overall summary of its effect on the regression line

- Points with large Cook's distance are called __influential points__

## Rules of thumb for identifying unusual points in simple linear regression

- Remember that $n$ is the number of observations used in the simple linear regression model

| Statistic | Moderately unusual | Very unusual |
| :- | :- | :- |
| Standardized residual | beyond $\pm 2$ | beyond $\pm 3$ |
| Leverage $h_i$ | $h_i > \dfrac{4}{n}$ | $h_i > \dfrac{6}{n}$ |
| Cook's distance $D_i$ | $D_i > 0.5$ | $D_i > 1$| 

## Example 2

Run the following code to produce a diagnostic plot that displays standardized residuals, leverage, and Cook's distance for each observation. Use the output to comment on our two unusual points.

In [None]:
nrow(PalmBeach)     # n = number of observations in dataset
plot(fit, which=5)  # diagnostic plot

*Write your answer here. Double-click to edit.*

- __Palm Beach (50)__

- __Dade (13)__

_Note._ We can also look at the following diagnostic plot for a direct comparison of Cook's distance:

In [None]:
plot(fit, which=4)

## Some options for dealing with a "troubling" observation

- Make sure there wasn't an __error__ with recording data

- Investigate whether a __different model__ would be a better fit
    - For example: other predictors, transformations of predictors, other distribution assumptions

    
- Fit the model with and without the point in question and __report both results__    

❗️ It is __NOT OKAY__ to just remove a data point and never mention it!

## Example 3

We saw that Palm Beach County (50) has the highest Cook's distance.

- In R, we can remove observations 7 and 42 from data frame `MyDataFrame` like this:

    ```r
    MyDataFrame[-c(7, 42), ]    # Don't forget the comma!
    ```

- Let's create a new data frame `PalmBeach2` that removes the Palm Beach County observation from `PalmBeach`:

In [None]:
PalmBeach2 <- PalmBeach[-c(50),]

- Now re-fit the model without the Palm Beach County observation:

- Let's visualize the two different fitted models

- First, let's make a scatterplot of `Bush` vs `Buchanan` using all the data

- Then, we can overlay both the original fitted line and the new fitted line without Palm Beach
    - _Tip._ `abline(fit2, col="blue")` draws the fitted line in blue.

- Finally, let's look at the summary output for both models and write the fitted model equations with and without Palm Beach County

*Write your answer here. Double-click to edit.*

- The fitted model with Palm Beach:

- The fitted model without Palm Beach: