-
Notifications
You must be signed in to change notification settings - Fork 23
/
smoothing.qmd
579 lines (440 loc) · 28.9 KB
/
smoothing.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
# Smoothing
```{r, echo=FALSE}
img_path <- "img"
```
Before continuing learning about machine learning algorithms, we introduce the important concept of *smoothing*. Smoothing is a very powerful technique used all across data analysis. Other names given to this technique are *curve fitting* and *low pass filtering*. It is designed to detect trends in the presence of noisy data in cases in which the shape of the trend is unknown. The *smoothing* name comes from the fact that to accomplish this feat, we assume that the trend is *smooth*, as in a smooth surface. In contrast, the noise, or deviation from the trend, is unpredictably wobbly:
```{r signal-plus-noise-example, message=FALSE, warning=FALSE, fig.height=6, echo=FALSE, cache=FALSE}
library(tidyverse)
set.seed(1)
n <- 100
x <- seq(-pi*4, pi*4, len = n)
tmp <- data.frame(x = x , f = sin(x) + x/8, e = rnorm(n, 0, 0.5))
p1 <- qplot(x, f, main = "smooth trend", ylim = range(tmp$f + tmp$e), data = tmp, geom = "line")
p2 <- qplot(x, e, main = "noise", ylim = range(tmp$f + tmp$e), data = tmp, geom = "line")
p3 <- qplot(x, f + e, main = "data = smooth trend + noise", ylim = range(tmp$f + tmp$e), data = tmp, geom = "line")
gridExtra::grid.arrange(p1, p2, p3)
```
Part of what we explain in this section are the assumptions that permit us to extract the trend from the noise.
## Example: Is it a 2 or a 7? {#sec-two-or-seven}
To motivate the need for smoothing and make the connection with machine learning, we will construct a simplified version of the MNIST dataset with just two classes for the outcome and two predictors. Specifically, we define the challenge as building an algorithm that can determine if a digit is a 2 or 7 from the proportion of dark pixels in the upper left quadrant ($X_1$) and the lower right quadrant ($X_2$). We also selected a random sample of 1,000 digits, 500 in the training set and 500 in the test set. We provide this dataset in the _mnist_27_ object in the `dslabs` package.
For the training data, we have $n=500$ observed outcomes $y_1,\dots,y_n$, with $Y$ defined as $1$ if the digit is 7 and 0 if it's 2, and $n=500$ features $\mathbf{x}_1, \dots, \mathbf{x}_n$, with each feature a two-dimensional point $\mathbf{x}_i = (x_{i,1}, x_{i,2})^\top$. Here is a plot of the $x_2$s versus the $x_1$s with color determining if $y$ is 1 (blue) or 0 (red):
```{r two-or-seven-scatter, warning=FALSE, message=FALSE, cache=FALSE}
library(caret)
library(dslabs)
mnist_27$train |> ggplot(aes(x_1, x_2, color = y)) + geom_point()
```
We can immediately see some patterns. For example, if $x_1$ (the upper left panel) is very large, then the digit is probably a 7. Also, for smaller values of $x_1$, the 2s appear to be in the mid range values of $x_2$.
To illustrate how to interpret $x_1$ and $x_2$, we include four example images. On the left are the original images of the two digits with the largest and smallest values for $x_1$ and on the right we have the images corresponding to the largest and smallest values of $x_2$:
```{r two-or-seven-images-large-x1, echo=FALSE, out.width="100%", fig.height=3, fig.width=6.5}
if (!exists("mnist")) mnist <- read_mnist()
is <- mnist_27$index_train[c(which.min(mnist_27$train$x_1), which.max(mnist_27$train$x_1))]
titles <- c("smallest","largest")
tmp <- lapply(1:2, function(i){
expand.grid(Row = 1:28, Column = 1:28) |>
mutate(label = titles[i],
value = mnist$train$images[is[i],])
})
tmp <- Reduce(rbind, tmp)
p1 <- tmp |> ggplot(aes(Row, Column, fill = value)) +
geom_raster(show.legend = FALSE) +
scale_y_reverse() +
scale_fill_gradient(low = "white", high = "black") +
facet_grid(.~label) +
geom_vline(xintercept = 14.5) +
geom_hline(yintercept = 14.5) +
ggtitle("Largest and smallest x_1")
is <- mnist_27$index_train[c(which.min(mnist_27$train$x_2), which.max(mnist_27$train$x_2))]
titles <- c("smallest","largest")
tmp <- lapply(1:2, function(i){
expand.grid(Row = 1:28, Column = 1:28) |>
mutate(label = titles[i],
value = mnist$train$images[is[i],])
})
tmp <- Reduce(rbind, tmp)
p2 <- tmp |> ggplot(aes(Row, Column, fill = value)) +
geom_raster(show.legend = FALSE) +
scale_y_reverse() +
scale_fill_gradient(low = "white", high = "black") +
facet_grid(.~label) +
geom_vline(xintercept = 14.5) +
geom_hline(yintercept = 14.5) +
ggtitle("Largest and smallest x_2")
gridExtra::grid.arrange(p1, p2, ncol = 2)
```
We can start getting a sense for why these predictors are useful, but also why the problem will be somewhat challenging.
We haven't really learned any algorithms yet, so let's try building an algorithm using multivariable regression. The model is simply:
$$
p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x}) = \mbox{Pr}(Y=1 \mid X_1=x_1 , X_2 = x_2) =
\beta_0 + \beta_1 x_1 + \beta_2 x_2
$$
We fit can fit this model using least squares and obtain an estimate $\hat{p}(\mathbf{x})$ by using the least square estimates $\hat{\beta}_0$, $\hat{\beta}_1$ and $\hat{\beta}_2$.
We define a decision rule by predicting $\hat{y}=1$ if $\hat{p}(\mathbf{x})>0.5$ and 0 otherwise.
```{r, echo=FALSE}
fit <- mnist_27$train |> mutate(y = ifelse(y == 7, 1, 0)) |> lm(y ~ x_1 + x_2, data = _)
p_hat <- predict(fit, newdata = mnist_27$test)
y_hat <- factor(ifelse(p_hat > 0.5, 7, 2))
```
We get an accuracy of `r confusionMatrix(y_hat, mnist_27$test$y)$overall[["Accuracy"]]`, well above 50%. Not bad for our first try. But can we do better?
Because we constructed the `mnist_27` example and we had at our disposal 60,000 digits in just the MNIST dataset, we used this to build the _true_ conditional distribution $p(\mathbf{x})$. Keep in mind that in practice we don't have access to the true conditional distribution. We include it in this educational example because it permits the comparison of $\hat{p}(\mathbf{x})$ to the true $p(\mathbf{x})$. This comparison teaches us the limitations of different algorithms. We have stored the true $p(\mathbf{x})$ in the `mnist_27` and can plot it as an image.
We draw a curve that separates pairs $(\mathbf{x})$ for which $p(\mathbf{x}) > 0.5$ and pairs for which $p(\mathbf{x}) < 0.5$:
```{r true-p-better-colors, echo=FALSE}
mnist_27$true_p |>
ggplot(aes(x_1, x_2, z = p)) +
geom_raster(aes(fill = p)) +
scale_fill_gradientn(colors = c("#F8766D", "white", "#00BFC4")) +
stat_contour(breaks = c(0.5), color = "black")
```
To start understanding the limitations of regression, first note that with regression $\hat{p}(\mathbf{x})$ has to be a plane, and as a result the boundary defined by the decision rule is given by:
$\hat{p}(\mathbf{x}) = 0.5$:
$$
\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 = 0.5 \implies
\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 = 0.5 \implies
x_2 = (0.5-\hat{\beta}_0)/\hat{\beta}_2 -\hat{\beta}_1/\hat{\beta}_2 x_1
$$
This implies that for the boundary, $x_2$ is a linear function of $x_1$, which suggests that our regression approach has no chance of capturing the non-linear nature of the true $p(\mathbf{x})$. Below is a visual representation of $\hat{p}(\mathbf{x})$ which clearly shows how it fails to capture the shape of $p(\mathbf{x})$:
```{r regression-p-hat, echo=FALSE, out.width="100%", fig.height=3, fig.width=7}
p_hat <- predict(fit, newdata = mnist_27$true_p)
p_hat <- scales::squish(p_hat, c(0, 1))
p1 <- mnist_27$true_p |> mutate(p_hat = p_hat) |>
ggplot(aes(x_1, x_2, z = p_hat)) +
geom_raster(aes(fill = p_hat)) +
scale_fill_gradientn(colors = c("#F8766D","white","#00BFC4")) +
stat_contour(breaks = c(0.5), color = "black")
p2 <- mnist_27$true_p |>
mutate(p_hat = p_hat) |>
ggplot() +
stat_contour(aes(x_1, x_2, z = p_hat), breaks = c(0.5), color = "black") +
geom_point(mapping = aes(x_1, x_2, color = y), data = mnist_27$test)
gridExtra::grid.arrange(p1, p2, ncol = 2)
```
We need something more flexible: a method that permits estimates with shapes other than a plane. Smoothing techniques permit this flexibility. We will start by describing nearest neighbor and kernel approaches. To understand why we cover this topic, remember that the concepts behind smoothing techniques are extremely useful in machine learning because conditional expectations/probabilities can be thought of as *trends* of unknown shapes that we need to estimate in the presence of uncertainty.
## Signal plus noise model
To explain these concepts, we will focus first on a problem with just one predictor. Specifically, we try to estimate the time trend in the 2008 US popular vote poll margin (the difference between Obama and McCain). Later we will learn about methods, such as k-nearest neighbors, that can be used to smooth with higher dimensions.
```{r polls-2008-data, warning=FALSE, message=FALSE, cache=FALSE}
polls_2008 |> ggplot(aes(day, margin)) + geom_point()
```
For the purposes of the popular vote example, do not think of it as a forecasting problem. Instead, we are simply interested in learning the shape of the trend *after* the election is over.
We assume that for any given day $x$, there is a true preference among the electorate $f(x)$, but due to the uncertainty introduced by the polling, each data point comes with an error $\varepsilon$. A mathematical model for the observed poll margin $Y_i$ is:
$$
Y_i = f(x_i) + \varepsilon_i
$$
To think of this as a machine learning problem, consider that we want to predict $Y$ given a day $x$. If we knew the conditional expectation $f(x) = \mbox{E}(Y \mid X=x)$, we would use it. But since we don't know this conditional expectation, we have to estimate it. Let's use regression, since it is the only method we have learned up to now.
```{r linear-regression-not-flexible, echo=FALSE}
resid <- ifelse(lm(margin~day, data = polls_2008)$resid > 0, "+", "-")
polls_2008 |>
mutate(resid = resid) |>
ggplot(aes(day, margin)) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
geom_point(aes(color = resid), size = 3)
```
The fitted regression line does not appear to describe the trend very well. For example, on September 4 (day -62), the Republican Convention was held and the data suggest that it gave John McCain a boost in the polls. However, the regression line does not capture this potential trend. To see the *lack of fit* more clearly, we note that points above the fitted line (blue) and those below (red) are not evenly distributed across days. We therefore need an alternative, more flexible approach.
## Bin smoothing
The general idea of smoothing is to group data points into strata in which the value of $f(x)$ can be assumed to be constant. We can make this assumption when we think $f(x)$ changes slowly and, as a result, $f(x)$ is almost constant in small windows of $x$. An example of this idea for the `poll_2008` data is to assume that public opinion remained approximately the same within a week's time. With this assumption in place, we have several data points with the same expected value.
If we fix a day to be in the center of our week, call it $x_0$, then for any other day $x$ such that $|x - x_0| \leq 3.5$, we assume $f(x)$ is a constant $f(x) = \mu$. This assumption implies that:
$$
E[Y_i | X_i = x_i ] \approx \mu \mbox{ if } |x_i - x_0| \leq 3.5
$$
In smoothing, we call the size of the interval satisfying $|x_i - x_0| \leq 3.5$ the *window size*, *bandwidth* or *span*. Later we will see that we try to optimize this parameter.
This assumption implies that a good estimate for $f(x_0)$ is the average of the $Y_i$ values in the window. If we define $A_0$ as the set of indexes $i$ such that $|x_i - x_0| \leq 3.5$ and $N_0$ as the number of indexes in $A_0$, then our estimate is:
$$
\hat{f}(x_0) = \frac{1}{N_0} \sum_{i \in A_0} Y_i
$$
We make this calculation with each value of $x$ as the center. In the poll example, for each day, we would compute the average of the values within a week with that day in the center. Here are two examples: $x_0 = -125$ and $x_0 = -55$. The blue segment represents the resulting average.
```{r binsmoother-expained, echo=FALSE}
span <- 3.5
tmp <- polls_2008 |>
crossing(center = polls_2008$day) |>
mutate(dist = abs(day - center)) |>
filter(dist <= span)
tmp |> filter(center %in% c(-125, -55)) |>
ggplot(aes(day, margin)) +
geom_point(data = polls_2008, size = 3, alpha = 0.5, color = "grey") +
geom_point(size = 2) +
geom_smooth(aes(group = center),
method = "lm", formula = y~1, se = FALSE) +
facet_wrap(~center)
```
By computing this mean for every point, we form an estimate of the underlying curve $f(x)$. Below we show the procedure happening as we move from the -155 up to 0. At each value of $x_0$, we keep the estimate $\hat{f}(x_0)$ and move on to the next point:
```{r binsmoother-animation, echo=FALSE, warning=FALSE}
library(gganimate)
span <- 7
fit <- with(polls_2008, ksmooth(day, margin, kernel = "box", x.points = day, bandwidth = span))
bin_fit <- data.frame(x = fit$x, .fitted = fit$y)
if (!file.exists(file.path(img_path, "binsmoother-animation.gif"))) {
p <- tmp |>
ggplot() +
geom_smooth(aes(day, margin, group = center, frame = center), method = "lm", formula = y~1, se = FALSE) +
geom_point(aes(day, margin), data = polls_2008, size = 3, alpha = .5, color = "grey") +
geom_point(aes(day, margin, frame = center)) +
geom_line(aes(x = x, y = .fitted, frame = x, cumulative = TRUE), data = bin_fit, color = "red") +
ggtitle("x0 = ")
gganimate(p, filename = file.path(img_path,"binsmoother-animation.gif"), interval = .1)
}
if (knitr::is_html_output()) {
knitr::include_graphics(file.path(img_path,"binsmoother-animation.gif"))
} else{
centers <- quantile(tmp$center, seq(1,6)/6)
tmp_bin_fit <- crossing(center = centers, bin_fit) |>
group_by(center) |>
filter(x <= center) |>
ungroup()
tmp |> filter(center %in% centers) |>
ggplot() +
geom_smooth(aes(day, margin), method = "lm", formula = y~1, se = FALSE) +
geom_point(aes(day, margin), data = polls_2008, size = 3, alpha = .5, color = "grey") +
geom_point(aes(day, margin)) +
geom_line(aes(x = x, y = .fitted), data = tmp_bin_fit, color = "red") +
ggtitle("x0 = ") +
facet_wrap(~center, nrow = 2)
}
```
The final code and resulting estimate look like this:
```{r binsmoother-final}
span <- 7
fit <- with(polls_2008, ksmooth(day, margin, kernel = "box", bandwidth = span))
polls_2008 |> mutate(fit = fit$y) |>
ggplot(aes(x = day)) +
geom_point(aes(y = margin), size = 3, alpha = .5, color = "grey") +
geom_line(aes(y = fit), color = "red")
```
## Kernels
The final result from the bin smoother is quite wiggly. One reason for this is that each time the window moves, two points change. We can attenuate this somewhat by taking weighted averages that give the center point more weight than far away points, with the two points at the edges receiving very little weight.
You can think of the bin smoother approach as a weighted average:
$$
\hat{f}(x_0) = \sum_{i=1}^N w_0(x_i) Y_i
$$
in which each point receives a weight of either $0$ or $1/N_0$, with $N_0$ the number of points in the week. In the code above, we used the argument `kernel="box"` in our call to the function `ksmooth`. This is because the weight function looks like a box. The `ksmooth` function provides a "smoother" option which uses the normal density to assign weights.
```{r gaussian-kernel, echo=FALSE, out.width="80%", fig.height=3, fig.width=6}
x_0 <- -125
p1 <- data.frame(x = polls_2008$day) |> mutate(w_0 = 1*I(abs(x - x_0) <= span/2)) |>
mutate(w_0 = w_0/sum(w_0)) |>
ggplot(aes(x, w_0)) +
geom_step() +
ggtitle("Box")
tmp <- with(data.frame(day = seq(min(polls_2008$day), max(polls_2008$day), .25)),
ksmooth(day, 1*I(day == x_0), kernel = "normal", x.points = day, bandwidth = span))
p2 <- data.frame(x = tmp$x, w_0 = tmp$y) |>
mutate(w_0 = w_0/sum(w_0)) |>
ggplot(aes(x, w_0)) +
geom_line() +
ggtitle("Normal")
gridExtra::grid.arrange(p1, p2, ncol = 2)
```
The final code and resulting plot for the normal kernel look like this:
```{r final-ksmooth-normal-kernel}
span <- 7
fit <- with(polls_2008, ksmooth(day, margin, kernel = "normal", bandwidth = span))
polls_2008 |> mutate(smooth = fit$y) |>
ggplot(aes(day, margin)) +
geom_point(size = 3, alpha = .5, color = "grey") +
geom_line(aes(day, smooth), color = "red")
```
Notice that this version looks smoother.
There are several functions in R that implement bin smoothers. One example is `ksmooth`, shown above. In practice, however, we typically prefer methods that use slightly more complex models than fitting a constant. The final result above, for example, is still somewhat wiggly in parts we don't expect it to be (between -125 and -75, for example). Methods such as `loess`, which we explain next, improve on this.
## Local weighted regression (loess)
A limitation of the bin smoother approach just described is that we need small windows for the approximately constant assumptions to hold. As a result, we end up with a small number of data points to average and obtain imprecise estimates $\hat{f}(x)$. Here we describe how *local weighted regression* (loess) permits us to consider larger window sizes. To do this, we will use a mathematical result, referred to as Taylor's theorem, which tells us that if you look closely enough at any smooth function $f(x)$, it will look like a line. To see why this makes sense, consider the curved edges gardeners make using straight-edged spades:
![](img/garden.png)
("Downing Street garden path edge"[^smoothing-1] by Flickr user Number 10[^smoothing-2]. CC-BY 2.0 license[^smoothing-3].)
[^smoothing-1]: https://www.flickr.com/photos/49707497\@N06/7361631644
[^smoothing-2]: https://www.flickr.com/photos/number10gov/
[^smoothing-3]: https://creativecommons.org/licenses/by/2.0/
Instead of assuming the function is approximately constant in a window, we assume the function is locally linear. We can consider larger window sizes with the linear assumption than with a constant. Instead of the one-week window, we consider a larger one in which the trend is approximately linear. We start with a three-week window and later consider and evaluate other options:
$$
E[Y_i | X_i = x_i ] = \beta_0 + \beta_1 (x_i-x_0) \mbox{ if } |x_i - x_0| \leq 21
$$
For every point $x_0$, loess defines a window and fits a line within that window. Here is an example showing the fits for $x_0=-125$ and $x_0 = -55$:
```{r loess, echo=FALSE}
span <- 21/diff(range(polls_2008$day))
tmp <- polls_2008 |>
crossing(center = polls_2008$day) |>
mutate(dist = abs(day - center)) |>
filter(rank(dist) / n() <= span) |>
mutate(weight = (1 - (dist / max(dist)) ^ 3) ^ 3)
tmp |>
filter(center %in% c(-125, -55)) |>
ggplot(aes(day, margin)) +
scale_size(range = c(0, 3)) +
geom_smooth(aes(group = center, weight = weight),
method = "lm", se = FALSE) +
geom_point(data = polls_2008, size = 3, alpha = .5, color = "grey") +
geom_point(aes(size = weight)) +
facet_wrap(~center)
```
The fitted value at $x_0$ becomes our estimate $\hat{f}(x_0)$. Below we show the procedure happening as we move from the -155 up to 0:
```{r loess-animation, echo=FALSE, warning=FALSE}
library(broom)
fit <- loess(margin ~ day, degree = 1, span = span, data = polls_2008)
loess_fit <- augment(fit)
if (!file.exists(file.path(img_path, "loess-animation.gif"))) {
p <- ggplot(tmp, aes(day, margin)) +
scale_size(range = c(0, 3)) +
geom_smooth(aes(group = center, frame = center, weight = weight), method = "lm", se = FALSE) +
geom_point(data = polls_2008, size = 3, alpha = .5, color = "grey") +
geom_point(aes(size = weight, frame = center)) +
geom_line(aes(x = day, y = .fitted, frame = day, cumulative = TRUE),
data = loess_fit, color = "red")
gganimate(p, filename = file.path(img_path, "loess-animation.gif"), interval = .1)
}
if (knitr::is_html_output()) {
knitr::include_graphics(file.path(img_path,"loess-animation.gif"))
} else{
centers <- quantile(tmp$center, seq(1,6)/6)
tmp_loess_fit <- crossing(center = centers, loess_fit) |>
group_by(center) |>
filter(day <= center) |>
ungroup()
tmp |> filter(center %in% centers) |>
ggplot() +
geom_smooth(aes(day, margin), method = "lm", se = FALSE) +
geom_point(aes(day, margin, size = weight), data = polls_2008, size = 3, alpha = .5, color = "grey") +
geom_point(aes(day, margin)) +
geom_line(aes(x = day, y = .fitted), data = tmp_loess_fit, color = "red") +
facet_wrap(~center, nrow = 2)
}
```
The final result is a smoother fit than the bin smoother since we use larger sample sizes to estimate our local parameters:
```{r final-loess}
total_days <- diff(range(polls_2008$day))
span <- 21/total_days
fit <- loess(margin ~ day, degree = 1, span = span, data = polls_2008)
polls_2008 |> mutate(smooth = fit$fitted) |>
ggplot(aes(day, margin)) +
geom_point(size = 3, alpha = .5, color = "grey") +
geom_line(aes(day, smooth), color = "red")
```
Different spans give us different estimates. We can see how different window sizes lead to different estimates:
```{r loess-multi-span-animation, echo=FALSE, warning=FALSE}
spans <- c(.66, 0.25, 0.15, 0.10)
fits <- tibble(span = spans) |>
group_by(span) |>
do(broom::augment(loess(margin ~ day, degree = 1, span = .$span, data = polls_2008)))
tmp <- fits |>
crossing(center = polls_2008$day) |>
mutate(dist = abs(day - center)) |>
filter(rank(dist) / n() <= span) |>
mutate(weight = (1 - (dist / max(dist)) ^ 3) ^ 3)
if (!file.exists(file.path(img_path, "loess-multi-span-animation.gif"))) {
p <- ggplot(tmp, aes(day, margin)) +
scale_size(range = c(0, 2)) +
geom_smooth(aes(group = center, frame = center, weight = weight), method = "lm", se = FALSE) +
geom_point(data = polls_2008, size = 2, alpha = .5, color = "grey") +
geom_line(aes(x = day, y = .fitted, frame = day, cumulative = TRUE),
data = fits, color = "red") +
geom_point(aes(size = weight, frame = center)) +
facet_wrap(~span)
gganimate(p, filename = file.path(img_path, "loess-multi-span-animation.gif"), interval = .1)
}
if (knitr::is_html_output()) {
knitr::include_graphics(file.path(img_path,"loess-multi-span-animation.gif"))
} else{
centers <- quantile(tmp$center, seq(1,3)/3)
tmp_fits <- crossing(center = centers, fits) |>
group_by(center) |>
filter(day <= center) |>
ungroup()
tmp |> filter(center %in% centers) |>
ggplot() +
geom_smooth(aes(day, margin), method = "lm", se = FALSE) +
geom_point(aes(day, margin, size = weight), data = polls_2008, size = 3, alpha = .5, color = "grey") +
geom_point(aes(day, margin)) +
geom_line(aes(x = day, y = .fitted), data = tmp_fits, color = "red") +
facet_grid(span ~ center)
}
```
Here are the final estimates:
```{r loess-final, echo=FALSE}
tmp |> ggplot(aes(day, margin)) +
geom_point(size = 2, alpha = .5, color = "grey") +
geom_line(aes(day, .fitted), data = fits, color = "red") +
facet_wrap(~span)
```
There are three other differences between `loess` and the typical bin smoother.
1\. Rather than keeping the bin size the same, `loess` keeps the number of points used in the local fit the same. This number is controlled via the `span` argument, which expects a proportion. For example, if `N` is the number of data points and `span=0.5`, then for a given $x$, `loess` will use the `0.5 * N` closest points to $x$ for the fit.
2\. When fitting a line locally, `loess` uses a *weighted* approach. Basically, instead of using least squares, we minimize a weighted version:
$$
\sum_{i=1}^N w_0(x_i) \left[Y_i - \left\{\beta_0 + \beta_1 (x_i-x_0)\right\}\right]^2
$$
However, instead of the Gaussian kernel, loess uses a function called the Tukey tri-weight:
$$
W(u)= \left( 1 - |u|^3\right)^3 \mbox{ if } |u| \leq 1 \mbox{ and } W(u) = 0 \mbox{ if } |u| > 1
$$
To define the weights, we denote $2h$ as the window size and define:
$$
w_0(x_i) = W\left(\frac{x_i - x_0}{h}\right)
$$
This kernel differs from the Gaussian kernel in that more points get values closer to the max:
```{r triweight-kernel, echo=FALSE, out.width="80%", fig.height=3, fig.width=6}
x_0 <- -125
tmp <- with(data.frame(day = seq(min(polls_2008$day), max(polls_2008$day), .25)),
ksmooth(day, 1*I(day == x_0), kernel = "normal", x.points = day, bandwidth = 7))
p1 <- data.frame(x = tmp$x, w_0 = tmp$y) |>
mutate(w_0 = w_0/sum(w_0)) |>
ggplot(aes(x, w_0)) +
geom_line() +
ggtitle("Normal")
p2 <- data.frame(x = seq(min(polls_2008$day), max(polls_2008$day), length.out = 100)) |>
mutate(w_0 = (1 - (abs(x - x_0)/21)^3)^3*I(abs(x - x_0) <= 21)) |>
ggplot(aes(x, w_0)) +
geom_line() +
ggtitle("Tri-weight")
gridExtra::grid.arrange(p1, p2, ncol = 2)
```
3\. `loess` has the option of fitting the local model *robustly*. An iterative algorithm is implemented in which, after fitting a model in one iteration, outliers are detected and down-weighted for the next iteration. To use this option, we use the argument `family="symmetric"`.
### Fitting parabolas
Taylor's theorem also tells us that if you look at any mathematical function closely enough, it looks like a parabola. The theorem also states that you don't have to look as closely when approximating with parabolas as you do when approximating with lines. This means we can make our windows even larger and fit parabolas instead of lines.
$$
E[Y_i | X_i = x_i ] = \beta_0 + \beta_1 (x_i-x_0) + \beta_2 (x_i-x_0)^2 \mbox{ if } |x_i - x_0| \leq h
$$
You may have noticed that when we showed the code for using loess, we set `degree = 1`. This tells loess to fit polynomials of degree 1, a fancy name for lines. If you read the help page for loess, you will see that the argument `degree` defaults to 2. By default, loess fits parabolas not lines. Here is a comparison of the fitting lines (red dashed) and fitting parabolas (orange solid):
```{r polls-2008-parabola-line-loess}
total_days <- diff(range(polls_2008$day))
span <- 28/total_days
fit_1 <- loess(margin ~ day, degree = 1, span = span, data = polls_2008)
fit_2 <- loess(margin ~ day, span = span, data = polls_2008)
polls_2008 |> mutate(smooth_1 = fit_1$fitted, smooth_2 = fit_2$fitted) |>
ggplot(aes(day, margin)) +
geom_point(size = 3, alpha = .5, color = "grey") +
geom_line(aes(day, smooth_1), color = "red", lty = 2) +
geom_line(aes(day, smooth_2), color = "orange", lty = 1)
```
The `degree = 2` gives us more wiggly results. In general, we actually prefer `degree = 1` as it is less prone to this kind of noise.
### Beware of default smoothing parameters
`ggplot` uses loess in its `geom_smooth` function:
```{r ggplot-loess-default, warning=FALSE, message=FALSE}
polls_2008 |> ggplot(aes(day, margin)) +
geom_point() +
geom_smooth(method = loess)
```
But be careful with default parameters as they are rarely optimal. However, you can conveniently change them:
```{r ggplot-loess-degree-1, warning=FALSE, message=FALSE}
polls_2008 |> ggplot(aes(day, margin)) +
geom_point() +
geom_smooth(method = loess, method.args = list(span = 0.15, degree = 1))
```
## Connecting smoothing to machine learning {#sec-smoothing-ml-connection}
To see how smoothing relates to machine learning with a concrete example, consider again our @sec-two-or-seven example. If we define the outcome $Y = 1$ for digits that are seven and $Y=0$ for digits that are 2, then we are interested in estimating the conditional probability:
$$
p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid X_1=x_1 , X_2 = x_2).
$$
with $x_1$ and $x_2$ the two predictors defined in @sec-two-or-seven. In this example, the 0s and 1s we observe are "noisy" because for some regions the probabilities $p(\mathbf{x})$ are not that close to 0 or 1. We therefore need to estimate $p(\mathbf{x})$. Smoothing is an alternative to accomplishing this. In @sec-two-or-seven, we saw that linear regression was not flexible enough to capture the non-linear nature of $p(\mathbf{x})$, thus smoothing approaches provide an improvement. In @sec-knn-cv-intro, we describe a popular machine learning algorithm, k-nearest neighbors, which is based on the concept of smoothing.
## Exercises
1\. The **dslabs** package provides the following dataset with mortality counts for Puerto Rico for 2015-2018.
```{r, eval=FALSE}
library(dslabs)
head(pr_death_counts)
```
Remove data from before May 2018, then use the `loess` function to obtain a smooth estimate of the expected number of deaths as a function of date. Plot this resulting smooth function. Make the span about two months long.
2\. Plot the smooth estimates against day of the year, all on the same plot but with different colors.
3\. Suppose we want to predict 2s and 7s in our `mnist_27` dataset with just the second covariate. Can we do this? On first inspection it appears the data does not have much predictive power. In fact, if we fit a regular logistic regression, the coefficient for `x_2` is not significant!
```{r, eval = FALSE}
library(broom)
library(dslabs)
mnist_27$train |>
glm(y ~ x_2, family = "binomial", data = _) |>
tidy()
```
Plotting a scatterplot here is not useful since `y` is binary:
```{r, eval = FALSE}
qplot(x_2, y, data = mnist_27$train)
```
Fit a loess line to the data above and plot the results. Notice that there is predictive power, except the conditional probability is not linear.