/
lab-02-solutions.Rmd
186 lines (137 loc) · 6.12 KB
/
lab-02-solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
title: 'Solutions: Introduction to Outliers'
author: "STA 325: Lab 2, Fall 2018"
output: pdf_document
---
Today's agenda: finding outliers
Programming partners:
You should have a programming partner for each lab, and you should switch off who is programming, and use each other for help. We will spend about 30--50 minutes per week on lab exercises and you will be expected to bring your laptops to class to work on these exercises in class. Myself and the TA will be in class to help you.
***Background***
Identifying outliers in data is an important part of statistical analyses. One
simple rule of thumb (due to John Tukey) for finding outliers is based on the
quartiles of the data: the first quartile $Q_1$ is the value $\geq 1/4$ of the
data, the second quartile $Q_2$ or the median is the value $\geq 1/2$ of the
data, and the third quartile $Q_3$ is the value $\geq 3/4$ of the data. The
interquartile range, $IQR$, is $Q_3 - Q_1$.
Tukey's rule says that the outliers are values more than $1.5$ times the interquartile range from the quartiles --- either below $Q_1 - 1.5 IQR$, or above $Q_3 + 1.5 IQR$.
In this lab, we will consider the following data
```{r}
x <- c(2.2, 7.8, -4.4, 0.0, -1.2, 3.9, 4.9, 2.0, -5.7, -7.9, -4.9, 28.7, 4.9)
```
We will use these as part of writing a function to identify outliers according
to Tukey's rule. Our function will be called \texttt{tukey.outlier}, and will
take in a data vector, and return a Boolean vector, \texttt{TRUE} for the
outlier observations and \texttt{FALSE} elsewhere.
***Lab Tasks***
1. (5) Calculate the first quartile, the third quartile, and the
inter-quartile range of \texttt{x}. Some built-in R functions calculate
these; you cannot use them, but you could use other functions, like
\texttt{sort} and \texttt{quantile}.
```{r}
(Q1 <- quantile(x,0.25)) #first quartile
(Q3 <- quantile(x,0.75)) #third quartile
(iqr.x <- Q3-Q1) #inter-quartile range
```
2. (10) Write a function, \texttt{quartiles}, which takes a data vector and
returns a vector of three components, the first quartile, the third quartile,
and the inter-quartile range. Show that it gives the right answers on
\texttt{x}. (You do not have to write a formal test for \texttt{quartiles}.)
```{r}
quartiles <- function(x) {
q1<-quantile(x,0.25,names=FALSE)
q3<-quantile(x,0.75,names=FALSE)
quartiles <- c(first=q1,third=q3,iqr=q3-q1)
return(quartiles)
}
```
Let's check that our function applied to the vector x returns the answer from task 1.
The code below illustrates that the first, third, and iqr of our new function matches the computations from task 1.
```{r}
quartiles(x)
```
3. (5) Which points in \texttt{x} are outliers, according to Tukey's rule, if
any?
Recall that Tukey's rule says that the outliers are values more than $1.5$ times the interquartile range from the quartiles --- either below $Q_1 - 1.5 IQR$, or above $Q_3 + 1.5 IQR$.
We can see below that the only value that is an outlier by Tukey's rule is 28.7.
```{r}
Q1 - 1.5*iqr.x
Q3 + 1.5*iqr.x
x[x>=18.5]
x[x <= -18.5]
```
4. (20) Write \texttt{tukey.outlier}, using your \texttt{quartiles}
function. The function should take a single data vector, and return a
Boolean vector, take in a data vector, and return a Boolean vector,
\texttt{TRUE} for the outlier observations and \texttt{FALSE} elsewhere.
Show that it passes \texttt{test.tukey.outlier}.
```{r}
# Input: data
# Output: outliers according to Tukey's rule
tukey.outlier <- function(x) {
quartiles <- quartiles(x)
lower.limit <- quartiles[1]-1.5*quartiles[3]
upper.limit <- quartiles[2]+1.5*quartiles[3]
outliers <- ((x < lower.limit) | (x > upper.limit))
return(outliers)
}
tukey.outlier(x)
```
5. (20) Write a function, \texttt{test.tukey.outlier}, which tests the
function \texttt{tukey.outlier} against your answer.
This function should return \texttt{TRUE} if \texttt{tukey.outlier} works
properly; otherwise, it can either return \texttt{FALSE}, or an error
message, as you prefer.
```{r}
# Input: Nothing
# Output: Boolean
test.tukey.outlier <- function() {
x <- c(2.2, 7.8, -4.4, 0.0, -1.2, 3.9, 4.9, 2.0, -5.7, -7.9, -4.9, 28.7, 4.9)
x.pattern <- rep(FALSE,length(x)); x.pattern[12] <- TRUE
stopifnot(all(tukey.outlier(x) == x.pattern))
return(TRUE)
}
test.tukey.outlier()
```
Remark: since \texttt{tukey.outlier(x)} and \texttt{x.pattern} are both
vectors, \texttt{==} will compare them element by element, giving us yet
another Boolean vector. We want to summarize this in a single TRUE/FALSE
value, hence we use the \texttt{all} command.
6. (5) Which data values should be outliers in \texttt{-x}?
```{r}
tukey.outlier(-x)
```
The same value as before, 28.7, is an outlier to the symmetry of x and -x.
7. (5) Which data values should be outliers in \texttt{100*x}?
```{r}
tukey.outlier(100*x)
```
Multiplying all the values by 100 also multiplies the quartiles and the IQR by 100, so once again only the next to last value is an outlier in \texttt{100*x}.
8. (10) Let's modify \texttt{test.tukey.outlier} to includes two test cases for tasks 6 and 7.
```{r}
# Inputs: none
# Output: TRUE if all tests pass, else stops with an error
test.tukey.outlier <- function() {
x <- c(2.2, 7.8, -4.4, 0.0, -1.2, 3.9, 4.9, 2.0, -5.7, -7.9, -4.9, 28.7, 4.9)
x.pattern <- rep(FALSE,length(x)); x.pattern[12] <- TRUE
stopifnot(all(tukey.outlier(x) == x.pattern))
stopifnot(all(tukey.outlier(-x) == tukey.outlier(x)))
stopifnot(all(tukey.outlier(100*x) == tukey.outlier(x)))
return(TRUE)
}
test.tukey.outlier()
```
9. (5) Show that your \texttt{tukey.outlier} function passes the new set of
tests, or modify it until it does.
```{r}
tukey.outlier(x)
```
10. (15) According to Tukey's rule, which points in the next vector $y$ are
outliers? What is the output of your function? If they differ, explain why.
```{r}
y <- c(11.0, 14.0, 3.5, 52.5, 21.5, 12.7, 16.7, 11.7, 10.8, -9.2, 12.3, 13.8, 11.1)
```
```{r}
tukey.outlier(y)
which(tukey.outlier(y))
```
We can see that our simple santity checks on our function have been passed for both data vectors that we have used, so we should feel comfortable using this code again in other exercises.