-
Notifications
You must be signed in to change notification settings - Fork 2
/
04-data-types.Rmd
120 lines (79 loc) · 2.48 KB
/
04-data-types.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
title: "Data Types"
output: html_notebook
editor_options:
chunk_output_type: inline
---
```{r setup, include = FALSE}
library(tidyverse)
library(lubridate)
library(hms)
library(babynames)
library(nycflights13)
```
## Your Turn 1
Use `flights` to create `delayed`, the variable that displays whether a flight was delayed (`arr_delay > 0`).
Then, remove all rows that contain an NA in `delayed`.
Finally, create a summary table that shows:
1. How many flights were delayed?
2. What proportion of flights were delayed?
```{r}
```
## Your Turn 2
Fill in the blanks to:
1. Isolate the last letter of every name.
2. Create a variable that indicates whether the last letter is one of "a", "e", "i", "o", "u", or "y".
3. Calculate the proportion of children whose name ends in a vowel, by **`year`** and **`sex`**.
4. Display the results as a line plot.
```{r}
babynames %>%
___(last = ___,
vowel = ___) %>%
group_by(___) %>%
___(prop_vowel = weighted.mean(vowel, w = n)) %>%
ggplot(mapping = aes(x = ___, y = ___)) +
___(mapping = ___)
```
## Your Turn 3
Repeat the previous exercise, some of whose code is below, to make a sensible graph of average TV consumption by marital status.
```{r}
gss_cat %>%
drop_na(tvhours) %>%
group_by(___) %>%
summarize(___) %>%
ggplot(mapping = ___) +
geom_col()
```
## Your Turn 4
Do you think liberals or conservatives watch more TV?
Compute average tv hours by party ID an then plot the results.
```{r}
```
## Your Turn 5
What is the best time of day to fly?
Use the `hour` and `minute` variables in `flights` to compute the time of day for each flight as an hms. Then use a smooth line to plot the relationship between time of day and `arr_delay`.
```{r}
```
## Your Turn 6
Fill in the blanks to:
1. Extract the day of the week of each flight (as a full name) from `time_hour`.
2. Calculate the average `arr_delay` by day of the week.
3. Plot the results as a column chart (bar chart) with `geom_col()`.
```{r}
flights %>%
drop_na(arr_delay) %>%
mutate(weekday = ___) %>%
___ %>%
___(avg_delay = ___) %>%
ggplot(mapping = aes(x = weekday, y = avg_delay)) +
___
```
***
# Take Aways
dplyr gives you three _general_ functions for manipulating data: `mutate()`, `summarize()`, and `group_by()`. Augment these with functions from the packages below, which focus on specific types of data.
Package | Data Type
--------- | --------
stringr | strings
forcats | factors
hms | times
lubridate | dates and times