materials/exercises/04-data-types.Rmd

---
title: "Data Types"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

```{r setup, include = FALSE}
library(tidyverse)
library(lubridate)
library(hms)
library(babynames)
library(nycflights13)
```


## Your Turn 1

Use `flights` to create `delayed`, the variable that displays whether a flight was delayed (`arr_delay > 0`).

Then, remove all rows that contain an NA in `delayed`. 

Finally, create a summary table that shows:

1. How many flights were delayed?
2. What proportion of flights were delayed?

```{r}

```


## Your Turn 2

Fill in the blanks to:

1. Isolate the last letter of every name.
2. Create a variable that indicates whether the last letter is one of "a", "e", "i", "o", "u", or "y".
3. Calculate the proportion of children whose name ends in a vowel, by **`year`** and **`sex`**.
4. Display the results as a line plot.

```{r}
babynames %>%
  ___(last = ___, 
      vowel = ___) %>%
  group_by(___) %>%
  ___(prop_vowel = weighted.mean(vowel, w = n)) %>%
  ggplot(mapping = aes(x = ___, y = ___)) +
  ___(mapping = ___)
```


## Your Turn 3

Repeat the previous exercise, some of whose code is below, to make a sensible graph of average TV consumption by marital status.

```{r}
gss_cat %>%
  drop_na(tvhours) %>%
  group_by(___) %>%
  summarize(___) %>%
  ggplot(mapping = ___) +
  geom_col()
```


## Your Turn 4

Do you think liberals or conservatives watch more TV?

Compute average tv hours by party ID an then plot the results.

```{r}

```


## Your Turn 5

What is the best time of day to fly?

Use the `hour` and `minute` variables in `flights` to compute the time of day for each flight as an hms. Then use a smooth line to plot the relationship between time of day and `arr_delay`.

```{r}

```


## Your Turn 6

Fill in the blanks to:

1. Extract the day of the week of each flight (as a full name) from `time_hour`.
2. Calculate the average `arr_delay` by day of the week.
3. Plot the results as a column chart (bar chart) with `geom_col()`.

```{r}
flights %>% 
  drop_na(arr_delay) %>% 
  mutate(weekday = ___) %>% 
  ___ %>% 
  ___(avg_delay = ___) %>% 
  ggplot(mapping = aes(x = weekday, y = avg_delay)) +
  ___
```


***

# Take Aways

dplyr gives you three _general_ functions for manipulating data: `mutate()`, `summarize()`, and `group_by()`. Augment these with functions from the packages below, which focus on specific types of data.

Package   | Data Type
--------- | --------
stringr   | strings
forcats   | factors
hms       | times
lubridate | dates and times