# Read these instructions completely in order to receive full credit

- Before you submit the problem set, make sure everything runs as expected. Go to the menu bar at the top of Jupyter Notebook and click `Kernel > Restart & Run All`. Your code should run from top to bottom with no errors. Failure to do this will result in loss of points.

- You should not use `install.packages()` anywhere. You may assume that we have already installed all the packages needed to run your code.

- Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE" and delete the `stop()` functions. The `stop()` functions produce an error and are there to remind you of cells that need an answer.

- If you are working in a group, make sure you and your collaborators have been [added to a group on Canvas](https://umich.instructure.com/courses/270337/discussion_topics/658777). You should also specify your group members when submitting to Gradescope.
- As a backup, *also* fill in your uniqid as well as those of your collaborators below:

Your uniqid: `<replace with your uniqid>`

Uniqids of your collaborator(s): `<replace with their uniqids>`

- This assignment should be submitted to both Canvas and Gradescope using the [instructions](https://piazza.com/class/jqh1wx3xw9amg?cid=55) posted on Piazza. As this problem set contains some questions which cannot be autograded, **you must upload a PDF to Canvas in order to recieve full credit.**

---

In [None]:
library(tidyverse)
library(nycflights13)

# STATS 306
## Problem Set 3: Data manipulation using `dplyr`
Each question is worth two points, for a total of 20. Problems with a `**` after the problem number are challenge problems.

## Billionaires
Problems 1-3 are based on a dataset on billionaires that is included along this problem set:

In [None]:
load("bil.RData")
print(bil)

The columns of `bil` include things like age, how the billionaire(s) made their wealth, their country of residence, etc. We will study this dataset in more detail in a coming lecture.

#### Problem 1
In which country are billionaires oldest on average? Youngest? (Assume all billionaires reside in country of their citizenship.) Do not count any country which has less than five observations. Store your answers in variables `oldest1` and `youngest1`, respectively.

In [None]:
# YOUR CODE HERE
stop()

In [None]:
stopifnot(exists("oldest1"))
stopifnot(exists("youngest1"))

#### Problem 2
Each billionaire has a `category` indicating how they made their fortune. The overall distribution of categories is:

In [None]:
table(bil$category)

After excluding billionaires with missing ages, group them into three age brackets: "40 and under", "41 to 65", and "above 65". (There are 90 billionaires aged 40 and under, for example.) What is the most common category of billionaire in each of the three age brackets? Store your answer in `table2`. The table should have three rows (one per category) and three columns: `age_bracket`, `most_common_category` and `n`, the number of billionaires in the most common category. Sort the table in descending order of `n`.

In [None]:
# YOUR CODE HERE
stop()

In [None]:
stopifnot(exists("table2"))

#### Problem 3**
Define a country's "gender gap" to be the difference in the percentage of male and female billionaires. (Hence, it is equal to zero if the country has exactly equal numbers of male and female billionaires.) 

Only one country has an equal number of male and female billionaires. The average gender gap across all countries is 0.85. If we plot countries according to their deviation from 0 (parity), grouping the 41 countries with no female billionaires into a single category, we obtain the following plot:

![image.png](attachment:image.png)

Recreate this plot. (If the billionaire represents a married couple, count it as both a male and female billionaire. If the billionaire represents a family fortune, drop it before summarizing the data.)

In [None]:
# YOUR CODE HERE
stop()

## Flights
The remaining problems pertain to the `flights` table.

#### Problem 4
Recall that each airplane has a unique tail number given by `tailnum`. Find the tail number of the airplane which flew to the largest number of *unique* destinations from any of the the three departure airports in `flights`. Store the string containing this tail number in a variable called `most_dests`.

In [None]:
# YOUR CODE HERE
stop()

In [None]:
stopifnot(exists("most_dests"))

#### Problem 5**
The following code adds a variable `week` to `flights`, such that `week==1` for the first seven days of the year, `week==2` for days 8-14, etc. (In the second half of the semester we will learn how to manipulate dates using the `lubridate` package.)

In [None]:
flights_week = mutate(flights, week=lubridate::week(time_hour))

Let a flight's "positive arrival delay" be defined as the larger of `arr_delay` and zero. We say a flight is *ridiculously late* if its arrival delay was more than ten times the average positive arrival delay for all flights in that week.
- Use the `flights_week` table to calculate the number of ridiculously late flights in each week of the year. For example, in the first week of the year there were 81 ridiculously late flights.
- Also add in the total number of flights in the data set for each week. 

Sort the resulting table in descending order of the number of ridiculously late flights and store it in a variable called `table5`. The table should have three columns, `week`, `n`, and `n_ridiculously_late`.

(*Hint*: Many students try to use the `max()` command for this exercise. This is the right idea, but make sure you understand what this function does: `?max`).

In [None]:
# YOUR CODE HERE
stop()

In [None]:
stopifnot(exists("table5"))

#### Problem 6
Use your solution in part 5
to generate a bar plot of the number of ridiculously late flights each week. Give your plot an appropriate title and axis labels.

In [None]:
# YOUR CODE HERE
stop()

#### Problem 7
Your plot from the preceding problem should exhibit a curious feature: in a couple of weeks there were far fewer ridiculously late flights than the rest. 
- Investigate this further by determining the fraction of departure times which were missing in each week. For example, in week 1, 0.57% of flights had missing departure times.
- Additionally, rank each week by this fraction. The week with the highest fraction of missing departure times should have rank one, second highest rank two, and so on.

Store the result in a variable called `table7`. 
Your table should have three columns: `week`, `frac_miss_dep_time` and `rank`. Sort your table in ascending order of `rank`.

In [None]:
# YOUR CODE HERE
stop()

In [None]:
stopifnot(exists("table7"))

#### Problem 8
For the week with the highest fraction of missing departure times in problem 6, generate a table `table8` which shows the total number of missing departure times for each hour and day of that week. Your table should have columns `year`, `month`, `day`, `hour`, and `n_miss_dep_time`. Sort your table in chronological order.

In [None]:
# YOUR CODE HERE
stop()

In [None]:
stopifnot(exists("table8"))

#### Problem 9
Two days in `table8` should stand out from the rest. To figure out what is going on, we will join some weather data from the `weather` table. Since we have not yet covered joins, this table is provided for you:

In [None]:
table9 = weather %>% filter(origin=="LGA") %>% left_join(table8, .)

Define a new variable `snowfall` in table 9, which is equal to hourly precipitation (in millimeters) if the temperature is below 36 degrees Fahrenheit in that hour, and zero otherwise.

Use `table9` to generate two plots: 

1. A bar plot of total snowfall for each day of the week in question, and
2. A line showing the total number of flights with missing departure times for each day.

In [None]:
# YOUR CODE HERE
stop()

#### Problem 10
In your own words, summarize your findings from problems 8-10. What do missing departure times likely represent in these data?

YOUR ANSWER HERE