# Lab 9 (3/16): MLB Data

### Web pages
Course page: https://ambujtewari.github.io/teaching/STATS306-Winter2020/

Lab page: https://rogerfan.github.io/stats306_w20/

### Office Hours
    Mondays: 2-4pm, USB 2165
    
### Contact
    Questions on problems: Use the slack discussions
    If you need to email me, include in the subject line: [STATS 306]
    Email: rogerfan@umich.edu

In [None]:
require(tidyverse)
require(stringr)

## `pivot_longer` and `pivot_wider` review

Remember that `pivot_longer` transforms datasets from "wide" to "long," collecting different column names into a variable. `pivot_wider` does the opposite, turning "long" datasets into "wide" ones by spreading out a variable's values into several columns

In [None]:
grades_wide = tribble(
  ~student,  ~`2015`, ~`2016`, ~`2017`,
'Roger',       83,      89,      93,
  'Jon',       92,      90,      93)
grades_wide

In [None]:
grades_long = grades_wide %>% 
    pivot_longer(
        `2015`:`2017`,     # Columns with data values
        names_to="year",   # New variable name in the wide data to store the names
        values_to="grade"  # New variable name in the wide data to store the values
    )

grades_long

In [None]:
grades_long %>% 
    pivot_wider(
        names_from=year,   # Variable from the long data where names are contained
        values_from=grade  # Variable from the long data where values are contained
    )

### `unite` and `separate`

These are used to combine the information from two variables into one variable, or separate information in that format. These are often useful when cleaning data, but can also be useful when doing some data transformations.

In [None]:
table3

In [None]:
table3 %>% separate(rate, into=c('cases', 'pop'))

In [None]:
# Note that the new variables are still string variables in the above
# example. To convert them automatically, use the convert argument.
table3 %>% separate(rate, into=c('cases', 'pop'), convert=TRUE)

`unite` does the opposite, and can be useful for some data manipulations. It is also often used to create unique identification variables, for instance by combining state and district number or similar hierarchical identifiers.

In [None]:
voting <- tribble(
  ~year,  ~gender,  ~age, ~percentage,
    2018,  'm', 1824, 27.4,
    2018,  'f', 1824, 32.8,
    2018,  'm', 2544, 38.0,
    2018,  'f', 2544, 42.9,
    2014,  'm', 1824, 14.7,
    2014,  'f', 1824, 17.2,
    2014,  'm', 2544, 26.3,
    2014,  'f', 2544, 30.4,
)

voting

In [None]:
voting %>% pivot_wider(names_from=c('gender', 'age'), values_from='percentage')

voting %>% unite(gender_age, gender, age) %>%
    pivot_wider(names_from=gender_age, values_from=percentage)


## MLB Data

This dataset contains information for player-seasons in the American League from 2015 to 2018. For those unfamiliar with baseball:
* `PA`: Plate Appearances, the number of times a player came up to bat.
* `HR`: number of home runs.
* `BB`: The number of walks.
* `BBrate`: The number of walks as a percentage of plate appearances (BB/PA).
* `K`: The number of strikeouts.
* `AVG`: A batters batting average.
* `FB`: The number of fly balls a batter hit.

Note: Also recall the functions [`unite`](https://tidyr.tidyverse.org/reference/unite.html), [`separate`](https://tidyr.tidyverse.org/reference/separate.html), and [`complete`](https://tidyr.tidyverse.org/reference/complete.html).

In [None]:
mlb = read_csv('https://raw.githubusercontent.com/rogerfan/stats306_f18_labs/master/mlb.csv')
head(mlb)

### Problem 1

Note that `BBrate` and `BB_K` were read in as strings. Clean up these variables and convert them to numeric variables, noting that `BB_K` should be two integer variables named `BB` and `K`. For `BBrate`, recall the function `str_replace` ([documentation](https://stringr.tidyverse.org/reference/str_replace.html)).

### Problem 2

Calculate the HR per FB rate for each team and year. Convert this to a wide dataset, so your variables should be `division`, `team`, and `2015`-`2018`, where values are the HR/FB rate. Note that you should ensure that `division` is still in the dataset.

Create a variable called `increased`, which checks if the HR/FB rate was higher in 2018 than it was in 2015 for that team.

Turn this back into a "long" dataset. Make sure you *do not* treat the `increased` variable as a values column, so the final dataset should have the variables `division`, `team`, `increased`, `year`, and `HR_FB`. Create a plot of HR/FB rate on year. Color it by `team`, facet it by `division`, and choose the linetype according to the `increased` variable.

### Problem 3

Go back to `mlb` and calculate total HRs and PAs per year per team. Create a wide version of this dataset. So there should be a `team` variable, then eight variables tracking values: `HR_2015`, `HR_2016`, `HR_2017`, `HR_2018`, `PA_2015`, `PA_2016`, `PA_2017`, and `PA_2018`.

Hint: Once you calculate the summary statistics you can use a `pivot_longer`, `unite`, and `pivot_wider` (noting that you can combine the `unite` operation into the `pivot_wider` command) in that order to create the wide version.

### Problem 4

The following code creates a dataset containing for each player the change in AVG from the previous season to the current season, as well as the change from the current season to the next season.

In [None]:
dat4 = mlb %>% filter(PA >= 200) %>%
    group_by(playerid) %>%
    arrange(playerid, year) %>%
    mutate(AVG_change = AVG - lag(AVG),
           next_AVG_change = lead(AVG) - AVG) %>%
    filter(!is.na(AVG)) %>%
    select(year:division, playerid, AVG:next_AVG_change)

head(dat4, 10)

Using this data, make two scatterplots where the `x`-axis is the current AVG and the `y`-axis is each of these change variables. Only use a single plotting command and facetting to accomplish this. What conclusions can you draw from these plots?

HINT: You will first need to do an additional data transformation involving a pivot.