<a href="https://colab.research.google.com/github/yardsale8/DSCI_210_R_notebooks/blob/main/activity_8_4_more_reshaping_table_in_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# More Practice Stacking and Unstacking in R

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


# Stack and Unstack


* `library(tidyr)`
* Stack $\rightarrow$ `gather`
* Unstack $\rightarrow$ `spread`

In [None]:
library(dplyr)
library(tidyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



## The stack + mutate + aggregate +  unstack trick

Recall that we can use stacking and unstacking columns to automate applying the same transformations to many columns

### Example - Recoding auto sales

In [3]:
sales <- read.csv("https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/auto_sales.csv")
sales

Salesperson,Compact,Sedan,SUV,Truck
<chr>,<int>,<int>,<int>,<int>
Ann,22,18,15,12
Bob,19,12,17,20
Yolanda,19,8,32,15
Xerxes,12,23,18,9


In [4]:
(sales
 %>% gather(key = "auto_type",
            value = "num_sales",
            Compact:Truck)
 %>% mutate(car_type = recode(auto_type,
                             `Compact` = 'car',
                             `Sedan` = 'car',
                             `SUV` = 'utility',
                             `Truck` = 'utility'))
 %>% group_by(Salesperson,
              car_type)
 %>% summarize(total_sales = sum(num_sales))
 %>% spread(key = car_type,
            value = total_sales)
 )

[1m[22m`summarise()` has grouped output by 'Salesperson'. You can override using the
`.groups` argument.


Salesperson,car,utility
<chr>,<int>,<int>
Ann,40,27
Bob,31,37
Xerxes,35,27
Yolanda,27,47


## <font color="red"> Exercise 8.4.1 </font>

Recall that the MoMA `Artist.csv` data had two columns (`BeginDate` and `EndDate`) that needed to be cleaned up by  replacing zeros with with a better representation of missing values, namely `NA` in R.

Since we need to perform the same transformations on both columns, we can use the stack + transform + unstack trick to clean both columns at once.

In [None]:
artist = read.csv("https://github.com/MuseumofModernArt/collection/raw/master/Artists.csv")
head(artist)

ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki.QID,ULAN
1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,
6,Danilo Aroldi,"Italian, born 1925",Italian,Male,1925,0,,


**Task:** Fix this issue by

1. Use `gather` stack the two columns.
2. Using `mutate` and `ifelse` to replace all zeros with `NA`.
3. Use `spread` to unstack the two columns, this time giving them more meaningful names.

In [None]:
# Your code here

## <font color="red"> Exercise 8.4.2 </font>

In this assignment we will visualize the effect of the introduction of the  designated hitter, by comparing the best overall team-wide earned run average (ERA) for each league. In the process, you will see an important application of reshaping tables when creating visualizations.  

Take a look at the `Teams.csv` file.  This file contains, for each season, team-by-team statistics.  We will focus on the ERA, which measures the average number of runs allowed by each team’s pitchers over a 9-inning game, with a smaller number indicating better pitching + defense.
Your job is to recreate the following graph.  

<img src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/img/min_era.png"/>

In [None]:
teams <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/Teams.csv')
head(teams)

yearID,lgID,teamID,franchID,divID,Rank,G,Ghome,W,L,⋯,DP,FP,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro
1871,,BS1,BNA,,3,31,,20,10,⋯,24,0.834,Boston Red Stockings,South End Grounds I,,103,98,BOS,BS1,BS1
1871,,CH1,CNA,,2,28,,19,9,⋯,16,0.829,Chicago White Stockings,Union Base-Ball Grounds,,104,102,CHI,CH1,CH1
1871,,CL1,CFC,,8,29,,10,19,⋯,15,0.818,Cleveland Forest Citys,National Association Grounds,,96,100,CLE,CL1,CL1
1871,,FW1,KEK,,7,19,,7,12,⋯,8,0.803,Fort Wayne Kekiongas,Hamilton Field,,101,107,KEK,FW1,FW1
1871,,NY2,NNA,,5,33,,16,17,⋯,14,0.84,New York Mutuals,Union Grounds (Brooklyn),,90,88,NYU,NY2,NY2
1871,,PH1,PNA,,1,28,,21,7,⋯,13,0.845,Philadelphia Athletics,Jefferson Street Grounds,,102,98,ATH,PH1,PH1


**Tasks:**

1. Filter the data to only the years after World War II (1946+).
2. Group and aggregate the data to compute the minimum ERA for each league for each season.
3. Split the min(ERA) by the leagues so that you have the two columns of min(ERA) values—one for each league—with one row per year.
4. Compute AL – NL, storing the result in a new column.
5. Stack the data for the AL, NL, and AL – NL, with the labels column called Type and the data column called min(ERA).
6. Save the resulting data frame to a variable named `min_era_by_league`
7. Use `ggplot` to recreate the plot.  Hint:  Use `geom_hline` and `geom_vline` to add the reference lines and `annotate` to add the annotation.

In [None]:
# Your code here