<a href="https://colab.research.google.com/github/yardsale8/DSCI_210_R_notebooks/blob/main/lecture_7_3_reshaping_table_in_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stacking and Unstacking in R

# Stack and Unstack


* `library(tidyr)`
* Stack $\rightarrow$ `gather`
* Unstack $\rightarrow$ `spread`

In [None]:
library(dplyr)
library(tidyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



## Stacking columns with `gather()`

The function `gather` from the `tidyr` library is used to stack columns.

Arguments are:

1. label column name
2. data column name
3. then a list of columns to stack

##### Imperative syntax:
```{R}
new_df <- gather(old_df, key = "lbl1", value = "lbl2", col1, col2, col3, ...)
```

#### Piping syntax:
```{R}
data %>%
  gather(key = "lbl1", value = "lbl2", col1, col2, col3, ...)
```

### A familiar example

In [None]:
sales <- read.csv("https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/auto_sales.csv")
sales

Salesperson,Compact,Sedan,SUV,Truck
Ann,22,18,15,12
Bob,19,12,17,20
Yolanda,19,8,32,15
Xerxes,12,23,18,9


#### Option 1: spell out all stacking columns

In [None]:
stacked_sales <- (
  sales
    %>% gather(key = "auto_type",
               value = "num_sales",
               Compact, Sedan, SUV, Truck)
)
head(stacked_sales)

Salesperson,auto_type,num_sales
Ann,Compact,22
Bob,Compact,19
Yolanda,Compact,19
Xerxes,Compact,12
Ann,Sedan,18
Bob,Sedan,12


#### Option 2: refer to column range

In [None]:
stacked_sales <- (
  sales
    %>% gather(key = "auto_type",
               value = "num_sales",
               Compact:Truck)
)
head(stacked_sales)

Salesperson,auto_type,num_sales
Ann,Compact,22
Bob,Compact,19
Yolanda,Compact,19
Xerxes,Compact,12
Ann,Sedan,18
Bob,Sedan,12


#### Option 3: select by exclusion

In [None]:
stacked_sales <- (
  sales
    %>% gather(key = "auto_type",
               value = "num_sales",
               -Salesperson)
)
head(stacked_sales)

Salesperson,auto_type,num_sales
Ann,Compact,22
Bob,Compact,19
Yolanda,Compact,19
Xerxes,Compact,12
Ann,Sedan,18
Bob,Sedan,12


## <font color="red"> Exercise 7.3.1 </font>

Notice that the years are all in separate columns in the `world_bank_fresh_download.csv`.  

In [None]:
world_bank = read.csv("https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/world_bank_fresh_download.csv")
head(world_bank)

Country,Region,Indicator,X1960,X1961,X1962,X1963,X1964,X1965,X1966,⋯,X2006,X2007,X2008,X2009,X2010,X2011,X2012,X2013,X2014,X2015
Algeria,Africa,Total_population,11124890.0,11404860.0,11690150.0,11985130.0,12295970.0,12626950.0,12980270.0,⋯,33749330.0,34261970.0,34811060.0,35401790.0,36036160.0,36717130.0,37439430.0,38186140.0,38934330.0,39666519.0
Algeria,Africa,CO2_emissions,0.5537636,0.53181,0.4849537,0.4528245,0.4595689,0.5224485,0.6494806,⋯,2.990267,3.189978,3.205183,3.428472,3.309912,3.316038,,,,
Algeria,Africa,Life_expectancy,46.13512,46.59032,47.045,47.4962,47.9419,48.3761,48.7908,⋯,72.55771,72.89837,73.21932,73.52102,73.80405,74.07,74.3241,74.56895,74.8081,
Algeria,Africa,Internet_usage,,,,,,,,⋯,7.375985,9.451191,10.18,11.23,12.5,14.0,15.22803,16.5,18.09,
Angola,Africa,Total_population,5270844.0,5367287.0,5465905.0,5565808.0,5665701.0,5765025.0,5863568.0,⋯,18541470.0,19183910.0,19842250.0,20520100.0,21219950.0,21942300.0,22685630.0,23448200.0,24227520.0,25021974.0
Angola,Africa,CO2_emissions,0.1043571,0.08471841,0.2160253,0.2068771,0.2161741,0.206089,0.2651641,⋯,1.200877,1.311096,1.369425,1.430873,1.401654,1.354008,,,,


**Question:** Why is this a violation of the Golden Rule?

> Your answer here

**Task:** Fix this issue by

1. Use `gather` in a pipe to stack all these columns.  Store the years in a column called `year` and the number in a column called `values`.
2. Using `mutate`, `gsub`, and `as.numeric` to clean up the resulting `year` column and convert it to a numeric column.
3. Save the resulting dataframe to a variable named `world_bank_stacked`

In [None]:
# Your code here

## Unstacking columns with `spread()`

The function `spread`, also from the `tidyr` library, is used to stack columns.

Arguments:

1. Column to split on
2. Column to split

#### Imperative syntac:

```{r}
new_df <- spread(old_df, key = col1, value = col2)
```
#### Piping:

```{r}
data %>%
  spread(key = col1, value = col2)
```

#### A simple unstack

In [None]:
head(stacked_sales)

Salesperson,auto_type,num_sales
Ann,Compact,22
Bob,Compact,19
Yolanda,Compact,19
Xerxes,Compact,12
Ann,Sedan,18
Bob,Sedan,12


In [None]:
(stacked_sales
 %>% spread(key = auto_type,
            value = num_sales)
 )

Salesperson,Compact,Sedan,SUV,Truck
Ann,22,18,15,12
Bob,19,12,17,20
Xerxes,12,23,18,9
Yolanda,19,8,32,15


## <font color="red"> Exercise 7.3.2 </font>

Continuing the `world_bank_fresh_download.csv`, notice that the labels in the `Indicator` column are actually variables.



In [None]:
head(world_bank_stacked)

ERROR: Error in head(world_bank_stacked): object 'world_bank_stacked' not found


**Question:** Why is this a violation of the Golden Rule?

> Your answer here

**Task:** Fix this issue by

1. Use `spread` in a pipe to unstack all these labels into their own columns.
2. Save the resulting dataframe to a variable named `world_bank_clean`

In [None]:
# Your code here