# Lab 8 (3/9): Data Manipulation

### Web pages
Course page: https://ambujtewari.github.io/teaching/STATS306-Winter2020/

Lab page: https://rogerfan.github.io/stats306_w20/

### Office Hours
    Mondays: 2-4pm, USB 2165
    
### Contact
    Questions on problems: Use the slack discussions
    If you need to email me, include in the subject line: [STATS 306]
    Email: rogerfan@umich.edu

In [None]:
require(tidyverse)


## Tibble miscellanea

In [None]:
mydat = tribble(
  ~variable1, ~another_var, ~`final var`,
  'a', 2, 3.6,
  'b', 1, 8.5
)

print(mydat)

### Subsetting

For selecting variables out of tibbles, you can use `$` or `[[ ]]` (in addition to `select`).
* `$` only selects by name and requires you to hard-code in the variable name.
* `[[ ]]` selects by name or position and takes an argument, which can be a variable.


In [None]:
mydat

In [None]:
mydat$variable1

In [None]:
mydat[['variable1']]
mydat[[1]]

In [None]:
mydat[[x]]

In [None]:
y = 'variable1'
mydat[[y]]

In [None]:
mydat$`final var`
mydat[['final var']]

In [None]:
varname = 'final var'
mydat[[varname]]

## Data Import

The package `readr` (part of `tidyverse`) contains several functions for reading in flat data. See the [readr documentation](https://readr.tidyverse.org/reference/index.html) for details. 

`read_csv` reads standard comma-delimited files. There are variants like `read_csv2` (semicolon-delimited) and `read_tsv` (tab-delimited), while `read_delim` allows reading in files with any delimiter. Note that all these read functions work both with local files and with hyperlinks.

Also note that the equivalents for writing/saving data files also exist, called `write_csv`, etc.

These are variants of `read.table`, `read.csv`, `write.table`, etc. which are a part of base R and can also be used for reading in files.

In [None]:
mtcars$car = rownames(mtcars)

write_csv(mtcars, "mtcars.csv")

In [None]:
mydat = read_csv("mtcars.csv")
head(mydat)

An option that commonly needs tweaking is `na`: this specifies the value (or values) that are used to represent missing values in your file:

In [None]:
read_csv(
"a,b,c
1,2,.", na='.')

## `pivot_longer` and `pivot_wider`

Remember that `pivot_longer` transforms datasets from "wide" to "long," collecting different column names into a variable. `pivot_wider` does the opposite, turning "long" datasets into "wide" ones by spreading out a variable's values into several columns

In [None]:
grades_wide = tribble(
  ~student,  ~`2015`, ~`2016`, ~`2017`,
'Roger',       83,      89,      93,
  'Jon',       92,      90,      93)
grades_wide

In [None]:
grades_long = grades_wide %>% 
    pivot_longer(
        `2015`:`2017`, 
        names_to="year", 
        values_to="grade"
    )

grades_long

In [None]:
grades_long %>% 
    pivot_wider(
        names_from="year", 
        values_from="grade"
    )

### Pivoting with multiple value columns

In [None]:
family_wide <- tribble(
  ~family,  ~dob_child1,  ~dob_child2, ~gender_child1, ~gender_child2,
       1L, "1998-11-26", "2000-01-29",             1L,             2L,
       2L, "1996-06-22",           NA,             2L,             NA,
       3L, "2002-07-11", "2004-04-05",             2L,             2L,
       4L, "2004-10-10", "2009-08-27",             1L,             1L,
       5L, "2000-12-05", "2005-02-28",             2L,             1L,
)
family_wide

If your variable names are well-formatted you can use the `names_sep` argument to separate the column names.

Note that the special argument `.value` is used to denote which part of the column name denotes the new value columns.

In [None]:
family_long = family_wide %>%
    pivot_longer(
        -family,
        names_to = c('.value', 'child'),
        names_sep = '_'
    )

family_long

In [None]:
family_long %>%
    pivot_wider(
        names_from = c('child'),
        values_from = c('dob', 'gender')
    )

For more complex variable names, you can also use the `names_pattern` to match any pattern that exists.

In [None]:
family_long = family_wide %>%
    pivot_longer(
        -family,
        names_to = c('.value', 'child'),
        names_pattern = '(.*)_(.*)'
    )

family_long

### Handling missing values

In [None]:
head(fish_encounters)

In [None]:
fish_wide1 = fish_encounters %>% 
    pivot_wider(names_from = station, values_from = seen)

head(fish_wide1)

In [None]:
fish_wide = fish_encounters %>% 
    pivot_wider(
      names_from = station, 
      values_from = seen,
      values_fill = list(seen = 0)
    )

head(fish_wide)

In [None]:
fish_wide1 %>% 
    pivot_longer(I80_1:MAW, 
                 names_to="station", 
                 values_to="seen")

In [None]:
fish_wide1 %>% 
    pivot_longer(I80_1:MAW, 
                 names_to="station", 
                 values_to="seen", 
                 values_drop_na=TRUE)

## Question 1

The following is a dataset of US voting participation, broken down by gender and age group.

In [None]:
voting_par <- tribble(
  ~year,  ~m_1824,  ~f_1824, ~m_2544, ~f_2544, ~m_4564, ~f_4564, ~m_65p, ~f_65p,
    2018, 27.4, 32.8, 38.0, 42.9, 53.7, 56.3, 65.4, 62.5,
    2014, 14.7, 17.2, 26.3, 30.4, 45.0, 47.0, 60.1, 55.5,
    2010, 18.7, 20.6, 30.5, 33.9, 50.7, 51.5, 62.0, 56.5,
    2006, 18.6, 21.2, 32.3, 36.5, 53.4, 55.1, 64.4, 57.5,
    2002, 15.7, 18.6, 32.7, 35.4, 52.6, 53.5, 65.4, 57.7
)

voting_par

Convert this dataset to a long version. The new dataset should have variables `year`, `gender`, `age`, and `voting_perc`. 

Use ggplot to visualize this new dataset. Make whatever aesthetic and formatting choices that make the most sense to you.

## Question 2

The following is an example dataset of answers from a multiple choice questionnaire, where for each question respondents could select up to three choices.

In [None]:
multi <- tribble(
  ~id, ~choice1, ~choice2, ~choice3,
  1, "A", "B", "C",
  2, "C", "B",  NA,
  3, "D",  NA,  NA,
  4, "B", "D",  NA
)

multi

Using pivot and other dplyr operations, turn `multi` into the following dataset that is easier to use.
```
# A tibble: 4 x 5
     id A     B     C     D    
  <dbl> <lgl> <lgl> <lgl> <lgl>
1     1 TRUE  TRUE  TRUE  FALSE
2     2 FALSE TRUE  TRUE  FALSE
3     3 FALSE FALSE FALSE TRUE 
4     4 FALSE TRUE  FALSE TRUE 
```
HINT: This will probably require more than one pivot operation.