# Section 2

Today's section will help familiarize ourselves with some more __R__ functions, improve our ability to manipulate and summarize data, and will spend some time getting more familiar with our regression output.


For today, let's load a few packages and read in a dataset on sleep quality and time allocation for 706 individuals. This dataset is saved to the section folder as `sleep75.dta`. 

In [None]:
library(tidyverse)
library(haven)
sleepdata <- read_dta("sleep75.dta")

## Grouping Data

Sometimes we may want to group our data by values of certain variables. For example, if we have data on automobile accidents and speed limits across states, we might want to group our data at the state level and look at how factors differ across states.

We can group data using __tidyverse__'s `group_by()` function. The function takes two arguments: first the name of the data, second the variable to group on: `group_by(data, varname)`

## Pipes

As we start wanting to generate more specific summary statistics that require multiple coding steps, it can get tedious (and memory-intensive) to constantly have to assign objects to memory in each intermediate step.

For an example, let's load in data on sleep quality and time allocation for 706 individuals. This dataset is saved to the section folder as `sleep75.dta`. 

Let's consider the variable `sleep`, the minutes slept at night per week. If we were interested in altering it to keep track of hours per night and also wanted to obtain summary statistics by whether individuals are in good or excellent health (`gdhlth = 1`), we could do it in the following way


In [3]:
library(tidyverse)
library(haven)
sleepdata <- read_dta("sleep75.dta")

sleepdata <- mutate(sleepdata, hrs_night = sleep/(7*60))
sleepdata_goodhealth <- filter(sleepdata, gdhlth == 1)
sleepdata_poorhealth <- filter(sleepdata, gdhlth == 0)

summarize(sleepdata_goodhealth, mean_hours = mean(hrs_night), min_hours = min(hrs_night), max = max(hrs_night), count_goodhealth = n())
summarize(sleepdata_badhealth, mean_hours = mean(hrs_night), min_hours = min(hrs_night), max = max(hrs_night), count_badhealth = n())


totexp,foodexp,energyexp,hhsize,nrooms,rural,indigenous,femalehead,kids
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
31429.906,14200.739,247.0,4,4,1,0,0,2
5598.218,2684.173,6.0,1,1,1,1,1,0
26099.916,6004.517,250.5,3,4,0,0,1,0
10849.955,5180.582,35.0,3,1,1,0,0,1
65354.637,26290.994,66.0,4,2,0,0,0,0
33261.453,13266.268,142.0,5,3,1,0,0,3


To get summary statistics on hours slept per night for each of the good and poor health groups, we had to use `filter()` to subset the data on health quality, store those subsets in data, and then generate summary statistics for each subset individually. 

`tidyverse` has a fantastic alternative that helps us skip these intermediate steps: a pipe `%>%`. The way the pipe (`%>%`) works is by taking the output from one expression and plugging it into the next expression (defaulting to the first argument in the second expression). For instance, we could rewrite the above code using pipes in fewer lines and without having to store our intermediate data:

In [None]:
sleepdata %>%
    mutate(hrs_night = sleep/(7*60)) %>%
    group_by(gdhlth) %>%
    summarize(mean_hours = mean(hrs_night), min_hours = min(hrs_night), max = max(hrs_night), count = n())