# ECON 490: Within Group Analysis (6)

## Prerequisites
---
1. Inspect and clean the variables of a data set.
2. Generate basic variables for a variety of purposes.

## Learning Outcomes
---
1. Use `arrange`, `group_by`, `group_keys` and `ungroup` to sort and organize data for specific purposes.
2. Generate variables with `summarize` to analyze patterns within groups of data. 
3. Reshape data frames using `pivot_wider` and `pivot_longer`.

In [1]:
source("6_tests.r")

“package ‘testthat’ was built under R version 4.1.3”


## 6.1 Key Functions for Group Analysis

When we are working on a particular project, it is often quite important to know how to visualize data for specific groupings, whether of variables or observations meeting specific conditions. In this notebook, we will go into depth and look at a variety of functions for conducting group-level analysis. We will rely heavily on the `dyplr` package, which we have implicitly imported through the `tidyverse` package. Let's import these packages and load in our "fake_data" now. Recall that this data set is simulating information of workers in the years 1995-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.

In [1]:
library(haven)
library(tidyverse)
library(IRdisplay)

fake_data <- read_csv("../econ490-stata/fake_data.csv") # change me!

ERROR: Error in parse(text = x, srcfile = src): <text>:4:18: unexpected INCOMPLETE_STRING
3: 
4: data <- read_csv("../econ490-stata/fake_data.csv
                    ^


Now that we've loaded in our data and already know how to view it, clean it, and generate additional variables for it as needed, we can look at commands with group this data.

#### 6.1.1 `arrange`

Before grouping data, we may want to order our data set based on the values of a particular data set. The `arrange` function helps us achieve this. It takes in a data frame and variable and rearranges our data frame in ascending order by default, with the option to arrange in descending order requiring a further `desc` function. If we want this function to rearrange our entire data set in order of one of our variables, say _year_, we can do it as below.

In [None]:
# arrange the dataframe by ascending year
fake_data %>% arrange(year)

# arrange the dataframe by descending year
fake_data %>% arrange(desc(year))

We can pass multiple variable parameters to the `arrange` function to indicate how we should further sort our data within each year grouping. For instance, including the _region_ variable will further sort each year grouping in order of region.

In [None]:
fake_data %>% arrange(year, region)

#### 6.1.2 `group_by`

This is one of the most pivotal functions in R. It allows you to group a data frame by the values of a specific variable and perform further operations on those groups. Let's say that we wanted to group our data set by _region_ and count the number of observations in each region. We can simply pass this variable as a parameter to our `group_by` function and further pipe this result into the `tally` function.

In [None]:
fake_data %>% group_by(region) %>% tally()

Notice how the `group_by` function nicely groups the regions in ascending order for us automatically. Unlike with the `arrange` function, it does not preserve the data set in its entirety. It instead collapses our data set into groups, thus it is important not to redefine our "data" data frame by this group_by if we want to preserve our original data. We can also pass multiple arguments to `group_by`. If we pass both _region_ and _treated_ to our function as inputs, our region groups will be further grouped by observations which are and are not treated. Let's count the number of treated and untreated observations in each region.

In [None]:
fake_data %>% group_by(region, treated) %>% tally()

Finally, we can pipe a group_by object into another group_by object. In this case, the second group_by will simply overwrite the first. For example, if we wanted to pass our original _region_ group_by into a _treated_ group_by, we will simply get a data frame counting the total number of observations who are treated and untreated.

In [None]:
fake_data %>% group_by(region) %>% group_by(treated) %>% tally()

#### 6.1.3 `group_keys`

This function allows us to see the specific groups for a group_by data frame we have created. For instance, if we wanted to see every year in the data, we could group by _year_ and then apply the `group_keys` function.

In [None]:
fake_data %>% group_by(year) %>% group_keys()

This is equivalent to using the `unique` function directly on a column of our data set, as below. The output is just a list in this case instead of another data frame as above.

In [None]:
unique(fake_data$year)

#### 6.1.4 `ungroup`

We can even selectively remove groups from a grouped dataframe. Say we realized that we didn't need the dataframe grouping by `region` and `treated` and wanted to just count by `region`. If this dataframe had been defined as A, we can simply use `ungroup` to "work backwards", removing the grouping by treatment status and having a count for just regions.

In [None]:
A <- fake_data %>% group_by(region, sex) %>% tally()
A %>% ungroup(treated) %>% tally()

### Exercise 1

Order the following pieces of code to create a data frame which shows, in descending order, the population of each region and treatment pairing. Be sure to include piping. Pass your final code into answer_1.

In [None]:
# count()
# arrange(desc(n))
# fake_data
# group_by(region, treated)

answer_1 <- # your answer ordering the 4 code pieces above and adjoining them with %>% here

test_1()

### Exercise 2

Use the methods we've learned so far to determine how many unique combinations of _region_, *birth_year* and *sex* exist in the data set.

In [None]:
# your code here

answer_2 <- # your answer here

test_2()

## 6.2 Generating Variables for Group Analysis

We have already seen how to redefine and add new variables to a data frame using the df$ <- format. We have also seen how to use the `mutate` function to add new variables to a data frame. However, we often want to add new variables to grouped data frames to display information about the different groups rather than different observations from the original data frame. That is where `summarise` comes in. The `summarise` function gives us access to a variety of common functions we can use to generate variables corresponding to individual groups. For instance, we may want to find the mean earnings of each region. As such, we will group on _region_ and then add a variable to our grouped data frame which aggregates the mean of the _earnings_ variable for each region group.

In [None]:
fake_data %>% group_by(region) %>% summarise(meanearnings = mean(earnings))

We may want more detailed information about each region. We can pass a series of parameters to `summarise`, and it will generate variables for all of these requests. Let's say we want the mean and standard deviation of _earnings_ for each group, as well as the range of _earnings_ of each group (max _earnings_ - min _earnings_).

In [None]:
fake_data %>% 
    group_by(region) %>% 
    summarise(meanearnings = mean(earnings), stdevearnings = sd(earnings), range = max(earnings) - min(earnings))

We may also want to calculate the number of observations in each region as an additional variable. Before, we could simply group by our _region_ variable and then immediately apply the `tally` function. However, now that we have defined a series of other variables, our data set on which `tally` operates is different. Watch what happens when we try to use `tally` after using `summarise`.

In [None]:
fake_data %>% 
    group_by(region) %>% 
    summarise(meanearnings = mean(earnings), stdevearnings = sd(earnings), range = max(earnings) - min(earnings)) %>%
    tally()

Now watch what happens when we try to use `tally` before using `summarise`.

In [None]:
fake_data %>% 
    group_by(region) %>% 
    tally() %>%
    summarise(meanearnings = mean(earnings), stdevearnings = sd(earnings), range = max(earnings) - min(earnings))

In the first case, tally does not have the necessary information left in the data frame to count the number of observations in each region. In the second case, tally has shrunk the data frame so much that the functions within `summarise` do not have the necessary information to make their calculations. This is where `n` comes in. This is a special function used within the `summarise` variable. It represents the number of observations within each group of a data frame. As such, it is directly paired with `group_by`, although it can be paired with `mutate` when we are working with the number of observations in a data set as a whole (i.e. with one group, meaning `n` represents the position of each observation). 

In [None]:
fake_data %>% 
    group_by(region) %>% 
    summarise(meanearnings = mean(earnings), stdevearnings = sd(earnings), range = max(earnings) - min(earnings), total = n())

The entire process in this section is similar to collapsing a dataset in Stata as part of aggregating a function across a series of observations. Luckily, it can be done more quickly here in R. 

### Exercise 3

Create a data frame which has the total number of people born in all birth years in which more than 100,000 people were born. The data frame should have two columns, one for *birth_year* and one for *total_births*, where *total_births* is arranged from most popular birth year to least popular birth year (with the least still having more than 100,000 in its *total_births*). Save your answer to answer_3.

In [None]:
answer_3 <- # your code here

test_3()

## 6.3 Reshaping Data 

Sometimes in our process of data analysis, we may want to restructure our data frame. To do this, we can take advantage of a series of functions within the `tidyr` package that we have imported implicitly through loading in the `tidyverse` package. Within this package, we can use a series of functions to quickly change the format of our data frame without having to redefine all of its columns and rows manually.

We often want to transform our data from "wide" to "long" format, or vice versa. Suppose that we wish to make our data set more "cross-sectional" in appearance by dropping the age variable and adding variables for each year, with the values in these columns corresponding to the earnings of that person in that year. Effectively, by adding columns, we are making our data set "wider", so we will use the `pivot_wider` function.

In [2]:
wide_data <- fake_data %>% select(-age) %>% pivot_wider(names_from = "year", values_from = "earnings")

ERROR: Error in UseMethod("select"): no applicable method for 'select' applied to an object of class "function"


We can see that the function above took the names from _year_ and generated a new variable for each of them 1995-2012, then supplied the corresponding values from _earnings_ to each of these years. In the case where a worker didn't have a recorded observation for a given year (and thus no wage), the _earnings_ variable is marked as missing.

Now suppose we want to work backward and transform this data set back into its original, "longer" shape (just now without the `age` variable). To do this, we can invoke the complementary `pivot_longer` function.

In [None]:
wide_data %>% pivot_longer(cols = c(1995:2012), names_to = "year", values_to = "earnings")

Here, we supply the names of all variables/columns we want to compress to cols (being all the year variables we now want to squash), we supply the name of the singular variable all of these columns will be compressed into (being our original _year_ variable), and then we finally supply the name of the column for all of these earnings values to now be stored in: our original _earnings_ variable.

If this doesn't seem intuitive or easy to grasp, don't worry. Even many experienced coders struggle with the reshape functionality in R, Stata, and other tools. With practice, it will become much more digestible!

### Exercise 4

Study the output of the cell below.

In [2]:
people = c("Jack", "Jack", "Jack", "Jill", "Jill", "Jill")
weeks= c(1, 2, 3, 1, 2, 3)
moneyspent = c(50, 75, 25, 15, 100, 60)

Expenses <- data.frame(Person = people, Week = weeks, Money_Spent = moneyspent)
Expenses

Person,Week,Money_Spent
<chr>,<dbl>,<dbl>
Jack,1,50
Jack,2,75
Jack,3,25
Jill,1,15
Jill,2,100
Jill,3,60


Run the code cell below to see the exercise!

In [5]:
display_html('<iframe src="https://h5p.open.ubc.ca/wp-admin/admin-ajax.php?action=h5p_embed&id=1207" width="862" height="258" frameborder="0" allowfullscreen="allowfullscreen" title="R 6.4"></iframe><script src="https://h5p.open.ubc.ca/wp-content/plugins/h5p/h5p-php-library/js/h5p-resizer.js" charset="UTF-8"></script>')

**Extra**: Try to create two data frames representing the two reshaping options described above! Feel free to label the columns of each new data frame as you see fit.

In [None]:
# your code here

## 6.4 Wrap Up
Being able to generate new variables and modify a data set to suit your specific research is pivotal. Now you should hopefully have more confidence in your ability to perform these tasks. Next, we will explore the challenges posed by working with multiple datasets at once.