# ECON 490: Combining Data Sets (7)

## Prerequisites 

1. Import data sets in csv and dta format. 
2. Create new variables for a variety of purposes. 
3. Use group_by and other functions to conduct group level analysis.

## Learning Outcomes

1. Append new observations and variables to an already existing data set using `rbind` and `cbind`.
2. Merge variables and their values from one data set into another using `left_join`, `right_join`, `inner_join`, and `full_join`.

In [7]:
source("7_tests.r")

## 7.1 Working with Multiple Data Sets

We'll continue working with the "fake_data" data set introduced in the previous lecture. Recall that this data set is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings. Let's load in this data set now.

In [1]:
library(haven)
library(tidyverse)
library(IRdisplay)

fake_data <- read_csv("../econ490-stata/fake_data.csv")  # change me!

“package ‘haven’ was built under R version 4.1.3”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

“package ‘ggplot2’ was built under R version 4.1.3”
“package ‘tidyr’ was built under R version 4.1.2”
“package ‘readr’ was built under R version 4.1.2”
“package ‘dplyr’ was built under R version 4.1.3”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



Since we are working with multiple data sets in this module, we will also import the "region_year_data" data set below. This data set is much smaller and gives the average log earnings and total number of people employed among each region in a series of years.

In [None]:
region_year_data <-  fake_data %>%    #
    group_by(year, region) %>%
    summarize(average_logearn = mean(log_earnings), n=n())



Often we will need to draw on data from multiple data sets such as these. Most of the time, these data sets will be available for download in different files (each for a given year, month, country, etc.) and may store different variables or observations. Thus, if we want to compile them we need to combine them into the same data frame.

There are two key ways of combining data, each reflecting different goals:

1. When we want to paste data directly beside or under our existing data set, we call this **appending** data.
    * If we think of a data set as a spreadsheet, this is like taking one data set and "pasting" it into the bottom of another to add more observations, or pasting one data set directly beside another to add more variables. We do this when two data sets have identical columns/variables (so that we can stack them vertically) or an equal number of observations (so that we can stick them beside each other horizontally).
2. When we want to add new variables and their data from another data set into our existing data set, we call this **merging** data.
    * This is like looking up values in a table and then adding a column; in Excel, this is like using `VLOOKUP`. Importantly, we can only merge data sets that share a common column or key to identify observations with particular values. For example, if we want to merge in data from a different year but for the same people (observations) as those we are currently working with, data sets will usually have an identifying number for the person that functions as our key when merging. Unlike with appending, this does not require column names or numbers of observations to be identical.

## 7.2 Appending Datasets

#### 7.2.1 Append vertically with `rbind`
Let's say that our "fake_data" data set is inexplicably missing 3 observations for worker 1; specifically, the earnings for this worker for the years 2003, 2005, and 2007 are missing. However, let's say these observations exist in another data set, "missing_data", which we can append to our "fake_data" data set since it contains all of the same variables. We can inspect this small data frame below.

In [2]:
missing_data <- data.frame(workerid = c(1, 1, 1), year = c(2003, 2005, 2007), sex = c("M", "M", "M"), 
                           birth_year = c(1944, 1944, 1944), age = c(59, 61, 63), start_year = c(1997, 1997, 1997),
                           region = c(1, 1, 1), treated = c(0, 0, 0), earnings = c(30000, 35000, 36000))

missing_data

workerid,year,sex,birth_year,age,start_year,region,treated,earnings
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,2003,M,1944,59,1997,1,0,30000
1,2005,M,1944,61,1997,1,0,35000
1,2007,M,1944,63,1997,1,0,36000


To append these four rows to the bottom of our data set, we can simply use the `rbind` function (row bind). This function allows us to bind together data sets vertically, with the data set specified second being placed directly underneath the data set specified first. In this way, we can combine data sets vertically if they share the same column names. 

In [None]:
fake_data <- rbind(fake_data, missing_data)

tail(fake_data)

This is a fast way of concatenating data sets vertically. We can see that it also does not require us to have a designated "master" and "using" data set. We can have both data sets stored in our notebook and view them simultaneously, making the process of appending data sets simpler, especially if we want to check for identical column names or missing values.

#### 7.2.2 Append horizontally with `cbind`
We may also want to concatenate data sets horizontally. Suppose that we have a new variable, _religious_, which is a dummy coded as 1 if the person self-identified as religious in that year and 0 if not. This data frame (which is technically a vector) is below.

In [2]:
set.seed(123)

missing_data2 <- data.frame(religious = sample(0:1, 2861772, replace = TRUE))

head(missing_data2)

Assuming it is ordered identically to our "fake_data" data set with respect to participants, we can simply bind this column to our existing data set using the `cbind` function.

In [None]:
fake_data <- cbind(fake_data, missing_data2)

head(fake_data)

We can see that this function appended our _religious_ variable to the data set. However, it required us to have an identical number of observations between the two data frames, and for both data frames to be ordered identically with respect to people. Often this is not the case, so we must turn to a more commonly used and slightly more challenging concept next: merging datasets. However, there are some exercises for you to try first.

## Exercise 1

Study the Stores data frame below.

In [2]:
names = c(1, 2, 3)
locations = c("A", "B", "C")

Stores <- data.frame(Stores = names, Locations = locations)
Stores

Stores,Locations
<dbl>,<chr>
1,A
2,B
3,C


Run the code cell below to see the exercise!

In [5]:
display_html('<iframe src="https://h5p.open.ubc.ca/wp-admin/admin-ajax.php?action=h5p_embed&id=1208" width="862" height="349" frameborder="0" allowfullscreen="allowfullscreen" title="R 7.1"></iframe><script src="https://h5p.open.ubc.ca/wp-content/plugins/h5p/h5p-php-library/js/h5p-resizer.js" charset="UTF-8"></script>')

## Exercise 2

Run the code cell below to see this second exercise!

In [6]:
display_html('<iframe src="https://h5p.open.ubc.ca/wp-admin/admin-ajax.php?action=h5p_embed&id=1209" width="862" height="371" frameborder="0" allowfullscreen="allowfullscreen" title="R 7.2"></iframe><script src="https://h5p.open.ubc.ca/wp-content/plugins/h5p/h5p-php-library/js/h5p-resizer.js" charset="UTF-8"></script>')

## 7.3 Merging Data Sets 

Merging data sets means matching existing observations between datasets along specific variables, typically in order to add more information about existing participants to our current data set. This process, also known in R as joining data, is more complicated than simply appending data. Luckily, we have four functions with descriptive names which help to crystallize this process for us depending on how we want to merge two data sets. Before we start, we should look at the structure of each data set.

In [None]:
head(fake_data)

head(region_year_data)

To do a merge of any type, we need to specify a "key" or variable on which we will merge our data sets. It is best to choose a variable (or variables) which uniquely identifies each observation, otherwise merging will incur challenges. We can guess from our knowledge of the data set that every combination of _workerid_ and _year_ returns a unique observation in the "fake_data" data set. Looking at the "region_year_data" data set above, we can see that every combination of _year_ and _region_ identifies unique observations in this data set. This second data set, however, does not have the _workerid_ variable, while the first data set has all three of the _workerid_, _year_ and _region_. Since the unique identifiers common to both data sets are _year_ and _region_, we will use these as our keys within the join functions. Since there are many observations with identical years and regions within the "fake_data" data set, we will be doing what is similar to a m:1 merge in Stata. However, we can specify how we would like matched and unmatched observations to be treated.

> **Tip**: If we do not have any common identifiers between our data sets, but do have variables which express the exact same information, we can simply rename one of the variables so that they are identical.

#### 7.3.1 Merge with `left_join`
The left join merge is a type of merge whereby we merge two data sets along one or more "keys", but keep all observations without a match from the data set specified first in the function and discard all the unmatched observations in the data set specified second. 

In [None]:
left_join(fake_data, region_year_data, by = c("year", "region"))

Notice here that this function preserves all rows in the first data set, in this case the "fake_data"  data set, no matter what. The only rows of the second data set, "region_year_data", which are kept are those which can be matched to a corresponding row from the first with identical key values (identical values for _year_ and _region_). A direct partner to this function is the `right_join` function, which operates identically but in reverse. That is, it keeps all observations in the second data set and keeps only those in the first which found a match with the second based on the identifier columns specified.

#### 7.3.2 Merge with `inner_join`
The inner join merge is a type of merge whereby we keep only observations which have found a match between the two data sets. In this way, this function necessarily discards as many or more observations than the other types of merges.

In [None]:
inner_join(fake_data, region_year_data, by = c("year", "region"))

We can see that this function matched many identical _region_ and _year_ pairings to different workers. That is because there are many workers who have data reported for the same year and same region (i.e. many different workers in "fake_data" have earnings recorded for 1999 in region 1). In some data sets, however, especially those which are not as large as "fake_data", we will lose many observations with `inner_join`, since this function only preserves observations which can be matched across the key/s specified in both data sets.

#### 7.3.3 Merge with `full_join`
This is the function that is closest to appending data horizontally. The process of full join ensures that all observations from both data sets are maintained; if observations from one data set do not find a match, they simply take on values of NA for the newly merged variables from the other data set.

In [None]:
full_join(fake_data, region_year_data, by = c("year", "region")

We can see that this function left many observations from our "fake_data" data set with missing values for variables from our "region_year_data" data set such as _avg_log_earnings_ and _total_employment_. This is because the "fake_data" data set has observations for workers in years which are not included in the "region_year_data" data set (since the former records information from 1982 on and the latter records information from 1998 on). In this way, while `full_join` typically retains the highest number of observations, it fills our data set with many missing observations.

When choosing which merge method to choose, it is important to consider if any observations will not find a match, which data sets these "unmatched" observations are in, and whether we would like for these observations to be recorded as missing or dropped. If we wish to drop unmatched observations in all cases, `inner_join` is most appropriate. If we have two data sets and want to drop unmatched observations solely from the first, `left_join` is most appropriate (and correspondingly `right_join` if we want to drop unmatched observations solely from the second). Finally, if we wanted to keep all observations no matter what and have unmatched observations automatically marked with missing values for variables for which they have no recorded information, we should use `full_join`. In all cases, unmatched observations refer to observations in a data set which do not share the same recorded value for the specified key/s (common identifier/s) with the data set they are being merged with.

## Exercise 3
Study the Hallways data frame concerning information about hallways in a building below.

In [8]:
names = c("North", "South", "East", "West")
doors = c(12, 5, 8, 9)
skylight = c(0, 1, 0, 0)

Hallways <- data.frame(Name = names, Number_of_Doors = doors, Skylight = skylight)
Hallways

Name,Number_of_Doors,Skylight
<chr>,<dbl>,<dbl>
North,12,0
South,5,1
East,8,0
West,9,0


Now look at the following data set containing information about the number of bathrooms in hallways in the same building.

In [9]:
names = c("West", "East", "North", "Upstairs Left", "Upstairs Right")
bathrooms = c(1, 2, 3, 3, 2)

Hallways2 <- data.frame(Name = names, Number_of_Bathrooms = bathrooms)
Hallways2

Name,Number_of_Bathrooms
<chr>,<dbl>
West,1
East,2
North,3
Upstairs Left,3
Upstairs Right,2


Complete the code below with the appropriate function to create a data frame with information about skylights, the number of doors, and number of bathrooms for just hallways "North", "East", and "West".

In [None]:
answer_3 <- ???(Hallways, Hallways2, by = "Name") # replace the ??? here with your function
answer_3

test_3()

## Exercise 4

Now look at this new data set containing more information about the building. Notice that information for the "South" hallway is not available.

In [10]:
names = c("West", "West", "East", "East", "North", "North")
year = c(2018, 2020, 2018, 2020, 2018, 2020)
closed = c(0, 1, 1, 1, 0, 1)

Hallways3 <- data.frame(Name = names, Year = year, Closed = closed)
Hallways3

Name,Year,Closed
<chr>,<dbl>,<dbl>
West,2018,0
West,2020,1
East,2018,1
East,2020,1
North,2018,0
North,2020,1


Now fill in the code with the appropriate function to create a data frame with information about the skylights, number of doors, and status of being open or closed for all 4 hallways in 2018 and 2020. You will have missing information for the year and open/closed status of the "South" hallway.

In [None]:
answer_4 <- ???(Hallways, Hallways3, by = "Name") # replace the ??? with your function here
answer_4

test_4()

## 7.4 Wrap Up

In this module, we learned how to combine different data sets. The most important lesson we should take away from this module is that we can append data sets vertically when they have identical variables and horizontally when they have identical observations (and when these variables and observations are identically ordered in both data sets). More generally, however, we want to merge different variables (columns) between two data sets using common identifier variables. We have a series of four types of merges we can use to accomplish this, each of which treats unmatched observations differently.

As a final note, throughout this module we used the join functions. However, base R has a `merge` function which can accomplish all of the joins we have discussed. We didn't cover this function in detail, however, because it operates much more slowly on large data sets. If you wish to learn more about this function, you can view its documentation by running the code cell below!

In [None]:
?merge

In the next module, we will look at graphing in R: the main types of graphs we can create, how to save these graphs, and best practices for data visualization more generally.