# ECON 490: Combining Datasets (7)
---
## Prerequisites: 
---
1. Import datasets in csv and dta format. 
2. Create new variables for a variety of purposes. 
3. Use group_by and other functions to conduct group level analysis.

## Learning Objectives:
---
- Append new observations and variables to an already existing dataset using `rbind` and `cbind`.
- Merge variables and their values from one dataset into another using `left_join`, `right_join`, `inner_join`, and `full_join`.

In [1]:
library(haven)
library(tidyverse)

“package ‘haven’ was built under R version 4.1.3”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

“package ‘ggplot2’ was built under R version 4.1.3”
“package ‘tidyr’ was built under R version 4.1.2”
“package ‘readr’ was built under R version 4.1.2”
“package ‘dplyr’ was built under R version 4.1.3”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



We'll continue working with the fake data dataset introduced in the previous lecture. Recall that this dataset is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings. 

In [None]:
fake_data <- read_csv("../econ490-stata/fake_data.csv")  # change me!

Since we are working with multiple datasets in this module, we will also import the region year dataset below. This dataset is much smaller and gives the average log earnings and total number of people employed among each region in a series of years.

In [None]:
region_year_data <- read_dta("../econ490-stata/region_year_data.dta") # change me!

Often we will need to draw on data from multiple datasets such as these. Most of the time, these datasets will be available for download in different files (each for a given year, month, country, etc.) and may store different variables or observations. Thus, if we want to compile them we need to combine them into the same data frame.

There are two key ways of combining data, each reflecting different goals:

1. When we want to paste data directly beside or under our existing dataset, we call this **appending** data.
    * If you think of a dataset as a spreadsheet, this is like taking one dataset and "pasting" it into the bottom of another to add more observations, or pasting one dataset directly beside another to add more variables. We do this when two datasets have identical columns/variables (so that we can stack them vertically) or identical number of observations (so that we can stick them beside each other horizontally).
2. When we want to add new variables and their data from another dataset into our existing dataset, we call this **merging** data.
    * This is like looking up values in a table and then adding a column; in Excel, this is called a `VLOOKUP`. Importantly, we can only merge data that share a common column or key to  identify observations with particular values. For example, if we want to merge in data from a different year but for the same people (observations) as those we are currently working with, datasets will usually have an identifying number for the person that functions as our key when merging. Unlike with appending, this does not require column names or numbers of observations to be identical.

## 7.1: Appending Datasets
---

### 7.1.1: Append vertically with `rbind`
Let's say that our `fake_data` dataset is inexplicably missing 3 observations for worker 1; specifically, the earnings for this worker for the years 2003, 2005, and 2007 are missing. However, let's say these observations exist in another dataset, `missing_data`, which we can append to our `fake_data` dataset since it contains all of the same variables. We can inspect this small dataframe below.

In [2]:
missing_data <- data.frame(workerid = c(1, 1, 1), year = c(2003, 2005, 2007), sex = c("M", "M", "M"), 
                           birth_year = c(1944, 1944, 1944), age = c(59, 61, 63), start_year = c(1997, 1997, 1997),
                           region = c(1, 1, 1), treated = c(0, 0, 0), earnings = c(30000, 35000, 36000))

missing_data

workerid,year,sex,birth_year,age,start_year,region,treated,earnings
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,2003,M,1944,59,1997,1,0,30000
1,2005,M,1944,61,1997,1,0,35000
1,2007,M,1944,63,1997,1,0,36000


To append these four rows to the bottom of our dataset, we can simply use the `rbind` function (row bind). This function allows us to bind together datasets vertically, with the dataset specified second being placed directly underneath the dataset specified first. In this way, we can combine datasets vertically if they share the same column names. 

In [None]:
fake_data <- rbind(fake_data, missing_data)

tail(fake_data)

This is a fast way of concatenating datasets vertically. We can see that it also does not require us to have a designated "master" and "using" dataset. We can have both datasets stored in our notebook and view them simultaneously, making the process of appending datasets simpler, especially if we want to check for identical column names or missing values.

### 7.1.2: Append horizontally with `cbind`
We may also want to concatenate datasets horizontally. Suppose that we have a new variable, `religious`, which is a dummy coded as 0 if the person self-identified as religious in that year and 0 if not. This dataframe (which is technically a vector) is below.

In [2]:
set.seed(123)

missing_data2 <- data.frame(religious = sample(0:1, 2861772, replace = TRUE))

head(missing_data2)

Assuming it is ordered identically to our `fake_data` dataset with respective to participants, we can simply bind this column to our existing dataset using the `cbind` function.

In [None]:
fake_data <- cbind(fake_data, missing_data2)

head(fake_data)

We can see that this function appended our `religious` variable to the dataset. However, it required us to have an identical number of observations between the two dataframes, and for both dataframes to be ordered identically with respect to people. Often this is not the case, so we must turn to a more commonly used and slightly more challenging concept: merging datasets.

## 7.2: Merging Datasets 
---
Merging datasets means matching existing observations between datasets along specific variables, typically in order to add more information about existing participants to our current dataset. This process, also known in R as joining data, is more complicated than simply appending data. Luckily, we have four functions with descriptive names which help to crystallize this process for us depending on how we want to merge two datasets. Before we start, we should look at the structure of each dataset.

In [None]:
head(fake_data)

head(region_year_data)

To do a merge of any type, we need to specify a "key" or variable on which we will merge our datasets. It is best to choose a variable (or variables) which uniquely identify each observation, otherwise merging will incur challenges. We can guess from our knowledge of the dataset that every combination of `workerid` and `year` returns a unique observation in the `fake_data` dataset. Looking at the `region_year_data` dataset above, we can see that every combination of `year` and `region` identifies unique observations in this dataset. This second dataset, however, does not have the `workerid` variable, while the first dataset has all three of the `workerid`, `year` and `region`. Since the unique identifiers common to both datasets are `year` and `region`, we will use these as our keys within the join functions. Since there are many observations with identical years and regions within the `fake_data` dataset, we will be doing what is similar to a m:1 merge in Stata. However, we can specify how we would like matched and unmatched observations to be treated.

> **Tip**: If we do not have any common identifiers between our datasets, but do have variables which express the exact same information, we can simply rename one of the variables so that they are identical.

### 7.2.1: Merge with `left_join`
The left join merge is a type of merge whereby we merge two datasets along one or more "keys", but keep all observations without a match from the dataset specified first in the function and discard all the unmatched observations in the dataset specified second. 

In [None]:
left_join(fake_data, region_year_data, by = c("year", "region"))

Notice here that this function preserves all rows in the first dataset, in this case the `fake_data` dataset, no matter what. The only rows of the second dataset, `region_year_data`, which are kept are those which can be matched to a corresponding row from the first with identical key values (identical values for `year` and `region`). A direct partner to this function is the `right_join` function, which operates identically but in reverse. That is, it keeps all observations in the second dataset and keeps only those in the first which found a match with the second based on the identifier columns specified.

### 7.2.2: Merge with `inner_join`
The inner join merge is a type of merge whereby we keep only observations which have found a match between the two datasets. In this way, this function necessarily discards as many or more observations than the other types of merges.

In [None]:
inner_join(fake_data, region_year_data, by = c("year", "region"))

We can see that this function matched many identical region and year pairings to different workers. That is because there are many workers who have data reported for the same year and same region (i.e. many different workers in `fake_data` have earnings recorded for 1999 in region 1. In some datasets, however, especially those which are not as large as `fake_data`, we will lose many observations with `inner_join`, since this function only preserves observations which can be matched across the key/s specified in both datasets.

### 7.2.3: Merge with `full_join`
This is the function that is closest to appending data horizontally. The process of full join ensures that all observations from both datasets are maintained; if observations from one dataset do not find a match, they simply take on values of NA for the newly merged variables from the other dataset.

In [None]:
full_join(fake_data, region_year_data, by = c("year", "region")

We can see that this function left many observations from our `fake_data` dataset with missing values for variables from our `region_year_data` dataset such as `avg_log_earnings` and `total_employment`. This is because the `fake_data` dataset has observations for workers in years which are not included in the `region_year_data` dataset (since the former records information from 1982 on the latter records information from 1998 on). In this way, while `full_join` typically retains the highest number of observations, it fills our dataset with many missing observations.

When choosing which merge method to choose, it is important to consider if any observations will not find a match, which datasets these "unmatched" observations are in, and whether we would like for these observations to be recorded as missing or dropped. If we wish to drop unmatched observations in all cases, `inner_join` is most appropriate. If we have two datasets and want to drop unmatched observations solely from the first, `left_join` is most appropriate (and correspondingly `right_join` if we want to drop unmatched observations solely from the second). Finally, if we wanted to keep all observations no matter what and have unmatched observations automatically marked with missing values for variables for which they have no recorded information, we should use `full_join`. In all cases, unmatched observations refer to observations in a dataset which do not share the same recorded value for the specified key/s (common identifier/s) with the dataset they are being merged with.

## 7.3: Wrap Up
---

In this module, we learned how to combine different datasets. The most important lesson we should take away from this module is that we can append datasets vertically when they have identical variables and horizontally when they have identical observations (and when these variables and observations are identically ordered in both datasets). More generally, however, we want to merge different variables (columns) between two datasets using common identifier variables. We have a series of four types of merges we can use to accomplish this, each of which treats unmatched observations differently.

As a final note, throughout this module we used the join functions. However, base R has a `merge` function which can accomplish all of the joins we have discussed. We didn't cover this function in detail, however, because it operates much more slowly on large datasets. If you wish to learn more about this function, you can view its documentation by running the code cell below!

In [None]:
?merge