# ECON 490: Merge and Append (8)

## Prerequisites 
---
1. Effectively use Stata do-files and generate log-files.
2. Be able to change your directory so that Stata can find your files.
3. Import datasets in csv and dta format. 
4. Save data files. 
5. Create new variables using the command `egen` and pre-commands`by` and `bysort`.

## Learning Outcomes
---
1. Adding new variables to an existing dataset (merge).
2. Adding new observations to already existing variables (append).

## 8.1 Introduction to Merge and Append

Often when we are working with data sets it will be necessary to merge or append this data with other data sets. For example, imagine that you want to do one of the following:

- You want to run a regression that has national fertility rate the main dependent variable and GDP/capita as an explanitory variable. You have one macro data set that has three variables - country, year, and fertility rate - and a second macro data set also with three variables - country, year, and GDP/capita. To do your research you would need to merge these two data sets to create a final data set. That final data set would have the same number of observations as the intital data set(s), but now with four variables: country, year, fertility rate and GDP/capita. 

- You want to run a regression that has number of births as the main dependent variable and education level of the mother as an explanitory variable. You have two such micro data sets, one from Canada and one from the US, and you want to combine them into one data set that includes observations from both countries. To do your reseaarch you would need to take one data set (say, the Canadian data) and append the second data set (here, the US data). This final data set would have same number of variables as the intial data set(s) but the number of observations would be the number of observations of the Canadian data set plus the number of observations of the US data set.

Here you will be learning how to undertake these two approaches to combining data sets: merge and append. 

We'll continue working with the fake data dataset introduced in the previous lecture. Recall that this dataset is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings. 

In [4]:
clear*

cd "/Users/marinaadshade/Documents/TELF Project/raw"
use fake_data, clear



/Users/marinaadshade/Documents/TELF Project/raw



## 8.2 Getting Ready to Merge and Append

Before introducing the command `merge`, we need the follow the steps below in order to properly combine datasets.

#### 8.2.1 Check the data set's unique identifiers 

The key to merging data sets is to understand what are the variables that uniquely identify each observation.

Let's look at our data. 

In [5]:
%browse 10

Unnamed: 0,workerid,year,sex,birth_year,age,start_year,region,treated,earnings
1,1,1999,M,1944,55,1997,1,0,39975.008
2,1,2001,M,1944,57,1997,1,0,278378.06
3,2,2001,M,1947,54,2001,4,0,18682.6
4,2,2002,M,1947,55,2001,4,0,293336.41
5,2,2003,M,1947,56,2001,4,0,111797.26
6,3,2005,M,1951,54,2005,5,0,88351.672
7,3,2010,M,1951,59,2005,5,0,46229.574
8,4,1997,M,1952,45,1997,5,1,24911.029
9,4,2001,M,1952,49,1997,5,1,9908.3623
10,5,2009,M,1954,55,1998,2,1,137207.34


Here we can see that each observation in the fake_data dataset is identified by the variables *workerid* and *year* (worker-year pairs). 

We can check to see if this is correct using the command `duplicates report`.

In [6]:
duplicates report workerid year


Duplicates in terms of workerid year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |      2861772             0
--------------------------------------


What this table shows is that there are 2,861,772 workerid-year combination (which is exactly equal to all of our observations). This means that every observation we have corresponds to a worker in a particular year. 

Let's take a look at a different data set now also stored in the folder. 

In [8]:
use region_year_data, clear

In [9]:
%browse 10

Unnamed: 0,year,region,avg_log_earnings,total_employment
1,1998,1,10.506687,30004
2,1999,1,10.513171,31367
3,2000,1,10.511585,33429
4,2001,1,10.550608,34547
5,2002,1,10.529206,35503
6,2003,1,10.615291,35809
7,2004,1,10.558952,36161
8,2005,1,10.538996,36966
9,2006,1,10.511196,38161
10,2007,1,10.525853,38051


In this case, it seems that every observation corresponds to a region and year combination. Again, we can use `duplicates report` to see if the variables `region` and `year` uniquely identify all observations.

In [10]:
duplicates report region year


Duplicates in terms of region year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |           70             0
--------------------------------------


The table shows that there is not a single case of repeated observations. Hence, we will refer to these variables as the "unique identifiers".

#### 8.2.2 Identify the "master" and "using" data sets

When merging data we need to decide which data set will be the primary data set (Stata refers to this data set as "master") and which will be secondary data set (Stata refers to this data set as "using"). Most of the time it will not matter which is the master and which is the using data sets. But you will need to know which is which in order to properly interpret the results. 

#### 8.2.3 Identity the matching observations 

There are three main ways to match observations. The first case is when both observations share the same unique identifiers, so one observation in the master dataset is matched to one observation in the using dataset (reffered as `1:1` merge). The other two cases arise when you match multiple observations in the master dataset to one observation in the using dataset (referred as `m:1` merge). If it is the case that one observation in the master dataset is matched to multiple observations in the using dataset this is known as a `1:m` merge.

## 8.3 Merging Data Sets

Once we know the unique identifiers, the master and using data sets, and what type of match we are doing we are able to merge the data sets. 

We begin having the master data opened in the current Stata session. For the sake of showing an example, let's suppose we want to set fake_data as the master dataset, and use region-year  as the using dataset. 

We already know that the fake_data's unique identifiers are *workerid* and *year* while the region-year's unique identifiers are *region* and *year*. The variables we use to link both data sets have to be the unique identifiers that are present in both data sets. Because *workerid* does not exist in the region-level data set, we will use variable *region* and *year* to merge the data sets. 

This means that for every region in the using data set there will be many observations in the individual level (master) data set to be matched. Therefore, this will be a `m:1` merge.  

In [11]:
use fake_data, clear  // This sets this data set as the master

In [12]:
merge m:1 region year using region_year_data 


    Result                      Number of obs
    -----------------------------------------
    Not matched                       406,963
        from master                   406,963  (_merge==1)
        from using                          0  (_merge==2)

    Matched                         2,454,809  (_merge==3)
    -----------------------------------------


Let's analyze the table above. It says that there were 406,963 observations in the master data couldn't be matched to any observation in the using dataset. This is due to the fact that our dataset at the region-year level does not have information for some years. 

Furthermore, the previous table shows that every observation from the using dataset got matched to some observation in the master dataset. The total number of matched observations is roughly 2.5 million. All of this information gets recorded into a new varible named *_merge*. Because of this, it is good practice to write `cap drop _merge` before running a merge command. 

Would we get the same results if we switched the master and using datasets?

In [13]:
use region_year_data, clear
merge 1:m region year using fake_data




    Result                      Number of obs
    -----------------------------------------
    Not matched                       406,963
        from master                         0  (_merge==1)
        from using                    406,963  (_merge==2)

    Matched                         2,454,809  (_merge==3)
    -----------------------------------------


Indeed, we get the same information. We typically want to restrict to observations that were correctly matched across datasets.

In [14]:
keep if _merge==3

(406,963 observations deleted)


<div class="alert alert-block alert-warning">
    
<b>Warning:</b> Before dropping the unmerged observations make sure you spend some time thinking about why they did not merge and correct any errors that you identify. For example, maybe the country names are different in the two data sets (i.e. one data set has "Barbados" and another data set has "The Barbados"). If this is the case you will want to change the names and attempt your match a second time.   
</div>

## 8.4 Appending Data Sets

We have used merge to combine datasets horizontally (we have added columns to the `master` dataset). However, if we want to combine datasets vertically  (add observations to the `master` dataset). Adding new information with `append` is very simple compared to the previous command. When we have a master dataset opened in our session, we can add observations using the syntax:

```stata
    append using new_dataset
```

This command will add new observations to the variables that are named *the same* across both datasets.

8.3 Wrap up

In this module we learned how to combine different datasets. This is an extremely useful skill, especaially when you are undertaking panel data regressions. We will be learning more about these types of regressions in [Module 16]().