# 8. Merge and Append 

In this module we'll learn how to combine different datasets. The new commands will help us deal with the following cases
- Adding new variables to an existing dataset (merge).
- Adding new observations to already existing variables (append).



# 8.1 Merge 


Before introducing the command `merge`, we need to understand how the data is structured. Let us open the dataset used in the previous module.


In [1]:
clear*

use fake_data, clear

The key to merging datasets is to understand what are the variables that *uniquely* identify each observation.

In [3]:
%browse 10

Unnamed: 0,workerid,year,sex,birth_year,age,start_year,region,treated,earnings
1,1,1999,M,1944,55,1997,1,0,39975.008
2,1,2001,M,1944,57,1997,1,0,278378.06
3,2,2001,M,1947,54,2001,4,0,18682.6
4,2,2002,M,1947,55,2001,4,0,293336.41
5,2,2003,M,1947,56,2001,4,0,111797.26
6,3,2005,M,1951,54,2005,5,0,88351.672
7,3,2010,M,1951,59,2005,5,0,46229.574
8,4,1997,M,1952,45,1997,5,1,24911.029
9,4,2001,M,1952,49,1997,5,1,9908.3623
10,5,2009,M,1954,55,1998,2,1,137207.34


It seems like each observation is a worker-year pair. We need to check whether this is true or not using the `duplicates report` command.

In [4]:
duplicates report workerid year


Duplicates in terms of workerid year

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |      2861772             0
--------------------------------------


What this table shows is that there are 2861772 workerid-year combination (which is exactly equal to all of our observations). This means that every observation we have corresponds to a worker in a particular year. 

Let's take a look at a different dataset now.

In [5]:
use region_year_data, clear

In [6]:
%browse 10

Unnamed: 0,year,region,avg_log_earnings,total_employment
1,1998,1,10.506687,30004
2,1999,1,10.513171,31367
3,2000,1,10.511585,33429
4,2001,1,10.550608,34547
5,2002,1,10.529206,35503
6,2003,1,10.615291,35809
7,2004,1,10.558952,36161
8,2005,1,10.538996,36966
9,2006,1,10.511196,38161
10,2007,1,10.525853,38051


In this case, it seems that every observation corresponds to a region and year combination. Again, we can use `duplicates report` to see if the variables `region` and `year` uniquely identify all observations.

In [7]:
duplicates report region year


Duplicates in terms of region year

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |           70             0
--------------------------------------


Indeed! The table shows that there is not a single case of repeated copies of some observation. Henceforth we will refer to these variables as the `unique identifiers`.

## 8.1.1 Master and Using Datasets

Suppose we want to combine both datasets. First, we need to decide which is going to be the main dataset (Stata refers to this dataset as `master`) and which secondary dataset we'll use to bring new variables to the master data. The latter dataset is referred as the `using` data.

## 8.1.2 How does matching observations work across datasets?


There are three main ways to match observations. The first case is when both observations share the same unique identifiers, so one observation in the master dataset is matched to one observation in the using dataset (reffered as `1:1` merge). The other two cases arise when you match multiple observations in the master dataset to one observation in the using dataset (referred as `m:1` merge). If it is the case that one observation in the master dataset is matched to multiple observations in the using dataset this is known as a `1:m` merge.

## 8.1.3 Merge in practice

We begin by choosing the master dataset and having it opened in the current Stata session. For the sake of showing an example, let's suppose we want to set `fake_data.dta` as the master dataset, and we intend to import the region-year level variables we computed in the other dataset. 

This would mean that for every region in the using dataset there will be many observations in the individual level (master) dataset to be matched. Therefore, this will be a `m:1` merge.  


In [8]:
use fake_data, clear

The variables we use to link both datasets have to be the unique identifiers that are present in both datasets. In this case, these variables are `region year` (notice that `workerid` does not exist in the region-level dataset).

In [10]:
merge m:1 region year using region_year_data


    Result                           # of obs.
    -----------------------------------------
    not matched                       406,963
        from master                   406,963  (_merge==1)
        from using                          0  (_merge==2)

    matched                         2,454,809  (_merge==3)
    -----------------------------------------


Let's analyze the table above. It says that there were 406,963 observations in the master data couldn't be matched to any observation in the using dataset. This is due to the fact that our dataset at the region-year level does not have information for some years. 

Furthermore, the previous table shows that every observation from the using dataset got matched to some observation in the master dataset. The total number of matched observations is roughly 2.5 million. All of this information gets recorded into a new varible named `_merge`. Because of this, it is good practice to write `cap drop _merge` before running a merge command or use the `nogen` option of this command. 

Would we get the same results if we switched the master and using datasets?

In [13]:
use region_year_data, clear
merge 1:m region year using fake_data




    Result                           # of obs.
    -----------------------------------------
    not matched                       406,963
        from master                         0  (_merge==1)
        from using                    406,963  (_merge==2)

    matched                         2,454,809  (_merge==3)
    -----------------------------------------


Indeed, we get the same information. We typically want to restrict to observations that were correctly matched across datasets.

In [14]:
keep if _merge==3

(406,963 observations deleted)


## 8.2 Append

Adding new information is very simple compared to the previous command. When we have a master dataset opened in our session, we can add new informations from another dataset using the `append` command. 

The syntax is simpler than merge 

```
    append using new_dataset
```


This command will add new observations to the variables that are named *the same* across both datasets.