# Merge Data Frame
> Full and Partial Match

Very often, we have data from multiple sources. To perform an analysis, we need to **merge** two dataframes together with one or more **common key variables**.

## Full match

A full match returns values that **have a counterpart** in the destination table. The values that are not match won't be return in the new data frame. The **partial** match, however, return the missing values as **NA**.


We will see a simple inner join. The inner join keyword selects records that have matching values in both tables. To join two datasets, we can use merge() function. We will use three arguments :

```r
merge(x, y, by.x = x, by.y = y)
```
### Arguments:
* x: The origin data frame
* y: The data frame to merge
* by.x: The column used for merging in x data frame. Column x to merge on
* by.y: The column used for merging in y data frame. Column y to merge on

### Example:

Create First Dataset with variables

* surname
* nationality
Create Second Dataset with variables

* surname
* movies

The common key variable is surname. We can merge both data and check if the dimensionality is 7x3.

We add stringsAsFactors=FALSE in the data frame because we don't want R to convert string as factor, we want the variable to be treated as character.

Create the origin data frame

In [3]:
producers <- data.frame(   
    surname =  c("Spielberg","Scorsese","Hitchcock","Tarantino","Polanski"),    
    nationality = c("US","US","UK","US","Poland"),    
    stringsAsFactors=FALSE)
producers

surname,nationality
<chr>,<chr>
Spielberg,US
Scorsese,US
Hitchcock,UK
Tarantino,US
Polanski,Poland


Create destination data frame

In [6]:
movies <- data.frame(    
    surname = c("Spielberg",
        "Scorsese",
                "Hitchcock",
                "Hitchcock",
                "Spielberg",
                "Tarantino",
                "Polanski"),    
    title = c("Super 8",
            "Taxi Driver",
            "Psycho",
            "North by Northwest",
            "Catch Me If You Can",
            "Reservoir Dogs","Chinatown"),                
             stringsAsFactors=FALSE)
movies

surname,title
<chr>,<chr>
Spielberg,Super 8
Scorsese,Taxi Driver
Hitchcock,Psycho
Hitchcock,North by Northwest
Spielberg,Catch Me If You Can
Tarantino,Reservoir Dogs
Polanski,Chinatown


Merge 2 dataset

In [7]:
m1<- merge(producers,movies, by.x = "surname")
m1

surname,nationality,title
<chr>,<chr>,<chr>
Hitchcock,UK,Psycho
Hitchcock,UK,North by Northwest
Polanski,Poland,Chinatown
Scorsese,US,Taxi Driver
Spielberg,US,Super 8
Spielberg,US,Catch Me If You Can
Tarantino,US,Reservoir Dogs


### With different names
Let's merge data frames when the common key variables have different names.

We change surname to name in the movies data frame. We use the function identical(x1, x2) to check if both dataframes are identical.

In [10]:
# Change name of ` movies ` dataframe
colnames(movies)[colnames(movies) == 'surname'] <- 'name'
movies

name,title
<chr>,<chr>
Spielberg,Super 8
Scorsese,Taxi Driver
Hitchcock,Psycho
Hitchcock,North by Northwest
Spielberg,Catch Me If You Can
Tarantino,Reservoir Dogs
Polanski,Chinatown


Merge different key value, notice, the final result takes the name from the by.x

In [12]:
m2 <- merge(producers, movies,by.x = "surname",by.y = "name")
m2

surname,nationality,title
<chr>,<chr>,<chr>
Hitchcock,UK,Psycho
Hitchcock,UK,North by Northwest
Polanski,Poland,Chinatown
Scorsese,US,Taxi Driver
Spielberg,US,Super 8
Spielberg,US,Catch Me If You Can
Tarantino,US,Reservoir Dogs


Check if m1 and m2 data are identical

In [14]:
identical(m1,m2)

## Partial match

It is not surprising that two dataframes do not have the same common key variables. In the full matching, the dataframe returns only rows found in both x and y data frame. With partial merging, it is possible to keep the rows with no matching rows in the other data frame. These rows will have NA in those columns that are usually filled with values from y. We can do that by setting all.x= TRUE.

For instance, we can add a new producer, Lucas, in the producer data frame without the movie references in movies data frame. If we set all.x= FALSE, R will join only the matching values in both data set. In our case, the producer Lucas will not be join to the merge because it is missing from one dataset.

Let's see the dimension of each output when we specify all.x= TRUE and when we don't.

Create a producer

In [16]:
add_producer <-  c('Lucas', 'US')
add_producer

Append to the `producer` data frame

In [17]:
producers <-rbind(producers,add_producer)
producers

surname,nationality
<chr>,<chr>
Spielberg,US
Scorsese,US
Hitchcock,UK
Tarantino,US
Polanski,Poland
Lucas,US


A partial merge, if we specify all.x=TRUE

In this case, we have NA value

In [20]:
m3 <-merge(producers,movies,by.x = "surname",by.y = "name", all.x = TRUE)
m3

surname,nationality,title
<chr>,<chr>,<chr>
Hitchcock,UK,Psycho
Hitchcock,UK,North by Northwest
Lucas,US,
Polanski,Poland,Chinatown
Scorsese,US,Taxi Driver
Spielberg,US,Super 8
Spielberg,US,Catch Me If You Can
Tarantino,US,Reservoir Dogs


if all.x = FALSE, that's exactly what happend before

In [25]:
identical(m2, merge(producers,movies,by.x = "surname",by.y = "name", all.x= FALSE))