# III. Joining and Concatenating

There may be cases when you have data stored in separate datasets and you want to combine these datasets in a way that will allow you to work with all of the data (not just each dataset separately). It's often the case that the separate datasets have a __column variable in common__ which we refer to as the key. The key is a variable you can use to combine the separate datasets into a single dataset, i.e. it is the variable you will join on. In oher cases you may not need to merge on a key but rather just concatenate columns vertically or horizontally.

### Joins:

There are four commonly used joins we'll be going over using the DataFrames package: **inner**, **left**, **right**, and **outer**.


* Inner join: The resultant dataframe will combine rows from both dataframes __ONLY__ for values of the key that exist in both tables. <br/>
<br/>
* Left join: The resultant dataframe will keep __ONLY__ the rows from the first dataframe whether or not the value of the key exists in the second dataframe. If the key value exists in the second table but not the first then it will not be kept in the resultant dataframe. <br/>
<br/>
* Right join: The resultant dataframe will keep __ONLY__ the rows from the second dataframe whether or not the value of the key exists in the first dataframe. If the key value exists in the first table but not the second then it will not be kept in the resultant dataframe. <br/>
<br/>
* Outer join: The resultant datframe will keep __ALL__ rows from both the first and second dataframes. <br/>
<br/>
* Anti join: The resultant datframe will keep those rows for key values that exist in the first table but __NOT__ the second table. The only columns in the resultant dataframe will be those form the first table.


Other joins, that we won't discuss, include semi and cross joins. 

Now let's create some dataframes so we can see how these joins work in practice.

In [1]:
using DataFrames, Dates

In [2]:
#create dataframes

df1 = DataFrame(A = [0, 1, 0, 1, 1, 1, 1, 0, 1, 0], 
                B = [100, 300, 200, 400, 200, 300, 200, 800, 1000, 900], 
                C = [Date(2019, 8, 31), Date(2017, 6, 2), Date(2015, 5, 5), Date(2014, 10, 8), Date(2013, 5, 10),
                      Date(2012, 4, 2), Date(2011, 8, 2), Date(2009, 10, 11), Date(2008, 10, 2), Date(2007, 6,6)],
                ID = [1234, 89, 3412, 578, 998, 124, 665, 8881, 901, 12] )

Unnamed: 0_level_0,A,B,C,ID
Unnamed: 0_level_1,Int64,Int64,Date,Int64
1,0,100,2019-08-31,1234
2,1,300,2017-06-02,89
3,0,200,2015-05-05,3412
4,1,400,2014-10-08,578
5,1,200,2013-05-10,998
6,1,300,2012-04-02,124
7,1,200,2011-08-02,665
8,0,800,2009-10-11,8881
9,1,1000,2008-10-02,901
10,0,900,2007-06-06,12


In [3]:
df2 = DataFrame(D = [2, 3, 3, 3, 2, 3, 2, 2, 2, 2, 3], 
                E = [3.2, 8.9, 10.2, 11.3, 88.1, 52.3, 15.2, 66.7, 98.9, 30.2, 16.0], 
                F = randn(11),
                ID = [1234, 22, 3412, 578, 998, 124, 18, 8881, 901, 12, 44])

Unnamed: 0_level_0,D,E,F,ID
Unnamed: 0_level_1,Int64,Float64,Float64,Int64
1,2,3.2,-1.21653,1234
2,3,8.9,-0.310181,22
3,3,10.2,0.393614,3412
4,3,11.3,0.0325621,578
5,2,88.1,-0.872102,998
6,3,52.3,0.577471,124
7,2,15.2,0.806145,18
8,2,66.7,0.631664,8881
9,2,98.9,-0.975538,901
10,2,30.2,-0.819793,12


Notice that the key variable is <i>ID</i>. So we'll be using the <i>ID</i> variable to do our joins.

To do joins you'll use the __join__ function. The basic syntax is

`join(table1, table2, on = [key], kind = [jointype])`

where __table1__ is the first table, __table2__ the second table, __[key]__ is the key variable, __[jointype]__ is the type of join.

#### Inner join.

Looking at the <i>ID</i> variable you can see that the dataframes have eight <i>ID</i> values in common: 1234, 3412, 578, 998 124, 8881, 901, and 12. Therefore, we'd expect the resultant dataframe of an inner join to have eight rows and seven columns: <i>ID</i> , <i>A</i>, <i>B</i>, <i>C</i>, <i>D</i>, <i>E</i>, and <i>F</i>.

In [4]:
inner_join = join(df1, df2, on = :ID, kind = :inner)

│   caller = ip:0x0
└ @ Core :-1


Unnamed: 0_level_0,A,B,C,ID,D,E,F
Unnamed: 0_level_1,Int64,Int64,Date,Int64,Int64,Float64,Float64
1,0,100,2019-08-31,1234,2,3.2,-1.21653
2,0,200,2015-05-05,3412,3,10.2,0.393614
3,1,400,2014-10-08,578,3,11.3,0.0325621
4,1,200,2013-05-10,998,2,88.1,-0.872102
5,1,300,2012-04-02,124,3,52.3,0.577471
6,0,800,2009-10-11,8881,2,66.7,0.631664
7,1,1000,2008-10-02,901,2,98.9,-0.975538
8,0,900,2007-06-06,12,2,30.2,-0.819793


If you wanted to rearrange the columns so that the <i>ID</i> column variable is first you can use  __permutecols!__.

In [5]:
permutecols!(inner_join, [4, 1, 2, 3, 5, 6, 7])

│   caller = top-level scope at In[5]:1
└ @ Core In[5]:1


Unnamed: 0_level_0,ID,A,B,C,D,E,F
Unnamed: 0_level_1,Int64,Int64,Int64,Date,Int64,Float64,Float64
1,1234,0,100,2019-08-31,2,3.2,-1.21653
2,3412,0,200,2015-05-05,3,10.2,0.393614
3,578,1,400,2014-10-08,3,11.3,0.0325621
4,998,1,200,2013-05-10,2,88.1,-0.872102
5,124,1,300,2012-04-02,3,52.3,0.577471
6,8881,0,800,2009-10-11,2,66.7,0.631664
7,901,1,1000,2008-10-02,2,98.9,-0.975538
8,12,0,900,2007-06-06,2,30.2,-0.819793


#### Left join.

You can see that the first dataframe **df1** has ten <i>ID</i> values. Therefore, we'd expect the resultant dataframe to have 10 rows and seven columns: <i>ID</i> , <i>A</i>, <i>B</i>, <i>C</i>, <i>D</i>, <i>E</i>, and  <i>F</i>.

Notice that of the ten <i>ID</i> values in the first table, the two <i>ID</i> values 89 and 665 **do not exist** in the ID column for the second table **df2**. What do you expect to see for the <i>D</i>, <i>E</i>, and <i>F</i> variables for these two values of the key variable that exist in the left table but not the right?

In [6]:
left_join = join(df1, df2, on = :ID, kind = :left)

Unnamed: 0_level_0,A,B,C,ID,D,E,F
Unnamed: 0_level_1,Int64,Int64,Date,Int64,Int64?,Float64?,Float64?
1,0,100,2019-08-31,1234,2,3.2,-1.21653
2,1,300,2017-06-02,89,missing,missing,missing
3,0,200,2015-05-05,3412,3,10.2,0.393614
4,1,400,2014-10-08,578,3,11.3,0.0325621
5,1,200,2013-05-10,998,2,88.1,-0.872102
6,1,300,2012-04-02,124,3,52.3,0.577471
7,1,200,2011-08-02,665,missing,missing,missing
8,0,800,2009-10-11,8881,2,66.7,0.631664
9,1,1000,2008-10-02,901,2,98.9,-0.975538
10,0,900,2007-06-06,12,2,30.2,-0.819793


#### Right join.

You can see that the second dataframe has 11 <i>ID</i> values. Therefore, we'd expect the resultant dataframe to have 11 rows and seven columns: <i>ID</i> , <i>A</i>, <i>B</i>, <i>C</i>, <i>D</i>, <i>E</i>, and  <i>F</i>.

Notice that of the 11 <i>ID</i> values in the second table, the three <i>ID</i> values 22, 18, and 44 **do not exist** in the <i>ID</i> column for the left table. What do you expect to see for the <i>A</i>, <i>B</i>, and <i>C</i> variables for these two values of the key variable that exist in the right table but not the left?

In [7]:
right_join = join(df1, df2, on = :ID, kind = :right)

Unnamed: 0_level_0,A,B,C,ID,D,E,F
Unnamed: 0_level_1,Int64?,Int64?,Date?,Int64,Int64,Float64,Float64
1,0,100,2019-08-31,1234,2,3.2,-1.21653
2,0,200,2015-05-05,3412,3,10.2,0.393614
3,1,400,2014-10-08,578,3,11.3,0.0325621
4,1,200,2013-05-10,998,2,88.1,-0.872102
5,1,300,2012-04-02,124,3,52.3,0.577471
6,0,800,2009-10-11,8881,2,66.7,0.631664
7,1,1000,2008-10-02,901,2,98.9,-0.975538
8,0,900,2007-06-06,12,2,30.2,-0.819793
9,missing,missing,missing,22,3,8.9,-0.310181
10,missing,missing,missing,18,2,15.2,0.806145


#### Outer join.

With this type of join we know that all rows of data between the two tables will be kept. How many rows do you expect the resultant dataframe to have?

The two tables have eight key values in common, **df1** has two key values that **df2** does not have, and **df2** has three key values that **df1** does not have. As a result, the resultant dataframe will have 13 (8+2+3) total rows and seven columns.

In [8]:
outer_join = join(df1, df2, on = :ID, kind = :outer)

Unnamed: 0_level_0,A,B,C,ID,D,E,F
Unnamed: 0_level_1,Int64?,Int64?,Date?,Int64,Int64?,Float64?,Float64?
1,0,100,2019-08-31,1234,2,3.2,-1.21653
2,1,300,2017-06-02,89,missing,missing,missing
3,0,200,2015-05-05,3412,3,10.2,0.393614
4,1,400,2014-10-08,578,3,11.3,0.0325621
5,1,200,2013-05-10,998,2,88.1,-0.872102
6,1,300,2012-04-02,124,3,52.3,0.577471
7,1,200,2011-08-02,665,missing,missing,missing
8,0,800,2009-10-11,8881,2,66.7,0.631664
9,1,1000,2008-10-02,901,2,98.9,-0.975538
10,0,900,2007-06-06,12,2,30.2,-0.819793


#### Anti join.

With this type of join you will keep rows and columns of data from the first table for those key values that exist __ONLY__ in the first table. So if a key value exists in the first table __AND__ the second table that data will __NOT__ be in the resultant dataframe.

If we do an anti join where **df1** is the first table and **df2** is the second table how many rows of data will be in the resultant dataframe? We know **df1** only has two key values (89 and 665) that are not in **df2** so in this case the resultant dataframe will have two rows of data.

In [9]:
anti_join_1 = join(df1, df2, on = :ID, kind = :anti)

Unnamed: 0_level_0,A,B,C,ID
Unnamed: 0_level_1,Int64,Int64,Date,Int64
1,1,300,2017-06-02,89
2,1,200,2011-08-02,665


We can do an anti join where we use **df2** as the first table and **df1** as the second table. Again, we know **df2** only has three key values not in **df1** so the resultant dataframe will have three rows of data.

In [10]:
anti_join_2 = join(df2, df1, on = :ID, kind = :anti)

Unnamed: 0_level_0,D,E,F,ID
Unnamed: 0_level_1,Int64,Float64,Float64,Int64
1,3,8.9,-0.310181,22
2,2,15.2,0.806145,18
3,3,16.0,-0.969091,44


The key variables don't have to have the same name in each dataframe. Let's change the name of the <i>ID</i> variable in **df1** to be <i>ID1</i> and in **df2** to be <i>ID2</i>:

In [11]:
rename!(df1, :ID => :ID1)
#df2temp = rename!(df2, :ID => :ID2)
rename!(df2, :ID => :ID2)

Unnamed: 0_level_0,D,E,F,ID2
Unnamed: 0_level_1,Int64,Float64,Float64,Int64
1,2,3.2,-1.21653,1234
2,3,8.9,-0.310181,22
3,3,10.2,0.393614,3412
4,3,11.3,0.0325621,578
5,2,88.1,-0.872102,998
6,3,52.3,0.577471,124
7,2,15.2,0.806145,18
8,2,66.7,0.631664,8881
9,2,98.9,-0.975538,901
10,2,30.2,-0.819793,12


Now, to do an inner join the `on` keyword argument is expressed as a pair where the first element of the pair is the name of the key in the first dataframe and the second element in the pair is the name of the key in the second dataframe.

In [12]:
join(df1, df2, on = :ID1 => :ID2, kind = :inner)

Unnamed: 0_level_0,A,B,C,ID1,D,E,F
Unnamed: 0_level_1,Int64,Int64,Date,Int64,Int64,Float64,Float64
1,0,100,2019-08-31,1234,2,3.2,-1.21653
2,0,200,2015-05-05,3412,3,10.2,0.393614
3,1,400,2014-10-08,578,3,11.3,0.0325621
4,1,200,2013-05-10,998,2,88.1,-0.872102
5,1,300,2012-04-02,124,3,52.3,0.577471
6,0,800,2009-10-11,8881,2,66.7,0.631664
7,1,1000,2008-10-02,901,2,98.9,-0.975538
8,0,900,2007-06-06,12,2,30.2,-0.819793


#### More on doing joins.

In [13]:
# create dataframes

df1 = DataFrame(A = [0, 1, 0, 1, 1, 1, 1, 0, 1, 0], 
                B = [100, 300, 200, 400, 200, 300, 200, 800, 1000, 900], 
                C = [Date(2019, 8, 31), Date(2017, 6, 2), Date(2015, 5, 5), Date(2014, 10, 8), Date(2013, 5, 10),
                      Date(2012, 4, 2), Date(2011, 8, 2), Date(2009, 10, 11), Date(2008, 10, 2), Date(2007, 6,6)],
                ID = [1234, 89, 3412, 578, 998, 124, 665, 8881, 901, 12] );

df2 = DataFrame(D = [2, 3, 3, 3, 2, 3, 2, 2, 2, 2, 3], 
                E = [3.2, 8.9, 10.2, 11.3, 88.1, 52.3, 15.2, 66.7, 98.9, 30.2, 16.0], 
                F = randn(11),
                ID = [1234, 22, 3412, 578, 998, 124, 18, 8881, 901, 12, 44]);

I. The **indicator** argument: <br/>
<br/>
You can pass an **indicator** keyword argument to the **join** function so that the resultant dataframe contains a column variable indicating from which table an observation came from: first, second, or both. Let's use this argument in an outer join.

In [14]:
outer_join = join(df1, df2, on = :ID, kind = :outer, indicator = :source)

Unnamed: 0_level_0,A,B,C,ID,D,E,F,source
Unnamed: 0_level_1,Int64?,Int64?,Date?,Int64,Int64?,Float64?,Float64?,Cat…
1,0,100,2019-08-31,1234,2,3.2,0.0128778,both
2,1,300,2017-06-02,89,missing,missing,missing,left_only
3,0,200,2015-05-05,3412,3,10.2,-0.527092,both
4,1,400,2014-10-08,578,3,11.3,1.1258,both
5,1,200,2013-05-10,998,2,88.1,-0.663834,both
6,1,300,2012-04-02,124,3,52.3,0.380954,both
7,1,200,2011-08-02,665,missing,missing,missing,left_only
8,0,800,2009-10-11,8881,2,66.7,-0.604882,both
9,1,1000,2008-10-02,901,2,98.9,0.27131,both
10,0,900,2007-06-06,12,2,30.2,0.713718,both


II. The __makeunique__ argument: <br/>
<br/>
This argument is useful in cases where each dataset has a column variable with the **same** name. 

Let's modify **df1** so that it has a column variable with the same name a column in **df2**.

In [15]:
# Modify df1 to have a "D" column variable

df1.D = rand(100:900,10);

What happens when you do a join on **df1** and **df2** now that they share a column variable with the same name?

In [16]:
inner_join = join(df1, df2, on = :ID, kind = :inner)

ArgumentError: ArgumentError: Duplicate variable names: :D. Pass makeunique=true to make them unique using a suffix automatically.

The error message indicates what to do. You can pass a boolean value of true for the `makeunique` keyword argument.

In [17]:
inner_join = join(df1, df2, on = :ID, makeunique = true, kind = :inner)

Unnamed: 0_level_0,A,B,C,ID,D,D_1,E,F
Unnamed: 0_level_1,Int64,Int64,Date,Int64,Int64,Int64,Float64,Float64
1,0,100,2019-08-31,1234,292,2,3.2,0.0128778
2,0,200,2015-05-05,3412,561,3,10.2,-0.527092
3,1,400,2014-10-08,578,791,3,11.3,1.1258
4,1,200,2013-05-10,998,165,2,88.1,-0.663834
5,1,300,2012-04-02,124,136,3,52.3,0.380954
6,0,800,2009-10-11,8881,552,2,66.7,-0.604882
7,1,1000,2008-10-02,901,280,2,98.9,0.27131
8,0,900,2007-06-06,12,339,2,30.2,0.713718


The resultant dataframe keeps both columns of the <i>D</i> variable but the <i>D</i> column for the second table (in this case **df2**) is renamed to <i>D_1</i>.

That is, a suffix of \_1 was automatically added to the name of the D column variable in the second table. If there had already been a variable named <i>D_1</i> then the suffix \_2 would've been used instead. The <i>D</i> column for the first table keeps the same name of <i>D</i>.

III. The __validate__ argument:

In our above tables the key values were unique in each table. That is, in each table the <i>ID</i> variable did not contain duplicate values. However, this is not a requirement for the key variable.

In **df2** let's add another row of data for <i>ID</i> 1234 and in **df1** let's drop the <i>D</i> column variable and add two more rows of data for <i>ID</i> 1234.

In [18]:
push!(df2, (3, 21.2, randn(1)[1], 1234));

In [19]:
select!(df1, Not([:D]))
push!(df1, (1, 700, Date(2011,7,7), 1234));
push!(df1, (1, 500, Date(2011,1,17), 1234));

In [20]:
df1

Unnamed: 0_level_0,A,B,C,ID
Unnamed: 0_level_1,Int64,Int64,Date,Int64
1,0,100,2019-08-31,1234
2,1,300,2017-06-02,89
3,0,200,2015-05-05,3412
4,1,400,2014-10-08,578
5,1,200,2013-05-10,998
6,1,300,2012-04-02,124
7,1,200,2011-08-02,665
8,0,800,2009-10-11,8881
9,1,1000,2008-10-02,901
10,0,900,2007-06-06,12


In [21]:
df2

Unnamed: 0_level_0,D,E,F,ID
Unnamed: 0_level_1,Int64,Float64,Float64,Int64
1,2,3.2,0.0128778,1234
2,3,8.9,0.41265,22
3,3,10.2,-0.527092,3412
4,3,11.3,1.1258,578
5,2,88.1,-0.663834,998
6,3,52.3,0.380954,124
7,2,15.2,-0.419663,18
8,2,66.7,-0.604882,8881
9,2,98.9,0.27131,901
10,2,30.2,0.713718,12


These changes result in **df2** having two rows of data with ID 1234 and with **df1** having three rows of data with ID 1234.

Let's now do a left join. In this the join operation will match all combinations of <i>ID</i> 1234. How many rows will the resultant dataframe have with a value of 1234 for the <i>ID</i> variable? 

Keep in mind the left table (__df1__) has three row entries for <i>ID</i> 1234 and the right table (__df2__) has two row entries for <i>ID</i> 1234, so in this case the result dataframe will have 6 (3x2) entries for <i>ID</i> 1234. 

In [22]:
left_join_2 = join(df1, df2, on = :ID, kind = :left)

Unnamed: 0_level_0,A,B,C,ID,D,E,F
Unnamed: 0_level_1,Int64,Int64,Date,Int64,Int64?,Float64?,Float64?
1,0,100,2019-08-31,1234,2,3.2,0.0128778
2,0,100,2019-08-31,1234,3,21.2,-0.0271656
3,1,300,2017-06-02,89,missing,missing,missing
4,0,200,2015-05-05,3412,3,10.2,-0.527092
5,1,400,2014-10-08,578,3,11.3,1.1258
6,1,200,2013-05-10,998,2,88.1,-0.663834
7,1,300,2012-04-02,124,3,52.3,0.380954
8,1,200,2011-08-02,665,missing,missing,missing
9,0,800,2009-10-11,8881,2,66.7,-0.604882
10,1,1000,2008-10-02,901,2,98.9,0.27131


We can see that the table has six entries for _ID_ 1234 by using `filter` to return only rows with an _ID_ of 1234:

In [23]:
filter(row -> row[:ID] == 1234, left_join_2)

Unnamed: 0_level_0,A,B,C,ID,D,E,F
Unnamed: 0_level_1,Int64,Int64,Date,Int64,Int64?,Float64?,Float64?
1,0,100,2019-08-31,1234,2,3.2,0.0128778
2,0,100,2019-08-31,1234,3,21.2,-0.0271656
3,1,700,2011-07-07,1234,2,3.2,0.0128778
4,1,700,2011-07-07,1234,3,21.2,-0.0271656
5,1,500,2011-01-17,1234,2,3.2,0.0128778
6,1,500,2011-01-17,1234,3,21.2,-0.0271656


You can enforce that the join fail if the keys are not unique in one or both tables using the __validate__ keyword argument. You pass this argument two boolean values indicating whether the tables need to have unique key values. The first boolean argument if set to true indicates the first table can not have duplicate key values; similarly, the second boolean if set to true indicates the second table can not have duplicate values.

In [24]:
df1

Unnamed: 0_level_0,A,B,C,ID
Unnamed: 0_level_1,Int64,Int64,Date,Int64
1,0,100,2019-08-31,1234
2,1,300,2017-06-02,89
3,0,200,2015-05-05,3412
4,1,400,2014-10-08,578
5,1,200,2013-05-10,998
6,1,300,2012-04-02,124
7,1,200,2011-08-02,665
8,0,800,2009-10-11,8881
9,1,1000,2008-10-02,901
10,0,900,2007-06-06,12


In [25]:
left_join_3 = join(df1, df2, on = :ID, validate = (true, true), kind = :left)

ArgumentError: ArgumentError: Merge key(s) are not unique in both df1 and df2. df1 contains 1 duplicate key: (ID = 1234,). df2 contains 1 duplicate key: (ID = 1234,).

Let's delete the last row, i.e. row 12, from **df2** with <i>ID</i> 1234. This will leave it with just one entry for <i>ID</i> 1234.

In [26]:
deleterows!(df2, 12)

│   caller = top-level scope at In[26]:1
└ @ Core In[26]:1


Unnamed: 0_level_0,D,E,F,ID
Unnamed: 0_level_1,Int64,Float64,Float64,Int64
1,2,3.2,0.0128778,1234
2,3,8.9,0.41265,22
3,3,10.2,-0.527092,3412
4,3,11.3,1.1258,578
5,2,88.1,-0.663834,998
6,3,52.3,0.380954,124
7,2,15.2,-0.419663,18
8,2,66.7,-0.604882,8881
9,2,98.9,0.27131,901
10,2,30.2,0.713718,12


Now if we set `validate = (false, true)` the join will work since we are not requiring the left table to have unique key values, due to passed in value of false,; but we are requiring the right table to have unique key values (which it now does), due to the passed in value of true for the validate argument.

In [27]:
left_join_4 = join(df1, df2, on = :ID, validate = (false, true), kind = :left)

Unnamed: 0_level_0,A,B,C,ID,D,E,F
Unnamed: 0_level_1,Int64,Int64,Date,Int64,Int64?,Float64?,Float64?
1,0,100,2019-08-31,1234,2,3.2,0.0128778
2,1,300,2017-06-02,89,missing,missing,missing
3,0,200,2015-05-05,3412,3,10.2,-0.527092
4,1,400,2014-10-08,578,3,11.3,1.1258
5,1,200,2013-05-10,998,2,88.1,-0.663834
6,1,300,2012-04-02,124,3,52.3,0.380954
7,1,200,2011-08-02,665,missing,missing,missing
8,0,800,2009-10-11,8881,2,66.7,-0.604882
9,1,1000,2008-10-02,901,2,98.9,0.27131
10,0,900,2007-06-06,12,2,30.2,0.713718


IV. Joining on two variables: <br/>

In [28]:
# create dataframes

df1 = DataFrame(A = [0, 1, 0, 1, 1, 1, 1, 0, 1, 0], 
                B = [100, 300, 200, 400, 200, 300, 200, 800, 1000, 900], 
                C = [Date(2019, 8, 31), Date(2017, 6, 2), Date(2015, 5, 5), Date(2014, 10, 8), Date(2013, 5, 10),
                      Date(2012, 4, 2), Date(2011, 8, 2), Date(2009, 10, 11), Date(2008, 10, 2), Date(2007, 6,6)],
                ID = [1234, 89, 3412, 578, 998, 124, 665, 8881, 901, 12] );

df2 = DataFrame(D = [2, 3, 3, 3, 2, 3, 2, 2, 2, 2, 3], 
                E = [3.2, 8.9, 10.2, 11.3, 88.1, 52.3, 15.2, 66.7, 98.9, 30.2, 16.0], 
                F = randn(11),
                ID = [1234, 22, 3412, 578, 998, 124, 18, 8881, 901, 12, 44]);

You can join on more than one variable. Here we'll do an example where we join on two variables. We'll add a column named <i>A</i> to **df2** so that now both dataframes will have <i>ID</i> and <i>A</i> columns.

In [29]:
df2.A = [0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1];

In [30]:
df2

Unnamed: 0_level_0,D,E,F,ID,A
Unnamed: 0_level_1,Int64,Float64,Float64,Int64,Int64
1,2,3.2,-1.57344,1234,0
2,3,8.9,-0.204683,22,1
3,3,10.2,0.69225,3412,1
4,3,11.3,-0.837547,578,0
5,2,88.1,-0.0024437,998,0
6,3,52.3,-0.433899,124,1
7,2,15.2,0.828746,18,1
8,2,66.7,-1.29748,8881,0
9,2,98.9,-1.54095,901,0
10,2,30.2,0.440707,12,1


Now we can do an inner join on the two variables <i>ID</i> and <i>A</i>. You can just specify these variables in the **on** keyword argument. Now when joining on both <i>ID</i> and <i>A</i> there are only three matching cases.

In [31]:
inner_join = join(df1, df2, on = [:ID, :A], kind = :inner)

Unnamed: 0_level_0,A,B,C,ID,D,E,F
Unnamed: 0_level_1,Int64,Int64,Date,Int64,Int64,Float64,Float64
1,0,100,2019-08-31,1234,2,3.2,-1.57344
2,1,300,2012-04-02,124,3,52.3,-0.433899
3,0,800,2009-10-11,8881,2,66.7,-1.29748


### Concatenation:

Concatenating, or combining, dataframes is relatively straightforward in Julia. The main commands are `hcat` and `vcat` for horizontal and vertical concatenation respectively.

The **hcat** function assumes the input dataframes have same number of rows. Horizontal concatenation combines two dataframes along the second dimension. What results is a dataframe with the same number of rows as either dataframe but with columns from both dataframes.

In [32]:
# create dataframes

df1 = DataFrame(A = [0, 1, 0, 1, 1, 1, 1, 0, 1, 0], 
                B = [100, 300, 200, 400, 200, 300, 200, 800, 1000, 900], 
                C = [Date(2019, 8, 31), Date(2017, 6, 2), Date(2015, 5, 5), Date(2014, 10, 8), Date(2013, 5, 10),
                      Date(2012, 4, 2), Date(2011, 8, 2), Date(2009, 10, 11), Date(2008, 10, 2), Date(2007, 6,6)],
                ID = [1234, 89, 3412, 578, 998, 124, 665, 8881, 901, 12] );

df2 = DataFrame(D = [2, 3, 3, 3, 2, 3, 2, 2, 2, 2, 3], 
                E = [3.2, 8.9, 10.2, 11.3, 88.1, 52.3, 15.2, 66.7, 98.9, 30.2, 16.0], 
                F = randn(11),
                ID = [1234, 22, 3412, 578, 998, 124, 18, 8881, 901, 12, 44]);

Let's create a new dataframe called **df2_noid** that is the same as **df2** but without the <i>ID</i> column. This way the **df1** and **df2** dataframes will have uniquely named columns.

We'll also delete the last row from **df2** so it has the number of rows as **df1** which is a requirement for `hcat`.

In [33]:
deleterows!(df2, nrow(df2)); # delete last row from df2

│   caller = top-level scope at In[33]:1
└ @ Core In[33]:1


In [34]:
df2_noid = df2[:, Not(:ID)];

Now we can horizontally concatenate the two dataframes.

In [35]:
hcat(df1, df2_noid)

Unnamed: 0_level_0,A,B,C,ID,D,E,F
Unnamed: 0_level_1,Int64,Int64,Date,Int64,Int64,Float64,Float64
1,0,100,2019-08-31,1234,2,3.2,1.66717
2,1,300,2017-06-02,89,3,8.9,-0.3383
3,0,200,2015-05-05,3412,3,10.2,-0.0274986
4,1,400,2014-10-08,578,3,11.3,-0.0368451
5,1,200,2013-05-10,998,2,88.1,0.232558
6,1,300,2012-04-02,124,3,52.3,-0.500451
7,1,200,2011-08-02,665,2,15.2,-0.396486
8,0,800,2009-10-11,8881,2,66.7,0.255381
9,1,1000,2008-10-02,901,2,98.9,-1.25766
10,0,900,2007-06-06,12,2,30.2,0.531815


Note if you had left the <i>ID</i> column in the right dataframe then you would've needed to use the `makeunique` option with **hcat** which would keep both <i>ID</i> columns in the resulting dataframe but automatically suffix the name of one of the ID columns with a _1.

In [36]:
hcat(df1, df2, makeunique=true)

Unnamed: 0_level_0,A,B,C,ID,D,E,F,ID_1
Unnamed: 0_level_1,Int64,Int64,Date,Int64,Int64,Float64,Float64,Int64
1,0,100,2019-08-31,1234,2,3.2,1.66717,1234
2,1,300,2017-06-02,89,3,8.9,-0.3383,22
3,0,200,2015-05-05,3412,3,10.2,-0.0274986,3412
4,1,400,2014-10-08,578,3,11.3,-0.0368451,578
5,1,200,2013-05-10,998,2,88.1,0.232558,998
6,1,300,2012-04-02,124,3,52.3,-0.500451,124
7,1,200,2011-08-02,665,2,15.2,-0.396486,18
8,0,800,2009-10-11,8881,2,66.7,0.255381,8881
9,1,1000,2008-10-02,901,2,98.9,-1.25766,901
10,0,900,2007-06-06,12,2,30.2,0.531815,12


The `vcat` function works much the same as the `hcat` function except it concatenates along the first dimension. Each input dataframe can have a different number of rows but should have the same number of columns with the same column names.

Let's create a **df1new** dataframe with the same column names as the **df1** datframe. Note the column names do not need to be in the same order for **vcat** to work though the names must match.

In [37]:
df1new = DataFrame(A = [3.2, 8.9, 10.2, 11.3, 88.1, 52.3, 15.2, 66.7, 98.9, 30.2, 16.0],
                   B = [2, 3, 3, 3, 2, 3, 2, 2, 2, 2, 3], 
                   C = randn(11),
                   ID = [1234, 22, 3412, 578, 998, 124, 18, 8881, 901, 12, 44])

Unnamed: 0_level_0,A,B,C,ID
Unnamed: 0_level_1,Float64,Int64,Float64,Int64
1,3.2,2,0.655238,1234
2,8.9,3,2.23704,22
3,10.2,3,0.829874,3412
4,11.3,3,-0.577773,578
5,88.1,2,-0.898256,998
6,52.3,3,0.904124,124
7,15.2,2,0.0501277,18
8,66.7,2,-0.221864,8881
9,98.9,2,0.775255,901
10,30.2,2,0.562006,12


In [38]:
df1

Unnamed: 0_level_0,A,B,C,ID
Unnamed: 0_level_1,Int64,Int64,Date,Int64
1,0,100,2019-08-31,1234
2,1,300,2017-06-02,89
3,0,200,2015-05-05,3412
4,1,400,2014-10-08,578
5,1,200,2013-05-10,998
6,1,300,2012-04-02,124
7,1,200,2011-08-02,665
8,0,800,2009-10-11,8881
9,1,1000,2008-10-02,901
10,0,900,2007-06-06,12


In [39]:
vcat(df1, df1new)

Unnamed: 0_level_0,A,B,C,ID
Unnamed: 0_level_1,Float64,Int64,Any,Int64
1,0.0,100,2019-08-31,1234
2,1.0,300,2017-06-02,89
3,0.0,200,2015-05-05,3412
4,1.0,400,2014-10-08,578
5,1.0,200,2013-05-10,998
6,1.0,300,2012-04-02,124
7,1.0,200,2011-08-02,665
8,0.0,800,2009-10-11,8881
9,1.0,1000,2008-10-02,901
10,0.0,900,2007-06-06,12


The dataframes essentially get stacked with the first input dataframe stacked on top of the second.

In this lesson we covered:
* Basic join operations including inner, left, right, outer, and anti.
* Additional join arguments, joing on two variables.
* Concatenating dataframes: hcat and vcat.