In [1]:
import pandas as pd

## Joining and Concatenating Data

Sometimes, we have several data sources which we liked to combine. This is done in pandas through mergers (similar to a join in SQL).

In order to do a join, we need to have a common feature in each data set to join/(merge) data from various sources. We also have to decide on the way in which we will join/merge the data.

<table><tr><td><img src='./pics/inner_join.PNG' width = 400></td><td><img src='pics/outer_join.PNG' width = 400></td></tr></table>
<table><tr><td><img src='./pics/left_join.PNG' width = 400+></td><td><img src='pics/right_join.PNG' width = 400></td></tr></table>

**Examples** 

**Inner Join** </br>
<img src="./pics/inner_join example.PNG" width = 400/>

**Outer Join** </br>
<img src="./pics/outer_join example.PNG" width = 400/>

**Left Join** </br>
<img src="./pics/left_join example.PNG" width = 400/>

**Right Join** </br>
<img src="./pics/right_join example.PNG" width = 400/>



### Let's do an example
- two data sets (GDP, Population) from the World Bank

In [14]:
# read in the datasets
gdp = pd.read_csv("./data/worldbank/WorldBank_GDP.csv")
pop = pd.read_csv("./data/worldbank/WorldBank_POP.csv")

In [None]:
gdp.head(10)

In [None]:
pop.head(10)

Now, we will use `.merge()` to combine the 2 datasets. 

NOTE: We can specify more than one column on which to merge, if our datasets have 2+ columns in common

In [None]:
world_data = gdp.merge(pop, how="left", on=["Country Name", "Year"])

world_data.head()

Note how the columns that had the same name in the original data are now indicated with `_x` or `_y` at the end. X is for the left (first) original table, and y is for the right (second) original table.

Let's have a look at some additional parameters of merge.

- e.g. suffixes, left_on, right_on

<img src="./pics/pandas dataframe merge.PNG" width = 600/>

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

Using the `suffixes=` parameter, we can change the default `_x` and `_y` suffixes.

In [None]:
world_data = gdp.merge(pop, how="left", on=["Country Name", "Year"], suffixes=("_gdp", "_pop"))

world_data.head()

**Relationship between two data sets**

<img src="./pics/One-to-One Relationships.PNG" width = 600/>


<img src="./pics/One-to-Many Relationship.PNG" width = 600/>

## Concatenating two dataframes
Concatenation is used when we want to add more data *with the exact same columns* to our existing dataframe. You can think of it as tacking on more rows to the original dataframe. 


<img src="./pics/concat.PNG" width = 600/>

**Example:**

In [None]:
# read in our data
df = pd.read_csv("http://bit.ly/kaggletrain")

print("Shape of Original Dataframe: " + str(df.shape))

In [None]:
df.head()

Next, we split our original dataset into two smaller datasets, each with fewer rows.

In [None]:
df1 = df.iloc[:400, :]
df2 = df.iloc[400:, ]

print("Shape of DF1: " + str(df1.shape))
print("Shape of DF2: " + str(df2.shape))

Finally, we use concat to stitch them back together.

In [None]:
df_concat = pd.concat([df1, df2])
print("Shape of df_concat: " + str(df_concat.shape))

In [None]:
# checks if a Series/DataFrame when compared to each other are of the same shape and contain the same elements
df_concat.equals(df)

**Additional parameters in concat**

<img src="./pics/pandas_concat.PNG" width = 600/>

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

Let's redo our example and add the `verify_integrity` parameter.

Verify integrity checks for duplicates in the two dataframes

In [24]:
df_concat = pd.concat([df1, df2], verify_integrity=True)

**Let's create a dataframe with duplicats** -> we get an error that indicates where the duplicate is at

In [None]:
df1 = df.iloc[:400, :]
df2 = df.iloc[399:, ]

df_concat = pd.concat([df1, df2], verify_integrity=True)