In [1]:
import pandas as pd

## Joining and Concatenating Data

Sometimes, we have several data sources which we liked to combine. This is done in pandas through mergers (similar to a join in SQL).

In order to do a join, we need to have a common feature in each data set to join/(merge) data from various sources. We also have to decide on the way in which we will join/merge the data.

<table><tr><td><img src='./pics/inner_join.PNG' width = 400></td><td><img src='pics/outer_join.PNG' width = 400></td></tr></table>
<table><tr><td><img src='./pics/left_join.PNG' width = 400+></td><td><img src='pics/right_join.PNG' width = 400></td></tr></table>

**Examples** 

**Inner Join** </br>
<img src="./pics/inner_join example.PNG" width = 400/>

**Outer Join** </br>
<img src="./pics/outer_join example.PNG" width = 400/>

**Left Join** </br>
<img src="./pics/left_join example.PNG" width = 400/>

**Right Join** </br>
<img src="./pics/right_join example.PNG" width = 400/>



### Let's do an example
- two data sets (GDP, Population) from the World Bank

In [5]:
# read in the datasets
gdp = pd.read_csv("./data/worldbank/WorldBank_GDP.csv")
pop = pd.read_csv("./data/worldbank/WorldBank_POP.csv")

In [6]:
gdp.head(10)

Unnamed: 0,Country Name,Country Code,Indicator Name,Year,GDP
0,China,CHN,GDP (current US$),2010,6087160000000.0
1,Germany,DEU,GDP (current US$),2010,3417090000000.0
2,Japan,JPN,GDP (current US$),2010,5700100000000.0
3,United States,USA,GDP (current US$),2010,14992100000000.0
4,China,CHN,GDP (current US$),2011,7551500000000.0
5,Germany,DEU,GDP (current US$),2011,3757700000000.0
6,Japan,JPN,GDP (current US$),2011,6157460000000.0
7,United States,USA,GDP (current US$),2011,15542600000000.0
8,China,CHN,GDP (current US$),2012,8532230000000.0
9,Germany,DEU,GDP (current US$),2012,3543980000000.0


In [7]:
pop.head(10)

Unnamed: 0,Country Name,Country Code,Indicator Name,Year,Pop
0,Aruba,ABW,"Population, total",2010,101669.0
1,Afghanistan,AFG,"Population, total",2010,29185507.0
2,Angola,AGO,"Population, total",2010,23356246.0
3,Albania,ALB,"Population, total",2010,2913021.0
4,Andorra,AND,"Population, total",2010,84449.0
5,Arab World,ARB,"Population, total",2010,354890042.0
6,United Arab Emirates,ARE,"Population, total",2010,8549988.0
7,Argentina,ARG,"Population, total",2010,40788453.0
8,Armenia,ARM,"Population, total",2010,2877319.0
9,American Samoa,ASM,"Population, total",2010,56079.0


Now, we will use `.merge()` to combine the 2 datasets. 

NOTE: We can specify more than one column on which to merge, if our datasets have 2+ columns in common

In [8]:
world_data = gdp.merge(pop, how="left", on=["Country Name", "Year"])

world_data.head()

Unnamed: 0,Country Name,Country Code_x,Indicator Name_x,Year,GDP,Country Code_y,Indicator Name_y,Pop
0,China,CHN,GDP (current US$),2010,6087160000000.0,CHN,"Population, total",1337705000.0
1,Germany,DEU,GDP (current US$),2010,3417090000000.0,DEU,"Population, total",81776930.0
2,Japan,JPN,GDP (current US$),2010,5700100000000.0,JPN,"Population, total",128070000.0
3,United States,USA,GDP (current US$),2010,14992100000000.0,USA,"Population, total",309326100.0
4,China,CHN,GDP (current US$),2011,7551500000000.0,CHN,"Population, total",1344130000.0


Note how the columns that had the same name in the original data are now indicated with `_x` or `_y` at the end. X is for the left (first) original table, and y is for the right (second) original table.

Let's have a look at some additional parameters of merge.

- e.g. suffixes, left_on, right_on

<img src="./pics/pandas dataframe merge.PNG" width = 600/>

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

Using the `suffixes=` parameter, we can change the default `_x` and `_y` suffixes.

In [12]:
world_data = gdp.merge(pop, how="left", on=["Country Name", "Year"], suffixes=("_gdp", "_pop"))

world_data.head()

Unnamed: 0,Country Name,Country Code_gdp,Indicator Name_gdp,Year,GDP,Country Code_pop,Indicator Name_pop,Pop
0,China,CHN,GDP (current US$),2010,6087160000000.0,CHN,"Population, total",1337705000.0
1,Germany,DEU,GDP (current US$),2010,3417090000000.0,DEU,"Population, total",81776930.0
2,Japan,JPN,GDP (current US$),2010,5700100000000.0,JPN,"Population, total",128070000.0
3,United States,USA,GDP (current US$),2010,14992100000000.0,USA,"Population, total",309326100.0
4,China,CHN,GDP (current US$),2011,7551500000000.0,CHN,"Population, total",1344130000.0


**Relationship between two data sets**

<img src="./pics/One-to-One Relationships.PNG" width = 600/>


<img src="./pics/One-to-Many Relationship.PNG" width = 600/>

## Concatenating two dataframes
Concatenation is used when we want to add more data *with the exact same columns* to our existing dataframe. You can think of it as tacking on more rows to the original dataframe. 


<img src="./pics/concat.PNG" width = 600/>

**Example:**

In [13]:
# read in our data
df = pd.read_csv("http://bit.ly/kaggletrain")

print("Shape of Original Dataframe: " + str(df.shape))

Shape of Original Dataframe: (891, 12)


In [14]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Next, we split our original dataset into two smaller datasets, each with fewer rows.

In [16]:
df1 = df.iloc[:400, :]
df2 = df.iloc[400:, ]

print("Shape of DF1: " + str(df1.shape))
print("Shape of DF2: " + str(df2.shape))

Shape of DF1: (400, 12)
Shape of DF2: (491, 12)


Finally, we use concat to stitch them back together.

In [17]:
df_concat = pd.concat([df1, df2])
print("Shape of df_concat: " + str(df_concat.shape))

Shape of df_concat: (891, 12)


In [18]:
# checks if a Series/DataFrame when compared to each other are of the same shape and contain the same elements
df_concat.equals(df)

True

**Additional parameters in concat**

<img src="./pics/pandas_concat.PNG" width = 600/>

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

Let's redo our example and add the `verify_integrity` parameter.

Varify integrity checks for duplicates in the two dataframes

In [19]:
df_concat = pd.concat([df1, df2], verify_integrity=True)

**Let's create a dataframe with duplicats** -> we get an error that indicates where the duplicate is at

In [23]:
df1 = df.iloc[:400, :]
df2 = df.iloc[399:, ]

df_concat = pd.concat([df1, df2], verify_integrity=True)

ValueError: Indexes have overlapping values: Int64Index([399], dtype='int64')