In [None]:
import pandas as pd

# 8 Merge

You can combine data in Pandas with ease. You can do inner join, outer join and left join etc.

## 8.1 The meetup dataset

We use the meetup dataset to explains how concatenation and merge works. The meetup dataset consists of:

- group1 and group2 data
- categories
- cities

### 8.1.1 Load groups

In [None]:
groups1 = pd.read_csv("groups1.csv")
groups1
groups2 = pd.read_csv("groups2.csv")
groups2

### 8.1.2 Load categories

In [None]:
categories = pd.read_csv("categories.csv")
categories

### 8.1.3 Load cities

In [None]:
cities = pd.read_csv("cities.csv", dtype={"zip": "string"})
cities

## 8.2 Concatenate groups

You can concat groups1 and groups2 by using the `pd.concat()` method.

### 8.2.1 keep index from orginal DataFrame

In [None]:
groups = pd.concat(objs=[groups1, groups2])

### 8.2.2 drop original index and create new one

In [None]:
groups = pd.concat(objs=[groups1, groups2], ignore_index=True)
groups

### 8.2.3 create multiindex

In [None]:
groups = pd.concat(objs=[groups1, groups2], keys=["G1", "G2"])

## 8.3 Left join

A left join is particularly useful when one data set is the focal
point of the analysis. We pull in the second data set to provide supplemental information related to the primary data set.

<img src="images/left-join.png" alt="Left Join" width="40%"/>

Let's supplment the group dataset with categories names by left joining the `categories` dataset:

In [None]:
groups.merge(categories, how="left", on="category_id").drop(columns=["category_id"])

## 8.4 Inner join

An inner join selects values that exist in both two DataFrames

<img src="images/inner-join.png" alt="Inner Join" width="40%"/>

Let's list groups with only valid categoies by inner joining the `categories` dataset:

In [None]:
groups.merge(categories, how = "inner", on = "category_id")

## 8.5 Outer join

An outer join combines all records across two data sets. 

<img src="images/outer-join.png" alt="Outer Join" width="40%"/>

Let's list groups with cities by outer joining the `city` dataset:

In [None]:
groups.merge(cities, how = "outer", left_on = "city_id", right_on = "id")

If you wish where the row comes from in the merged dataset, you may pass the `indicator` parameter to the `merge()` method:

In [None]:
m1 = groups.merge(cities, how = "outer", left_on = "city_id", right_on = "id", indicator=True)
m1[m1["_merge"] == "right_only"]

## 8.6 Join with index

We can also use index as columns to join as long as it makes sense. 


Let's change the `cities` index to its id column:

In [24]:
cities2 = cities.set_index('id')

### 8.6.1 Merge on index

You the `right_index`, `left_index` parameter of `merge()` to use index on the other side.

In [None]:
# groups.merge(cities2, how="left", left_on="city_id", right_index=True)
groups.merge(cities2, how="left", left_on="city_id", right_index=True)