## Pandas Tutorial 9: Merging DataFrames

In this tutorial, we explore how to merge DataFrames using Pandas' `merge()` function, similar to database joins. You'll learn to perform left, right, outer, and inner joins, and use advanced options like `indicator` and `suffixes`. 

#### Topics covered:
* **Merging DataFrames with `merge()`**
* **Outer and Inner joins**
* **Left Joins**
* **Using the `indicator ` Flag**
* **Applying `suffixes()` in Merges**

In [1]:
import pandas as pd

In [2]:
df1 = pd.DataFrame({
    "city": ["new york","chicago","orlando"],
    "temperature": [21,14,35],
})
df1

Unnamed: 0,city,temperature
0,new york,21
1,chicago,14
2,orlando,35


In [3]:
df2 = pd.DataFrame({
    "city": ["chicago", "new york","orlando"],
    "humidity": [65,68,75],
})
df2

Unnamed: 0,city,humidity
0,chicago,65
1,new york,68
2,orlando,75


## Merging Two DataFrames

`pd.merge(df1, df2, on="city")` combines two DataFrames using the `city` column.

**Key Features:**
* Combines DataFrames on a common column
* Produces a new DataFrame with data from both sources.
* Aligns rows based on the matching values in the `city` column.

In [4]:
# Merges df1 and df2 on the 'city' column
df3 = pd.merge(df1,df2,on="city")
df3

Unnamed: 0,city,temperature,humidity
0,new york,21,68
1,chicago,14,65
2,orlando,35,75


## Type of Database Joins

In [5]:
df1 = pd.DataFrame({
    "city": ["new york","chicago","orlando","baltimore"],
    "temperature": [21,14,35,32],
})
df1

Unnamed: 0,city,temperature
0,new york,21
1,chicago,14
2,orlando,35
3,baltimore,32


In [6]:
df2 = pd.DataFrame({
    "city": ["chicago", "new york","san francisco"],
    "humidity": [65,68,71],
})
df2

Unnamed: 0,city,humidity
0,chicago,65
1,new york,68
2,san francisco,71


## Inner Join (default)
The **inner join** (which is the default for `merge()` returns only the rows where there is a match in both DataFrames based on the specified key (`city` in this case).
<img src="inner-join.png" alt="Inner Join" width="300"/>

**Key Features:**
* **Inner Join:** Returns only rows with matching values in both DataFrames.
* **Default Behavior:** The default join type when using `pd.merge()` 
* **Result:** Produces a DataFrame containing only cities that appear in both `df1` and `df2`

In [7]:
df3 = pd.merge(df1, df2, on="city")
df3

Unnamed: 0,city,temperature,humidity
0,new york,21,68
1,chicago,14,65


## Outer Join
The **outer join** returns all rows from both DataFrames. If there is no match, the missing side will have `NaN` values.
<img src="outer-join.png" alt="Outer Join" width="300"/>

**Key Features:**
* **Outer Join:** Includes all rows from both DataFrames, with `NaN` where there are no matches.
* **Result:** Produces a DataFrame containing all cities, even if ac ity appears only in one DataFrame.

In [11]:
df3 = pd.merge(df1, df2, on="city", how="outer")
df3

Unnamed: 0,city,temperature,humidity
0,new york,21.0,68.0
1,chicago,14.0,65.0
2,orlando,35.0,
3,baltimore,32.0,
4,san francisco,,71.0


## Left Join
The **left join** returns all rows from the left DataFrame (`df1`), along with matching rows from the right DataFrame (`df2`). If there is no match, `NaN` is used for missing values in the right DataFrame.
<img src="left-join.png" alt="Left Join" width="300"/>

**Key Features:**
* **Left Join:** Keeps all rows from the left DataFrame and adds matching rows from the right DataFrame.
* **Result:** Includes all cities from `df1`, with `NaN` where there's no match in `df2`.

In [9]:
df3 = pd.merge(df1, df2, on="city", how="left")
df3

Unnamed: 0,city,temperature,humidity
0,new york,21,68.0
1,chicago,14,65.0
2,orlando,35,
3,baltimore,32,


## Right Join
The **right join** returns all rows from the right DataFrame (`df2`), along with matching rows from the left DataFrame (`df1`). if there is no match, `NaN` is used for missing values in the left DataFrame.

<img src="right-join.png" alt="Right Join" width="300"/>

**Key Features:**
* **Right Join:** Keeps all rows from the right DataFrame and adds matching rows from the left DataFrame.
* **Result:** Includes all cities from `df2`, with `NaN` where there's no match in `df1`.

In [10]:
df3 = pd.merge(df1, df2, on="city", how="right")
df3

Unnamed: 0,city,temperature,humidity
0,chicago,14.0,65
1,new york,21.0,68
2,san francisco,,71


## Outer Join with `indicator=True` flag

The `indicator=True` flag adds a new column named `_merge` that shows the origin of each row - whether it exists in the left DataFrame, right DataFrame, or both.

**Key Features:**
* **Indicator Column:** Adds `_merge` to show whether rows are from the left DataFrame, right DataFrame, or both.
* **Outer Join:** Returns all rows from both DataFrames, with `NaN` where there's no match.
* **Result:** Provides extra insight into the source of each row in the merged DataFrame.

In [12]:
df3 = pd.merge(df1, df2, on="city", how="outer", indicator=True)
df3

Unnamed: 0,city,temperature,humidity,_merge
0,new york,21.0,68.0,both
1,chicago,14.0,65.0,both
2,orlando,35.0,,left_only
3,baltimore,32.0,,left_only
4,san francisco,,71.0,right_only


## Merging with Column Suffixes

When merging two DataFrames with columns that share the same name (e.g., `temperature`, `humidity`), Pandas automatically adds suffixes (`_x` and `_y`) to distinguish the columns from each DataFrame.

**Key Features:**
* **Automatic Suffixes:** Pandas adds `_x` and `_y` to columns that exist in both DataFrames.
* **Result:** The merged DataFrame contains data frmo both DataFrames, with clearly marked columns to avoid conflicts.

This is useful when combining datasets that have overlapping column names.

In [13]:
df1 = pd.DataFrame({
    "city": ["new york","chicago","orlando","baltimore"],
    "temperature": [21,14,35,38],
    "humidity": [65,68,71,75]
})
df1

Unnamed: 0,city,temperature,humidity
0,new york,21,65
1,chicago,14,68
2,orlando,35,71
3,baltimore,38,75


In [14]:
df2 = pd.DataFrame({
    "city": ["chicago","new york","san diego"],
    "temperature": [21,14,35],
    "humidity": [65,68,71]
})
df2

Unnamed: 0,city,temperature,humidity
0,chicago,21,65
1,new york,14,68
2,san diego,35,71


In [15]:
df3=pd.merge(df1,df2,on="city")
df3

Unnamed: 0,city,temperature_x,humidity_x,temperature_y,humidity_y
0,new york,21,65,14,68
1,chicago,14,68,21,65


## Merging with Custom Suffixes

You can use the `suffixes` argument to customize the suffixes added to overlapping column names during a merge. Here, `_left` and `_right` are used instead of the default `_x` and `_y`.

**Key Features:**
* **Custom Suffixes:** Replaces the default `_x` and `_y` suffixes with user-defined ones (`_left` and `_right).
* **Result:** A more descriptive and readable DataFrame when merging columns with the same names from different DataFrames.

In [16]:
df3=pd.merge(df1,df2,on="city", suffixes=('_left','_right'))
df3

Unnamed: 0,city,temperature_left,humidity_left,temperature_right,humidity_right
0,new york,21,65,14,68
1,chicago,14,68,21,65


In [17]:
df1 = pd.DataFrame({
    "city": ["new york","chicago","orlando"],
    "temperature": [21,14,35],
})
df1.set_index('city',inplace=True)
df1

Unnamed: 0_level_0,temperature
city,Unnamed: 1_level_1
new york,21
chicago,14
orlando,35


In [18]:
df2 = pd.DataFrame({
    "city": ["chicago","new york","orlando"],
    "humidity": [65,68,75],
})
df2.set_index('city',inplace=True)
df2

Unnamed: 0_level_0,humidity
city,Unnamed: 1_level_1
chicago,65
new york,68
orlando,75


## Joining DataFrames with Suffixes

The `join()` method combines two DataFrames using their index, and the `lsuffix` and `rsuffix` parameters add custom suffixes to overlapping columns from the left and right DataFrames.

**Key Features:**
* **Joining on Index:** Joins DataFrames based on their index values.
* **Custom Suffixes:** `;suffix` and `rsuffix` prevent column name conflicts by adding custom suffixes (`_l` and `_r`).
* **Result:** A joined DataFrame with columns clearly distinguished using custom suffixes.

In [19]:
df1.join(df2, lsuffix='_l', rsuffix='_r')

Unnamed: 0_level_0,temperature,humidity
city,Unnamed: 1_level_1,Unnamed: 2_level_1
new york,21,68
chicago,14,65
orlando,35,75
