# **📘 Day 8 – Merging, Joining & Concatenating🐼**

#### **Goal:** Learn to combine multiple DataFrames into one for richer analysis.

#### **Topics to Cover:** Concatenation, Merging and Joining.

***

## **Introduction 🌱**

**Merging**, **Joining** and **Concatenation** are fundamental pandas operations for combining different DataFrame. There are tools you use to integrate data from multiple sources into a single, cohesive dataset for analysis.

- **`Concatenation`:** is like stacking two DataFrame on top of each other (vertically) or side-by-side (horizontally). It's used when the DataFrames have similar columns or rows.

- **`Merging` & `Joinging`:** are DataBase operations. They combine tow DataFrames based on a shared column (a "key") to link corresponding rows. This is essential for bringing together information that lives in separate tables.

### **Importance for an AI/ML Student**

Data for machine learning models rarely comes from a single file. You'll often have multiple datasets that need to be combined before you can start training. Merging, joining, and concatenating are the primary methods you'll use for:

- **Feature Engineering:** You can merge a dataset of customer demographics with their purchase history to create new features, like "average age of customers who buy a specific product."

- **Creating a Unified Dataset:** You might have one CSV file with product information and another with sales data. You need to join them to calculate revenue per product.

- **Data Aggregation from Multiple Sources:** Imagine you have daily sales data from multiple regional offices. You can concatenate them into a single DataFrame to get a holistic view of global sales.

---

## Let's begin 🚀

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

---

### **1. `Concatenation` 🔗**

#### Methods and Attributes.

| Method      | Purpose                                                     | Syntax                  |
|-------------|-----------------------------------------------------------------------------|-------------------------|
| `pd.concat()` | DataFrames/Series together vertically (rows) or horizontally (columns). | `pd.concat([df1, df2])` |

**1.1 `pd.concat()`:**  This method is used to combine a list of DataFrames along a specified axis. It can stack them **vertically (rows)** or **horizontally (columns)**, with flexible alignment rules.

**Parameters:**

- `objs`: The list of DataFrames you want to combine.

- `axis`: Which way to combine them. 0 (default) for rows, 1 for columns.

- `join`: How to handle columns that aren't in all DataFrames. 'outer' (default) keeps all columns and fills missing values with NaN. 'inner' keeps only the columns found in all DataFrames.

- `ignore_index`: If True, the new DataFrame gets a clean, sequential index.

<br>

#### **Important to Note:**

**`axis=0` (vertical stacking / add rows)**  
  - Rows are added one below the other.  
  - **Alignment happens on columns.**  
  - `join` decides which columns to keep:  
    - `outer`: keep all columns (union).  
    - `inner`: keep only the common columns (intersection).  

**`axis=1` (horizontal stacking / add columns)**  
  - Columns are added side by side.  
  - **Alignment happens on row indexes.**  
  - `join` decides which indexes to keep:  
    - `outer`: keep all indexes (union).  
    - `inner`: keep only the common indexes (intersection).  

👉 **Rule of thumb:** the `join` always applies to **the other axis**.  
- If stacking by rows (`axis=0`), `join` applies to **columns**.  
- If stacking by columns (`axis=1`), `join` applies to **indexes**.


### *DataFrames for Concatenation part*

In [2]:
# Example DataFrames for row-wise concatenation (same columns, different rows)
concat_rows_df1 = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'city': ['New York', 'Los Angeles', 'Chicago']
})

concat_rows_df2 = pd.DataFrame({
    'id': [4, 5, 6],
    'name': ['David', 'Eva', 'Frank'],
    'city': ['Houston', 'Phoenix', 'San Diego']
})

# Example DataFrames for column-wise concatenation (same rows, different columns)
concat_cols_df1 = pd.DataFrame({
    'id': [1, 2, 3],
    'math_score': [85, 90, 78]
})

concat_cols_df2 = pd.DataFrame({
    'id': [1, 2, 3],
    'science_score': [88, 76, 92]
})

In [3]:
# objs: the dataframes you want to combine
# by default axis=0
pd.concat(objs=[concat_rows_df1, concat_rows_df2]) # Vertical concatenation, stacking rows (column aligned)

Unnamed: 0,id,name,city
0,1,Alice,New York
1,2,Bob,Los Angeles
2,3,Charlie,Chicago
0,4,David,Houston
1,5,Eva,Phoenix
2,6,Frank,San Diego


In [4]:
# axis=1 Horizontal concatenation (index aligned)
pd.concat(objs=[concat_cols_df1, concat_cols_df2], axis=1) # combines dataframes on the basis of customer_id along axis=1 (side-by-side)

Unnamed: 0,id,math_score,id.1,science_score
0,1,85,1,88
1,2,90,2,76
2,3,78,3,92


In [5]:
# ignore_index: If True, the new DataFrame gets a clean, sequential index.
pd.concat(objs=[concat_cols_df1, concat_cols_df2], axis=1, ignore_index=True)

Unnamed: 0,0,1,2,3
0,1,85,1,88
1,2,90,2,76
2,3,78,3,92


In [6]:
# with 'outer' (union)
pd.concat(objs=[concat_rows_df1, concat_rows_df2], ignore_index=True, join='outer')
pd.concat(objs=[concat_cols_df1, concat_cols_df2], axis=1, join='outer')

Unnamed: 0,id,math_score,id.1,science_score
0,1,85,1,88
1,2,90,2,76
2,3,78,3,92


In [7]:
# with 'inner' (intersect)
pd.concat(objs=[concat_rows_df1, concat_rows_df2], join='inner')
pd.concat(objs=[concat_cols_df1, concat_cols_df2], axis=1, join='inner')

Unnamed: 0,id,math_score,id.1,science_score
0,1,85,1,88
1,2,90,2,76
2,3,78,3,92


---

### **2. `Merging & Joining`** 🔗🔀

#### Methods and Attributes:

| Method     | Purpose                                                                 | Syntax                                          |
|------------|-------------------------------------------------------------------------|-------------------------------------------------|
| `pd.merge()` | Combines DataFrames based on a shared key or column, similar to a database join. | `pd.merge(df1, df2, on='key_column', how='inner')` |
| `df.join()`  | Combines DataFrames based on their indexes, which is often a more convenient syntax for simple merges. | `df1.join(df2)` |


***

#### **2.1 `pd.merge()`:** Combining DataFrames

`pd.merge()` is the primary method for combining DataFrames by linking rows based on a shared column, much like performing a SQL `JOIN`. It is the correct method to use when the two DataFrames have different information about the same entities and you want to combine them logically.

<!-- *** -->

#### Key Parameters

* **`left` & `right`**: The two DataFrames you want to merge.

* **`on`**: The name of the column (or list of columns) to join on. This column must exist in both DataFrames.
    * **Note:** If the column names are different in each DataFrame, you can use `left_on` and `right_on` instead.

* **`how`**: Specifies the type of merge to perform. It determines which rows to keep based on the join key.
    * **`'inner'` (default):** Keeps only the rows where the key exists in **both** DataFrames. This is the most common type of merge.
    * **`'left'`:** Keeps **all** rows from the left DataFrame and adds matching rows from the right. If no match is found, it fills the missing values with `NaN`.
    * **`'right'`:** Keeps **all** rows from the right DataFrame and adds matching rows from the left.
    * **`'outer'`:** Keeps **all** rows from both DataFrames, combining them where the key matches and filling in `NaN` where there is no match.

<!-- *** -->

#### **Important to Note:**

`pd.merge()` is a powerful tool designed for combining data based on **common values** in one or more columns, not on the index.

**Key Difference from Concatenation:**

`pd.merge()` is smart. It doesn't care about the row position or index of your DataFrames; it only cares about the **values in the `on` column**. This is the fundamental difference that makes it perfect for linking your `customers` and `orders` DataFrames using `customer_id`.

### *DataFrames for Merging part*

In [8]:

#Create DataFrame 1
df1 = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'age': [25, 30, 35, 40, 28],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
})
df1

Unnamed: 0,customer_id,name,age,city
0,1,Alice,25,New York
1,2,Bob,30,Los Angeles
2,3,Charlie,35,Chicago
3,4,David,40,Houston
4,5,Eva,28,Phoenix


In [9]:
#Create DataFrame 2
df2 = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105, 106],
    'customer_id': [1, 2, 2, 3, 6, 7],
    'city': ['New York', 'Los Angeles', 'San Francisco', 'Chicago', 'Miami', 'Boston'],
    'product': ['Laptop', 'Tablet', 'Headphones', 'Monitor', 'Keyboard', 'Mouse'],
    'amount': [1200, 300, 150, 400, 80, 40]
})
df2

Unnamed: 0,order_id,customer_id,city,product,amount
0,101,1,New York,Laptop,1200
1,102,2,Los Angeles,Tablet,300
2,103,2,San Francisco,Headphones,150
3,104,3,Chicago,Monitor,400
4,105,6,Miami,Keyboard,80
5,106,7,Boston,Mouse,40


In [10]:
# Let's start practicing
pd.merge(left=df1, right=df2) # Currently merge based on shared column by both DataFrames
# See how smart pd.merge() is in the output. The customer_id 2 appears once in df1 but twice in df2 with different orders. pd.merge() correctly identifies and joins all matching rows, creating a new row for each instance. This is the logical behavior of a database-style merge.

Unnamed: 0,customer_id,name,age,city,order_id,product,amount
0,1,Alice,25,New York,101,Laptop,1200
1,2,Bob,30,Los Angeles,102,Tablet,300
2,3,Charlie,35,Chicago,104,Monitor,400


In [11]:
# merge on basis of customer_id
pd.merge(left=df1, right=df2, on='customer_id')

Unnamed: 0,customer_id,name,age,city_x,order_id,city_y,product,amount
0,1,Alice,25,New York,101,New York,Laptop,1200
1,2,Bob,30,Los Angeles,102,Los Angeles,Tablet,300
2,2,Bob,30,Los Angeles,103,San Francisco,Headphones,150
3,3,Charlie,35,Chicago,104,Chicago,Monitor,400


In [12]:
# merge on basis of city
pd.merge(left=df1, right=df2, on='city')

Unnamed: 0,customer_id_x,name,age,city,order_id,customer_id_y,product,amount
0,1,Alice,25,New York,101,1,Laptop,1200
1,2,Bob,30,Los Angeles,102,2,Tablet,300
2,3,Charlie,35,Chicago,104,3,Monitor,400


In [13]:
# merge on basis of all shared columns
pd.merge(left=df1, right=df2, on=['customer_id', 'city'])
# 👉 In short: all columns listed in on=[...] act together as a combined key. A row will only join if the values in every column match in both DataFrames.

Unnamed: 0,customer_id,name,age,city,order_id,product,amount
0,1,Alice,25,New York,101,Laptop,1200
1,2,Bob,30,Los Angeles,102,Tablet,300
2,3,Charlie,35,Chicago,104,Monitor,400


In [14]:
# merge how='left'
pd.merge(left=df1, right=df2, how='left')

Unnamed: 0,customer_id,name,age,city,order_id,product,amount
0,1,Alice,25,New York,101.0,Laptop,1200.0
1,2,Bob,30,Los Angeles,102.0,Tablet,300.0
2,3,Charlie,35,Chicago,104.0,Monitor,400.0
3,4,David,40,Houston,,,
4,5,Eva,28,Phoenix,,,


In [15]:
# merge how='right'
pd.merge(left=df1, right=df2, how='right')

Unnamed: 0,customer_id,name,age,city,order_id,product,amount
0,1,Alice,25.0,New York,101,Laptop,1200
1,2,Bob,30.0,Los Angeles,102,Tablet,300
2,2,,,San Francisco,103,Headphones,150
3,3,Charlie,35.0,Chicago,104,Monitor,400
4,6,,,Miami,105,Keyboard,80
5,7,,,Boston,106,Mouse,40


In [16]:
# merge how='outer'
pd.merge(left=df1, right=df2, how='outer')

Unnamed: 0,customer_id,name,age,city,order_id,product,amount
0,1,Alice,25.0,New York,101.0,Laptop,1200.0
1,2,Bob,30.0,Los Angeles,102.0,Tablet,300.0
2,2,,,San Francisco,103.0,Headphones,150.0
3,3,Charlie,35.0,Chicago,104.0,Monitor,400.0
4,4,David,40.0,Houston,,,
5,5,Eva,28.0,Phoenix,,,
6,6,,,Miami,105.0,Keyboard,80.0
7,7,,,Boston,106.0,Mouse,40.0


***

#### **2.2 `.join()`:** Combining DataFrames on the Index

`.join()` is a convenient DataFrame method used to combine two DataFrames based on their **indexes**. It's a slightly simpler syntax for a specific type of merge where the alignment happens on the row labels instead of a specific column.

<!-- *** -->

#### Key Parameters

* **`other`**: The other DataFrame you want to join with.

* **`on`**: The column name in the left DataFrame to use as a key. This column will be joined to the index of the `other` DataFrame.
    * **Note:** If not specified, `.join()` defaults to joining on the indexes of both DataFrames.

* **`how`**: Specifies the type of join to perform. It works just like in `pd.merge()` and determines which rows to keep.
    * **`'left'` (default):** Keeps all rows from the left DataFrame.
    * **`'right'`:** Keeps all rows from the right DataFrame.
    * **`'inner'`:** Keeps only the rows where the index exists in **both** DataFrames.
    * **`'outer'`:** Keeps **all** rows from both DataFrames.

<!-- *** -->

#### **Important to Note:**

The biggest difference between `.join()` and `pd.merge()` is that `.join()` is designed to align DataFrames on their **index** by default. While you can specify a column to join `on`, that column will be matched against the other DataFrame's **index**, not its columns.

A `.join()` operation on two DataFrames will not work as expected if the joining key is not set as the index for both. To perform a join correctly, you must first ensure that the common column you intend to join on is explicitly set as the index for both DataFrames.

**👉 Rule of Thumb:**
* Use **`pd.merge()`** when you want to join on a **column** (e.g., `on='customer_id'`).
* Use **`.join()`** when you want to join on the **index**.

### *DataFrames for Joining part*

In [17]:
df1 = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "David"],
    "city": ["New York", "Los Angeles", "Chicago", "Houston"]
}, index=[1, 2, 3, 4])

df2 = pd.DataFrame({
    "order_count": [5, 2, 7, 3],
    "membership": ["Gold", "Silver", "Gold", "Bronze"]
}, index=[1, 2, 2, 4])

In [18]:
df1.join(df2)


Unnamed: 0,name,city,order_count,membership
1,Alice,New York,5.0,Gold
2,Bob,Los Angeles,2.0,Silver
2,Bob,Los Angeles,7.0,Gold
3,Charlie,Chicago,,
4,David,Houston,3.0,Bronze


In [19]:
df1.join(df2, how='right')

Unnamed: 0,name,city,order_count,membership
1,Alice,New York,5,Gold
2,Bob,Los Angeles,2,Silver
2,Bob,Los Angeles,7,Gold
4,David,Houston,3,Bronze


In [20]:
df1.join(df2, how='inner')

Unnamed: 0,name,city,order_count,membership
1,Alice,New York,5,Gold
2,Bob,Los Angeles,2,Silver
2,Bob,Los Angeles,7,Gold
4,David,Houston,3,Bronze


In [21]:
df1.join(df2, how='outer')

Unnamed: 0,name,city,order_count,membership
1,Alice,New York,5.0,Gold
2,Bob,Los Angeles,2.0,Silver
2,Bob,Los Angeles,7.0,Gold
3,Charlie,Chicago,,
4,David,Houston,3.0,Bronze


***


## **Summary & Key Takeaways** 📝

- **Concatenation (`pd.concat`)** is for stacking DataFrames (either vertically or horizontally) based on their index or columns. Think of it as putting blocks on top of each other or side by side.

- **Merging (`pd.merge`)** and **Joining (`.join`)** are for combining DataFrames based on a shared key or column.

---

### **The Common Confusion: `.join()` vs. `.merge()`** 😕

This is one of the most frequent points of confusion for new pandas users. The core difference is the default method of operation:

* **`.join()`**: Primarily designed to combine DataFrames on their **indexes**. While you can join on a column, it must be matched against the other DataFrame's index.
* **`pd.merge()`**: The more versatile and explicit method for combining DataFrames on a **common column** (or columns). This is often the preferred choice when you have a specific column you want to use as a key, as it doesn't require you to modify the DataFrame's index first.

---

### **Rule of Thumb** ✅

To keep it simple:

* Use **`.join()`** when the DataFrames you want to combine already share a common **index**.
* Use **`pd.merge()`** when you want to combine DataFrames based on a common **column**.

