# Day 9: Merging and Joining DataFrames

On Day 9, we'll explore a critical data manipulation task: combining multiple datasets. Real-world data is often split across different files or tables. Pandas provides powerful, high-performance functions to merge and join these datasets, similar to SQL.

We will cover:
- Inner Joins
- Outer, Left, and Right Joins
- Merging on common columns

Let's begin by importing Pandas.

In [None]:
import pandas as pd

---

## Part 1: Preparing the DataFrames

To learn about joins, we first need some data. Let's create two DataFrames: one for employees and one for their departments.

In [None]:
employees_data = {
    "EmployeeID": [101, 102, 103, 104, 105],
    "Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "DepartmentID": [1, 2, 1, 3, 4],
}
employees_df = pd.DataFrame(employees_data)

departments_data = {
    "DepartmentID": [1, 2, 3, 5],
    "DepartmentName": ["HR", "Engineering", "Marketing", "Sales"],
}
departments_df = pd.DataFrame(departments_data)

print("Employees DataFrame:")
print(employees_df)
print("-" * 25)
print("Departments DataFrame:")
print(departments_df)

*Notice that employee Eve has a `DepartmentID` of 4, which doesn't exist in the departments table. Also, the `Sales` department (ID 5) has no employees.*

---

## Part 2: The Inner Join

An inner join returns only the rows where the key exists in **both** DataFrames.

**Exercise 2.1:** Perform an inner join on `employees_df` and `departments_df` to create a new DataFrame that includes employee names and their corresponding department names.

In [None]:
# Your code here

**Solution 2.1:**

In [None]:
# Solution
inner_join_df = pd.merge(employees_df, departments_df, on="DepartmentID", how="inner")
print(inner_join_df)

*Note that Eve (DepartmentID 4) and the Sales department (DepartmentID 5) are missing, as expected.*

---

## Part 3: Outer Joins (Left, Right, and Full)

Outer joins are useful when you want to keep all records from one or both DataFrames, even if there isn't a match.

**Exercise 3.1: Left Join**

A left join keeps all rows from the **left** DataFrame (`employees_df` in this case). Perform a left join and observe the result for Eve.

In [None]:
# Your code here

**Solution 3.1:**

In [None]:
# Solution
left_join_df = pd.merge(employees_df, departments_df, on="DepartmentID", how="left")
print(left_join_df)

*Notice that Eve is included, but her `DepartmentName` is `NaN` (Not a Number), as there was no match.*

**Exercise 3.2: Right Join**

A right join keeps all rows from the **right** DataFrame (`departments_df`). Perform a right join and observe the result for the Sales department.

In [None]:
# Your code here

**Solution 3.2:**

In [None]:
# Solution
right_join_df = pd.merge(employees_df, departments_df, on="DepartmentID", how="right")
print(right_join_df)

*This time, the Sales department is included, but its `EmployeeID` and `Name` are `NaN`.*

**Exercise 3.3: Outer Join**

A full outer join keeps all rows from **both** DataFrames. Perform an outer join and see how both Eve and the Sales department are handled.

In [None]:
# Your code here

**Solution 3.3:**

In [None]:
# Solution
outer_join_df = pd.merge(employees_df, departments_df, on="DepartmentID", how="outer")
print(outer_join_df)

*The result includes everyone and every department, filling in missing data with `NaN`.*

---

## Part 4: Merging on Different Column Names

Sometimes, the key you want to join on has a different name in each DataFrame. Pandas handles this easily.

In [None]:
# Let's create a new DataFrame for salaries
salaries_data = {"EmpID": [101, 102, 103, 104], "Salary": [70000, 80000, 75000, 90000]}
salaries_df = pd.DataFrame(salaries_data)

print("Employees DataFrame:")
print(employees_df)
print("-" * 20)
print("Salaries DataFrame:")
print(salaries_df)

**Exercise 4.1:** Merge `employees_df` and `salaries_df` using `EmployeeID` from the left frame and `EmpID` from the right frame.

In [None]:
# Your code here

**Solution 4.1:**

In [None]:
# Solution
# Use the left_on and right_on parameters
employee_salaries_df = pd.merge(
    employees_df, salaries_df, left_on="EmployeeID", right_on="EmpID", how="left"
)
print(employee_salaries_df)

*You can drop the redundant key column (`EmpID`) if you wish.*

---

### Great job!

Combining data sources is a fundamental step in data analysis. Understanding the different types of joins is crucial for ensuring you are creating the correct dataset for your analysis. Next, we will dive into the world of SciPy!