# Day 15 - Merging and Joining DataFrames

## Why Merging and Joining DataFrames is Important
In data science, it's common to encounter data spread across multiple sources or files. For example, you might have customer information in one dataset and transaction records in another. Combining these datasets enables you to gain a complete view of your data, allowing for more in-depth analysis.

Pandas provides tools for merging and joining DataFrames. These operations are useful for creating a unified dataset, allowing you to reveal insights that may not be apparent when the data is siloed. Whether you're performing business analysis or building machine learning models, merging datasets is an essential skill.

## Types of Joins in SQL used in Pandas
Before we dive into examples, let's quickly recap the types of joins you can perform in SQL and how they compare to Pandas.

Joins are a way to combine two DataFrames based on a common column (or index). Depending on the type of join, you may include all rows from both DataFrames or only those that have matching values in the join key(s). We use `merge` in the DataFrame to perform these operations.

Here are the four main types of joins:
- **Inner Join**: Returns only the rows where there is a match in both DataFrames.
- **Outer Join**: Returns all rows from both DataFrames, filling in NaN where there are no matches.
- **Left Join**: Returns all rows from the left DataFrame and the matching rows from the right. If no match is found, the result will contain NaN values for the right DataFrame columns.
- **Right Join**: Returns all rows from the right DataFrame and the matching rows from the left. If no match is found, the result will contain NaN values for the left DataFrame columns.

### 1. Inner Join:
An inner join returns only the rows where there are matching values in both DataFrames.

In [None]:
import pandas as pd

# Example DataFrames
df1 = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'CustomerID': [1, 2, 5],
    'Purchase': ['Laptop', 'Tablet', 'Phone']
})

# Inner Join
inner_join = pd.merge(df1, df2, on='CustomerID', how='inner')
print("Inner Join Result:")
print(inner_join)

### 2. Outer Join:
An outer join returns all rows from both DataFrames. Where there are no matches, NaN is inserted.

In [None]:
# Outer Join
outer_join = pd.merge(df1, df2, on='CustomerID', how='outer')
print("\nOuter Join Result:")
print(outer_join)

### 3. Left Join:
A left join returns all rows from the left DataFrame and the matching rows from the right DataFrame. Non-matching rows from the right are filled with NaN.

In [None]:
# Left Join
left_join = pd.merge(df1, df2, on='CustomerID', how='left')
print("\nLeft Join Result:")
print(left_join)

### 4. Right Join:
A right join returns all rows from the right DataFrame and the matching rows from the left DataFrame. Non-matching rows from the left are filled with NaN.

In [None]:
# Right Join
right_join = pd.merge(df1, df2, on='CustomerID', how='right')
print("\nRight Join Result:")
print(right_join)

### Key Takeaways:
- **Inner Join**: Best when you only need rows where there is a match between both DataFrames.
- **Outer Join**: Useful for preserving all rows from both DataFrames, even if some data is missing.
- **Left Join**: Ideal when you want to keep all rows from the left DataFrame, even if there are no matches in the right DataFrame.
- **Right Join**: Similar to the left join, but prioritizes keeping all rows from the right DataFrame.

## Example: Combining Multiple DataFrames
Pandas offers two main methods for combining DataFrames: `.merge()` and `.join()`. Both are useful depending on the structure of your data and how you want to align the datasets.

### Merging DataFrames with `.merge()`
The `.merge()` function allows you to combine two DataFrames based on one or more key columns, similar to SQL joins.

In [None]:
# Example DataFrames
customers = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

transactions = pd.DataFrame({
    'TransactionID': [101, 102, 103, 104],
    'CustomerID': [1, 2, 2, 4],
    'Amount': [250, 150, 200, 300]
})

# Merging on 'CustomerID'
merged_df = pd.merge(customers, transactions, on='CustomerID')
print("Merged DataFrame:")
print(merged_df)

### Joining DataFrames with `.join()`
The `.join()` method is used to combine DataFrames based on their index, making it useful when the DataFrames share the same index or when you want to perform index-based joins.

In [None]:
# Setting 'CustomerID' as the index
customers.set_index('CustomerID', inplace=True)
transactions.set_index('CustomerID', inplace=True)

# Joining the DataFrames
joined_df = customers.join(transactions)
print("\nJoined DataFrame:")
print(joined_df)

## Use Case: Merging Customer Data with Transaction Records
Let's apply these concepts to a real-world scenario. Imagine you're working with customer data and transaction records stored in two CSV files. We want to merge these files to create a unified dataset, allowing for a more comprehensive analysis.

### Step 1: Loading the Datasets
We have two datasets: one containing customer information and another with transaction data. We'll load these datasets into Pandas DataFrames.

In [None]:
# Loading the customer data
customers_df = pd.read_csv('customers.csv')

# Loading the transaction data
transactions_df = pd.read_csv('transactions.csv')

# Display the first few rows of each DataFrame
print("Customers DataFrame:")
print(customers_df.head())

print("\nTransactions DataFrame:")
print(transactions_df.head())

### Step 2: Merging the Datasets
Next, we'll merge these DataFrames on the `CustomerID` column to combine customer information with transaction history.

In [None]:
# Merging the DataFrames on 'CustomerID'
merged_data = pd.merge(customers_df, transactions_df, on='CustomerID')

# Display the merged DataFrame
print("\nMerged Customer and Transaction Data:")
print(merged_data.head())

### Step 3: Analyzing the Merged Data
Now that we've combined the data, we can perform analyses such as calculating the total spending per customer or identifying the customers with the most transactions.

In [None]:
# Calculating total spending per customer
total_spending = merged_data.groupby('Name')['Amount'].sum().reset_index()

print("\nTotal Spending per Customer:")
print(total_spending)