<a href="https://colab.research.google.com/github/ricardogr07/100-days-of-python-and-data-science/blob/main/33_Combining_Joining_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 33 – Combining and Joining Datasets in Python

In today’s post, we’ll explore techniques for merging datasets with different structures. This is a crucial skill when working with data from multiple sources, allowing us to perform comprehensive analyses by bringing related data together. Whether you’re handling business intelligence data, building machine learning models, or conducting research, merging datasets is a fundamental task.

## Why Combining and Joining Datasets Is Important

Data is often stored in multiple files or tables, each representing a different aspect of the overall picture. Combining datasets enables you to:

- Gain a complete view: Merging related data sources provides a holistic perspective, which is essential for thorough analysis.

- Enhance analysis: Merging datasets allows you to calculate key metrics, create detailed reports, and gain actionable insights.

- Simplify data preparation: By combining datasets early in your analysis workflow, you create a unified dataset that’s easier to work with.

## Types of Joins in Pandas

Pandas provides several ways to combine dataframes, each suited to specific use cases:

- `merge()`: Used to join datasets based on common columns, similar to SQL joins.

- `join()`: Designed for joining on the index, often used when datasets share the same index.

**Common Types of Joins:**

- Inner Join: Returns only the rows where there is a match in both dataframes.

- Outer Join: Returns all rows from both dataframes, filling in NaN where there’s no match.

- Left Join: Returns all rows from the left dataframe and the matching rows from the right.

- Right Join: Returns all rows from the right dataframe and the matching rows from the left.

Each type of join serves a specific purpose, allowing you to merge datasets in the way that best suits your analysis needs.

## Example Data

We’ll use two sample dataframes to demonstrate these joins:

- Customers: Contains basic customer information.

- Transactions: Contains transaction data, including amounts spent.

In [None]:
import pandas as pd

# Sample DataFrames
customers = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

transactions = pd.DataFrame({
    'TransactionID': [101, 102, 103, 104],
    'CustomerID': [1, 2, 2, 4],
    'Amount': [250, 150, 200, 300]
})

# Displaying the dataframes
print("Customers DataFrame:")
print(customers)

print("\nTransactions DataFrame:")
print(transactions)

## Merging DataFrames Using `merge()`

The merge() function in Pandas allows you to combine datasets on shared columns. Here, we’ll merge the customers and transactions dataframes using the CustomerID column.

In [None]:
# Merging on 'CustomerID'
merged_df = pd.merge(customers, transactions, on='CustomerID')

# Displaying the merged DataFrame
print("Merged DataFrame:")
print(merged_df)

**Explanation:** We’ve performed an inner join on CustomerID, resulting in a new dataframe that combines customer details with their corresponding transactions.

## Exploring Different Types of Joins

You can specify the type of join by using the `how` parameter in the `merge()` function. Let’s explore each join type with our sample data.

**Inner Join (default):** Only keeps rows where there’s a match in both dataframes.

In [None]:
inner_join_df = pd.merge(customers, transactions, on='CustomerID', how='inner')
print("\nInner Join Result:")
print(inner_join_df)

**Left Join:** Keeps all rows from the left dataframe (customers), adding data from the right dataframe (transactions) where available.

In [None]:
left_join_df = pd.merge(customers, transactions, on='CustomerID', how='left')
print("\nLeft Join Result:")
print(left_join_df)

**Right Join:** Keeps all rows from the right dataframe (transactions), adding data from the left dataframe (customers) where available.

In [None]:
right_join_df = pd.merge(customers, transactions, on='CustomerID', how='right')
print("\nRight Join Result:")
print(right_join_df)

**Outer Join:** Returns all rows from both dataframes, filling in missing values where there’s no match.

In [None]:
outer_join_df = pd.merge(customers, transactions, on='CustomerID', how='outer')
print("\nOuter Join Result:")
print(outer_join_df)

**Explanation:** Different join types allow you to handle missing data and maintain specific rows in the result, depending on your analysis needs.

## Joining DataFrames with `.join()`

The .join() method is used for index-based joins, typically when both datasets are indexed by the same key. Let’s reset the index to CustomerID and perform a join.

In [None]:
# Setting 'CustomerID' as the index
customers.set_index('CustomerID', inplace=True)
transactions.set_index('CustomerID', inplace=True)

# Joining the DataFrames
joined_df = customers.join(transactions)

# Displaying the joined DataFrame
print("\nJoined DataFrame:")
print(joined_df)

**Explanation:** The .join() method merges the data based on the index rather than a specific column.

## Use Case: Combining Data from Multiple Sources

Let’s apply these techniques to a practical use case. Imagine you’re working on a project where you need to combine customer information with their transaction data, which are stored in separate CSV files. You’ll use these files to calculate key insights, such as total customer spending.

### Step 1: Loading the Datasets

We’ll load two CSV files: one containing customer information and the other containing transaction data.

In [None]:
# Loading customer and transaction data
customers_df = pd.read_csv('customers.csv')
transactions_df = pd.read_csv('transactions.csv')

# Displaying the first few rows of each dataset
print("Customers DataFrame:")
print(customers_df.head())

print("\nTransactions DataFrame:")
print(transactions_df.head())

### Step 2: Merging the Datasets

Next, we’ll merge the datasets on the CustomerID column to create a unified dataset.

In [None]:
# Merging the DataFrames on 'CustomerID'
merged_data = pd.merge(customers_df, transactions_df, on='CustomerID')

# Displaying the merged DataFrame
print("\nMerged Customer and Transaction Data:")
print(merged_data.head())

### Step 3: Analyzing the Combined Data

With the datasets combined, you can now analyze the data to extract insights. For example, let’s calculate the total spending per customer and identify customers with the highest number of transactions.

In [None]:
# Calculating total spending per customer
total_spending = merged_data.groupby('Name')['Amount'].sum().reset_index()

print("\nTotal Spending per Customer:")
print(total_spending)

# Calculating transaction counts per customer
transaction_count = merged_data.groupby('Name')['TransactionID'].count().reset_index()
transaction_count.columns = ['Name', 'TransactionCount']

print("\nTransaction Count per Customer:")
print(transaction_count)

**Explanation:** After merging, you can use Pandas functions to group, sum, or count specific data points. In this case, we calculated:

- Total spending per customer: Summing the transaction Amount for each customer.

- Transaction count per customer: Counting the number of transactions each customer made.

## Conclusion

In today’s post, we explored how to combine datasets using Pandas, with a focus on `merge()` and `.join()` methods. These tools are fundamental for merging data from multiple sources, enabling you to create comprehensive datasets for analysis. By mastering these techniques, you can easily unify disparate data, generate meaningful insights, and prepare datasets for more advanced analysis.

**Key Takeaways:**

- Use `merge()` to join dataframes based on common columns.

- Use `.join()` for index-based joins, when both dataframes share a common index.

- Practice using different types of joins (inner, outer, left, right) based on the data and the analysis goal.

Stay tuned for tomorrow’s post, where we’ll dive into Advanced Data Aggregation Techniques!