# Change Data Capture (CDC) with Pandas
In this notebook, we will explore how to implement Change Data Capture (CDC) using `pandas`.
CDC allows us to track and process only the changes in data, making data processing more efficient.

We will simulate a dataset where changes occur over time and demonstrate how to capture and track those changes.

## Step 1: Simulating Initial Dataset
Let's start by creating an initial dataset for a simple inventory system.

In [1]:
import pandas as pd

# Initial dataset
data_initial = {'Item': ['Apple', 'Banana', 'Orange', 'Mango'],
                'Stock': [100, 150, 80, 200],
                'Price': [1.2, 0.5, 0.8, 1.5]}

df_initial = pd.DataFrame(data_initial)
df_initial

Unnamed: 0,Item,Stock,Price
0,Apple,100,1.2
1,Banana,150,0.5
2,Orange,80,0.8
3,Mango,200,1.5


## Step 2: Capturing Changes
Let's assume that over time, the stock values of these items change and new items are added to the inventory.
We will create a new dataset to reflect these changes and demonstrate how to capture them.

In [2]:
# Updated dataset
data_updated = {'Item': ['Apple', 'Banana', 'Orange', 'Mango', 'Grapes'],
                'Stock': [90, 160, 75, 200, 50],
                'Price': [1.2, 0.5, 0.85, 1.5, 2.0]}

df_updated = pd.DataFrame(data_updated)
df_updated

Unnamed: 0,Item,Stock,Price
0,Apple,90,1.2
1,Banana,160,0.5
2,Orange,75,0.85
3,Mango,200,1.5
4,Grapes,50,2.0


## Step 3: Detecting Changes with CDC
To detect changes between the two datasets, we can compare the initial and updated datasets.
We'll use the `merge()` function to track added, removed, and updated records.

In [3]:
# Merging datasets to detect changes
df_changes = pd.merge(df_initial, df_updated, on='Item', how='outer', suffixes=('_initial', '_updated'), indicator=True)
df_changes

Unnamed: 0,Item,Stock_initial,Price_initial,Stock_updated,Price_updated,_merge
0,Apple,100.0,1.2,90,1.2,both
1,Banana,150.0,0.5,160,0.5,both
2,Orange,80.0,0.8,75,0.85,both
3,Mango,200.0,1.5,200,1.5,both
4,Grapes,,,50,2.0,right_only


## Step 4: Categorizing Changes
Now that we have detected changes, we can categorize them as follows:
- **Added**: Items that are present in the updated dataset but not in the initial dataset.
- **Removed**: Items that were in the initial dataset but are not in the updated dataset.
- **Updated**: Items where values have changed between the two datasets.

In [4]:
# Categorizing changes
added_items = df_changes[df_changes['_merge'] == 'right_only']
removed_items = df_changes[df_changes['_merge'] == 'left_only']
updated_items = df_changes[(df_changes['_merge'] == 'both') & (df_changes['Stock_initial'] != df_changes['Stock_updated'])]

print('Added Items:')
print(added_items[['Item', 'Stock_updated', 'Price_updated']])

print('Removed Items:')
print(removed_items[['Item', 'Stock_initial', 'Price_initial']])

print('Updated Items:')
print(updated_items[['Item', 'Stock_initial', 'Stock_updated']])


Added Items:
     Item  Stock_updated  Price_updated
4  Grapes             50            2.0
Removed Items:
Empty DataFrame
Columns: [Item, Stock_initial, Price_initial]
Index: []
Updated Items:
     Item  Stock_initial  Stock_updated
0   Apple          100.0             90
1  Banana          150.0            160
2  Orange           80.0             75


## Step 5: Logging Changes for Future Analysis
We can log the changes detected by CDC for future analysis or auditing.
This can help track inventory adjustments and price changes over time.

In [5]:
# Creating logs of changes
log_added = added_items[['Item', 'Stock_updated', 'Price_updated']].copy()
log_removed = removed_items[['Item', 'Stock_initial', 'Price_initial']].copy()
log_updated = updated_items[['Item', 'Stock_initial', 'Stock_updated']].copy()

# Save logs to CSV
log_added.to_csv('log_added.csv', index=False)
log_removed.to_csv('log_removed.csv', index=False)
log_updated.to_csv('log_updated.csv', index=False)

print('Logs saved successfully.')

Logs saved successfully.


## Summary
In this notebook, we have demonstrated how to:
- Simulate changes in a dataset.
- Detect changes using Change Data Capture (CDC) techniques.
- Categorize the changes as added, removed, or updated records.
- Log the changes for future analysis.

Next, we will explore data validation and quality assurance in the following notebook.