# Real-Time Data Processing with Pandas
In this notebook, we will explore how to handle and process real-time data using `pandas`.
We will demonstrate how to work with large datasets in chunks and how to aggregate real-time data streams.

## Step 1: Loading Data in Chunks
Real-time data can arrive in large quantities, so we need to process it in chunks to optimize performance.

In [1]:
import pandas as pd

# Simulating a large dataset
data = pd.DataFrame({
    'Timestamp': pd.date_range(start='2023-01-01', periods=10000, freq='H'),
    'Visitor Count': pd.Series(range(10000))
})

# Saving this as a CSV to simulate a large file
data.to_csv('large_dataset.csv', index=False)

# Loading the data in chunks
chunk_size = 1000  # Defining chunk size
chunks = pd.read_csv('large_dataset.csv', chunksize=chunk_size)

# Processing each chunk
for chunk in chunks:
    print(chunk.head())  # Display the first 5 rows of each chunk
    # Perform operations on each chunk


             Timestamp  Visitor Count
0  2023-01-01 00:00:00              0
1  2023-01-01 01:00:00              1
2  2023-01-01 02:00:00              2
3  2023-01-01 03:00:00              3
4  2023-01-01 04:00:00              4
                Timestamp  Visitor Count
1000  2023-02-11 16:00:00           1000
1001  2023-02-11 17:00:00           1001
1002  2023-02-11 18:00:00           1002
1003  2023-02-11 19:00:00           1003
1004  2023-02-11 20:00:00           1004
                Timestamp  Visitor Count
2000  2023-03-25 08:00:00           2000
2001  2023-03-25 09:00:00           2001
2002  2023-03-25 10:00:00           2002
2003  2023-03-25 11:00:00           2003
2004  2023-03-25 12:00:00           2004
                Timestamp  Visitor Count
3000  2023-05-06 00:00:00           3000
3001  2023-05-06 01:00:00           3001
3002  2023-05-06 02:00:00           3002
3003  2023-05-06 03:00:00           3003
3004  2023-05-06 04:00:00           3004
                Timestamp  Visitor

## Step 2: Real-Time Data Aggregation
Aggregating real-time data streams can help provide insights on the fly.
We will demonstrate how to aggregate visitor counts in real-time.

In [2]:
# Simulating data streaming by processing the dataset in chunks and aggregating the visitor count
total_visitors = 0

for chunk in pd.read_csv('large_dataset.csv', chunksize=chunk_size):
    # Aggregating visitor count in real-time
    total_visitors += chunk['Visitor Count'].sum()
    print(f'Total Visitors So Far: {total_visitors}')



Total Visitors So Far: 499500
Total Visitors So Far: 1999000
Total Visitors So Far: 4498500
Total Visitors So Far: 7998000
Total Visitors So Far: 12497500
Total Visitors So Far: 17997000
Total Visitors So Far: 24496500
Total Visitors So Far: 31996000
Total Visitors So Far: 40495500
Total Visitors So Far: 49995000


## Step 3: Integrating Real-Time Data with an API
Real-time data often comes from APIs. In this section, we will simulate pulling real-time data from an API and processing it using `pandas`.

In [3]:
import requests

# Simulating an API call for real-time data
api_url = 'https://api.example.com/real_time_data'
# Simulated API response (mocking an API call)
api_response = {
    'Timestamp': '2023-09-18 12:00:00',
    'Visitor Count': 150
}

# Convert API response to DataFrame and process
real_time_data = pd.DataFrame([api_response])
real_time_data['Timestamp'] = pd.to_datetime(real_time_data['Timestamp'])
real_time_data


Unnamed: 0,Timestamp,Visitor Count
0,2023-09-18 12:00:00,150


## Summary
In this notebook, we have demonstrated:
- How to load and process large datasets in chunks.
- How to perform real-time data aggregation.
- How to simulate real-time data from an API and integrate it into a `pandas` DataFrame.

Next, we will explore Change Data Capture (CDC) in the following notebook.