# Final Project: Pathumthani Platform Data Integration
In this final project notebook, we will integrate all the concepts we've learned so far:
- Data Cleaning
- Real-Time Data Processing
- Change Data Capture (CDC)
- Data Validation and Quality Assurance

The goal is to prepare and integrate data from different sectors (tourism, agriculture, and sports) to create a reliable and functional platform.

## Step 1: Loading and Cleaning Datasets
We will load sample datasets from the tourism and agriculture sectors, clean the data, and ensure it is ready for integration.

In [1]:
import pandas as pd

# Sample datasets for tourism and agriculture
tourism_data = {'Location': ['Temple A', 'Park B', 'Beach C'],
                'Visitors': [500, None, 850],  # Missing value
                'Date': ['2023-09-15', '2023-09-16', '2023-09-17']}

agriculture_data = {'Crop': ['Rice', 'Corn', 'Mango'],
                    'Yield': [1200, 1500, 850],
                    'Harvest Date': ['2023-09-10', '2023-09-12', '2023-09-14']}

# Create DataFrames
df_tourism = pd.DataFrame(tourism_data)
df_agriculture = pd.DataFrame(agriculture_data)

# Display tourism dataset
df_tourism

Unnamed: 0,Location,Visitors,Date
0,Temple A,500.0,2023-09-15
1,Park B,,2023-09-16
2,Beach C,850.0,2023-09-17


In [2]:
# Display agriculture dataset
df_agriculture

Unnamed: 0,Crop,Yield,Harvest Date
0,Rice,1200,2023-09-10
1,Corn,1500,2023-09-12
2,Mango,850,2023-09-14


## Step 2: Cleaning the Data
Let's handle the missing values in the tourism dataset and standardize the date formats.

In [3]:
# Fill missing values in 'Visitors' column with the average visitor count
df_tourism['Visitors'].fillna(df_tourism['Visitors'].mean(), inplace=True)

# Convert 'Date' and 'Harvest Date' columns to datetime format
df_tourism['Date'] = pd.to_datetime(df_tourism['Date'])
df_agriculture['Harvest Date'] = pd.to_datetime(df_agriculture['Harvest Date'])

df_tourism

Unnamed: 0,Location,Visitors,Date
0,Temple A,500.0,2023-09-15
1,Park B,675.0,2023-09-16
2,Beach C,850.0,2023-09-17


## Step 3: Real-Time Data Processing
For real-time data processing, let's simulate incoming data for tourism. We will process this data in real-time chunks.

In [4]:
# Simulating real-time incoming data
real_time_tourism_data = pd.DataFrame({
    'Location': ['Temple A', 'Park B'],
    'Visitors': [520, 780],
    'Date': ['2023-09-18', '2023-09-18']
})

# Process incoming data in real-time
df_tourism = pd.concat([df_tourism, real_time_tourism_data], ignore_index=True)
df_tourism

Unnamed: 0,Location,Visitors,Date
0,Temple A,500.0,2023-09-15 00:00:00
1,Park B,675.0,2023-09-16 00:00:00
2,Beach C,850.0,2023-09-17 00:00:00
3,Temple A,520.0,2023-09-18
4,Park B,780.0,2023-09-18


## Step 4: Change Data Capture (CDC)
We will track changes between the initial and updated agriculture datasets using CDC.

In [5]:
# Updated agriculture dataset with changes
agriculture_data_updated = {'Crop': ['Rice', 'Corn', 'Mango'],
                           'Yield': [1250, 1450, 870],  # Updated values
                           'Harvest Date': ['2023-09-10', '2023-09-12', '2023-09-14']}

df_agriculture_updated = pd.DataFrame(agriculture_data_updated)

# Detecting changes
df_cdc = pd.merge(df_agriculture, df_agriculture_updated, on='Crop', suffixes=('_initial', '_updated'))
df_cdc

Unnamed: 0,Crop,Yield_initial,Harvest Date_initial,Yield_updated,Harvest Date_updated
0,Rice,1200,2023-09-10,1250,2023-09-10
1,Corn,1500,2023-09-12,1450,2023-09-12
2,Mango,850,2023-09-14,870,2023-09-14


## Step 5: Data Validation and Quality Assurance
We will apply validation rules to the integrated tourism and agriculture data to ensure quality.

In [6]:
# Validation: Ensure that 'Visitors' in tourism data is always greater than zero
invalid_visitors = df_tourism[df_tourism['Visitors'] <= 0]
print('Invalid Visitors Data:', invalid_visitors)

# Validation: Check if 'Yield' in agriculture data is within a realistic range
invalid_yield = df_agriculture[(df_agriculture['Yield'] < 100) | (df_agriculture['Yield'] > 2000)]
print('Invalid Yield Data:', invalid_yield)


Invalid Visitors Data: Empty DataFrame
Columns: [Location, Visitors, Date]
Index: []
Invalid Yield Data: Empty DataFrame
Columns: [Crop, Yield, Harvest Date]
Index: []


## Step 6: Final Integration and Reporting
After cleaning, processing, and validating the data, we can now integrate the datasets and generate a report.

In [7]:
# Integrating tourism and agriculture data into a single dataset
df_integrated = pd.concat([df_tourism, df_agriculture], axis=1)
df_integrated

Unnamed: 0,Location,Visitors,Date,Crop,Yield,Harvest Date
0,Temple A,500.0,2023-09-15 00:00:00,Rice,1200.0,2023-09-10
1,Park B,675.0,2023-09-16 00:00:00,Corn,1500.0,2023-09-12
2,Beach C,850.0,2023-09-17 00:00:00,Mango,850.0,2023-09-14
3,Temple A,520.0,2023-09-18,,,NaT
4,Park B,780.0,2023-09-18,,,NaT


## Summary
In this final project notebook, we have:
- Cleaned and prepared datasets from multiple sectors.
- Processed real-time data and simulated incoming streams.
- Applied Change Data Capture (CDC) to track changes.
- Validated data quality to ensure accurate results.
- Integrated the datasets for the Pathumthani platform.

This concludes the demo series on data cleaning, real-time processing, CDC, and validation.