### **Report on Data Processing Approach**

**Introduction**  
This report outlines the approach taken for data processing, exploration, and validation in our project. Given the nature of our work, we have chosen to handle these tasks separately from the main pipeline orchestration and data analysis process.

**Purpose of This Notebook**  
The primary function of this notebook is to facilitate data exploration, cleaning, and validation. These steps are crucial in ensuring data quality before it is integrated into further analysis.

**Separation from Pipeline Orchestration**  
As this notebook focuses on pre-processing activities, it is not part of our main pipeline orchestration or analytical workflows. Instead, it serves as a dedicated space to refine and understand the data before it is incorporated into structured analysis.

**Extending Data Exploration**  
In addition to its core purpose, this notebook allows us to extend our data exploration beyond the established star schema tracks. By working within this environment, we can uncover additional insights and refine our approach to data modeling and analysis.

**Conclusion**  
By keeping data processing separate, we ensure that our pipeline remains streamlined while still leveraging a structured approach to data exploration and validation. This method allows for a more flexible and comprehensive understanding of our dataset without interfering with the core analytical processes.


### **Test Case:** 
##### **Validation of Delivered Order Value Against Payment Receipts**

**Objective**  
This test case examines the relationship between orders, order items, and payments to ensure data consistency. Specifically, we validate whether delivered orders align with recorded payment receipts while identifying discrepancies in order item tracking.

**Observations**  
1) Order-Payment Relationship
    - Orders are linked to payments.
    - Orders are also linked to order items.
    - However, order items are not directly linked to payments.

2) Data Inconsistencies Identified
    - Payments exist without corresponding order item details.
    - Orders marked as delivered and paid have no order item records.
    - A total of 643 transactions are categorized as unavailable, with missing payment values. This highlights inconsistencies in data management, leading to potential misreporting.

**Validation Rule Implementation**  
To improve data quality, our validation rules are set as follows:
- Accepted: Orders that are cancelled (considered valid exceptions).
- Flagged as Errors: Orders with statuses created, delivered, invoiced, shipped, or unavailable that do not have proper order item details. These will require rectification and data cleaning.

**Conclusion**  
This validation process highlights critical gaps in data tracking, particularly with missing order item records. By enforcing these validation rules, we ensure greater accuracy in order and payment reconciliation, reducing potential errors in financial reporting and business operations.




##### **Step 1: Collect the raw data and explore data**
[Raw Data store in GitHub Repository](https://github.com/yeesoontuck/module2-project/tree/main/data)

In [12]:
import pandas as pd

In [13]:
payment_df = pd.read_csv('../../data/olist_order_payments_dataset.csv')
payment_df

total_payment = round(payment_df['payment_value'].sum(),2)
print(f'Total Payment Value from order_payments (raw data): BRL {total_payment}')


Total Payment Value from order_payments (raw data): BRL 16008872.12


##### **Step 2: Data Discovery**

##### **Step 3: Data Cleaning**  
To ensure data integrity, we implement a structured data cleaning process using dbt tests, leveraging dbt.utils and dbt.expectations to validate data quality.

**Key Validation Checks:**  
Data Duplication – Identify and remove duplicate records to prevent redundancy.
Irrelevant Information – Filter out unnecessary or misleading data that does not contribute to analysis.
Error Removal – Detect and eliminate inconsistencies or anomalies to improve data accuracy.
By applying these validations, we enhance data reliability, ensuring a cleaner and more structured dataset for further analysis.

##### **Step 4: Transform and Enrich Data (Data Integration)**  
In this step, we focus on integrating and transforming the data into a usable format suitable for business or data analysis. This ensures that the data is structured and ready for deeper insights.

**Key Actions:**  
- Data Transformation – Convert raw data into a consistent, well-organized format for analysis, applying necessary transformations to align with business requirements.
- Data Integration – Combine data from various sources to create a unified dataset, ensuring consistency and compatibility for analysis.
- Logical Validation – Verify key logical checks, such as ensuring delivered orders are not marked as unpaid, ensuring that business rules and logic are correctly applied throughout the dataset.

By transforming and enriching the data in this manner, we enhance its usability, providing a solid foundation for accurate and reliable business analysis.

In [14]:
oip_df = pd.read_csv('3_data_csv/fact_order_item_details.csv') # 'oip' stands for order-order_item-payment
filtered_oip_df = oip_df[['price', 'freight_value', 'prop_payment_value']].sum() # prop_payment_value is the proportioned payment value to distribute into order item lines.
filtered_oip_df


price                 13591643.70
freight_value          2251909.54
prop_payment_value    15846280.17
dtype: float64

**Data Explanation: Price, Freight, and Payment Values**  
1) price: Represents the total order value (excluding freight).
2) freight_value: Represents the total freight value associated with the order items.
3) prop_payment_value: The total payment value linked to the order, proportionately distributed across the corresponding order items.

**Summary**
The total payment value that can be identified in both the order and the order items is the sum of price and freight_value, which accounts for the complete financial transaction related to the order. This relationship ensures that payments are accurately tracked and matched against the order and its associated items.

##### **Step 4 - Finding the Discrepancies between Row Payment Data and Transformed Payment Data**

In this step, we focus on identifying discrepancies between the original row payment data and the transformed payment data. This process helps to ensure all payment information has been correctly processed and integrated into the final dataset.

**Key Actions:**
- Data Comparison – Compare the raw payment data with the transformed payment data to identify any differences or mismatches.
- Identify Discrepancies – Focus on key areas where discrepancies may arise, such as differences in payment amounts or missing payment records.
- Investigate Anomalies – Investigate the root causes of any inconsistencies, such as data entry errors, transformation issues, or incorrect mapping between order and payment records.


**Result (see below)**  
- Missing Data: A total of 883 transactions are identified where the order items could not be linked or identified.
- Distinct Orders: The second step involved checking the number of distinct orders, with a total of 776 unique orders identified.


In [15]:
missing_df = pd.read_csv('3_data_csv/case1_missingdata.csv') # 'oip' stands for order-order_item-payment
oip_df.head(3)

filtered_missing_df = missing_df.groupby('order_status', as_index=False).agg(
    order_status_count=('order_status', 'count'),
    price_sum=('price', 'sum'))

filtered_missing_df

Unnamed: 0,order_status,order_status_count,price_sum
0,canceled,179,0.0
1,created,5,0.0
2,delivered,3,134.97
3,invoiced,2,0.0
4,shipped,1,0.0
5,unavailable,643,0.0


In [16]:
filtered_payment_no_item_df = missing_df.groupby('order_in_orders', as_index=False).agg(
    payment_count=('payment_without_items', 'count'),
    total_payment_without_items=('payment_without_items', 'sum')
)
filtered_payment_no_item_df

Unnamed: 0,order_in_orders,payment_count,total_payment_without_items
0,0010dedd556712d7bb69a19cb7bbd37a,1,111.12
1,00a500bc03bc4ec968e574c2553bed4b,1,555.99
2,00b1cb0320190ca0daa2c88b35206009,1,0.00
3,00bca4adac549020c1273714d04d0208,1,111.30
4,00d0ffd14774da775ac832ba8520510f,1,134.49
...,...,...,...
771,fdcca0e15a4d03e3fb89fb14664a3744,1,29.59
772,fddbd183387b5c9bcbafbd0fe965301f,1,40.00
773,fe87d4b944748f63ca5ed22cc55b6fb6,1,173.68
774,feae5ecdf2cc16c1007741be785fe3cd,1,68.53


In [17]:
total_payment_value_without_items = missing_df['payment_without_items'].sum()
print('Total payment value without order item track: BRL',total_payment_value_without_items)

Total payment value without order item track: BRL 162591.95


##### **Step 5: Set Validation after Data Evaluation**

This validation rule is designed to accept only one order status category ('canceled'). This ensures that both the system administrator and data analysts are alerted to verify and confirm the case, ensuring data accuracy and integrity.