<a href="https://colab.research.google.com/github/urmilapol/urmilapolprojects/blob/master/DE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Explanation of the Python Data Pipeline Code
The provided Python script (pipeline.py) serves as a practical demonstration of a simple Data Pipeline and the Data Flow that occurs within it. It uses the pandas library to process simulated sales order data, illustrating the core concepts of data engineering: Extraction, Transformation, and Loading (ETL/ELT).

1. The Data Pipeline: Orchestration (The "How" and "When")
The Data Pipeline is represented by the run_pipeline() function, which acts as the orchestrator. Its primary role is to define the sequence of operations and manage the flow of control between the different stages.

Function	Pipeline Stage	Role in Orchestration
run_pipeline()	Orchestrator	Calls the stages in the correct order: extract_data → transform_data → load_data. It ensures the entire process executes reliably from start to finish.
extract_data()	Extract/Ingestion	The starting point. Responsible for connecting to the data source (simulated by reading a CSV file) and bringing the raw data into memory (pandas.DataFrame).
load_data()	Load/Storage	The endpoint. Responsible for taking the processed data and writing it to the final storage locations (simulated by writing to two separate CSV files).
This structure ensures that the pipeline is modular, allowing each stage to be developed, tested, and monitored independently.

2. The Data Flow: Transformation Logic (The "What" and "Where")
The Data Flow is primarily contained within the transform_data() function. This function defines the specific sequence of steps and logic applied to the data to convert it from its raw state into a business-ready format.

The code demonstrates a multi-step data flow:

Step 2.1: Cleansing and Validation
The first part of the transformation ensures data quality:
df['Amount'] = pd.to_numeric(df['Amount'], errors='coerce')
df.dropna(subset=['Amount'], inplace=True)
This logic cleans the Amount column by converting it to a numeric type. If any value cannot be converted (e.g., "N/A" or "Error"), it is set to NaN and then dropped, ensuring that only valid, numeric data proceeds to the next steps.

Step 2.2: Filtering (Business Rule Application)
A specific business rule is applied to the data flow:
cleansed_df = df[df['Amount'] >= MIN_ORDER_AMOUNT].copy()
This step filters the dataset, keeping only orders where the Amount is greater than or equal to $100 (defined by MIN_ORDER_AMOUNT). This simulates a common scenario where a pipeline filters out irrelevant or low-value transactions before further processing. The output cleansed_df represents a Curated Layer of data.

Step 2.3: Aggregation (Data Enrichment)
The final step of the data flow aggregates the filtered data:
aggregated_df = cleansed_df.groupby('CustomerID').agg(...)
This logic groups the data by CustomerID and calculates key metrics: the total number of orders, the total sales amount, and the average order value. This transformation creates a summary dataset (aggregated_df) that is optimized for reporting and analytics, representing a Presentation Layer of data.

3. Simulating Storage Layers
The load_data() function demonstrates how a single pipeline can feed multiple storage layers, a key concept in modern data architectures (like the Data Lakehouse):

1	cleansed_orders.csv: Stores the filtered, row-level data. This is analogous to a Silver Layer—clean, consistent, and ready for detailed analysis.
2	customer_sales_summary.csv: Stores the aggregated, summarized data. This is analogous to a Gold Layer—highly refined, business-ready, and optimized for dashboards and reports.

In summary, the Python script clearly separates the Pipeline (the sequence of function calls) from the Data Flow (the logic inside transform_data), providing a concrete, executable example of the concepts discussed in your syllabus.


In [1]:
import pandas as pd
import os
from datetime import datetime

# --- Configuration ---
SOURCE_FILE = "/content/sample_data/source_data.csv"
CLEANSED_FILE = "/content/sample_data/cleansed_orders.csv"
AGGREGATED_FILE = "/content/sample_data/customer_sales_summary.csv"
MIN_ORDER_AMOUNT = 100

def extract_data(file_path: str) -> pd.DataFrame:
    """
    Stage 1: Extraction (Simulated Ingestion/Load)
    Reads the raw data from the source CSV file.
    """
    print(f"[{datetime.now().strftime('%H:%M:%S')}] Starting Stage 1: Extracting data from {file_path}...")
    try:
        # Use pandas to read the CSV file
        df = pd.read_csv(file_path)
        print(f"Successfully extracted {len(df)} records.")
        return df
    except FileNotFoundError:
        print(f"Error: Source file not found at {file_path}")
        return pd.DataFrame()

def transform_data(df: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Stage 2: Transformation (Data Flow Logic)
    Performs cleansing, filtering, and aggregation.
    """
    print(f"[{datetime.now().strftime('%H:%M:%S')}] Starting Stage 2: Transforming data...")

    # 2.1 Cleansing: Ensure 'Amount' is numeric and handle potential errors
    df['Amount'] = pd.to_numeric(df['Amount'], errors='coerce')
    df.dropna(subset=['Amount'], inplace=True)
    print(f"   - Cleansing complete. Remaining records: {len(df)}")

    # 2.2 Filtering: Filter out small orders (Data Flow Logic 1)
    cleansed_df = df[df['Amount'] >= MIN_ORDER_AMOUNT].copy()
    print(f"   - Filtered orders below ${MIN_ORDER_AMOUNT}. Records for analysis: {len(cleansed_df)}")

    # 2.3 Aggregation: Calculate total sales per customer (Data Flow Logic 2)
    aggregated_df = cleansed_df.groupby('CustomerID').agg(
        Total_Orders=('OrderID', 'count'),
        Total_Sales=('Amount', 'sum'),
        Average_Order_Value=('Amount', 'mean')
    ).reset_index()

    print(f"   - Aggregation complete. Generated summary for {len(aggregated_df)} customers.")

    return cleansed_df, aggregated_df

def load_data(cleansed_df: pd.DataFrame, aggregated_df: pd.DataFrame):
    """
    Stage 3: Load (Simulated Storage)
    Writes the transformed data to destination CSV files.
    """
    print(f"[{datetime.now().strftime('%H:%M:%S')}] Starting Stage 3: Loading data...")

    # Load 1: Cleansed data (Simulating a Silver/Curated layer)
    cleansed_df.to_csv(CLEANSED_FILE, index=False)
    print(f"   - Cleansed orders loaded to {CLEANSED_FILE}")

    # Load 2: Aggregated summary (Simulating a Gold/Presentation layer)
    aggregated_df.to_csv(AGGREGATED_FILE, index=False)
    print(f"   - Customer sales summary loaded to {AGGREGATED_FILE}")

    print(f"[{datetime.now().strftime('%H:%M:%S')}] Data Pipeline execution complete.")

def run_pipeline():
    """
    Orchestrates the entire data pipeline process.
    """
    # 1. Extract/Load
    raw_data = extract_data(SOURCE_FILE)
    if raw_data.empty:
        print("Pipeline aborted due to extraction error.")
        return

    # 2. Transform (The Data Flow)
    cleansed_data, aggregated_summary = transform_data(raw_data)

    # 3. Load
    load_data(cleansed_data, aggregated_summary)

if __name__ == "__main__":
    run_pipeline()


[11:35:13] Starting Stage 1: Extracting data from /content/sample_data/source_data.csv...
Successfully extracted 10 records.
[11:35:13] Starting Stage 2: Transforming data...
   - Cleansing complete. Remaining records: 10
   - Filtered orders below $100. Records for analysis: 5
   - Aggregation complete. Generated summary for 4 customers.
[11:35:13] Starting Stage 3: Loading data...
   - Cleansed orders loaded to /content/sample_data/cleansed_orders.csv
   - Customer sales summary loaded to /content/sample_data/customer_sales_summary.csv
[11:35:13] Data Pipeline execution complete.
