# Real-Time Analytics - IBM Cloud Object Storage & Spark Integration

## **1️ IBM Cloud Object Storage (COS) Configuration**
This section defines the necessary parameters to connect to IBM Cloud Object Storage:

- **`SERVICE_NAME`**: Logical service name for Spark to recognize the COS instance.
- **`ACCESS_KEY`**: Your IBM Cloud Object Storage access key.
- **`SECRET_KEY`**: Your IBM Cloud Object Storage secret key.
- **`ENDPOINT`**: The region-specific endpoint URL for COS (Frankfurt: `eu-de`).
- **`COS_BUCKET`**: The bucket where processed data is stored.
- **`COS_FILE`**: The name of the processed dataset file.


In [None]:
SERVICE_NAME = "mycos"  # logical service name
ACCESS_KEY = "access_key"         # Copy from your new service credentials
SECRET_KEY = "secret_key"
ENDPOINT = "s3.eu-de.cloud-object-storage.appdomain.cloud"  # Frankfurt region endpoint
COS_BUCKET = "processed-data-bucket"
COS_FILE = "processed_customer_purchase_behavior.csv"

## **🔧 Configuration Details**
- **`appName`**: `"Real_Time_Analytics"` - Sets the name of the Spark application.
- **`stocator.scheme.list`**: `"cos"` - Enables Spark to recognize COS as a valid storage scheme.
- **`fs.cos.impl`**: `"com.ibm.stocator.fs.ObjectStoreFileSystem"` - Registers Stocator as the file system handler for COS.
- **Authentication Credentials**:
  - **`access.key`**: IBM COS Access Key.
  - **`secret.key`**: IBM COS Secret Key.
  - **`endpoint`**: The regional endpoint (Frankfurt: `s3.eu-de.cloud-object-storage.appdomain.cloud`).
- **`getOrCreate()`**: Ensures a new Spark session is created if not already running.

In [None]:
# Initialize Spark Session with basic Stocator config using the logical service name
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Real_Time_Analytics") \
    .config("spark.hadoop.fs.stocator.scheme.list", "cos") \
    .config("spark.hadoop.fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem") \
    .config(f"spark.hadoop.fs.cos.{SERVICE_NAME}.access.key", ACCESS_KEY) \
    .config(f"spark.hadoop.fs.cos.{SERVICE_NAME}.secret.key", SECRET_KEY) \
    .config(f"spark.hadoop.fs.cos.{SERVICE_NAME}.endpoint", f"https://{ENDPOINT}") \
    .getOrCreate()

print("✅ Spark Session Initialized with COS Service Name:", SERVICE_NAME)

✅ Spark Session Initialized with COS Service Name: mycos


## **2. Configuring Hadoop for IBM Cloud Object Storage**
In this step, we explicitly set the **Hadoop configuration** to allow Spark to authenticate and interact with **IBM Cloud Object Storage (COS)**.

While the previous Spark session initialization provides access to COS, explicitly setting the **Hadoop configuration** ensures:
- **Seamless integration** with Spark's file system operations.
- **Improved stability** for reading and writing large datasets.
- **Persistent access** across different Spark jobs.

---

## ** Hadoop Configuration Parameters**
- **`fs.cos.{SERVICE_NAME}.access.key`**: Sets the access key for authentication.
- **`fs.cos.{SERVICE_NAME}.secret.key`**: Sets the secret key for secure access.
- **`fs.cos.{SERVICE_NAME}.endpoint`**: Defines the regional endpoint for object storage access.

In [None]:
# Explicitly set the Hadoop configuration for COS credentials
hadoopConf = spark._jsc.hadoopConfiguration()
hadoopConf.set(f"fs.cos.{SERVICE_NAME}.access.key", ACCESS_KEY)
hadoopConf.set(f"fs.cos.{SERVICE_NAME}.secret.key", SECRET_KEY)
hadoopConf.set(f"fs.cos.{SERVICE_NAME}.endpoint", f"https://{ENDPOINT}")

## Construct COS URL

The Cloud Object Storage (COS) URL is built using the logical service name and bucket details.

In [None]:
# Build the COS URL using the logical service name:
cos_url = f"cos://{COS_BUCKET}.{SERVICE_NAME}/{COS_FILE}"
print("Using COS URL:", cos_url)

Using COS URL: cos://processed-data-bucket.mycos/processed_customer_purchase_behavior.csv


## Configure Spark with COS Credentials

Spark configurations are updated to enable access to IBM Cloud Object Storage.

In [None]:
spark.conf.set(f"spark.hadoop.fs.cos.{COS_BUCKET}.access.key", ACCESS_KEY)
spark.conf.set(f"spark.hadoop.fs.cos.{COS_BUCKET}.secret.key", SECRET_KEY)
spark.conf.set(f"spark.hadoop.fs.cos.{COS_BUCKET}.endpoint", f"https://{ENDPOINT}")

print("✅ Spark Configuration Updated with COS Credentials")


✅ Spark Configuration Updated with COS Credentials


## Define Schema and Load Data from COS

The dataset schema is defined using `StructType` and `StructField`. The processed data is then loaded from IBM Cloud Object Storage using the specified schema.

In [None]:
# Define the schema for your dataset
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

schema = StructType([
    StructField("Customer ID", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Gender", StringType(), True),
    StructField("Item Purchased", StringType(), True),
    StructField("Category", StringType(), True),
    StructField("Purchase Amount (USD)", DoubleType(), True),
    StructField("Location", StringType(), True),
    StructField("Size", StringType(), True),
    StructField("Color", StringType(), True),
    StructField("Season", StringType(), True),
    StructField("Review Rating", DoubleType(), True),
    StructField("Subscription Status", IntegerType(), True),
    StructField("Payment Method", StringType(), True),
    StructField("Shipping Type", StringType(), True),
    StructField("Discount Applied", IntegerType(), True),
    StructField("Promo Code Used", IntegerType(), True),
    StructField("Previous Purchases", IntegerType(), True),
    StructField("Preferred Payment Method", StringType(), True),
    StructField("Frequency of Purchases", StringType(), True),
    StructField("High Value Customer", StringType(), True)
])

# Read the data from COS using Stocator
processed_df = spark.read \
    .option("header", "true") \
    .schema(schema) \
    .csv(cos_url)

print("✅ Processed Data Loaded Successfully!")
processed_df.show(5)

## Write Batch Data to a Temporary Directory

In [None]:
import os

# Define a temporary directory to store CSV files (adjust path if necessary)
temp_dir = "/tmp/stream_data"
os.makedirs(temp_dir, exist_ok=True)

# Write the processed DataFrame to this directory as CSV files (overwrite existing)
processed_df.write.mode("overwrite").option("header", "true").csv(temp_dir)
print("✅ Processed data written to:", temp_dir)


✅ Processed data written to: /tmp/stream_data


## Define Schema and Read Streaming Data

The schema is defined using `StructType` and `StructField`. A streaming DataFrame is created by reading data from the temporary directory with a specified schema.

In [None]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Use the same schema as before
schema = StructType([
    StructField("Customer ID", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Gender", StringType(), True),
    StructField("Item Purchased", StringType(), True),
    StructField("Category", StringType(), True),
    StructField("Purchase Amount (USD)", DoubleType(), True),
    StructField("Location", StringType(), True),
    StructField("Size", StringType(), True),
    StructField("Color", StringType(), True),
    StructField("Season", StringType(), True),
    StructField("Review Rating", DoubleType(), True),
    StructField("Subscription Status", IntegerType(), True),
    StructField("Payment Method", StringType(), True),
    StructField("Shipping Type", StringType(), True),
    StructField("Discount Applied", IntegerType(), True),
    StructField("Promo Code Used", IntegerType(), True),
    StructField("Previous Purchases", IntegerType(), True),
    StructField("Preferred Payment Method", StringType(), True),
    StructField("Frequency of Purchases", StringType(), True),
    StructField("High Value Customer", StringType(), True)
])

# Read streaming data from the temporary directory
streaming_df = spark.readStream \
    .option("header", "true") \
    .option("maxFilesPerTrigger", 1) \
    .schema(schema) \
    .csv(temp_dir)

print("✅ Streaming DataFrame created from directory")


✅ Streaming DataFrame created from directory


## Streaming Query for Aggregating Total Sales per Category

A streaming query is executed to compute the total purchase amount per category in real time. The results are continuously written to the console in `complete` mode, meaning the entire result is updated with each new batch of data.

### Key Configurations:
- **Grouping by "Category"**: Aggregates sales per product category.
- **Summing "Purchase Amount (USD)"**: Computes total sales for each category.
- **`outputMode("complete")`**: Ensures the full updated result is displayed in every batch.
- **`format("console")`**: Outputs the results to the console for monitoring.

In [None]:
# Run a streaming query that aggregates total sales per category and writes to the console.
stream_query = streaming_df.groupBy("Category").agg({"Purchase Amount (USD)": "sum"}) \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

print("✅ Streaming query started. Check the console output for streaming results.")

## Streaming Query Execution with Time-Controlled Execution

A streaming query is initiated to compute the total purchase amount per category in real time. The query runs for a fixed duration before being stopped.

### Key Configurations:
- **Grouping by "Category"**: Aggregates sales data for each product category.
- **Summing "Purchase Amount (USD)"**: Computes total sales in real time.
- **`outputMode("complete")`**: Displays the full updated result in every batch.
- **`format("console")`**: Prints the results to the console for monitoring.
- **`time.sleep(20)`**: Allows the query to run for 20 seconds before stopping.

In [None]:
import time

# Run a streaming query to compute total sales per category in real time
query = streaming_df.groupBy("Category").agg({"Purchase Amount (USD)": "sum"}) \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

# Let the streaming query run for a short period (e.g., 20 seconds)
time.sleep(20)
query.stop()
print("✅ Streaming query stopped.")

✅ Streaming query stopped.


## Monitoring and Stopping the Streaming Query

This section ensures the streaming query runs for a controlled duration while providing insights into its execution status before stopping it.

### Key Configurations:
- **`time.sleep(20)`**: Simulates real-time processing by allowing the query to run for 20 seconds.
- **Query Status Monitoring**:
  - `stream_query.status`: Retrieves the current status of the streaming query.
  - `stream_query.lastProgress`: Fetches details about the last processing batch.
- **Graceful Query Termination**:
  - `stream_query.stop()`: Stops the streaming process after completion.

In [None]:
import time

# Let the streaming query run for a while (e.g., 20 seconds) to simulate real-time processing.
time.sleep(20)

# Check streaming query status and progress (optional)
print("Query Status:", stream_query.status)
print("Last Progress:", stream_query.lastProgress)

# Stop the streaming query when finished.
stream_query.stop()
print("✅ Streaming query stopped.")


Query Status: {'message': 'Waiting for data to arrive', 'isDataAvailable': False, 'isTriggerActive': False}
Last Progress: {'id': '48111222-4447-43da-97b2-20bc58538bd9', 'runId': 'aa831cd6-17c6-4c08-b38f-e39d41095744', 'name': None, 'timestamp': '2025-02-24T15:35:48.129Z', 'batchId': 0, 'numInputRows': 0, 'inputRowsPerSecond': 0.0, 'processedRowsPerSecond': 0.0, 'durationMs': {'latestOffset': 4, 'triggerExecution': 4}, 'stateOperators': [], 'sources': [{'description': 'FileStreamSource[file:/tmp/stream_data]', 'startOffset': None, 'endOffset': None, 'latestOffset': None, 'numInputRows': 0, 'inputRowsPerSecond': 0.0, 'processedRowsPerSecond': 0.0}], 'sink': {'description': 'org.apache.spark.sql.execution.streaming.ConsoleTable$@51a20257', 'numOutputRows': -1}}
✅ Streaming query stopped.


The **RealTime_Analytics.ipynb** notebook successfully demonstrates real-time data processing using **Apache Spark Structured Streaming** and **IBM Cloud Object Storage (COS)**. The implementation covers key aspects of big data analytics, including:

- **Spark Session Initialization**: Configuring Spark with COS integration for seamless data access.
- **Data Loading & Schema Definition**: Ensuring structured and type-safe data ingestion.
- **Batch Processing**: Loading and analyzing pre-processed customer purchase data.
- **Real-time Streaming Processing**: Implementing Structured Streaming to process continuous data streams.
- **Aggregation & Monitoring**: Computing real-time total sales per category and tracking query progress.

This notebook serves as a foundation for real-time analytics pipelines, enabling businesses to gain insights from continuously incoming data streams. Further enhancements may include **visualization dashboards, advanced machine learning models, and cloud deployment** for scalable, production-ready solutions.