# **Predictive Modeling with Apache Spark and IBM Cloud Object Storage**

## **Installing Required Dependencies**  
To interact with IBM Cloud Object Storage (COS) and process data with Apache Spark, install the required dependencies.

In [44]:
# !pip install ibm-cos-sdk

- The `ibm-cos-sdk` library allows interaction with IBM Cloud Object Storage.

## **Importing Necessary Libraries**  
The following libraries are essential for data processing and cloud storage interaction:

In [45]:
import os
import sys
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
import ibm_boto3
from ibm_botocore.client import Config
import pandas as pd
import io

### **Library Functions:**
- `pyspark.sql.SparkSession`: Manages Spark functionalities.  
- `pyspark.sql.types`: Defines structured schemas for Spark DataFrames.  
- `ibm_boto3` & `ibm_botocore.client.Config`: Enables interaction with IBM Cloud Object Storage (COS).  
- `pandas` & `io`: Handles data in memory, allowing efficient processing.  

## **Setting Up IBM Cloud Object Storage Client**  
To interact with IBM Cloud Object Storage (COS), we must configure authentication credentials and initialize a client.  

In [46]:
# IBM Cloud Object Storage Credentials
cos_credentials = {
    "access_key_id": "891d47bfa17346038b970fa46d63fd41",
    "secret_access_key": "25fb5c98b01a4359fc5a11fcd848ae60debd89c2ac7ebf66",
    "endpoint": "https://s3.eu-de.cloud-object-storage.appdomain.cloud",
}

# Create IBM COS Client
cos_client = ibm_boto3.client(
    service_name="s3",
    aws_access_key_id=cos_credentials["access_key_id"],
    aws_secret_access_key=cos_credentials["secret_access_key"],
    endpoint_url=cos_credentials["endpoint"]
)

BUCKET_NAME = "processed-data-bucket"
FILE_NAME = "processed_customer_purchase_behavior.csv"

### **Explanation:**
- `cos_credentials`: A dictionary storing IBM COS credentials (access key, secret key, and endpoint).  
- `ibm_boto3.client("s3")`: Initializes an IBM COS client for interacting with the object storage service.  
- `BUCKET_NAME`: The name of the storage bucket where data is stored.  
- `FILE_NAME`: The specific file to be accessed for predictive modeling.  

✅ With this setup, we can now retrieve data from IBM Cloud Object Storage securely.

## **Initializing Spark Session**  
Apache Spark is a powerful distributed processing system. Here, we initialize a Spark session to process large datasets.  

In [47]:
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Predictive_Modeling") \
    .master("local[*]") \
    .getOrCreate()

print("✅ Spark Session Initialized!")


✅ Spark Session Initialized!


### **Key Parameters:**
- `.appName("Predictive_Modeling")`: Assigns a name to the Spark application.  
- `.master("local[*]")`: Runs Spark locally using all available CPU cores.  
- `.getOrCreate()`: Creates a new Spark session if one does not exist.  

✅ If successful, Spark will display an initialization message.

## **Downloading Data from IBM Cloud Object Storage**  
The dataset is stored in IBM Cloud Object Storage. To analyze it, we first retrieve the file from the specified bucket.  

In [48]:
# Download the file from COS
response = cos_client.get_object(Bucket=BUCKET_NAME, Key=FILE_NAME)
file_content = response["Body"].read().decode("utf-8")

### **Explanation:**
- `cos_client.get_object()`: Fetches the file from COS.  
- `.read().decode("utf-8")`: Reads and decodes the file content for processing.  

## **Converting Data to Pandas and Saving Locally**  
Once the dataset is retrieved, it is loaded into a Pandas DataFrame for further processing.

In [49]:
data = pd.read_csv(io.StringIO(file_content))

data.to_csv("/content/" + FILE_NAME, index=False)  # Save locally for Spark to read

### **Why Save Locally?**  
- `pd.read_csv()`: Loads the dataset into a Pandas DataFrame.  
- `to_csv()`: Saves the file locally so that it can be read by Apache Spark.  

This ensures that Spark can efficiently access the data for further transformations.

## **Defining the Schema and Loading Processed Data**  

To ensure structured data processing, we define an explicit schema for our dataset before loading it into a Spark DataFrame. This guarantees data integrity and consistency.

In [50]:
import os
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Define your file path (Ensure this is the correct path)
#LOCAL_FILE_PATH = "processed_data_bucket/processed_customer_purchase_behavior.csv"

# Define the schema of your processed data
schema = StructType([
    StructField("Customer ID", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Gender", StringType(), True),
    StructField("Item Purchased", StringType(), True),
    StructField("Category", StringType(), True),
    StructField("Purchase Amount (USD)", DoubleType(), True),
    StructField("Location", StringType(), True),
    StructField("Size", StringType(), True),
    StructField("Color", StringType(), True),
    StructField("Season", StringType(), True),
    StructField("Review Rating", DoubleType(), True),
    StructField("Subscription Status", IntegerType(), True),
    StructField("Payment Method", StringType(), True),
    StructField("Shipping Type", StringType(), True),
    StructField("Discount Applied", IntegerType(), True),
    StructField("Promo Code Used", IntegerType(), True),
    StructField("Previous Purchases", IntegerType(), True),
    StructField("Preferred Payment Method", StringType(), True),
    StructField("Frequency of Purchases", StringType(), True),
    StructField("High Value Customer", StringType(), True)
])

# Read the processed data from local CSV file
processed_df = spark.read.option("header", "true").schema(schema).csv("/content/" + FILE_NAME)

# Display schema and first few rows
print("✅ Processed Data Loaded Successfully!")
processed_df.printSchema()

✅ Processed Data Loaded Successfully!
root
 |-- Customer ID: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Item Purchased: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Purchase Amount (USD): double (nullable = true)
 |-- Location: string (nullable = true)
 |-- Size: string (nullable = true)
 |-- Color: string (nullable = true)
 |-- Season: string (nullable = true)
 |-- Review Rating: double (nullable = true)
 |-- Subscription Status: integer (nullable = true)
 |-- Payment Method: string (nullable = true)
 |-- Shipping Type: string (nullable = true)
 |-- Discount Applied: integer (nullable = true)
 |-- Promo Code Used: integer (nullable = true)
 |-- Previous Purchases: integer (nullable = true)
 |-- Preferred Payment Method: string (nullable = true)
 |-- Frequency of Purchases: string (nullable = true)
 |-- High Value Customer: string (nullable = true)



### **Explanation:**
- **`StructType([])`**: Defines the schema as a structured collection of fields.  
- **Each `StructField` contains:**
  - Column Name  
  - Data Type (e.g., `StringType()`, `IntegerType()`, `DoubleType()`)  
  - Nullability (`True` if the column can have missing values)  
- **Spark Read Operation**:  
  - Reads the CSV file while applying the predefined schema.  
  - Ensures proper data types instead of relying on automatic type inference.  
- **Validation**:  
  - `printSchema()`: Displays the structure of the loaded DataFrame to confirm correctness.  

✅ **This schema-based loading ensures structured data ingestion, reducing errors in subsequent analysis.**  

In [51]:
processed_df.limit(10).toPandas()

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Payment Method,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Preferred Payment Method,Frequency of Purchases,High Value Customer
0,1,55,Male,Blouse,Clothing,53.0,Kentucky,L,Gray,Winter,3.1,1,Credit Card,Express,1,1,14,Venmo,Fortnightly,No
1,2,19,Male,Sweater,Clothing,64.0,Maine,L,Maroon,Winter,3.1,1,Bank Transfer,Express,1,1,2,Cash,Fortnightly,No
2,3,50,Male,Jeans,Clothing,73.0,Massachusetts,S,Maroon,Spring,3.1,1,Cash,Free Shipping,1,1,23,Credit Card,Weekly,No
3,4,21,Male,Sandals,Footwear,90.0,Rhode Island,M,Maroon,Spring,3.5,1,PayPal,Next Day Air,1,1,49,PayPal,Weekly,Yes
4,5,45,Male,Blouse,Clothing,49.0,Oregon,M,Turquoise,Spring,2.7,1,Cash,Free Shipping,1,1,31,PayPal,Annually,No
5,6,46,Male,Sneakers,Footwear,20.0,Wyoming,M,White,Summer,2.9,1,Venmo,Standard,1,1,14,Venmo,Weekly,No
6,7,63,Male,Shirt,Clothing,85.0,Montana,M,Gray,Fall,3.2,1,Debit Card,Free Shipping,1,1,49,Cash,Quarterly,Yes
7,8,27,Male,Shorts,Clothing,34.0,Louisiana,L,Charcoal,Winter,3.2,1,Debit Card,Free Shipping,1,1,19,Credit Card,Weekly,No
8,9,26,Male,Coat,Outerwear,97.0,West Virginia,L,Silver,Summer,2.6,1,Venmo,Express,1,1,8,Venmo,Annually,Yes
9,10,57,Male,Handbag,Accessories,31.0,Missouri,M,Pink,Spring,4.8,1,PayPal,2-Day Shipping,1,1,4,Cash,Quarterly,No


## **Handling Missing Values**  

To ensure data completeness, we compute the number of missing values in each column.

In [52]:
from pyspark.sql.functions import col, sum

# Count missing values in each column
missing_values = processed_df.select([sum(col(c).isNull().cast("int")).alias(c) for c in processed_df.columns])
print("✅ Missing Values per Column:")
missing_values.limit(1).toPandas()


✅ Missing Values per Column:


Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Payment Method,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Preferred Payment Method,Frequency of Purchases,High Value Customer
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


- Uses `isNull().cast("int")` to convert missing values to integers (1 for missing, 0 for present).
- Aggregates the sum for each column to get the count of missing values.
- Converts the result to a Pandas DataFrame for better readability.

## **Counting Unique Values per Column**  

Unique values indicate the variety of data in each column, which is essential for understanding categorical features.

- `countDistinct(col(c))` counts unique values for each column.
- Helps in identifying categorical vs. numerical variables.
- Provides insights into feature cardinality, useful for modeling.

In [53]:
from pyspark.sql.functions import countDistinct

# Count unique values for each column
unique_counts = processed_df.select([countDistinct(col(c)).alias(c) for c in processed_df.columns])

print("✅ Unique Value Counts per Column:")
unique_counts.limit(1).toPandas()


✅ Unique Value Counts per Column:


Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Payment Method,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Preferred Payment Method,Frequency of Purchases,High Value Customer
0,3900,53,2,25,4,81,50,4,25,4,26,2,6,6,2,2,50,6,7,2


## **Distribution Analysis for Categorical Features**  

We analyze categorical columns to understand the distribution of each feature.

- Groups the data by categorical features.
- Counts the occurrences of each unique value.
- Helps in identifying imbalanced distributions in categorical variables.


In [54]:
categorical_features = ["Subscription Status", "Discount Applied", "Promo Code Used", "High Value Customer"]

for col_name in categorical_features:
    print(f"✅ Distribution for {col_name}:")
    processed_df.groupBy(col_name).count().show()


✅ Distribution for Subscription Status:
+-------------------+-----+
|Subscription Status|count|
+-------------------+-----+
|                  1| 1053|
|                  0| 2847|
+-------------------+-----+

✅ Distribution for Discount Applied:
+----------------+-----+
|Discount Applied|count|
+----------------+-----+
|               1| 1677|
|               0| 2223|
+----------------+-----+

✅ Distribution for Promo Code Used:
+---------------+-----+
|Promo Code Used|count|
+---------------+-----+
|              1| 1677|
|              0| 2223|
+---------------+-----+

✅ Distribution for High Value Customer:
+-------------------+-----+
|High Value Customer|count|
+-------------------+-----+
|                 No| 2974|
|                Yes|  926|
+-------------------+-----+



## **Statistical Analysis of Purchase Amount**  

To summarize purchasing patterns, we calculate key statistical metrics.

- Computes **minimum, maximum, mean, and median** purchase amounts.
- `percentile_approx()` is used for efficient median calculation.
- Helps in understanding spending behavior across customers.

In [55]:
from pyspark.sql.functions import min, max, avg, percentile_approx

purchase_stats = processed_df.select(
    min("Purchase Amount (USD)").alias("Min_Purchase"),
    max("Purchase Amount (USD)").alias("Max_Purchase"),
    avg("Purchase Amount (USD)").alias("Mean_Purchase"),
    percentile_approx("Purchase Amount (USD)", 0.5).alias("Median_Purchase")
)

print("✅ Purchase Amount Statistics:")
purchase_stats.show()


✅ Purchase Amount Statistics:
+------------+------------+-----------------+---------------+
|Min_Purchase|Max_Purchase|    Mean_Purchase|Median_Purchase|
+------------+------------+-----------------+---------------+
|        20.0|       100.0|59.76435897435898|           60.0|
+------------+------------+-----------------+---------------+



## **Detecting Outliers in Purchase Amount**  

We use the Interquartile Range (IQR) method to detect potential outliers.

- **Q1 & Q3**: 25th and 75th percentiles of purchase amounts.
- **IQR (Interquartile Range)**: Q3 - Q1.
- **Outlier Criteria**: Values below `Q1 - 1.5*IQR` or above `Q3 + 1.5*IQR` are considered outliers.
- Helps in identifying anomalies in purchasing behavior.

In [56]:
quantiles = processed_df.approxQuantile("Purchase Amount (USD)", [0.25, 0.75], 0.05)
Q1, Q3 = quantiles[0], quantiles[1]
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = processed_df.filter(
    (col("Purchase Amount (USD)") < lower_bound) | (col("Purchase Amount (USD)") > upper_bound)
).count()

print(f"✅ Number of Outliers in 'Purchase Amount (USD)': {outliers}")


✅ Number of Outliers in 'Purchase Amount (USD)': 0


## **Categorical Feature Distribution Analysis**  

We analyze distributions across all categorical variables.

- Groups data by each categorical column.
- Counts occurrences of each unique value.
- Helps in understanding category distributions, detecting class imbalances, and improving feature selection.

In [57]:
from pyspark.sql.functions import count

# Categorical columns to analyze
categorical_cols = ["Gender", "Item Purchased", "Category", "Location", "Size", "Color",
                    "Season", "Payment Method", "Shipping Type", "Preferred Payment Method", "Frequency of Purchases"]

for col_name in categorical_cols:
    print(f"✅ Distribution for {col_name}:")
    processed_df.groupBy(col_name).agg(count("*").alias("count")).show(truncate=False)


✅ Distribution for Gender:
+------+-----+
|Gender|count|
+------+-----+
|Female|1248 |
|Male  |2652 |
+------+-----+

✅ Distribution for Item Purchased:
+--------------+-----+
|Item Purchased|count|
+--------------+-----+
|T-shirt       |147  |
|Jacket        |163  |
|Sneakers      |145  |
|Belt          |161  |
|Dress         |166  |
|Sweater       |164  |
|Hat           |154  |
|Coat          |161  |
|Sunglasses    |161  |
|Pants         |171  |
|Hoodie        |151  |
|Handbag       |153  |
|Gloves        |140  |
|Backpack      |143  |
|Shirt         |169  |
|Shoes         |150  |
|Blouse        |171  |
|Jewelry       |171  |
|Boots         |144  |
|Shorts        |157  |
+--------------+-----+
only showing top 20 rows

✅ Distribution for Category:
+-----------+-----+
|Category   |count|
+-----------+-----+
|Outerwear  |324  |
|Clothing   |1737 |
|Footwear   |599  |
|Accessories|1240 |
+-----------+-----+

✅ Distribution for Location:
+-------------+-----+
|Location     |count|
+-----

## **Categorical Feature Encoding**
To prepare categorical variables for machine learning, we use the following steps:
1. **String Indexing**: Converts categorical values into numerical indices.
2. **One-Hot Encoding**: Converts indexed categorical features into sparse vectors.
3. **Pipeline Processing**: Applies transformations efficiently.

After encoding, the original categorical columns are dropped.

✅ **Outcome**: All categorical features are now numerically represented and ready for model training.


In [58]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline

categorical_features = ["Item Purchased", "Category", "Location", "Size", "Color",
                        "Season", "Payment Method", "Shipping Type",
                        "Preferred Payment Method", "Frequency of Purchases"]

indexers = [StringIndexer(inputCol=col, outputCol=col+"_index") for col in categorical_features]
encoders = [OneHotEncoder(inputCol=col+"_index", outputCol=col+"_encoded") for col in categorical_features]

pipeline = Pipeline(stages=indexers + encoders)
processed_df = pipeline.fit(processed_df).transform(processed_df)

# Drop original categorical columns after encoding
processed_df = processed_df.drop(*categorical_features)

print("✅ Categorical Encoding Completed!")
processed_df.printSchema()


✅ Categorical Encoding Completed!
root
 |-- Customer ID: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Purchase Amount (USD): double (nullable = true)
 |-- Review Rating: double (nullable = true)
 |-- Subscription Status: integer (nullable = true)
 |-- Discount Applied: integer (nullable = true)
 |-- Promo Code Used: integer (nullable = true)
 |-- Previous Purchases: integer (nullable = true)
 |-- High Value Customer: string (nullable = true)
 |-- Item Purchased_index: double (nullable = false)
 |-- Category_index: double (nullable = false)
 |-- Location_index: double (nullable = false)
 |-- Size_index: double (nullable = false)
 |-- Color_index: double (nullable = false)
 |-- Season_index: double (nullable = false)
 |-- Payment Method_index: double (nullable = false)
 |-- Shipping Type_index: double (nullable = false)
 |-- Preferred Payment Method_index: double (nullable = false)
 |-- Frequency of Purchases_index: double (nulla

## **Numeric Feature Scaling**
To normalize numeric features, we:
1. **Assemble Numeric Features**: Combines numeric columns into a feature vector.
2. **Min-Max Scaling**: Scales values between 0 and 1 for better model performance.
3. **Pipeline Processing**: Ensures transformations are applied efficiently.



In [59]:
from pyspark.ml.feature import MinMaxScaler, VectorAssembler

# Assemble numeric features for scaling
numeric_features = ["Purchase Amount (USD)", "Review Rating"]
assembler = VectorAssembler(inputCols=numeric_features, outputCol="num_features")

# Apply Min-Max Scaling
scaler = MinMaxScaler(inputCol="num_features", outputCol="scaled_features")

pipeline = Pipeline(stages=[assembler, scaler])
processed_df = pipeline.fit(processed_df).transform(processed_df)

# Drop original numeric columns after scaling
processed_df = processed_df.drop(*numeric_features)

print("✅ Numeric Feature Scaling Completed!")
processed_df.printSchema()


✅ Numeric Feature Scaling Completed!
root
 |-- Customer ID: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Subscription Status: integer (nullable = true)
 |-- Discount Applied: integer (nullable = true)
 |-- Promo Code Used: integer (nullable = true)
 |-- Previous Purchases: integer (nullable = true)
 |-- High Value Customer: string (nullable = true)
 |-- Item Purchased_index: double (nullable = false)
 |-- Category_index: double (nullable = false)
 |-- Location_index: double (nullable = false)
 |-- Size_index: double (nullable = false)
 |-- Color_index: double (nullable = false)
 |-- Season_index: double (nullable = false)
 |-- Payment Method_index: double (nullable = false)
 |-- Shipping Type_index: double (nullable = false)
 |-- Preferred Payment Method_index: double (nullable = false)
 |-- Frequency of Purchases_index: double (nullable = false)
 |-- Item Purchased_encoded: vector (nullable = true)
 |-- Category_encoded: vecto

## **Feature Engineering & Assembly**
1. **Feature Selection**: Combines encoded categorical and scaled numerical features.
2. **Vector Assembling**: Merges all features into a single feature vector.
3. **Column Cleanup**: Drops unnecessary columns after transformation.



In [60]:
from pyspark.ml.feature import VectorAssembler

# Selecting encoded categorical & scaled numeric features
feature_cols = [
    "Item Purchased_encoded", "Category_encoded", "Location_encoded", "Size_encoded",
    "Color_encoded", "Season_encoded", "Payment Method_encoded", "Shipping Type_encoded",
    "Preferred Payment Method_encoded", "Frequency of Purchases_encoded", "scaled_features"
]

# Assemble all features into one vector
feature_assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

# Transform data
processed_df = feature_assembler.transform(processed_df)

# Drop redundant columns
columns_to_drop = [
    "Customer ID", "Gender", "Item Purchased_index", "Category_index", "Location_index", "Size_index",
    "Color_index", "Season_index", "Payment Method_index", "Shipping Type_index",
    "Preferred Payment Method_index", "Frequency of Purchases_index",
    "Item Purchased_encoded", "Category_encoded", "Location_encoded", "Size_encoded", "Color_encoded",
    "Season_encoded", "Payment Method_encoded", "Shipping Type_encoded", "Preferred Payment Method_encoded",
    "Frequency of Purchases_encoded", "scaled_features", "num_features"
]

processed_df = processed_df.drop(*columns_to_drop)

print("✅ Features Assembled & Unnecessary Columns Dropped!")
processed_df.printSchema()


✅ Features Assembled & Unnecessary Columns Dropped!
root
 |-- Age: integer (nullable = true)
 |-- Subscription Status: integer (nullable = true)
 |-- Discount Applied: integer (nullable = true)
 |-- Promo Code Used: integer (nullable = true)
 |-- Previous Purchases: integer (nullable = true)
 |-- High Value Customer: string (nullable = true)
 |-- features: vector (nullable = true)



## **Import Required Libraries for Model Training & Evaluation**

This cell imports essential PySpark libraries for machine learning model training, evaluation, and feature transformation.

- **SparkSession**: Required for Spark DataFrame operations.
- **StringIndexer**: Converts categorical labels into numerical indices for machine learning models.
- **RandomForestClassifier & LogisticRegression**: Classification models for predicting high-value customers.
- **MulticlassClassificationEvaluator & BinaryClassificationEvaluator**: Evaluators for assessing model performance using metrics such as accuracy, precision, recall, F1-score, and AUC.
- **Pipeline**: Helps in structuring ML workflows by chaining multiple transformations and model training steps into a single execution.


In [61]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.ml import Pipeline


## **Target Variable Processing**
1. **String Indexing**: Converts the target column (`High Value Customer`) into a numerical label.
2. **Train-Test Split**: Splits data into 80% training and 20% testing for evaluation.


In [62]:
# ✅ Convert target column to numeric (needed for ML)
indexer = StringIndexer(inputCol="High Value Customer", outputCol="label")
processed_df = indexer.fit(processed_df).transform(processed_df).drop("High Value Customer")

# ✅ Train-Test Split (80-20)
train_data, test_data = processed_df.randomSplit([0.8, 0.2], seed=42)
print(f"✅ Data Split Completed: Train ({train_data.count()}), Test ({test_data.count()})")


✅ Data Split Completed: Train (3177), Test (723)


## **Model Training & Evaluation**
1. **Define Models**: Implements `RandomForestClassifier` and `LogisticRegression`.
2. **Define Evaluation Metrics**: Includes Accuracy, Precision, Recall, F1 Score, and AUC.
3. **Train & Evaluate Models**: Uses Spark ML Pipeline to train models and evaluate performance.


In [63]:
# Define your classifiers
models = {
    "Random Forest": RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=50),
    "Logistic Regression": LogisticRegression(featuresCol="features", labelCol="label", maxIter=20),
}

# Define evaluation metrics
evaluators = {
    "Accuracy": MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy"),
    "Precision": MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedPrecision"),
    "Recall": MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedRecall"),
    "F1 Score": MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1"),
    "AUC": BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderROC"),
}

### 🚀 Model Training, Evaluation, and Cloud Storage Function

This function **trains a model, evaluates its performance, saves predictions and the trained model**, and optionally uploads them to IBM Cloud Object Storage (COS).

#### 📌 **Function: `evaluate_model`**
- **Inputs:**
  - `model_name`: Name of the model (e.g., "Random Forest", "Logistic Regression").
  - `model`: The PySpark ML model instance.
  - `train_data`: Training dataset.
  - `test_data`: Testing dataset.
  - `cos_client`: IBM Cloud Object Storage client.
  - `bucket_name`: IBM COS bucket where the model and predictions will be stored.
  - `save_path`: Optional directory path to save the model and predictions locally before upload.

- **Workflow:**
  1. Uses a **Pipeline** to train the model on `train_data`.
  2. Generates **predictions** on `test_data`.
  3. Evaluates **Accuracy, Precision, Recall, F1 Score, and AUC** using predefined evaluators.
  4. **Displays performance metrics** in a readable format.
  5. **Saves predictions and the trained model** locally if `save_path` is provided.
  6. **Compresses the saved outputs** (Parquet files and model files) into ZIP format.
  7. **Uploads the compressed ZIP files** to IBM Cloud Object Storage.

- **Returns:**
  - A tuple containing (`model_name`, Accuracy, Precision, Recall, F1 Score, AUC).

#### 📌 **Execution**
- The function is executed for all models stored in the `models` dictionary.
- The results are stored in the `model_results` list.

This approach ensures an **automated, scalable** pipeline for model training, evaluation, and deployment to the cloud. ✅


In [64]:
import shutil
import os

# Define a function to train, evaluate, and save predictions & model to COS.
def evaluate_model(model_name, model, train_data, test_data, cos_client, bucket_name, save_path=None):
    """
    Trains a model, evaluates performance, saves predictions and model if save_path is provided.
    Returns a tuple: (model_name, Accuracy, Precision, Recall, F1 Score, AUC)
    """
    pipeline = Pipeline(stages=[model])
    trained_model = pipeline.fit(train_data)
    predictions = trained_model.transform(test_data)

    # Evaluate performance metrics
    metrics = {metric_name: evaluator.evaluate(predictions)
               for metric_name, evaluator in evaluators.items()}

    print(f"\n🔹 {model_name} Performance Metrics:")
    for metric, value in metrics.items():
        print(f"✅ {metric}: {value:.4f}")

    if save_path is not None:
        model_folder = f"{save_path}/{model_name.replace(' ', '_')}_model"
        predictions_folder = f"{save_path}/{model_name.replace(' ', '_')}_predictions"

        # ✅ Save predictions as Parquet (Spark saves it as a directory)
        predictions.write.mode("overwrite").parquet(predictions_folder)
        print(f"✅ Predictions for {model_name} saved to: {predictions_folder}")

        # ✅ Save trained model
        trained_model.write().overwrite().save(model_folder)
        print(f"✅ Model saved to: {model_folder}")

        # ✅ Zip the Parquet directory
        predictions_zip = f"{predictions_folder}.zip"
        shutil.make_archive(predictions_folder, 'zip', predictions_folder)

        model_zip = f"{model_folder}.zip"
        shutil.make_archive(model_folder, 'zip', model_folder)

        # ✅ Upload to IBM COS
        with open(predictions_zip, "rb") as f:
            cos_client.put_object(Bucket=bucket_name, Key=f"{model_name.replace(' ', '_')}_predictions.zip", Body=f.read())

        with open(model_zip, "rb") as f:
            cos_client.put_object(Bucket=bucket_name, Key=f"{model_name.replace(' ', '_')}_model.zip", Body=f.read())

        print(f"✅ Model and predictions uploaded to IBM Cloud Object Storage as ZIP files.")

    return (model_name, *metrics.values())

# Evaluate models and store results
model_results = [evaluate_model(name, model, train_data, test_data, cos_client, BUCKET_NAME, save_path="/mnt/data")
                 for name, model in models.items()]



🔹 Random Forest Performance Metrics:
✅ Accuracy: 0.9461
✅ Precision: 0.9497
✅ Recall: 0.9461
✅ F1 Score: 0.9439
✅ AUC: 0.9992
✅ Predictions for Random Forest saved to: /mnt/data/Random_Forest_predictions
✅ Model saved to: /mnt/data/Random_Forest_model
✅ Model and predictions uploaded to IBM Cloud Object Storage as ZIP files.

🔹 Logistic Regression Performance Metrics:
✅ Accuracy: 0.9723
✅ Precision: 0.9723
✅ Recall: 0.9723
✅ F1 Score: 0.9723
✅ AUC: 0.9974
✅ Predictions for Logistic Regression saved to: /mnt/data/Logistic_Regression_predictions
✅ Model saved to: /mnt/data/Logistic_Regression_model
✅ Model and predictions uploaded to IBM Cloud Object Storage as ZIP files.


## **Model Performance Comparison**
1. **Convert Results to DataFrame**: Store model evaluation metrics in a structured format.
2. **Upload Comparison Results**: Save model performance comparison as a CSV file to IBM COS.
3. **Select Best Model**: Choose the model with the highest AUC score.


In [65]:
from pyspark.sql import DataFrame
import pandas as pd
import io

# Convert results to Pandas DataFrame (excluding predictions)
results_pd = pd.DataFrame(
    model_results,
    columns=["Model", *evaluators.keys()]
)

print("✅ Model Comparison Results:")
print(results_pd)

# Upload model comparison results to IBM COS
csv_buffer = io.StringIO()
results_pd.to_csv(csv_buffer, index=False)
cos_client.put_object(Bucket=BUCKET_NAME, Key="model_comparison_results.csv", Body=csv_buffer.getvalue())

print("✅ Model comparison results saved to IBM Cloud Object Storage")

# Extract best model based on highest AUC
best_model_name = results_pd.loc[results_pd["AUC"].idxmax(), "Model"]
print(f"\n🔹 Best model based on AUC is: {best_model_name}")


✅ Model Comparison Results:
                 Model  Accuracy  Precision    Recall  F1 Score       AUC
0        Random Forest  0.946058   0.949730  0.946058  0.943905  0.999188
1  Logistic Regression  0.972337   0.972337  0.972337  0.972337  0.997384
✅ Model comparison results saved to IBM Cloud Object Storage

🔹 Best model based on AUC is: Random Forest


In [66]:
!mv /content/predictions/Random_Forest_predictions.parquet "/content/drive/MyDrive/IBM_Project/"

mv: cannot stat '/content/predictions/Random_Forest_predictions.parquet': No such file or directory


## **Download & Extract Predictions from IBM COS**
1. **Download ZIP File**: Fetches the model predictions from IBM Cloud Object Storage.
2. **Extract Contents**: Unzips the downloaded file for further analysis.
3. **Load into Spark**: Reads extracted Parquet files into a Spark DataFrame for inspection.


In [67]:
import zipfile
import os

def extract_zip_from_cos(cos_client, bucket_name, zip_filename, extract_path="/mnt/data"):
    """
    Downloads a ZIP file from IBM COS, extracts it, and returns the extracted folder path.
    """
    local_zip_path = f"{extract_path}/{zip_filename}"

    # ✅ Download the ZIP file from COS
    with open(local_zip_path, "wb") as f:
        cos_client.download_fileobj(bucket_name, zip_filename, f)

    print(f"✅ {zip_filename} downloaded from IBM COS.")

    # ✅ Extract the ZIP file
    extracted_folder = local_zip_path.replace(".zip", "")
    with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:
        zip_ref.extractall(extracted_folder)

    print(f"✅ {zip_filename} extracted to {extracted_folder}")

    return extracted_folder

# Example usage:
predictions_folder = extract_zip_from_cos(cos_client, BUCKET_NAME, "Random_Forest_predictions.zip")

# ✅ Load extracted Parquet data into Spark DataFrame
predictions_df = spark.read.parquet(predictions_folder)
predictions_df.show(5)


✅ Random_Forest_predictions.zip downloaded from IBM COS.
✅ Random_Forest_predictions.zip extracted to /mnt/data/Random_Forest_predictions
+---+-------------------+----------------+---------------+------------------+--------------------+-----+--------------------+--------------------+----------+
|Age|Subscription Status|Discount Applied|Promo Code Used|Previous Purchases|            features|label|       rawPrediction|         probability|prediction|
+---+-------------------+----------------+---------------+------------------+--------------------+-----+--------------------+--------------------+----------+
| 18|                  0|               0|              0|                 3|(129,[6,37,76,94,...|  0.0|[44.1882008211829...|[0.88376401642365...|       0.0|
| 18|                  0|               0|              0|                 5|(129,[12,24,34,76...|  0.0|[43.9732433678702...|[0.87946486735740...|       0.0|
| 18|                  0|               0|              0|              

## **Validating Predictions File**
1. **Check Parquet File Existence**: Ensures the extracted predictions file is valid.
2. **Load Predictions**: Reads and displays sample predictions.



In [68]:
from pyspark.sql.utils import AnalysisException

try:
    predictions_df = spark.read.parquet(predictions_folder)
    print("✅ Parquet file loaded successfully!")
    predictions_df.show(5)  # Show first 5 rows
except AnalysisException:
    print("❌ No valid Parquet file found in extracted folder!")


✅ Parquet file loaded successfully!
+---+-------------------+----------------+---------------+------------------+--------------------+-----+--------------------+--------------------+----------+
|Age|Subscription Status|Discount Applied|Promo Code Used|Previous Purchases|            features|label|       rawPrediction|         probability|prediction|
+---+-------------------+----------------+---------------+------------------+--------------------+-----+--------------------+--------------------+----------+
| 18|                  0|               0|              0|                 3|(129,[6,37,76,94,...|  0.0|[44.1882008211829...|[0.88376401642365...|       0.0|
| 18|                  0|               0|              0|                 5|(129,[12,24,34,76...|  0.0|[43.9732433678702...|[0.87946486735740...|       0.0|
| 18|                  0|               0|              0|                 5|(129,[14,24,64,10...|  0.0|[43.4896757876328...|[0.86979351575265...|       0.0|
| 18|           

## **List IBM Cloud Object Storage Contents**
1. **Retrieve File List**: Lists all files and folders stored in the IBM COS bucket.
2. **Validate Uploads**: Ensures that required files (models, predictions) exist.

In [69]:
import ibm_boto3
from ibm_botocore.client import Config

def list_bucket_contents(cos_client, bucket_name):
    """
    Lists all files and folders stored in the specified IBM COS bucket.
    """
    try:
        response = cos_client.list_objects_v2(Bucket=bucket_name)

        if 'Contents' not in response:
            print(f"📂 Bucket '{bucket_name}' is empty.")
            return

        print(f"📂 Contents of '{bucket_name}':")
        for obj in response['Contents']:
            print(f"  - {obj['Key']}")

    except Exception as e:
        print(f"❌ Error listing bucket contents: {str(e)}")

# Example usage:
list_bucket_contents(cos_client, BUCKET_NAME)


📂 Contents of 'processed-data-bucket':
  - Logistic_Regression_model.zip
  - Logistic_Regression_predictions.zip
  - Random Forest/._SUCCESS.crc
  - Random Forest/.part-00000-e88714df-5953-4649-bb99-31df0fade10a-c000.snappy.parquet.crc
  - Random Forest/_SUCCESS
  - Random Forest/part-00000-e88714df-5953-4649-bb99-31df0fade10a-c000.snappy.parquet
  - Random_Forest_model.zip
  - Random_Forest_predictions.zip
  - model_comparison_results.csv
  - processed_customer_purchase_behavior.csv


## **Extract & Re-upload Files to IBM Cloud**
1. **Download & Extract ZIP File**: Fetches and unzips model files from IBM COS.
2. **Upload Extracted Files**: Re-uploads extracted files back to IBM COS for further usage.

In [70]:
import ibm_boto3
from ibm_botocore.client import Config
import zipfile
import os

# Function to extract and upload files back to IBM COS
def extract_and_upload(cos_client, bucket_name, zip_filename, extract_to="/mnt/data/extracted_files"):
    """
    Downloads a ZIP file from IBM COS, extracts it, and uploads extracted files back to COS.
    """
    # Ensure extract directory exists
    os.makedirs(extract_to, exist_ok=True)

    # Download ZIP file from COS
    zip_path = f"/mnt/data/{zip_filename}"
    try:
        cos_client.download_file(Bucket=bucket_name, Key=zip_filename, Filename=zip_path)
        print(f"✅ Downloaded {zip_filename} from IBM COS.")
    except Exception as e:
        print(f"❌ Error downloading ZIP file: {e}")
        return

    # Extract ZIP file
    try:
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(extract_to)
        print(f"✅ Extracted files to: {extract_to}")
    except Exception as e:
        print(f"❌ Error extracting ZIP file: {e}")
        return

    # Upload extracted files back to IBM COS
    for root, _, files in os.walk(extract_to):
        for file in files:
            file_path = os.path.join(root, file)
            object_key = f"{best_model_name}/{file}"  # Upload inside 'extracted/' folder in COS

            try:
                with open(file_path, "rb") as f:
                    cos_client.put_object(Bucket=bucket_name, Key=object_key, Body=f)
                print(f"✅ Uploaded {file} to IBM COS at: extracted/{file}")
            except Exception as e:
                print(f"❌ Error uploading {file}: {e}")

# Example usage:
extract_and_upload(cos_client, "processed-data-bucket", "Random_Forest_predictions.zip")


✅ Downloaded Random_Forest_predictions.zip from IBM COS.
✅ Extracted files to: /mnt/data/extracted_files
✅ Uploaded ._SUCCESS.crc to IBM COS at: extracted/._SUCCESS.crc
✅ Uploaded _SUCCESS to IBM COS at: extracted/_SUCCESS
✅ Uploaded .part-00000-e88714df-5953-4649-bb99-31df0fade10a-c000.snappy.parquet.crc to IBM COS at: extracted/.part-00000-e88714df-5953-4649-bb99-31df0fade10a-c000.snappy.parquet.crc
✅ Uploaded part-00000-9f0f3f38-6cb7-4d8b-8ca9-5c925c7a5c1e-c000.snappy.parquet to IBM COS at: extracted/part-00000-9f0f3f38-6cb7-4d8b-8ca9-5c925c7a5c1e-c000.snappy.parquet
✅ Uploaded part-00000-e88714df-5953-4649-bb99-31df0fade10a-c000.snappy.parquet to IBM COS at: extracted/part-00000-e88714df-5953-4649-bb99-31df0fade10a-c000.snappy.parquet
✅ Uploaded .part-00000-9f0f3f38-6cb7-4d8b-8ca9-5c925c7a5c1e-c000.snappy.parquet.crc to IBM COS at: extracted/.part-00000-9f0f3f38-6cb7-4d8b-8ca9-5c925c7a5c1e-c000.snappy.parquet.crc
