<a href="https://colab.research.google.com/github/suriarasai/BEAD2025/blob/main/colab/05c_Image_Processing_Using_RDD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates the core RDD-based approach, which is excellent for custom, low-level data transformations on unstructured data like images.

### PySpark Setp

The first step involves installing pyspark.  The next step is to install findspark library.

*Note: the --ignore-install flag is used to ignore previous installations and use the latest one built alongside the allocated cluster.*

In [1]:
import os

# 1. Install OpenJDK 21 (if not already done in a previous cell)
!apt-get update -qq
!apt-get install -qq openjdk-21-jdk-headless

# 2. Verify where it landed (if needed)
!ls /usr/lib/jvm | grep 21

# 3. Point to JDK 21
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-21-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

# 4. Install PySpark via pip (make sure this happens AFTER setting JAVA_HOME)
!pip install pyspark --quiet

# 5. Import and start Spark
from pyspark.sql import SparkSession
spark = (
    SparkSession.builder
      .master("local[*]")
      .appName("Spark on Java21")
      .getOrCreate()
)


W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Selecting previously unselected package openjdk-21-jre-headless:amd64.
(Reading database ... 126371 files and directories currently installed.)
Preparing to unpack .../openjdk-21-jre-headless_21.0.8+9~us1-0ubuntu1~22.04.1_amd64.deb ...
Unpacking openjdk-21-jre-headless:amd64 (21.0.8+9~us1-0ubuntu1~22.04.1) ...
Selecting previously unselected package openjdk-21-jdk-headless:amd64.
Preparing to unpack .../openjdk-21-jdk-headless_21.0.8+9~us1-0ubuntu1~22.04.1_amd64.deb ...
Unpacking openjdk-21-jdk-headless:amd64 (21.0.8+9~us1-0ubuntu1~22.04.1) ...
Setting up openjdk-21-jre-headless:amd64 (21.0.8+9~us1-0ubuntu1~22.04.1) ...
update-alternatives: using /usr/lib/jvm/java-21-openjdk-amd64/bin/java to provide /usr/bin/java (java) in auto mode
update-alternatives: using /usr/lib/jvm/java-21-openjdk-amd64/bin/j

Next, we need to set up the Kaggle API to download our dataset.

In [2]:
# Install required packages
!pip install kaggle -q

### Configure Kaggle API
To use the Kaggle API, you need your API token.

Go to Kaggle account page: https://www.kaggle.com/your-username/account

Click on "Create New API Token". This will download a kaggle.json file.

Run the next cell and upload that kaggle.json file when prompted.

In [3]:
# Upload your kaggle.json file
from google.colab import files
print('Please upload your kaggle.json file')
files.upload()

# Move the file to the required directory and set permissions
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Please upload your kaggle.json file


Saving kaggle.json to kaggle.json


### Download and Unzip the Dataset
We'll use the "Intel Image Classification" dataset, which is well-structured and a manageable size. It contains images of natural scenes categorized into folders.

In [4]:
# Download the dataset from Kaggle
!kaggle datasets download -d puneet6060/intel-image-classification

# Unzip the dataset quietly
!unzip -q intel-image-classification.zip

Dataset URL: https://www.kaggle.com/datasets/puneet6060/intel-image-classification
License(s): copyright-authors
Downloading intel-image-classification.zip to /content
 92% 319M/346M [00:00<00:00, 517MB/s]
100% 346M/346M [00:00<00:00, 431MB/s]


### Load Images as an RDD
The key to reading binary files like images is the sc.binaryFiles() method. It creates an RDD where each element is a tuple (filepath, PortableDataStream). The PortableDataStream contains the raw image bytes.

In [None]:
!pip uninstall -y Pillow
!pip install Pillow

So first test if the Pillow library works as expected.

In [14]:
from PIL import Image
import numpy as np

# Define a path to a single, specific image
single_image_path = 'seg_train/seg_train/buildings/0.jpg'

print(f"Attempting to process one image: {single_image_path}")

try:
    # Try to open the image from the file system
    with open(single_image_path, 'rb') as f:
        image = Image.open(f)

        # Perform the same transformations
        processed_image = image.convert('L').resize((64, 64))
        image_array = np.array(processed_image)

        print(f"âœ… Success! Single image processed correctly.")
        print(f"Image mode: {processed_image.mode}, Size: {processed_image.size}, Shape: {image_array.shape}")

except FileNotFoundError:
    print(f"ðŸ›‘ FAILED: The file was not found at '{single_image_path}'. Make sure the dataset was unzipped correctly.")
except Exception as e:
    print(f"ðŸ›‘ FAILED TO PROCESS SINGLE IMAGE. THIS IS THE KEY ERROR:")
    print(f"   Error Type: {type(e).__name__}")
    print(f"   Error Message: {e}")

Attempting to process one image: seg_train/seg_train/buildings/0.jpg
âœ… Success! Single image processed correctly.
Image mode: L, Size: (64, 64), Shape: (64, 64)


In [5]:
# Define the correct path to the training images
# The path is relative to your notebook's location, which is /content
image_dir = "seg_train/seg_train/forest"

# Load all images from the directory and its subdirectories into an RDD
# The result is an RDD of (file_path, binary_content)
image_rdd = spark.sparkContext.binaryFiles(image_dir)

# --- Verification ---
# You can confirm the path is correct by listing its contents
print("Verifying directory contents:")
!ls {image_dir} | head -n 5

# Let's inspect the first element to see the structure
print("\n--- Processing RDD ---")
first_element = image_rdd.take(1)[0]
print(f"File Path: {first_element[0]}")
print(f"Data Type: {type(first_element[1])}")
print(f"A sample of image RDD has {image_rdd.count()} images.")

Verifying directory contents:
10007.jpg
10010.jpg
10020.jpg
10030.jpg
10037.jpg

--- Processing RDD ---
File Path: file:/content/seg_train/seg_train/forest/17180.jpg
Data Type: <class 'bytes'>
A sample of image RDD has 2271 images.


In [16]:
import os
from PIL import Image
import io
import numpy as np

def process_image(element):
    """
    Parses an RDD element to process an image, correctly handling the 'bytes' object.

    Args:
        element: A tuple containing the file path and the image data as bytes.

    Returns:
        A tuple of (label, numpy_array) or None if processing fails.
    """
    filepath, data_stream = element

    try:
        # 1. Extract the label from the parent directory's name
        label = os.path.basename(os.path.dirname(filepath))

        # 2. Read the image bytes into a Pillow Image object
        # THE FIX IS HERE: We remove .readAll() because data_stream is already 'bytes'
        image = Image.open(io.BytesIO(data_stream))

        # 3. Perform transformations
        processed_image = image.convert('L').resize((64, 64))

        # 4. Convert the processed image to a NumPy array
        image_array = np.array(processed_image)

        return (label, image_array)

    except Exception as e:
        # This should no longer happen, but it's good practice to keep it
        print(f"Could not process {filepath}: {e}")
        return None

# --- Apply the Corrected Function and Analyze Results ---

# Use map() to apply our processing function to each element of the RDD
processed_rdd = image_rdd.map(process_image)

# Use filter() to remove any images that failed processing
processed_rdd = processed_rdd.filter(lambda x: x is not None)

# Cache the RDD in memory for faster access
processed_rdd.cache()

# Action 1: Count the total number of successfully processed images
total_images = processed_rdd.count()
print(f"âœ… Successfully processed {total_images} images.")

# Action 2: Inspect the first 2 elements of our processed RDD
print("\n--- Sample of Processed RDD ---")
sample_data = processed_rdd.take(2)
for label, img_array in sample_data:
    print(f"Label: {label}, Image Array Shape: {img_array.shape}, DType: {img_array.dtype}")


# Action 3: Get a count of images per category
label_counts = processed_rdd.map(lambda x: x[0]).countByValue()

print("\n--- Image Count Per Category ---")
for label, count in sorted(label_counts.items()):
    print(f"- {label}: {count} images")



âœ… Successfully processed 2271 images.

--- Sample of Processed RDD ---
Label: forest, Image Array Shape: (64, 64), DType: uint8
Label: forest, Image Array Shape: (64, 64), DType: uint8

--- Image Count Per Category ---
- forest: 2271 images


###Conclusion and Cleanup

This example demonstrates the fundamental RDD workflow for custom data processing. We loaded raw binary files, applied a complex Python function using map(), and performed aggregations.

For more structured machine learning tasks, the next step would often be to convert this RDD into a Spark DataFrame, which integrates seamlessly with Spark's MLlib library. However, for custom ETL (Extract, Transform, Load) and preprocessing, the RDD API provides maximum flexibility.

Finally, it's good practice to stop the Spark session to release resources.

In [17]:
# Stop the Spark session
spark.stop()