# Homework Assignment: Building an Image Processing Pipeline with PySpark

## Objective
Develop a PySpark pipeline to classify images from the CIFAR-10 dataset using a machine learning model.

## Prerequisites
- Basic knowledge of Python and machine learning.
- Access to an environment where PySpark is installed and configured.
- CIFAR-10 dataset available in your working directory.

## Resources
- [PySpark Documentation](https://spark.apache.org/docs/latest/api/python/index.html)
- [Machine Learning with PySpark MLlib](https://spark.apache.org/docs/latest/ml-guide.html)
- [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html)

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("CIFAR-10 Image Processing with PySpark") \
    .config("spark.executor.memory", "4g") \
    .config("spark.driver.memory", "4g") \
    .config("spark.memory.fraction", "0.6") \
    .config("spark.executor.memoryOverhead", "512m") \
    .config("spark.sql.shuffle.partitions", "100") \
    .getOrCreate()

1. Why is it necessary to configure `spark.executor.memory` and `spark.driver.memory`?

2. What does setting `spark.memory.fraction` achieve?

In [None]:
import pickle

def unpickle(file):
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict

3. Why is the purpose of serialization in distributed systems?

CIFAR-10 dataset files contain image data and labels in a format that is not immediately suitable for analysis with Spark. You need to transform this data into a format that can be used to create a DataFrame in Spark. Below is the starter function to load a CIFAR-10 batch file into a list of tuples, which will be parallelized into an RDD and then converted into a DataFrame. Add line by line comments in the provided code to explain the transformation process, particularly focusing on image reshaping and serialization.

In [None]:
from PIL import Image
import io
import numpy as np

def load_cifar10_batch(file):
    """
    Loads a CIFAR-10 batch file and returns a list of tuples containing image data and labels.
    Args:
    - file (str): Path to the CIFAR-10 batch file.
    Returns:
    - list: A list of tuples, where each tuple contains (image_data, label).
    """
    batch = unpickle(file)
    data = batch[b'data']
    labels = batch[b'labels']
    images_and_labels = []

    # TODO: Comment the following code
    for i in range(len(data)):
        image_array = data[i]
        image_array_reshaped = image_array.reshape(3, 32, 32).transpose(1, 2, 0)
        image = Image.fromarray(image_array_reshaped)
        img_byte_arr = io.BytesIO()
        image.save(img_byte_arr, format='PNG')
        image_bytes = img_byte_arr.getvalue()
        images_and_labels.append((image_bytes, labels[i]))
    
    return images_and_labels

## Data Preparation

DataFrames provide a convenient and efficient way to handle structured data in Spark. You will now take the data loaded from CIFAR-10 files, parallelize it using RDDs (Resilient Distributed Datasets), and then convert these RDDs into DataFrames. This process must handle multiple batches of data to form a comprehensive dataset.

In [None]:
from pyspark.sql import Row

# Function to create a DataFrame from a single batch file
def create_dataframe_from_batch(file):
    images_and_labels = load_cifar10_batch(file)
    rdd = spark.sparkContext.parallelize(images_and_labels)
    row_rdd = rdd.map(lambda x: Row(image_data=x[0], label=x[1]))
    df = spark.createDataFrame(row_rdd)
    return df

# Load and combine multiple batches
df = None
for batch_file in batch_files:
    batch_df = create_dataframe_from_batch(batch_file)
    if df is None:
        df = batch_df
    else:
        df = df.union(batch_df)


4. What does the `parallelize` method do, and why is it important in Spark?
5. How does the `union` method help in combining data from different sources?

In machine learning, features need to be numeric and typically normalized. The images in the CIFAR-10 dataset are in byte format and must be converted into a usable form for machine learning models. This task involves writing a UDF that converts the image byte data into a dense vector of normalized pixel values.

In [None]:
from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors, VectorUDT
import numpy as np

def convert_bytes_to_vector(image_bytes):
    image = Image.open(io.BytesIO(image_bytes))
    array = np.array(image).flatten().astype(float) / 255.0
    return Vectors.dense(array)

convert_udf = udf(convert_bytes_to_vector, VectorUDT())

# Apply UDF to the DataFrame
df = df.withColumn("features", convert_udf("image_data"))


6. Why is it necessary to normalize the pixel values in image processing?
7. What are the benefits of using UDFs in Spark?