# Deep learning using Big data

To meet the requirements, it’s important to address the Big Data aspects of the project before fully diving into our deep learning model. So following step by step on how we can achieve this and fulfill the learning outcomes.


1. Critically assess the data storage and management requirements (MLO 1)

Before starting with deep learning, we would have to assess how Big Data storage is required for your project. For the CIFAR-10 dataset for instance, although relatively small by Big Data standards, we'll need to show how you would store and manage a much larger dataset.

Task: Compare traditional storage approaches (e.g., relational databases) versus modern Big Data storage solutions (e.g., HDFS, cloud storage).
Modern Perspective: Emphasize the scalability and fault tolerance of modern storage solutions like Hadoop HDFS or cloud-based object storage (S3, GCS).

We will Describe how a distributed system like Hadoop/Spark would handle a larger dataset by distributing chunks across nodes, improving access and speed.

Implementation: We’ll later process CIFAR-10 using Spark (even though it’s small around 600 mega bytes, we can simulate the steps for Big Data).


2. Assess design concepts and architectural patterns of distributed Big Data systems (MLO 2)

Before the deep learning phase, this step involves understanding how distributed systems like Hadoop and Spark handle data.

Task: Analyze Spark’s architecture, including the role of the master node (driver) and worker nodes (executors). We will discuss how data is split into partitions, processed in parallel, and combined using MapReduce or Resilient Distributed Datasets (RDDs).

So we can show an experiment where we distribute and process the CIFAR-10 dataset using Spark. We could, for instance, normalize the data or convert it into different formats (e.g., Parquet).

Design Focus: We will discuss the use of in-memory processing in Spark, which allows faster computation compared to traditional Hadoop systems, which rely heavily on disk I/O.

3. Critically evaluate and select a Big Data environment (MLO 3)

Before or during data processing for deep learning, this involves justifying the choice of environment (e.g., using Spark in Colab).

Task: Comparing different Big Data environments (e.g., Spark, Hadoop, cloud-based solutions) and explain why Spark is a suitable choice for handling our data.

Why Spark?: For its ability to perform in-memory processing, scalability, and support for integrating deep learning libraries like TensorFlow.

Data Management: We will describe how we will manage large-scale data in a distributed environment, including fault tolerance (using replication) and parallel processing (using multiple cores).

Order of Execution:
- Big Data Setup (before deep learning):

- Downloading the CIFAR-10 dataset and load it into Spark as RDDs or DataFrames.
Process the dataset in a distributed way (e.g., normalization, transformations).

- Discussing how we’d handle a much larger dataset (scalability, storage).

- Deep Learning Model:
 After performing Big Data preprocessing, we can transition to building our neural network using the preprocessed CIFAR-10 data. Train the model using TensorFlow or PyTorch, but still emphasize Spark for data preprocessing if needed.

**JDK Java file**

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

**Downloading spark over hadoop**

In [None]:
!wget https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz


--2024-09-23 15:03:08--  https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
Resolving archive.apache.org (archive.apache.org)... 65.108.204.189, 2a01:4f9:1a:a084::2
Connecting to archive.apache.org (archive.apache.org)|65.108.204.189|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 299350810 (285M) [application/x-gzip]
Saving to: ‘spark-3.3.1-bin-hadoop3.tgz’


2024-09-23 15:22:32 (251 KB/s) - ‘spark-3.3.1-bin-hadoop3.tgz’ saved [299350810/299350810]



In [None]:
!tar -xvzf spark-3.3.1-bin-hadoop3.tgz


spark-3.3.1-bin-hadoop3/
spark-3.3.1-bin-hadoop3/LICENSE
spark-3.3.1-bin-hadoop3/NOTICE
spark-3.3.1-bin-hadoop3/R/
spark-3.3.1-bin-hadoop3/R/lib/
spark-3.3.1-bin-hadoop3/R/lib/SparkR/
spark-3.3.1-bin-hadoop3/R/lib/SparkR/DESCRIPTION
spark-3.3.1-bin-hadoop3/R/lib/SparkR/INDEX
spark-3.3.1-bin-hadoop3/R/lib/SparkR/Meta/
spark-3.3.1-bin-hadoop3/R/lib/SparkR/Meta/Rd.rds
spark-3.3.1-bin-hadoop3/R/lib/SparkR/Meta/features.rds
spark-3.3.1-bin-hadoop3/R/lib/SparkR/Meta/hsearch.rds
spark-3.3.1-bin-hadoop3/R/lib/SparkR/Meta/links.rds
spark-3.3.1-bin-hadoop3/R/lib/SparkR/Meta/nsInfo.rds
spark-3.3.1-bin-hadoop3/R/lib/SparkR/Meta/package.rds
spark-3.3.1-bin-hadoop3/R/lib/SparkR/Meta/vignette.rds
spark-3.3.1-bin-hadoop3/R/lib/SparkR/NAMESPACE
spark-3.3.1-bin-hadoop3/R/lib/SparkR/R/
spark-3.3.1-bin-hadoop3/R/lib/SparkR/R/SparkR
spark-3.3.1-bin-hadoop3/R/lib/SparkR/R/SparkR.rdb
spark-3.3.1-bin-hadoop3/R/lib/SparkR/R/SparkR.rdx
spark-3.3.1-bin-hadoop3/R/lib/SparkR/doc/
spark-3.3.1-bin-hadoop3/R/lib/Spar

Now we will set up the paths for Java, Spark, and Hadoop

These environment variables will point to the correct installations of Java and Spark. Without them, Spark won’t know where to find its required components.


This step meets MLO 2, as it reflects configuring the Big Data system architecture by integrating Spark with the existing system, allowing us to handle large datasets.

In [None]:
import os

# Set the Java path
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

# Set the Spark path
os.environ["SPARK_HOME"] = "/content/spark-3.3.1-bin-hadoop3"


Now that the environment variables are set, the next step is to install **PySpark**, which is the Python API for Spark. This allows us to use Spark's features in Python directly.


Why  doing this?
- **PySpark** is essential for interacting with Spark through Python, which is what we'll use to preprocess the CIFAR-10 dataset in a distributed manner.
- This step supports **MLO 3**: selecting an appropriate Big Data environment for data retrieval and management. PySpark is a key part of integrating big data processing with neural networks, as it bridges distributed data processing with advanced analytics (deep learning).


In [None]:
!pip install -q pyspark

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


 **Initializing a Spark Session**

- The Spark session is the entry point for working with Spark, which is built on top of Hadoop 3. It allows us to load, process, and manage large datasets across distributed clusters, simulating a real-world scenario.

- The configuration assigns 2GB of memory to the executor, which handles data processing. This step is essential for initiating any distributed data processing task, fulfilling MLO 2 and 3 by establishing the architecture and environment for Big Data processing..



In [None]:
from pyspark.sql import SparkSession


spark = SparkSession.builder \
    .appName("CIFAR-10 Processing") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .getOrCreate()


# Checking the Spark session
spark

**Loading the CIFAR-10 Dataset**

We don't need to download the CIFAR-10 dataset manually, we will load it directly.

- This step demonstrates to load a widely-used dataset for deep learning directly into our Spark environment, leveraging **Hadoop's** distributed processing capabilities.

- Converting the dataset into an RDD allows us to perform distributed computations efficiently, meeting **MLO 1 and 3** by showcasing data management and preprocessing in a Big Data context.


In [None]:
import pandas as pd
import numpy as np

# Loading the CIFAR-10 dataset and process it as before
import tarfile
import pickle

# Download the dataset
!wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

# Extracting the dataset
with tarfile.open('cifar-10-python.tar.gz', 'r:gz') as tar:
    tar.extractall()

# Loading the dataset
def load_cifar_batch(filename):
    with open(filename, 'rb') as f:
        datadict = pickle.load(f, encoding='latin1')
        images = datadict['data']
        labels = datadict['labels']
        return images, labels

# Loading all batches
train_images = []
train_labels = []

for i in range(1, 6):
    batch_images, batch_labels = load_cifar_batch(f'cifar-10-batches-py/data_batch_{i}')
    train_images.append(batch_images)
    train_labels.append(batch_labels)

train_images = np.vstack(train_images)
train_labels = np.hstack(train_labels)

# Flatten the images and create a Pandas DataFrame in smaller chunks
data = []

# Loading data in smaller batches
for i in range(0, len(train_images), 1000):
    batch_images = train_images[i:i + 1000].reshape(-1, 32 * 32 * 3)
    batch_labels = train_labels[i:i + 1000].flatten()
    # Create a DataFrame for this batch
    batch_df = pd.DataFrame(batch_images.tolist(), columns=[f'pixel_{j}' for j in range(32 * 32 * 3)])
    batch_df['label'] = batch_labels
    data.append(batch_df)

# Concatenate all batches into a single DataFrame
train_df_pandas = pd.concat(data, ignore_index=True)

# Converting pixel values to float
train_df_pandas = train_df_pandas.astype({f'pixel_{i}': 'float' for i in range(32 * 32 * 3)})

# Defining the schema for the DataFrame
from pyspark.sql.types import StructType, StructField, FloatType, IntegerType

schema = StructType([
    StructField(f'pixel_{i}', FloatType(), False) for i in range(32 * 32 * 3)
] + [StructField('label', IntegerType(), False)])




--2024-09-23 16:07:32--  https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Resolving www.cs.toronto.edu (www.cs.toronto.edu)... 128.100.3.30
Connecting to www.cs.toronto.edu (www.cs.toronto.edu)|128.100.3.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170498071 (163M) [application/x-gzip]
Saving to: ‘cifar-10-python.tar.gz.1’


2024-09-23 16:07:47 (12.5 MB/s) - ‘cifar-10-python.tar.gz.1’ saved [170498071/170498071]



In [None]:
# Converting the Pandas DataFrame to a Spark DataFrame with the defined schema
train_df = spark.createDataFrame(train_df_pandas, schema=schema)

# Showing the first few entries
train_df.show(5)

NameError: name 'spark' is not defined

In [None]:
from pyspark.sql.types import StructType, StructField, FloatType, IntegerType

# Define schema
schema = StructType([
    StructField(f'pixel_{i}', FloatType(), False) for i in range(32 * 32 * 3)
] + [StructField('label', IntegerType(), False)])

print(schema)


StructType([StructField('pixel_0', FloatType(), False), StructField('pixel_1', FloatType(), False), StructField('pixel_2', FloatType(), False), StructField('pixel_3', FloatType(), False), StructField('pixel_4', FloatType(), False), StructField('pixel_5', FloatType(), False), StructField('pixel_6', FloatType(), False), StructField('pixel_7', FloatType(), False), StructField('pixel_8', FloatType(), False), StructField('pixel_9', FloatType(), False), StructField('pixel_10', FloatType(), False), StructField('pixel_11', FloatType(), False), StructField('pixel_12', FloatType(), False), StructField('pixel_13', FloatType(), False), StructField('pixel_14', FloatType(), False), StructField('pixel_15', FloatType(), False), StructField('pixel_16', FloatType(), False), StructField('pixel_17', FloatType(), False), StructField('pixel_18', FloatType(), False), StructField('pixel_19', FloatType(), False), StructField('pixel_20', FloatType(), False), StructField('pixel_21', FloatType(), False), StructFi