# Unit 2 Preprocessing the Data

# Introduction to Preprocessing the Data

Welcome back\! Now that you've learned how to load and understand the dataset for drawing recognition, it's time to move on to the next crucial step: **preprocessing the data**. Preprocessing is essential because it prepares your data for the machine learning model, ensuring that it can learn effectively. By the end of this lesson, you'll be equipped with the skills to clean and normalize your dataset, setting the stage for successful model training.

-----

## What You'll Learn

In this lesson, you will learn how to preprocess the dataset to make it suitable for training a drawing recognition model. We'll cover three main tasks: cleaning and normalizing the data. Here's a glimpse of the code you'll be working with:

```python
import numpy as np
from sklearn.model_selection import train_test_split

# Load and prepare data
data = []
labels = []
for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:15000]
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Shuffle data
indices = np.arange(len(data))
np.random.shuffle(indices)
data, labels = data[indices], labels[indices]

# Reshape and normalize
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
```

This code snippet demonstrates how to load, clean, and normalize the data, as well as how to split it into training and testing sets. You'll learn how to ensure your data is in the right format and ready for model training.

-----

## Why It Matters

Preprocessing is a vital step in any machine learning project because it directly impacts the model's performance. By cleaning the data, you remove any inconsistencies or errors that could skew the results. Normalizing the data ensures that all input features are on a similar scale, which helps the model learn more effectively.

Are you excited to dive in? Let's start the practice section and apply these preprocessing techniques to your dataset\!

## Loading and Labeling Drawing Data

Great job analyzing the dataset distribution! Now let's move on to the first step of preprocessing: loading and preparing the drawing data.

When working with image datasets for machine learning, it's crucial to ensure all images have consistent dimensions and data types. Inconsistent data can lead to errors or poor model performance later on.

In this practice, you'll load the drawing files, check their data types, and convert them if necessary. You'll also create label arrays for each category and combine everything into unified arrays.

This foundational step ensures your data is properly formatted before moving on to more advanced preprocessing techniques like normalization and augmentation.

```python
import urllib.request
import numpy as np
import os

# Ensure the data is downloaded
categories = ['cat', 'house', 'airplane', 'apple', 'bicycle']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []

# TODO: Load and prepare data for each category
for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    # TODO: Load up to 15000 images from each category file
    imgs = ________
    
    # TODO: Check if the data type is uint8, if not convert it
    if imgs.dtype != np.uint8:
        imgs = ________
    
    # TODO: Add the images to the data list
    data.append(________) 
    
    # TODO: Create labels for this category (use the category index)
    # and add them to the labels list
    labels.append(________) 

# TODO: Combine all data and labels into single arrays
data = ________
labels = ________

# Print information about the loaded data
print(f"Total number of images: {len(data)}")
print(f"Image data type: {data.dtype}")
print(f"Image shape: {data[0].shape}")
print(f"Label shape: {labels.shape}")
print(f"First 5 labels: {labels[:5]}")

```

Of course\! Here is the completed code for loading and preparing the drawing data.

### Completed Code

```python
import urllib.request
import numpy as np
import os

# Ensure the data is downloaded
categories = ['cat', 'house', 'airplane', 'apple', 'bicycle']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []

# Load and prepare data for each category
for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    # Load up to 15000 images from each category file
    imgs = np.load(filepath)[:15000]
    
    # Check if the data type is uint8, if not convert it
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    
    # Add the images to the data list
    data.append(imgs) 
    
    # Create labels for this category (use the category index)
    # and add them to the labels list
    labels.append(np.full(imgs.shape[0], idx)) 

# Combine all data and labels into single arrays
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Print information about the loaded data
print(f"Total number of images: {len(data)}")
print(f"Image data type: {data.dtype}")
print(f"Image shape: {data[0].shape}")
print(f"Label shape: {labels.shape}")
print(f"First 5 labels: {labels[:5]}")
```

-----

### Explanation 🧑‍🏫

1.  **Load Images**: `np.load(filepath)[:15000]` loads the `.npy` file for a category and slices it to get the first **15,000 images**.
2.  **Convert Data Type**: `imgs.astype(np.uint8)` converts the image data to `uint8` (unsigned 8-bit integer), a standard and memory-efficient format for grayscale images.
3.  **Append Data**: `data.append(imgs)` adds the array of images for the current category to the `data` list.
4.  **Create and Append Labels**: `np.full(imgs.shape[0], idx)` creates an array where every element is the category index (`idx`). The length of this array (`imgs.shape[0]`) matches the number of images loaded for that category. This label array is then added to the `labels` list.
5.  **Concatenate Arrays**: `np.concatenate(...)` takes the lists of NumPy arrays (`data` and `labels`) and joins them into single, large NumPy arrays, which are necessary for feeding into a machine learning model.

## Shuffling and Normalizing Drawing Data

Great job loading and labeling your drawing data! Now let's continue with essential preprocessing steps: shuffling and normalizing the data.

Shuffling ensures your model doesn't learn patterns based on the order of samples. Normalizing (scaling pixel values from 0-255 to 0-1) helps your model converge faster during training.

In this practice, you'll take your prepared data and:

Shuffle the data and labels together.
Reshape the images for CNN compatibility.
Normalize the pixel values.
These steps are critical for preparing your drawing dataset for effective model training.

```python
import urllib.request
import numpy as np
import os

# Ensure the data is downloaded
categories = ['cat', 'house', 'airplane', 'apple', 'bicycle']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:15000]  # Load up to 15000 images per category
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# TODO: Create an array of indices for shuffling
indices = ________

# TODO: Shuffle the indices
np.random.________

# TODO: Use the shuffled indices to reorder both data and labels
data, labels = ________, ________

# TODO: Reshape the data to (samples, height, width, channels) and normalize to [0,1]
data = data.reshape(________, ________, ________, ________).astype(________) / ________

```

Of course\! Here is the completed code for shuffling and normalizing the drawing data.

### Completed Code

```python
import urllib.request
import numpy as np
import os

# Ensure the data is downloaded
categories = ['cat', 'house', 'airplane', 'apple', 'bicycle']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:15000]  # Load up to 15000 images per category
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Create an array of indices for shuffling
indices = np.arange(len(data))

# Shuffle the indices
np.random.shuffle(indices)

# Use the shuffled indices to reorder both data and labels
data, labels = data[indices], labels[indices]

# Reshape the data to (samples, height, width, channels) and normalize to [0,1]
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Print information to verify the changes
print(f"Data shape after reshape: {data.shape}")
print(f"Data type after normalization: {data.dtype}")
print(f"Min and max pixel values: {data.min()}, {data.max()}")
print(f"First 5 labels after shuffle: {labels[:5]}")
```

-----

### Explanation 🧑‍🏫

1.  **Create Indices**: `np.arange(len(data))` generates an array of sequential integers from `0` to the total number of images. This creates a reference for the original order of the data.
2.  **Shuffle Indices**: `np.random.shuffle(indices)` shuffles this array of indices in-place. This creates a new, random order.
3.  **Apply Shuffled Order**: `data, labels = data[indices], labels[indices]` uses NumPy's advanced indexing to reorder both the `data` and `labels` arrays according to the `shuffled` indices. This is a critical step to ensure that each image remains correctly associated with its label.
4.  **Reshape and Normalize**: `data.reshape(-1, 28, 28, 1).astype('float32') / 255.0` performs two operations in one line:
      * `reshape(-1, 28, 28, 1)`: Changes the shape of the data array to be compatible with Convolutional Neural Networks (CNNs), which expect a 4D tensor: `(number_of_samples, height, width, color_channels)`. Here, `-1` tells NumPy to automatically calculate the number of samples.
      * `.astype('float32') / 255.0`: First, it converts the data type from `uint8` to `float32`. Then, it divides every pixel value by `255.0`, scaling the range from `[0, 255]` down to `[0, 1]`. This normalization helps the model train more efficiently.

## Implementing Data Augmentation for Drawing Recognition

Now that you've successfully shuffled and normalized your drawing data, let's split the data into training and test sets.


```python
import urllib.request
import numpy as np
import os
from sklearn.model_selection import train_test_split

# Ensure the data is downloaded
categories = ['cat', 'house', 'airplane', 'apple', 'bicycle']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:15000]  # Load up to 15000 images per category
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Shuffle data
indices = np.arange(len(data))
np.random.shuffle(indices)
data, labels = data[indices], labels[indices]

# Reshape and normalize
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# TODO: Split data into training and testing sets and print shapes

```

Of course. Here's the completed code to split your preprocessed data into training and test sets.

-----

### Completed Code

```python
import urllib.request
import numpy as np
import os
from sklearn.model_selection import train_test_split

# Ensure the data is downloaded
categories = ['cat', 'house', 'airplane', 'apple', 'bicycle']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:15000]
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Shuffle data
indices = np.arange(len(data))
np.random.shuffle(indices)
data, labels = data[indices], labels[indices]

# Reshape and normalize
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(
    data, labels, test_size=0.2, random_state=42, stratify=labels
)

# Print the shapes to verify the split
print("Training data shape:", x_train.shape)
print("Testing data shape:", x_test.shape)
print("Training labels shape:", y_train.shape)
print("Testing labels shape:", y_test.shape)
```

-----

### Explanation 🧑‍🏫

The final step in this preprocessing workflow is to partition the data. This is accomplished using Scikit-learn's `train_test_split` function.

`x_train, x_test, y_train, y_test = train_test_split(...)`

This single line splits both the image data (`data`) and the corresponding `labels` into four new arrays:

  * **`x_train`**: The subset of images used for training the model (80% of the original data in this case).
  * **`x_test`**: The subset of images used for evaluating the model's performance after training (the remaining 20%).
  * **`y_train`**: The labels corresponding to the `x_train` images.
  * **`y_test`**: The labels corresponding to the `x_test` images.

The function's parameters are set as follows:

  * `test_size=0.2`: Specifies that **20%** of the data should be allocated to the test set.
  * `random_state=42`: Ensures that the split is **reproducible**. Anyone who runs this code with the same `random_state` will get the exact same split, which is crucial for consistent results.
  * `stratify=labels`: This is an important parameter that ensures the proportion of each drawing category is the same in both the training and testing sets. It prevents a situation where, by random chance, one set has significantly more or fewer images of a certain category than the other.