# Unit 3 Evaluating the Model and Visualizing the Predictions

Welcome back\! Now that you've trained your Convolutional Neural Network (CNN) model, it's time to evaluate its performance and visualize its predictions. This lesson will guide you through the process of assessing how well your model performs on unseen data and how to interpret its predictions. Whether you're familiar with model evaluation or this is your first time, this lesson will provide you with the knowledge and skills needed to evaluate a CNN model effectively.

## What You'll Learn

In this lesson, you will learn how to evaluate the performance of your CNN model using various metrics. We will cover the steps involved in assessing the model's accuracy and visualizing its predictions. You will also learn how to interpret the results and understand the model's strengths and weaknesses.

Here's a glimpse of the code you'll be working with:

```python
# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f"Test accuracy: {accuracy:.4f}")

# Visualize predictions
def show_predictions(model, x_test, y_test, categories):
    predictions = model.predict(x_test)
    predicted_labels = np.argmax(predictions, axis=1)
    plt.figure(figsize=(10, 10))
    for i in range(9):
        plt.subplot(3, 3, i + 1)
        plt.imshow(x_test[i].reshape(28, 28), cmap='gray')
        plt.title(f"True: {categories[y_test[i]]}\nPred: {categories[predicted_labels[i]]}")
        plt.axis('off')
    plt.tight_layout()
    plt.savefig('static/images/predictions.png')

show_predictions(model, x_test, y_test, categories)
```

This code snippet demonstrates how to evaluate your model's accuracy and visualize its predictions. You will understand each step and how it contributes to assessing the model's performance.

## Understanding Precision and Recall

Precision and recall are two important metrics for evaluating classification models, especially when dealing with imbalanced datasets.

**Precision** measures how many of the items predicted as a certain class are actually of that class. High precision means that when the model predicts a class, it is usually correct.

$$Precision=\frac{True Positives}{True Positives+False Positives}$$

**Recall** (also known as sensitivity) measures how many of the actual items of a certain class the model correctly identified. High recall means the model finds most of the items of that class.

$$Recall=\frac{True Positives}{True Positives+False Negatives}$$

Precision and recall are often in tension: increasing one can sometimes decrease the other. That’s why the F1 score, which combines both, is also useful.

You can compute precision and recall for your model as follows:

```python
from sklearn.metrics import precision_score, recall_score

# Compute precision and recall (macro average for multi-class)
precision = precision_score(y_test, predicted_labels, average='macro')
recall = recall_score(y_test, predicted_labels, average='macro')
print(f"Precision (macro): {precision:.4f}")
print(f"Recall (macro): {recall:.4f}")
```

## Evaluating with the F1 Score

In addition to accuracy, another important metric for evaluating your CNN model is the **F1 score**. The F1 score is especially useful when your data is imbalanced, meaning some classes have more samples than others. It combines both **precision** (how many of the predicted positives are actually positive) and **recall** (how many of the actual positives were correctly predicted) into a single metric.

The F1 score is calculated as:

$$F1 Score=2\times\frac{Precision\times Recall}{Precision+Recall}$$

A higher F1 score indicates better model performance, especially when you care equally about precision and recall.

Here's how you can compute the F1 score for your model's predictions:

```python
from sklearn.metrics import f1_score

# Get predicted labels
predictions = model.predict(x_test)
predicted_labels = np.argmax(predictions, axis=1)

# Compute F1 score (macro average for multi-class)
f1 = f1_score(y_test, predicted_labels, average='macro')
print(f"F1 Score (macro): {f1:.4f}")
```

By including the F1 score in your evaluation, you gain a more comprehensive understanding of your model's performance, especially in cases where accuracy alone might be misleading.

## The Classification Report

The **classification report** is a comprehensive summary of key evaluation metrics for each class in your dataset. It includes precision, recall, F1 score, and support (the number of true instances for each class). This report helps you quickly assess how well your model is performing for each individual class, not just overall.

Here’s how you can generate and display a classification report:

```python
from sklearn.metrics import classification_report

# Generate classification report
report = classification_report(y_test, predicted_labels, target_names=categories)
print(report)
```

The output will look something like this:

```
              precision    recall  f1-score   support

      apple       0.85      0.80      0.82        50
     banana       0.78      0.84      0.81        50
      chair       0.90      0.88      0.89        50

   accuracy                           0.84       150
  macro avg       0.84      0.84      0.84       150
weighted avg       0.84      0.84      0.84       150
```

Each row shows the metrics for a specific class, while the averages at the bottom summarize overall performance. This detailed breakdown helps you identify which classes your model predicts well and which may need more attention.

## Understanding the Confusion Matrix and Heatmap

Another valuable tool for evaluating your CNN model is the **confusion matrix**. A confusion matrix provides a detailed breakdown of your model’s predictions by showing how many times each class was correctly or incorrectly predicted. Each row of the matrix represents the actual class, while each column represents the predicted class.

  * **Diagonal elements** (from top-left to bottom-right) show the number of correct predictions for each class.
  * **Off-diagonal elements** indicate misclassifications, showing where the model confused one class for another.

This matrix helps you identify specific classes your model struggles with, which is especially useful for multi-class problems like sketch recognition.

To make the confusion matrix easier to interpret, you can visualize it as a **heatmap**. A heatmap uses color intensity to represent the values in the confusion matrix, making patterns and problem areas stand out visually.

Here’s how you can compute and visualize the confusion matrix as a heatmap:

```python
from sklearn.metrics import confusion_matrix
import seaborn as sns

# Compute confusion matrix
cm = confusion_matrix(y_test, predicted_labels)

# Plot confusion matrix as a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=categories, yticklabels=categories)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.savefig('static/images/confusion_matrix.png')
```

By analyzing the confusion matrix and its heatmap, you can quickly spot which classes are most often confused with each other. This insight can guide you in refining your model or dataset to improve overall performance.

Here is an example of how the heatmap might look like:

## Why It Matters

Evaluating a CNN model is a crucial step in the machine learning process. It allows you to measure the model's accuracy and identify areas for improvement. By mastering the evaluation process, you will be able to create models that can reliably recognize and classify images. This skill is essential for developing robust AI systems and advancing research in machine learning.

Are you excited to see how well your model performs? Let's dive into the practice section and start evaluating your CNN model for sketch recognition\!




## Visualizing Loss and Accuracy Together

Great job saving and loading your sketch recognition model! Now, let's enhance your evaluation toolkit by visualizing both accuracy and loss metrics. This dual visualization will give you deeper insights into how your model learns over time and help you identify potential issues like overfitting.

The history object returned by model.fit() contains all metrics tracked during training. You've already seen how to plot accuracy — now you'll expand this to include loss metrics in a side-by-side comparison.

This comprehensive visualization approach is standard practice among machine learning engineers to fully understand model performance. By tracking both metrics together, you can make better decisions about when to stop training or how to improve your model architecture.

```python
import os
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", message="Your `PyDataset` class should call", category=UserWarning)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  

import urllib.request
import numpy as np
import os
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
import matplotlib.pyplot as plt

# Data loading and preprocessing (already done for you)
categories = ['cat', 'house', 'apple']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []
IMAGE_COUNT = 3000

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:IMAGE_COUNT]
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Shuffle data
indices = np.arange(len(data))
np.random.shuffle(indices)
data, labels = data[indices], labels[indices]

# Reshape and normalize
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
print(f"Training data shape: {x_train.shape}, Testing data shape: {x_test.shape}")

# Create data augmentation generator
datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1,
    shear_range=0.1,
    fill_mode='nearest'
)

# Fit the generator to the training data
datagen.fit(x_train)

def build_simple_cnn():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28,28,1)),
        tf.keras.layers.Conv2D(32, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(len(categories), activation='softmax')
    ])
    return model

# Build the model
model = build_simple_cnn()

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Print model summary
model.summary()

# Train the model for a few epochs
history = model.fit(datagen.flow(x_train, y_train, batch_size=32),
                   epochs=3,
                   validation_data=(x_test, y_test),
                   steps_per_epoch=len(x_train) // 32)

# Evaluate the original model
original_loss, original_accuracy = model.evaluate(x_test, y_test)
print(f"Original model - Test accuracy: {original_accuracy:.4f}")

# Save the model to disk
model.save('sketch_recognition_model.keras')
print("Model saved successfully!")

# Load the model from disk
loaded_model = tf.keras.models.load_model('sketch_recognition_model.keras')
print("Model loaded successfully!")

# Evaluate the loaded model
loaded_loss, loaded_accuracy = loaded_model.evaluate(x_test, y_test)
print(f"Loaded model - Test accuracy: {loaded_accuracy:.4f}")

# Compare the performance
print(f"Accuracy difference: {abs(original_accuracy - loaded_accuracy):.6f}")

# Check if the models perform identically
if abs(original_accuracy - loaded_accuracy) < 1e-6:
    print("Success! The loaded model performs identically to the original model.")
else:
    print("There might be small differences between the original and loaded models.")

# Create a figure with two subplots
plt.figure(figsize=(12, 5))

# Plot training & validation accuracy values
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')

# TODO: Create the second subplot (1, 2, 2) for loss values
# TODO: Plot the training loss from history.history['loss']
# TODO: Plot the validation loss from history.history['val_loss']
# TODO: Add appropriate title, labels, and legend to the loss plot
plt.subplot(________) 
plt.plot(________) 
plt.plot(________) 
plt.title(________) 
plt.ylabel(________) 
plt.xlabel(________) 
plt.legend(________, loc=________) 

plt.tight_layout()
plt.savefig('static/images/plot.png')
print("Training metrics visualization saved to 'static/images/plot.png'")

```

## Visualizing Loss and Accuracy Together

Great job saving and loading your sketch recognition model\! Now, let's enhance your evaluation toolkit by visualizing both accuracy and loss metrics. This dual visualization will give you deeper insights into how your model learns over time and help you identify potential issues like overfitting.

The `history` object returned by `model.fit()` contains all metrics tracked during training. You've already seen how to plot accuracy — now you'll expand this to include loss metrics in a side-by-side comparison.

This comprehensive visualization approach is standard practice among machine learning engineers to fully understand model performance. By tracking both metrics together, you can make better decisions about when to stop training or how to improve your model architecture.

```python
import os
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", message="Your `PyDataset` class should call", category=UserWarning)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  

import urllib.request
import numpy as np
import os
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
import matplotlib.pyplot as plt

# Data loading and preprocessing (already done for you)
categories = ['cat', 'house', 'apple']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []
IMAGE_COUNT = 3000

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:IMAGE_COUNT]
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Shuffle data
indices = np.arange(len(data))
np.random.shuffle(indices)
data, labels = data[indices], labels[indices]

# Reshape and normalize
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
print(f"Training data shape: {x_train.shape}, Testing data shape: {x_test.shape}")

# Create data augmentation generator
datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1,
    shear_range=0.1,
    fill_mode='nearest'
)

# Fit the generator to the training data
datagen.fit(x_train)

def build_simple_cnn():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28,28,1)),
        tf.keras.layers.Conv2D(32, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(len(categories), activation='softmax')
    ])
    return model

# Build the model
model = build_simple_cnn()

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Print model summary
model.summary()

# Train the model for a few epochs
history = model.fit(datagen.flow(x_train, y_train, batch_size=32),
                   epochs=3,
                   validation_data=(x_test, y_test),
                   steps_per_epoch=len(x_train) // 32)

# Evaluate the original model
original_loss, original_accuracy = model.evaluate(x_test, y_test)
print(f"Original model - Test accuracy: {original_accuracy:.4f}")

# Save the model to disk
model.save('sketch_recognition_model.keras')
print("Model saved successfully!")

# Load the model from disk
loaded_model = tf.keras.models.load_model('sketch_recognition_model.keras')
print("Model loaded successfully!")

# Evaluate the loaded model
loaded_loss, loaded_accuracy = loaded_model.evaluate(x_test, y_test)
print(f"Loaded model - Test accuracy: {loaded_accuracy:.4f}")

# Compare the performance
print(f"Accuracy difference: {abs(original_accuracy - loaded_accuracy):.6f}")

# Check if the models perform identically
if abs(original_accuracy - loaded_accuracy) < 1e-6:
    print("Success! The loaded model performs identically to the original model.")
else:
    print("There might be small differences between the original and loaded models.")

# Create a figure with two subplots
plt.figure(figsize=(12, 5))

# Plot training & validation accuracy values
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')

# Create the second subplot (1, 2, 2) for loss values
plt.subplot(1, 2, 2) 
# Plot the training loss from history.history['loss']
plt.plot(history.history['loss']) 
# Plot the validation loss from history.history['val_loss']
plt.plot(history.history['val_loss']) 
# Add appropriate title, labels, and legend to the loss plot
plt.title('Model loss') 
plt.ylabel('Loss') 
plt.xlabel('Epoch') 
plt.legend(['Train', 'Validation'], loc='upper right') 

plt.tight_layout()
plt.savefig('static/images/plot.png')
print("Training metrics visualization saved to 'static/images/plot.png'")
```

## Creating Confusion Matrix for Sketch Recognition

Now that you've visualized your model's training metrics, let's create a confusion matrix to gain deeper insights into your model's performance. A confusion matrix shows how many predictions were correct for each class and reveals patterns in misclassifications.

In this task, you'll create and visualize a confusion matrix for your sketch recognition model, complete with proper category labels. This visualization will help you identify which sketch types are most challenging for your model to distinguish.

The confusion matrix is a standard evaluation tool in machine learning that complements accuracy metrics by showing exactly where your model makes mistakes. This information is invaluable for improving your model's performance on specific classes.


```python
import os
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", message="Your `PyDataset` class should call", category=UserWarning)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  

import urllib.request
import numpy as np
import os
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns

# Data loading and preprocessing (already done for you)
categories = ['cat', 'house', 'apple']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []
IMAGE_COUNT = 3000

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:IMAGE_COUNT]
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Shuffle data
indices = np.arange(len(data))
np.random.shuffle(indices)
data, labels = data[indices], labels[indices]

# Reshape and normalize
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
print(f"Training data shape: {x_train.shape}, Testing data shape: {x_test.shape}")

# Create data augmentation generator
datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1,
    shear_range=0.1,
    fill_mode='nearest'
)

# Fit the generator to the training data
datagen.fit(x_train)

def build_simple_cnn():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28,28,1)),
        tf.keras.layers.Conv2D(32, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(len(categories), activation='softmax')
    ])
    return model

# Build the model
model = build_simple_cnn()

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Print model summary
model.summary()

# Train the model for a few epochs
history = model.fit(datagen.flow(x_train, y_train, batch_size=32),
                   epochs=3,
                   validation_data=(x_test, y_test),
                   steps_per_epoch=len(x_train) // 32)

# Evaluate the original model
original_loss, original_accuracy = model.evaluate(x_test, y_test)
print(f"Original model - Test accuracy: {original_accuracy:.4f}")

# Save the model to disk
model.save('sketch_recognition_model.keras')
print("Model saved successfully!")

# TODO: Get model predictions on test data (convert from probabilities to class indices)
y_pred = model.predict(________) 
y_pred_classes = np.argmax(________, axis=________) 

# TODO: Create the confusion matrix using sklearn's confusion_matrix function
cm = confusion_matrix(________, ________) 

# TODO: Plot the confusion matrix as a heatmap using seaborn's heatmap function.
#       - Set annot=True to show numbers in each cell.
#       - Set fmt='d' for integer formatting.
#       - Use cmap='Blues' for color.
#       - Set xticklabels and yticklabels to the category names.
#       - Add axis labels and a title.

# TODO: For each category, print the accuracy for that class.
#       - For each row in the confusion matrix, calculate the number of correct predictions (diagonal value).
#       - Calculate the total number of true samples for that class (sum of the row).
#       - Print the accuracy as a percentage and the count of correct predictions.

```

## Creating Confusion Matrix for Sketch Recognition

Now that you've visualized your model's training metrics, let's create a confusion matrix to gain deeper insights into your model's performance. A confusion matrix shows how many predictions were correct for each class and reveals patterns in misclassifications.

In this task, you'll create and visualize a confusion matrix for your sketch recognition model, complete with proper category labels. This visualization will help you identify which sketch types are most challenging for your model to distinguish.

The confusion matrix is a standard evaluation tool in machine learning that complements accuracy metrics by showing exactly where your model makes mistakes. This information is invaluable for improving your model's performance on specific classes.

```python
import os
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", message="Your `PyDataset` class should call", category=UserWarning)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  

import urllib.request
import numpy as np
import os
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns

# Data loading and preprocessing (already done for you)
categories = ['cat', 'house', 'apple']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []
IMAGE_COUNT = 3000

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:IMAGE_COUNT]
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Shuffle data
indices = np.arange(len(data))
np.random.shuffle(indices)
data, labels = data[indices], labels[indices]

# Reshape and normalize
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
print(f"Training data shape: {x_train.shape}, Testing data shape: {x_test.shape}")

# Create data augmentation generator
datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1,
    shear_range=0.1,
    fill_mode='nearest'
)

# Fit the generator to the training data
datagen.fit(x_train)

def build_simple_cnn():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28,28,1)),
        tf.keras.layers.Conv2D(32, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(len(categories), activation='softmax')
    ])
    return model

# Build the model
model = build_simple_cnn()

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Print model summary
model.summary()

# Train the model for a few epochs
history = model.fit(datagen.flow(x_train, y_train, batch_size=32),
                   epochs=3,
                   validation_data=(x_test, y_test),
                   steps_per_epoch=len(x_train) // 32)

# Evaluate the original model
original_loss, original_accuracy = model.evaluate(x_test, y_test)
print(f"Original model - Test accuracy: {original_accuracy:.4f}")

# Save the model to disk
model.save('sketch_recognition_model.keras')
print("Model saved successfully!")

# Get model predictions on test data (convert from probabilities to class indices)
y_pred = model.predict(x_test) 
y_pred_classes = np.argmax(y_pred, axis=1) 

# Create the confusion matrix using sklearn's confusion_matrix function
cm = confusion_matrix(y_test, y_pred_classes) 

# Plot the confusion matrix as a heatmap using seaborn's heatmap function.
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=categories, yticklabels=categories)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.savefig('static/images/confusion_matrix.png')
print("Confusion matrix visualization saved to 'static/images/confusion_matrix.png'")

# For each category, print the accuracy for that class.
print("\nAccuracy per class:")
for i, category in enumerate(categories):
    correct_predictions = cm[i, i]
    total_true_samples = np.sum(cm[i, :])
    if total_true_samples > 0:
        accuracy = (correct_predictions / total_true_samples) * 100
        print(f"  {category}: {accuracy:.2f}% ({correct_predictions}/{total_true_samples} correct)")
    else:
        print(f"  {category}: No true samples for this class in the test set.")

```

## Finding Most Confused Category Pairs

Great job creating the confusion matrix! Now, let's take your analysis one step further by identifying the most confused pairs of categories. This is a crucial skill for model improvement — knowing exactly which classes your model struggles to distinguish helps you focus your efforts on the right areas.

In this task, you'll implement a function that analyzes the confusion matrix to find and rank pairs of categories that are most frequently confused with each other. For example, your model might often mistake cat sketches for house sketches, or vice versa.

Understanding these confusion patterns is valuable for targeting specific weaknesses in your model, potentially collecting more training data for problematic categories, or considering feature engineering to better distinguish similar classes.


```python

import os
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", message="Your `PyDataset` class should call", category=UserWarning)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  

import urllib.request
import numpy as np
import os
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns

# Data loading and preprocessing (already done for you)
categories = ['cat', 'house', 'apple']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []
IMAGE_COUNT = 3000

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:IMAGE_COUNT]
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Shuffle data
indices = np.arange(len(data))
np.random.shuffle(indices)
data, labels = data[indices], labels[indices]

# Reshape and normalize
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
print(f"Training data shape: {x_train.shape}, Testing data shape: {x_test.shape}")

# Create data augmentation generator
datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1,
    shear_range=0.1,
    fill_mode='nearest'
)

# Fit the generator to the training data
datagen.fit(x_train)

def build_simple_cnn():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28,28,1)),
        tf.keras.layers.Conv2D(32, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(len(categories), activation='softmax')
    ])
    return model

# Build the model
model = build_simple_cnn()

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Print model summary
model.summary()

# Train the model for a few epochs
history = model.fit(datagen.flow(x_train, y_train, batch_size=32),
                   epochs=3,
                   validation_data=(x_test, y_test),
                   steps_per_epoch=len(x_train) // 32)

# Evaluate the original model
original_loss, original_accuracy = model.evaluate(x_test, y_test)
print(f"Original model - Test accuracy: {original_accuracy:.4f}")

# Get predictions on test data
y_pred = model.predict(x_test)
y_pred_classes = np.argmax(y_pred, axis=1)

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred_classes)

# TODO: Define a function to find the most confused pairs
# The function should:
# 1. Take the confusion matrix and category list as input
# 2. Create a list to store tuples of (true_category, predicted_category, count)
# 3. Loop through each cell in the confusion matrix (except diagonal)
# 4. Sort the list by count in descending order
# 5. Return the sorted list
def find_confused_pairs(cm, categories):
    n_classes = len(________) 
    confused_pairs = []
    
    for i in range(________): 
        for j in range(________): 
            if ________:  # Skip diagonal (correct predictions) 
                confused_pairs.append((________, ________, ________)) 
    
    # Sort by confusion count (descending)
    confused_pairs.sort(key=lambda x: ________, reverse=________) 
    return confused_pairs

# TODO: Call the function and print the results
confused_pairs = find_confused_pairs(________, ________) 
print("\nMost Confused Category Pairs:")
print("-----------------------------")
for true_cat, pred_cat, count in confused_pairs:
    print(f"True: {________}, Predicted: {________}, Count: {________}") 

# TODO: Calculate and print confusion as percentage of true class
print("\nConfusion as Percentage of True Class:")
print("------------------------------------")
for true_cat, pred_cat, count in confused_pairs:
    true_idx = categories.index(________) 
    total = np.sum(cm[________, :]) 
    percentage = (________ / ________) * 100 
    print(f"True: {true_cat}, Predicted: {pred_cat}: {percentage:.2f}%")

```

## Finding Most Confused Category Pairs

Great job creating the confusion matrix\! Now, let's take your analysis one step further by identifying the most confused pairs of categories. This is a crucial skill for model improvement — knowing exactly which classes your model struggles to distinguish helps you focus your efforts on the right areas.

In this task, you'll implement a function that analyzes the confusion matrix to find and rank pairs of categories that are most frequently confused with each other. For example, your model might often mistake cat sketches for house sketches, or vice versa.

Understanding these confusion patterns is valuable for targeting specific weaknesses in your model, potentially collecting more training data for problematic categories, or considering feature engineering to better distinguish similar classes.

```python
import os
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", message="Your `PyDataset` class should call", category=UserWarning)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  

import urllib.request
import numpy as np
import os
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns

# Data loading and preprocessing (already done for you)
categories = ['cat', 'house', 'apple']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []
IMAGE_COUNT = 3000

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:IMAGE_COUNT]
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Shuffle data
indices = np.arange(len(data))
np.random.shuffle(indices)
data, labels = data[indices], labels[indices]

# Reshape and normalize
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
print(f"Training data shape: {x_train.shape}, Testing data shape: {x_test.shape}")

# Create data augmentation generator
datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1,
    shear_range=0.1,
    fill_mode='nearest'
)

# Fit the generator to the training data
datagen.fit(x_train)

def build_simple_cnn():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28,28,1)),
        tf.keras.layers.Conv2D(32, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(len(categories), activation='softmax')
    ])
    return model

# Build the model
model = build_simple_cnn()

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Print model summary
model.summary()

# Train the model for a few epochs
history = model.fit(datagen.flow(x_train, y_train, batch_size=32),
                   epochs=3,
                   validation_data=(x_test, y_test),
                   steps_per_epoch=len(x_train) // 32)

# Evaluate the original model
original_loss, original_accuracy = model.evaluate(x_test, y_test)
print(f"Original model - Test accuracy: {original_accuracy:.4f}")

# Get predictions on test data
y_pred = model.predict(x_test)
y_pred_classes = np.argmax(y_pred, axis=1)

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred_classes)

# Define a function to find the most confused pairs
def find_confused_pairs(cm, categories):
    n_classes = len(categories) 
    confused_pairs = []
    
    for i in range(n_classes): 
        for j in range(n_classes): 
            if i != j:  # Skip diagonal (correct predictions) 
                confused_pairs.append((categories[i], categories[j], cm[i, j])) 
    
    # Sort by confusion count (descending)
    confused_pairs.sort(key=lambda x: x[2], reverse=True) 
    return confused_pairs

# Call the function and print the results
confused_pairs = find_confused_pairs(cm, categories) 
print("\nMost Confused Category Pairs:")
print("-----------------------------")
for true_cat, pred_cat, count in confused_pairs:
    print(f"True: {true_cat}, Predicted: {pred_cat}, Count: {count}") 

# Calculate and print confusion as percentage of true class
print("\nConfusion as Percentage of True Class:")
print("------------------------------------")
for true_cat, pred_cat, count in confused_pairs:
    true_idx = categories.index(true_cat) 
    total = np.sum(cm[true_idx, :]) 
    if total > 0:
        percentage = (count / total) * 100 
        print(f"True: {true_cat}, Predicted: {pred_cat}: {percentage:.2f}%")
    else:
        print(f"True: {true_cat}, Predicted: {pred_cat}: No true samples for this class.")
```

## Detailed Metrics for Sketch Recognition Performance

After analyzing confusion patterns between categories, let's generate a more comprehensive evaluation using a classification report. This powerful tool provides detailed metrics for each sketch category that go beyond simple accuracy.

A classification report includes three key metrics:

Precision: How many of the sketches identified as a certain category are actually that category
Recall: How many of the actual sketches of a category were correctly identified
F1-score: A balanced measure combining precision and recall
These metrics reveal nuanced insights about your model's performance on each category. For instance, your model might correctly identify most apple sketches (high recall) but sometimes mistakenly classify other sketches as apples (lower precision).

In this task, you'll generate a classification report and add a custom per-class analysis to better understand your model's strengths and weaknesses with each sketch type.

```python
import os
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", message="Your `PyDataset` class should call", category=UserWarning)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  

import urllib.request
import numpy as np
import os
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
from sklearn.metrics import classification_report

# Data loading and preprocessing (already done for you)
categories = ['cat', 'house', 'apple']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []
IMAGE_COUNT = 3000

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:IMAGE_COUNT]
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Shuffle data
indices = np.arange(len(data))
np.random.shuffle(indices)
data, labels = data[indices], labels[indices]

# Reshape and normalize
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
print(f"Training data shape: {x_train.shape}, Testing data shape: {x_test.shape}")

# Create data augmentation generator
datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1,
    shear_range=0.1,
    fill_mode='nearest'
)

# Fit the generator to the training data
datagen.fit(x_train)

def build_simple_cnn():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28,28,1)),
        tf.keras.layers.Conv2D(32, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(len(categories), activation='softmax')
    ])

    model.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    return model

# Build the model
model = build_simple_cnn()

# Print model summary
model.summary()

# Train the model for a few epochs
history = model.fit(datagen.flow(x_train, y_train, batch_size=32),
                   epochs=3,
                   validation_data=(x_test, y_test),
                   steps_per_epoch=len(x_train) // 32)

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f"Test accuracy: {accuracy:.4f}")

# TODO: Get model predictions on the test data
y_pred = model.predict(________) 

# TODO: Convert the prediction probabilities to class indices
y_pred_classes = np.argmax(________, axis=________) 

# TODO: Generate a classification report using sklearn's classification_report function
report = classification_report(________, ________, target_names=________) 

# TODO: Print the classification report with a descriptive header
print("\nClassification Report:")
print(________) 

# TODO: Add a per-class analysis section that:
print("\nPer-class Analysis:")
print("------------------")
for i, category in enumerate(categories):
    # TODO: Find indices for this class
    class_indices = np.where(________ == i)[0] 
    
    # TODO: Get predictions and true labels for this class only
    class_pred = y_pred_classes[________] 
    class_true = y_test[________] 
    
    # TODO: Calculate accuracy for this class
    class_accuracy = np.mean(________ == ________) 
    
    # TODO: Count total samples for this class
    total_samples = len(________) 
    
    # TODO: Print the results for this class
    print(f"{category}: {________:.2%} accuracy ({________} samples)") 

```

## Detailed Metrics for Sketch Recognition Performance

After analyzing confusion patterns between categories, let's generate a more comprehensive evaluation using a classification report. This powerful tool provides detailed metrics for each sketch category that go beyond simple accuracy.

A classification report includes three key metrics:

  * **Precision**: How many of the sketches identified as a certain category are actually that category
  * **Recall**: How many of the actual sketches of a category were correctly identified
  * **F1-score**: A balanced measure combining precision and recall

These metrics reveal nuanced insights about your model's performance on each category. For instance, your model might correctly identify most apple sketches (high recall) but sometimes mistakenly classify other sketches as apples (lower precision).

In this task, you'll generate a classification report and add a custom per-class analysis to better understand your model's strengths and weaknesses with each sketch type.

```python
import os
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", message="Your `PyDataset` class should call", category=UserWarning)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  

import urllib.request
import numpy as np
import os
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
from sklearn.metrics import classification_report

# Data loading and preprocessing (already done for you)
categories = ['cat', 'house', 'apple']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []
IMAGE_COUNT = 3000

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:IMAGE_COUNT]
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Shuffle data
indices = np.arange(len(data))
np.random.shuffle(indices)
data, labels = data[indices], labels[indices]

# Reshape and normalize
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
print(f"Training data shape: {x_train.shape}, Testing data shape: {x_test.shape}")

# Create data augmentation generator
datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1,
    shear_range=0.1,
    fill_mode='nearest'
)

# Fit the generator to the training data
datagen.fit(x_train)

def build_simple_cnn():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28,28,1)),
        tf.keras.layers.Conv2D(32, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(len(categories), activation='softmax')
    ])

    model.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    return model

# Build the model
model = build_simple_cnn()

# Print model summary
model.summary()

# Train the model for a few epochs
history = model.fit(datagen.flow(x_train, y_train, batch_size=32),
                   epochs=3,
                   validation_data=(x_test, y_test),
                   steps_per_epoch=len(x_train) // 32)

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f"Test accuracy: {accuracy:.4f}")

# Get model predictions on the test data
y_pred = model.predict(x_test) 

# Convert the prediction probabilities to class indices
y_pred_classes = np.argmax(y_pred, axis=1) 

# Generate a classification report using sklearn's classification_report function
report = classification_report(y_test, y_pred_classes, target_names=categories) 

# Print the classification report with a descriptive header
print("\nClassification Report:")
print(report) 

# Add a per-class analysis section that:
print("\nPer-class Analysis:")
print("------------------")
for i, category in enumerate(categories):
    # Find indices for this class
    class_indices = np.where(y_test == i)[0] 
    
    # Get predictions and true labels for this class only
    class_pred = y_pred_classes[class_indices] 
    class_true = y_test[class_indices] 
    
    # Calculate accuracy for this class
    class_accuracy = np.mean(class_pred == class_true) 
    
    # Count total samples for this class
    total_samples = len(class_indices) 
    
    # Print the results for this class
    print(f"{category}: {class_accuracy:.2%} accuracy ({total_samples} samples)")
```

## Visualizing Misclassified Sketches in Action

After analyzing your model's performance with metrics and confusion patterns, let's examine what is actually happening with misclassified sketches. Seeing examples of where your model makes mistakes provides valuable insights that numbers alone cannot capture.

In this task, you will identify sketches that your model classified incorrectly and display them in a grid. This visual inspection will help you spot patterns in your model's errors — perhaps certain sketches are consistently misinterpreted due to similar visual features.

By examining misclassified examples, you can better understand your model's weaknesses and potentially improve its architecture or training data to address these specific issues.


```python
import os
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", message="Your `PyDataset` class should call", category=UserWarning)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  

import urllib.request
import numpy as np
import os
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
import matplotlib.pyplot as plt

# Data loading and preprocessing (already done for you)
categories = ['cat', 'house', 'apple']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []
IMAGE_COUNT = 3000

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:IMAGE_COUNT]
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Shuffle data
indices = np.arange(len(data))
np.random.shuffle(indices)
data, labels = data[indices], labels[indices]

# Reshape and normalize
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
print(f"Training data shape: {x_train.shape}, Testing data shape: {x_test.shape}")

# Create data augmentation generator
datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1,
    shear_range=0.1,
    fill_mode='nearest'
)

# Fit the generator to the training data
datagen.fit(x_train)

def build_simple_cnn():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28,28,1)),
        tf.keras.layers.Conv2D(32, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(len(categories), activation='softmax')
    ])

    model.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    return model

# Build the model
model = build_simple_cnn()

# Print model summary
model.summary()

# Train the model for a few epochs
history = model.fit(datagen.flow(x_train, y_train, batch_size=32),
                   epochs=3,
                   validation_data=(x_test, y_test),
                   steps_per_epoch=len(x_train) // 32)

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f"Test accuracy: {accuracy:.4f}")

# Get predictions
y_pred = model.predict(x_test)
y_pred_classes = np.argmax(y_pred, axis=1)

# TODO: Find indices of misclassified images (where predicted class != true class)
misclassified_indices = np.where(________ != ________)[0]
print(f"Found {len(misclassified_indices)} misclassified images")

# TODO: Create a figure to display misclassified images
plt.figure(figsize=(________, ________))

# TODO: Loop through the first 9 misclassified images (or fewer if there aren't 9)
for i in range(min(________, len(________))): 
    # TODO: Get the index of the misclassified image
    idx = ________[i]
    
    # TODO: Create a subplot in a 3x3 grid
    plt.subplot(________, ________, ________)
    
    # TODO: Display the image (remember to reshape from (28,28,1) to (28,28))
    plt.imshow(________.reshape(________, ________), cmap='gray')
    
    # TODO: Set the title to show both true and predicted categories
    plt.title(f"True: {________[________[idx]]}\nPred: {________[________[idx]]}")
    
    # TODO: Turn off axis labels
    plt.axis(________)

# TODO: Adjust layout and save the figure
plt.tight_layout()
plt.savefig('static/images/misclassified.png')
print("Misclassified images visualization saved to 'static/images/misclassified.png'")

```

## Visualizing Misclassified Sketches in Action

After analyzing your model's performance with metrics and confusion patterns, let's examine what is actually happening with misclassified sketches. Seeing examples of where your model makes mistakes provides valuable insights that numbers alone cannot capture.

In this task, you will identify sketches that your model classified incorrectly and display them in a grid. This visual inspection will help you spot patterns in your model's errors — perhaps certain sketches are consistently misinterpreted due to similar visual features.

By examining misclassified examples, you can better understand your model's weaknesses and potentially improve its architecture or training data to address these specific issues.

```python
import os
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", message="Your `PyDataset` class should call", category=UserWarning)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  

import urllib.request
import numpy as np
import os
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
import matplotlib.pyplot as plt

# Data loading and preprocessing (already done for you)
categories = ['cat', 'house', 'apple']
base_url = 'https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/'

os.makedirs('quickdraw_data', exist_ok=True)

for category in categories:
    file_path = f'quickdraw_data/{category}.npy'
    if not os.path.exists(file_path):
        print(f"Downloading {category}...")
        urllib.request.urlretrieve(base_url + category + '.npy', file_path)
    else:
        print(f"{category}.npy already exists.")

# Load and prepare data
data = []
labels = []
IMAGE_COUNT = 3000

for idx, cat in enumerate(categories):
    filepath = f'quickdraw_data/{cat}.npy'
    imgs = np.load(filepath)[:IMAGE_COUNT]
    if imgs.dtype != np.uint8:
        imgs = imgs.astype(np.uint8)
    data.append(imgs)
    labels.append(np.full(imgs.shape[0], idx))

# Combine all data and labels
data = np.concatenate(data, axis=0)
labels = np.concatenate(labels, axis=0)

# Shuffle data
indices = np.arange(len(data))
np.random.shuffle(indices)
data, labels = data[indices], labels[indices]

# Reshape and normalize
data = data.reshape(-1, 28, 28, 1).astype('float32') / 255.0

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
print(f"Training data shape: {x_train.shape}, Testing data shape: {x_test.shape}")

# Create data augmentation generator
datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1,
    shear_range=0.1,
    fill_mode='nearest'
)

# Fit the generator to the training data
datagen.fit(x_train)

def build_simple_cnn():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28,28,1)),
        tf.keras.layers.Conv2D(32, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(len(categories), activation='softmax')
    ])

    model.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    return model

# Build the model
model = build_simple_cnn()

# Print model summary
model.summary()

# Train the model for a few epochs
history = model.fit(datagen.flow(x_train, y_train, batch_size=32),
                   epochs=3,
                   validation_data=(x_test, y_test),
                   steps_per_epoch=len(x_train) // 32)

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f"Test accuracy: {accuracy:.4f}")

# Get predictions
y_pred = model.predict(x_test)
y_pred_classes = np.argmax(y_pred, axis=1)

# Find indices of misclassified images (where predicted class != true class)
misclassified_indices = np.where(y_pred_classes != y_test)[0]
print(f"Found {len(misclassified_indices)} misclassified images")

# Create a figure to display misclassified images
plt.figure(figsize=(10, 10))

# Loop through the first 9 misclassified images (or fewer if there aren't 9)
for i in range(min(9, len(misclassified_indices))): 
    # Get the index of the misclassified image
    idx = misclassified_indices[i]
    
    # Create a subplot in a 3x3 grid
    plt.subplot(3, 3, i + 1)
    
    # Display the image (remember to reshape from (28,28,1) to (28,28))
    plt.imshow(x_test[idx].reshape(28, 28), cmap='gray')
    
    # Set the title to show both true and predicted categories
    plt.title(f"True: {categories[y_test[idx]]}\nPred: {categories[y_pred_classes[idx]]}")
    
    # Turn off axis labels
    plt.axis('off')

# Adjust layout and save the figure
plt.tight_layout()
plt.savefig('static/images/misclassified.png')
print("Misclassified images visualization saved to 'static/images/misclassified.png'")
```