<a target="_blank" href="https://colab.research.google.com/github/umanitoba-meagher-projects/public-experiments/blob/main/jupyter-notebooks/Object%20Classification%20and%20Localization/dataset-sizes-classification-report.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:

"""
Author: Zhenggang Li & A.V. Ronquillo
Date: May 21, 2024

## Note: Note: The author generated this text in part with GPT-4,
OpenAI’s large-scale language-generation model. Upon generating
draft code, the author reviewed, edited, and revised the code
to their own liking and takes ultimate responsibility for
the content of this code.

"""

# Introduction

This notebook introduces methods for evaluating the performance of a machine learning model in identifying and locating different animal species in trail camera photos. A classification model is used to classify and locate animal species across datasets of varying sizes (100, 500, and 1000 images). The notebook employs Python-based libraries such as `fastai`, `matplotlib`, and `sklearn` to preprocess data, train a convolutional neural network (CNN) using the `resnet34` architecture, and generate classification reports. These reports include precision, recall, F1-score, and support metrics for each class, visualized through bar plots.

# Critical Uses & Adaptability

## What the Notebook Can Be Used For:

- **Dataset Exploration:**  
  This notebook allows users to explore datasets of varying sizes, providing insights into how dataset size impacts model performance. It evaluates classification accuracy and localization metrics, offering a detailed understanding of the dataset's characteristics.

- **Educational Purposes & Demonstrations:**  
  The notebook serves as an educational resource for understanding Python-based machine learning workflows. It demonstrates the use of libraries like `fastai` and `sklearn` for image classification tasks, making it suitable for teaching concepts such as CNNs, data preprocessing, and performance evaluation.

- **Feature Extraction:**  
  By fine-tuning the `resnet34` model, the notebook extracts meaningful features from images, which can be used for further analysis or integrated into other machine learning pipelines.

## How the Notebook Can Be Adapted:

- **Integration with Spatial Design & Architectural Studies:**  
  The notebook can be adapted for site analysis by replacing the animal image dataset with datasets containing architectural elements or spatial layouts. This enables the classification of design features or spatial patterns.

- **Variables & Customization:**  
  - The `species` and `dataset_size` variables in the dataset processing cells can be modified to accommodate different classes or dataset sizes.  
  - The `batch_tfms` and `item_tfms` parameters in the `ImageDataLoaders` initialization allow customization of image transformations.

- **Swapping Datasets:**  
  - The dataset path in the `base_path` variable (e.g., in the cell processing the 100 dataset) can be updated to point to a custom dataset. This enables the notebook to work with entirely different image collections.

- **Scalability:**  
  The notebook can be scaled to handle larger datasets by adjusting the `batch_size` and `num_workers` parameters in the data loader. Additionally, the number of epochs in the `fine_tune` method can be increased for more extensive training.

# Module: Importing Additional Necessary Python Packages


For the purposes of the notebook in CoLab and the consideration of the runtime, the classification report will only be visualized through the 100, 500, and 1000 dataset for further analysis of the model's animal classification accuracy.

The `matplotlib.pyplot` is assigned the alias of `plt`. This imported module is a library used for creating two-dimensional plots in Python. `sklearn.metrics` is used for visualization and model evaluation metrics. It essentially imports the `classification_report` function from the module. This element is another method of visualizing classification and localization accuracy in various quantitative metrics, allowing detailed model performance analysis. In doing so, it can provide insights that a confusion matrix may not necessarily offer. A classification report evaluates the performance of a classification model by calculating precision, recall, F1-score, and support for each class.

In [None]:
import os
# Borealis API configuration
import requests
import zipfile

BOREALIS_SERVER = "https://borealisdata.ca"

def get_public_dataset_info(persistent_id):
    """
    Get information about a public dataset
    """
    url = f"{BOREALIS_SERVER}/api/datasets/:persistentId/"
    params = {"persistentId": persistent_id}

    response = requests.get(url, params=params)

    if response.status_code == 200:
        dataset_info = response.json()
    else:
        print(f"Cannot access dataset: {response.status_code}")
        return None
    """
    Get a list of files in a public dataset
    """
    # Access the list of files from the dataset_info dictionary
    files_list = dataset_info['data']['latestVersion']['files']

    # Create an empty list to store file information
    file_info_list = []

    # Iterate through the files list and append file ID and filename to the list
    for file_info in files_list:
        file_id = file_info['dataFile']['id']
        filename = file_info['dataFile']['filename']
        file_info_list.append({"file_id": file_id, "filename": filename})

    return file_info_list

def download_public_file(file_id, save_path="./"):
    """
    Download a specific public file from a dataset by its file ID
    No authentication required
    """
    url = f"{BOREALIS_SERVER}/api/access/datafile/{file_id}"

    response = requests.get(url, stream=True)

    if response.status_code == 200:
        # Determine filename from headers or URL
        filename = None
        if "Content-Disposition" in response.headers:
            cd = response.headers["Content-Disposition"]
            # Try to extract filename from content disposition
            if "filename=" in cd:
                filename = cd.split("filename=")[1].strip('"')

        # Fallback to extracting from URL if header not available or malformed
        if not filename:
             filename = url.split("/")[-1]

        file_path = f"{save_path}/{filename}"

        with open(file_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)

        print(f"SUCCESS: File downloaded to {file_path}")
        return file_path
    else:
        print(f"ERROR: {response.status_code}: File may be restricted or not found")
        return None

def is_zip_file(filepath):
    """
    Checks if a file is a valid zip file.
    """
    return zipfile.is_zipfile(filepath)

def unzip_file(filepath, extract_path="./"):
    """
    Unzips a zip file to a specified path and returns the name of the top-level extracted folder.
    Returns None if not a zip file or extraction fails.
    """
    if is_zip_file(filepath):
        try:
            with zipfile.ZipFile(filepath, 'r') as zip_ref:
                # Get the name of the top-level directory within the zip
                # Assumes there is a single top-level directory
                top_level_folder = None
                for file_info in zip_ref.infolist():
                    parts = file_info.filename.split('/')
                    if parts[0] and len(parts) > 1:
                        top_level_folder = parts[0]
                        break # Assuming the first entry gives the top-level folder

                zip_ref.extractall(extract_path)
                print(f"SUCCESS: Successfully unzipped {filepath} to {extract_path}")
                return top_level_folder

        except Exception as e:
            print(f"ERROR: Error unzipping {filepath}: {e}")
            return None
    else:
        print(f"INFO: {filepath} is not a valid zip file.")
        return None

# Initialize Borealis dataset access
public_doi = "doi:10.5683/SP3/H3HGWF"
print("Borealis dataset initialized for animal notebook data.")from fastai.vision.all import *
from pathlib import Path

#Additional Packages
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
import pandas as pd
import seaborn as sns

##### Google Drive File Path

In [None]:
def mount_google_drive():
# Borealis API configuration
import requests
import zipfile

BOREALIS_SERVER = "https://borealisdata.ca"

def get_public_dataset_info(persistent_id):
    """
    Get information about a public dataset
    """
    url = f"{BOREALIS_SERVER}/api/datasets/:persistentId/"
    params = {"persistentId": persistent_id}

    response = requests.get(url, params=params)

    if response.status_code == 200:
        dataset_info = response.json()
    else:
        print(f"Cannot access dataset: {response.status_code}")
        return None
    """
    Get a list of files in a public dataset
    """
    # Access the list of files from the dataset_info dictionary
    files_list = dataset_info['data']['latestVersion']['files']

    # Create an empty list to store file information
    file_info_list = []

    # Iterate through the files list and append file ID and filename to the list
    for file_info in files_list:
        file_id = file_info['dataFile']['id']
        filename = file_info['dataFile']['filename']
        file_info_list.append({"file_id": file_id, "filename": filename})

    return file_info_list

def download_public_file(file_id, save_path="./"):
    """
    Download a specific public file from a dataset by its file ID
    No authentication required
    """
    url = f"{BOREALIS_SERVER}/api/access/datafile/{file_id}"

    response = requests.get(url, stream=True)

    if response.status_code == 200:
        # Determine filename from headers or URL
        filename = None
        if "Content-Disposition" in response.headers:
            cd = response.headers["Content-Disposition"]
            # Try to extract filename from content disposition
            if "filename=" in cd:
                filename = cd.split("filename=")[1].strip('"')

        # Fallback to extracting from URL if header not available or malformed
        if not filename:
             filename = url.split("/")[-1]

        file_path = f"{save_path}/{filename}"

        with open(file_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)

        print(f"SUCCESS: File downloaded to {file_path}")
        return file_path
    else:
        print(f"ERROR: {response.status_code}: File may be restricted or not found")
        return None

def is_zip_file(filepath):
    """
    Checks if a file is a valid zip file.
    """
    return zipfile.is_zipfile(filepath)

def unzip_file(filepath, extract_path="./"):
    """
    Unzips a zip file to a specified path and returns the name of the top-level extracted folder.
    Returns None if not a zip file or extraction fails.
    """
    if is_zip_file(filepath):
        try:
            with zipfile.ZipFile(filepath, 'r') as zip_ref:
                # Get the name of the top-level directory within the zip
                # Assumes there is a single top-level directory
                top_level_folder = None
                for file_info in zip_ref.infolist():
                    parts = file_info.filename.split('/')
                    if parts[0] and len(parts) > 1:
                        top_level_folder = parts[0]
                        break # Assuming the first entry gives the top-level folder

                zip_ref.extractall(extract_path)
                print(f"SUCCESS: Successfully unzipped {filepath} to {extract_path}")
                return top_level_folder

        except Exception as e:
            print(f"ERROR: Error unzipping {filepath}: {e}")
            return None
    else:
        print(f"INFO: {filepath} is not a valid zip file.")
        return None

# Initialize Borealis dataset access
public_doi = "doi:10.5683/SP3/H3HGWF"
print("Borealis dataset initialized for animal notebook data.")    print("Google Drive is mounted. Proceed with file operations.")

# Module: Repeating Error Handling Functions

The same snippet of code from the script for the Confusion Matrix will be utilized for the functions that focuses on error handling in image files.

In [None]:
def safe_get_image_files(path):
    image_files = []
    for img_path in get_image_files(path):
        try:
            # Attempt to open the image to verify it is not corrupted
            with PILImage.create(img_path) as img:
                image_files.append(img_path)
        except Exception as e:
            print(f"Skipping file {img_path} due to error: {e}")
    return image_files

# Module: Data Processing & Fine-Tuning the Model

In this module, the same lines of code from the Confusion Matrix script of gathering images based on `species` and `size` is also repeated.

The `cnn_learner` is still using the `resnet34` architecture as well as fastai, and it ensures that the model focuses on using `accuracy` as the model's metric for image classification and localization accuracy. The model is then fine-tuned for `4` epochs.

In [None]:
def process_dataset(base_path, species, size):
    files = []
    # Gather files for all species for the given size
    for animal in species:
        path = Path(base_path) / f'{animal}_{size}'
        files += safe_get_image_files(path)

    if not files:
        print(f"No images found for dataset size {size}. Skipping...")
        return

    # Initialize Data Loaders
    dls = ImageDataLoaders.from_path_func(
        path=base_path,
        fnames=files,
        label_func=lambda x: x.parent.name.split('_')[0],
        item_tfms=Resize(460),
        batch_tfms=aug_transforms(size=224),
        bs=32,
        num_workers=0,
        valid_pct=0.8  # ensure there is a validation set
    )

    # Initialize CNN Learner and Fine-Tune the Model
    learn = cnn_learner(dls, resnet34, metrics=accuracy)
    learn.fine_tune(4)

# Module: Retrieving Predictions & True Labels
After the training, the model's predictions and true labels are retrieved. By using a container tuple, the `get_preds` method gathers the predicted probabilities and the true labels. The raw predictions outputs are converted to probabilities using the `preds_softmax` function that sum to 1 for each sample.

The `max` function is applied along the first dimension where (axis=1), which corresponds to the different classes. The `predicted_probs` variable contains the maximum probabilities, and the `actual_classes` variable contains the predicted class labels.

In [None]:
    preds, y_true = learn.get_preds()
    preds_softmax = preds.softmax(dim=1)
    predicted_probs, actual_classes = preds_softmax.max(dim=1)

# Module: Converting the Report into a Pandas DataFrame

By using the `argmax` function, the `y_pred` calculates the predicted class labels by taking the maximum value along the first dimension `(dim-1)` of the `preds_softmax` tensor. This establishes that the `preds_softmax` contains the predicted probabilities for each class.

The `report` function generates the `classification_report` from the `sklearn.metrics` module that was previously imported. It takes the true labels `y_true` and the predicted class labels `y_pred` as input, and sets `output_dict=True` to return the report as a dictionary. In doing so, it can allow access to the individual metrics for each class, including the overall metrics in a more structured manner.

Transposing the `DataFrame` swaps the rows and columns so that class labels become the index and the metrics (precision, recall, f1-score, etc.) become the columns.

In [None]:
    # Assuming y_true and preds_softmax are your true labels and output probabilities.
    y_pred = preds_softmax.argmax(dim=1)
    report = classification_report(y_true, y_pred, output_dict=True)
    df = pd.DataFrame(report).transpose()

# Module: Keys Dictionary & Generating the Classification Visual

The dictionary for the report contains these keys:

  `accuracy`: Overall classification accuracy of the model. The percentage of correctly predicted instances out of all instances in the dataset.

  `macro avg`: The macro-averaged precision, recall, f1-score, and support across all classes. This calculates the metric for each class and then takes the average across all classes. This gives equal weight to each class, regardless of the number of instances in each class.

  `weighted avg`: Weighted average of the precision, recall, f1-score, and support across all classes. Weighted averaging calculates each class metric by the number of instances in that class, giving more weight to larger classes.





  Class-specific metrics: For each class, the dictionary contains the following metrics:

  `precision`: The percentage of predicted positive instances that are actually positive. It is calculated as the number of true positives divided by the total number of predicted positives.

  `recall`: The percentage of actual positive instances that are predicted to be positive. It is calculated as the number of true positives divided by the total number of actual positives.

  `f1-score`: The harmonic mean of precision and recall. It is calculated as 2 * (precision * recall) / (precision + recall).

  `support`: Number of true instances for the class


  These metrics can provide a sharper control of how you want to evaluate the model's performance on each class individually. Therefore it can be useful for identifying classes that the model is struggling with or for comparing the model's performance.

In [None]:
    plt.figure(figsize=(10, 6))
    sns.barplot(data=df, x=df.index, y='precision', label='Precision', color='b')
    sns.barplot(data=df, x=df.index, y='recall', label='Recall', color='r', alpha=0.6)
    sns.barplot(data=df, x=df.index, y='f1-score', label='F1-Score', color='g', alpha=0.3)
    plt.xticks(rotation=45)
    plt.xlabel('Classes')
    plt.ylabel('Scores')
    plt.title(f'Classification Report for {size} Dataset')
    plt.legend()
    plt.show()

`figsize` ensures that the specified size of the visual is 10 inches wide and 6 inches high. By using the Seaborn library, the `sns.barplot` creates a bar graph visual. The `label` argument specifies the label for the plot, and the `color` argument specifies the color of the bars. The `alpha` arguement controls the transparency of the `recall` bar as well as the `f1-score`. These parameters can be controlled to your liking in terms of what you want to communicate through the classification report.

# Module: Process Datasets to create the Classification Report

Similar to initializing the Confusion Matrix generation, this module uses the same snippet of code to process each dataset size for the Classification Report. In doing so, it ensures that necessary files and dataset sizes from Google Drive are made accessible for processing. This first run of the script is the  Classification Report for the 100 Dataset, the same steps are repated for each dataset in the notebook.

# 100 Dataset

In [None]:
if __name__ == "__main__":
    mount_google_drive()
    base_path = './av datasets'
    species = ['fox', 'squirrel', 'deer']
    dataset_size = 100
    process_dataset(base_path, species, dataset_size)

# 500 Dataset

In [None]:
if __name__ == "__main__":
    mount_google_drive()
    base_path = './av datasets'
    species = ['fox', 'squirrel', 'deer']
    dataset_size = 500
    process_dataset(base_path, species, dataset_size)

# 1000 Dataset

In [None]:
if __name__ == "__main__":
    mount_google_drive()
    base_path = './av datasets'
    species = ['fox', 'squirrel', 'deer']
    dataset_size = 1000
    process_dataset(base_path, species, dataset_size)