<a target="_blank" href="https://colab.research.google.com/github/umanitoba-meagher-projects/public-experiments/blob/main/jupyter-notebooks/Explore%20Large%20Image%20Datasets/image-filter.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
"""
Author: Ryleigh J. Bruce
Date: June 4, 2024

Purpose: To sort through a directory of images and copy files over to a new folder based on a specific string in the file name. A text file containing a list of all of the copied files is produced alongside the folder.


Note: The author generated this text in part with GPT-4,
OpenAI’s large-scale language-generation model. Upon generating
draft code, the authors reviewed, edited, and revised the code
to their own liking and takes ultimate responsibility for
the content of this code.

"""

**NOTE: These scripts will need to be modified to extract the necessary information from metadata. Delete once necessary adjustements have been completed.**

# Overview

This notebook describes methods for filtering, organizing, and reviewing large collections of image files based on specific criteria embedded in their filenames. The primary purpose is to automate the process of identifying images that match user-defined strings—such as species names, dates, or camera locations—within a directory, and then copy those files to a new destination folder.

# Critical Uses & Adaptability

## What the Notebooks Can Be Used For:

- Exploration and filtering of large image datasets based on filename patterns: this is useful for curating datasets for specific analyses or experiments, and for quickly assessing the distribution of images matching certain criteria.
- File management, data filtering, and visualization in the context of image datasets: this is useful for automating repetitive tasks.
- By filtering images according to embedded metadata in filenames, datasets can be prepared for feature extraction or downstream analysis. The approach can be extended to select images with particular attributes relevant to research questions.

## How the Notebook Can Be Adapted:

- The workflow can be adapted for projects involving spatial analysis or architectural site studies by modifying the search parameters to reflect spatial features, site codes, or architectural elements present in filenames. This supports organization and review of spatially referenced image collections.
- The notebook can be used with different datasets by changing the `source_directory` and `destination_directory` variables.

### Mount the Notebook to Google Drive and Install Necessary Libraries

Here the Borealis data access functions are defined, allowing the notebook to download and access files from the Borealis public data repository without requiring authentication. The notebook sets up functions for dataset access, file downloading, and zip file handling.

In [None]:
# Borealis API configuration
import requests
import zipfile

BOREALIS_SERVER = "https://borealisdata.ca"

def get_public_dataset_info(persistent_id):
    """
    Get information about a public dataset
    """
    url = f"{BOREALIS_SERVER}/api/datasets/:persistentId/"
    params = {"persistentId": persistent_id}

    response = requests.get(url, params=params)

    if response.status_code == 200:
        dataset_info = response.json()
    else:
        print(f"Cannot access dataset: {response.status_code}")
        return None
    """
    Get a list of files in a public dataset
    """
    # Access the list of files from the dataset_info dictionary
    files_list = dataset_info['data']['latestVersion']['files']

    # Create an empty list to store file information
    file_info_list = []

    # Iterate through the files list and append file ID and filename to the list
    for file_info in files_list:
        file_id = file_info['dataFile']['id']
        filename = file_info['dataFile']['filename']
        file_info_list.append({"file_id": file_id, "filename": filename})

    return file_info_list

def download_public_file(file_id, save_path="./"):
    """
    Download a specific public file from a dataset by its file ID
    No authentication required
    """
    url = f"{BOREALIS_SERVER}/api/access/datafile/{file_id}"

    response = requests.get(url, stream=True)

    if response.status_code == 200:
        # Determine filename from headers or URL
        filename = None
        if "Content-Disposition" in response.headers:
            cd = response.headers["Content-Disposition"]
            # Try to extract filename from content disposition
            if "filename=" in cd:
                filename = cd.split("filename=")[1].strip('"')

        # Fallback to extracting from URL if header not available or malformed
        if not filename:
             filename = url.split("/")[-1]

        file_path = f"{save_path}/{filename}"

        with open(file_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)

        print(f"✅ File downloaded to {file_path}")
        return file_path
    else:
        print(f"❌ Error {response.status_code}: File may be restricted or not found")
        return None

def is_zip_file(filepath):
    """
    Checks if a file is a valid zip file.
    """
    return zipfile.is_zipfile(filepath)

def unzip_file(filepath, extract_path="./"):
    """
    Unzips a zip file to a specified path and returns the name of the top-level extracted folder.
    Returns None if not a zip file or extraction fails.
    """
    if is_zip_file(filepath):
        try:
            with zipfile.ZipFile(filepath, 'r') as zip_ref:
                # Get the name of the top-level directory within the zip
                # Assumes there is a single top-level directory
                top_level_folder = None
                for file_info in zip_ref.infolist():
                    parts = file_info.filename.split('/')
                    if parts[0] and len(parts) > 1:
                        top_level_folder = parts[0]
                        break # Assuming the first entry gives the top-level folder

                zip_ref.extractall(extract_path)
                print(f"✅ Successfully unzipped {filepath} to {extract_path}")
                return top_level_folder

        except Exception as e:
            print(f"❌ Error unzipping {filepath}: {e}")
            return None
    else:
        print(f"ℹ️ {filepath} is not a valid zip file.")
        return None

# Initialize Borealis dataset access
public_doi = "doi:10.5683/SP3/H3HGWF"

# Download the 4370-entire-subset.zip file which contains the image collection
dataset_files = get_public_dataset_info(public_doi)
subset_file_id = None
for file_info in dataset_files:
    if file_info['filename'] == '4370-entire-subset.zip':
        subset_file_id = file_info['file_id']
        break

if subset_file_id:
    print("Downloading 4370-entire-subset.zip...")
    downloaded_file = download_public_file(subset_file_id, "./")
    if downloaded_file and is_zip_file(downloaded_file):
        extracted_folder = unzip_file(downloaded_file, "./")
        print(f"Data extracted to: {extracted_folder}")
else:
    print("❌ Could not find 4370-entire-subset.zip in the dataset")

The `os` and `shutil` Python modules allow for file processing within the Colab environment, specifically reading, writing, copying, and moving files.

In [None]:
import os
import shutil
import random
from PIL import Image
import matplotlib.pyplot as plt

# File Search Based on Species

### Define the Directories and Search Parameters

This code block defines the source folder, the destination folder, and the file path for the text file that will be produced alongside the new image folder.

In [None]:
# Define the source directory where images are stored
source_directory = './4370-entire-subset/small-animal-collection'
# Define the destination directory where deer images will be copied
destination_directory = "./image-filter/racoons"
# Define the text file path where filenames will be saved
output_text_file = "./image-filter/Racoon Images.txt"

This line determines what the later script is looking for in the file names. Here, the string that has been specified is ‘raccoon’.

In [None]:
#define the species that is being searched for
species = 'raccoon'

### Sort the Dataset and Save the Selected Images

In this code block the os module is used to check for the destination directory, and will create one if it does not exist.

The code `os.makedirs(destination_directory` uses the `os` module to create a directory at the specified path. The `exist_ok=True` portion of the script ensures that the code will not fail if a directory already exists, and instead will move on to the following modules.

The `images = []` initializes the images list to be used in the making of a text file in later code.

In [None]:
# Ensure that the destination directory exists, create if it does not
os.makedirs(destination_directory, exist_ok=True)

images = []

The script begins by opening a text file that will be used to record the names of the selected images. The `os.wal` function from the `os` module is used to go through all of the files at the supplied source directory, while checking for the specified species in the file name (here it is searching for ‘racoon’). When files matching that criteria is found the name of the file is written in the text file and the file is copied to the destination directory.

The final print statement notifies us that the script has completed and the images have been copied to a new folder.

The ‘except’ block ensures that any files that aren’t able to be copied to the destination folder are printed along with the associated error code.

In [None]:
# Open the text file for writing
with open(output_text_file, "w") as file:
    # Walk through the all files in the source directory
    for dirpath, dirnames, filenames in os.walk(source_directory):
# Filter for files that include 'Raccoon' in their name and are image files
        for filename in filenames:
            if species in filename.lower() and filename.lower().endswith(('.png', '.jpg', '.jpeg', '.JPG')):
                # Full path of the file
                full_file_path = os.path.join(dirpath, filename)
                # Add the file to the list of images
                images.append(full_file_path)
                # Write filename to the text file
                file.write(filename + "\n")
                # Copy the file to the destination directory
                shutil.copy(full_file_path, os.path.join(destination_directory, filename))
                try:
                    shutil.copy(full_file_path, os.path.join(destination_directory, filename))
                except Exception as e:
                    print(f"Failed to copy {filename}. Reason: {str(e)}")

print("Files have been filtered and copied.")

### Display a Subset of the Filtered Images

In this code block the `subset_size` is the number of images that will be displayed within the grid. Here the value is set to 15. The subset is selected randomly using the `random.sample` function.

In [None]:
# Display a subset of images in grid format
subset_size = 15
selected_files_subset = random.sample(images, min(subset_size, len(images)))

`plt.figure(figsize=(20, 10))` sets the size of the figure to 20 units wide and 10 units tall. The columns and rows values have been set to 5 and 3 respectively.

In [None]:
fig = plt.figure(figsize=(20, 10)) # Size of the entire figure
columns = 5
rows = 3

This code block loops over each file in the `selected_files_subset` and opens them using `PIL`. It then adds a new subplot to the figure for each image and displays it in the current subplot. `axis(‘off’)` removes the x and y axes from the subplot to maintain legibility. `ax.set_title(os.path.basename(file_path), fontsize=8, pad=5)` sets the title of the subplot as the filename of the image in size 8 font with a padding of five pixels from the image.

In [None]:
for i, file_path in enumerate(selected_files_subset):
    img = Image.open(file_path)
    ax = fig.add_subplot(rows, columns, i + 1)
    ax.imshow(img)
    ax.axis('off')
    ax.set_title(os.path.basename(file_path), fontsize=8, pad=5)

Here `plt.subplots_adjust(wspace=0.5, hspace=0.5)` is used to maintain uniform spacing between the subplots by setting the width space (`wspace`) and height space (`hspace`) to 0.5 units. `plt.tight_layout(pad=1)` automatically adjusts the canvas to ensure that there is no overlapping content, and the gridded images are then displayed.

In [None]:
# Display a subset of images in grid format
subset_size = 15
selected_files_subset = random.sample(images, min(subset_size, len(images)))

fig = plt.figure(figsize=(20, 10)) # Size of the entire figure
columns = 5
rows = 3

for i, file_path in enumerate(selected_files_subset):
    img = Image.open(file_path)
    ax = fig.add_subplot(rows, columns, i + 1)
    ax.imshow(img)
    ax.axis('off')
    ax.set_title(os.path.basename(file_path), fontsize=8, pad=5)

# Adjust spacing
plt.subplots_adjust(wspace=0.5, hspace=0.5)
plt.tight_layout(pad=1)

plt.show()

## Search Files Based on Date

### Define the Directories and Search Parameters

This script functions and looks largely the same as the previous file search script, aside from changing the string the script is searching for in the file names.

In [None]:
# Define the source directory where images are stored
source_directory = './4370-entire-subset/small-animal-collection'
# Define the destination directory where images will be copied
destination_directory = './image-filter/June 3rd 2020'
# Define the text file path where filenames will be saved
output_text_file = './image-filter/June 3rd 2020 Images.txt'

It is critical to format the date the same way that it is formatted in the file names, or else the search will return no images.

In [None]:
# Define the date we are searching for in the filename
date_to_search = "2020-06-03"

### Sort the Dataset and Save the Selected Images

The remainder of the script remains the same, aside from the `species` variable being replaced by the `date_to_search` variable.

In [None]:
# Ensure that the destination directory exists, create if it does not
os.makedirs(destination_directory, exist_ok=True)

images = []

# Open the text file for writing
with open(output_text_file, "w") as file:
    # Walk through the all files in the source directory
    for dirpath, dirnames, filenames in os.walk(source_directory):
# Filter for files that include 'Raccoon' in their name and are image files
        for filename in filenames:
            if date_to_search in filename.lower() and filename.lower().endswith(('.png', '.jpg', '.jpeg', '.JPG')):
                # Full path of the file
                full_file_path = os.path.join(dirpath, filename)
                # Add the file to the list of images
                images.append(full_file_path)
                # Write filename to the text file
                file.write(filename + "\n")
                # Copy the file to the destination directory
                shutil.copy(full_file_path, os.path.join(destination_directory, filename))
                try:
                    shutil.copy(full_file_path, os.path.join(destination_directory, filename))
                except Exception as e:
                    print(f"Failed to copy {filename}. Reason: {str(e)}")

print("Files have been filtered and copied.")

### Display a Subset of the Filtered Images

A randomly selected subset of images can now be displayed using the script from the previous module to ensure that the script is functioning properly:

In [None]:
# Display a subset of images in grid format
subset_size = 15
selected_files_subset = random.sample(images, min(subset_size, len(images)))

fig = plt.figure(figsize=(20, 10)) # Size of the entire figure
columns = 5
rows = 3

for i, file_path in enumerate(selected_files_subset):
    img = Image.open(file_path)
    ax = fig.add_subplot(rows, columns, i + 1)
    ax.imshow(img)
    ax.axis('off')
    ax.set_title(os.path.basename(file_path), fontsize=8, pad=5)

# Adjust spacing
plt.subplots_adjust(wspace=0.5, hspace=0.5)
plt.tight_layout(pad=1)

plt.show()

# File Search based on Location

### Define the Directories and Search Parameters

As long as the desired variable is in the file name, this script can be modified to search files based on a range of variables.

In [None]:
# Define the source directory where images are stored
source_directory = './4370-entire-subset/small-animal-collection'
# Define the destination directory where images will be copied
destination_directory = './image-filter/camera2'
# Define the text file path where filenames will be saved
output_text_file = './image-filter/camera2/Camera 2 Images.txt'

Here the script searches for and copies all images taken at the ‘camera2’ site.

In [None]:
# Define the date we are searching for in the filename
camera_location = "camera2"

### Sort the Dataset and Save the Selected Images

The remainder of the script remains the same, aside from the `date_to_search` variable being replaced by the `camera_location` variable.

In [None]:
# Ensure that the destination directory exists, create if it does not
os.makedirs(destination_directory, exist_ok=True)

images = []

# Open the text file for writing
with open(output_text_file, "w") as file:
    # Walk through the all files in the source directory
    for dirpath, dirnames, filenames in os.walk(source_directory):
# Filter for files that include 'Raccoon' in their name and are image files
        for filename in filenames:
            if camera_location in filename.lower() and filename.lower().endswith(('.png', '.jpg', '.jpeg', '.JPG')):
                # Full path of the file
                full_file_path = os.path.join(dirpath, filename)
                # Add the file to the list of images
                images.append(full_file_path)
                # Write filename to the text file
                file.write(filename + "\n")
                # Copy the file to the destination directory
                shutil.copy(full_file_path, os.path.join(destination_directory, filename))
                try:
                    shutil.copy(full_file_path, os.path.join(destination_directory, filename))
                except Exception as e:
                    print(f"Failed to copy {filename}. Reason: {str(e)}")

print("Files have been filtered and copied.")

### Display a Subset of the Filtered Images

A randomly selected subset of images can now be displayed using the script from the previous modules to ensure that the script is functioning properly:

In [None]:
# Display a subset of images in grid format
subset_size = 15
selected_files_subset = random.sample(images, min(subset_size, len(images)))

fig = plt.figure(figsize=(20, 10)) # Size of the entire figure
columns = 5
rows = 3

for i, file_path in enumerate(selected_files_subset):
    img = Image.open(file_path)
    ax = fig.add_subplot(rows, columns, i + 1)
    ax.imshow(img)
    ax.axis('off')
    ax.set_title(os.path.basename(file_path), fontsize=8, pad=5)

# Adjust spacing
plt.subplots_adjust(wspace=0.5, hspace=0.5)
plt.tight_layout(pad=1)

plt.show()