<a target="_blank" href="https://colab.research.google.com/github/mmeagher/experiments/blob/main/jupyter-notebooks/Explore%20Large%20Image%20Datasets/random-manual-review.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
"""
Author: Ryleigh J. Bruce
Date: June 7, 2024

Purpose: Copy a subset of images to a new folder for manual review.


Note: The author generated this text in part with GPT-4,
OpenAI’s large-scale language-generation model. Upon generating
draft code, the authors reviewed, edited, and revised the code
to their own liking and takes ultimate responsibility for
the content of this code.

"""

# Overview

This notebook provides a workflow for selecting a random subset of images from a larger dataset for manual review. The goal in doing this is to facilitate the process of sampling, organizing, and visually inspecting images.

# Critical Uses & Adaptability

## What the Notebook Can Be Used For:

- Randomly sample and inspect images from large datasets.

- Automate repetitive tasks and learn foundational concepts in working with image data.

- Extend the workflow to include feature extraction or pre-processing steps. This can include extracting metadata, computing image statistics, or preparing data for downstream analysis.

## How the Notebook Can Be Adapted:

- The animal image dataset can be substituted with collections relevant to architectural contexts such as site photographs, plan scans, or spatial diagrams. The random sampling and visualization steps support unbiased review and documentation of spatial features.

- Variables such as `source_dir`, `destination_dir`, `number_of_images`, and `images_to_display` can be modified to adjust the source and target locations, the number of images sampled, and the number of images displayed, tailoring the workflow to specific project requirements.

- To use a different image dataset, change the `source_dir` and `destination_dir` variables in the cell labeled "Step 2: Set your paths and parameters."

### Mount the Notebook to Google Drive and Install Necessary Libraries

Here the drive module is imported, allowing the Colab environment to access files within Google Drive. The notebook is then mounted to Google Drive so that it can interact with the files.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Here, the `os`, `shutil`, `random`, `Image`, and `matplotlib.pyplot` modules are being imported. The `os` module is critical for interacting with file directories and joining path components. The `shutil` module  will aid in automating copying files to a new directory, and the `random` module implements random number generators. The `Image` and `matplotlib.pyplot` modules will allow for image display after the files have been copied to a new directory.

In [None]:
import os
import shutil
import random
from PIL import Image
import matplotlib.pyplot as plt

### Define Directories and Number of Images

In this module the source and destination directories are being defined, as well as the desired number of images to be copied to a new folder. The source directory should contain the path to the folder with the dataset that needs to be manually reviewed, while the destination directory will be the location the subset is copied to. The `number_of_images` variable can be altered according to the desired number of images.

In [None]:
# Step 2: Set your paths and parameters
source_dir = '/content/drive/MyDrive/shared-data/Notebook datafiles/4370-entire-subset/small-animal-collection'
destination_dir = '/content/drive/MyDrive/shared-data/Notebook datafiles/image-filter/Subset-review'
number_of_images = 50  # Number of random images to copy
images_to_display = 15 # Number of images to display

Here the `os` module is used to check if the destination directory already exists, and if it doesn’t the `os.makedirs()` function is used to create it.

In [None]:
# Create the destination directory if it doesn't exist
if not os.path.exists(destination_dir):
    os.makedirs(destination_dir)

### List all Files and Select a Sample of Random Images

This script creates a list called `all_files` containing the paths of all the files in the source directory. This is done by using the `os.path.join(source_dir, f)` function to concatenate `source_directory` and each entry `f` to create a full path from the file name, which is then added to the `all_files` list.

In [None]:
# Step 3: List all files in the source directory
all_files = [f for f in os.listdir(source_dir) if os.path.isfile(os.path.join(source_dir, f))]

Here the `random` module is used to select a random sample of images from the newly created `all_files` list.

In [None]:
# Step 4: Select a random subset of images
selected_files = random.sample(all_files, min(number_of_images, len(all_files)))

### Copy the Selected Files to the New Directory

In this code block the join function from the `os` module is used to combine the source directory and filename in order to create a valid file path for copying to the new folder. The same is done with the destination directory. The `shutil` module is used to copy the selected files over to the destination file path while attempting to preserve the associated metadata.

In [None]:
# Step 5: Copy the selected images to the destination directory
for file_name in selected_files:
    source_path = os.path.join(source_dir, file_name)
    destination_path = os.path.join(destination_dir, file_name)
    shutil.copy2(source_path, destination_path)

This print statement informs the user when the script is completed and shows the destination folder path.

In [None]:
print(f'Copied {len(selected_files)} images to {destination_dir} for manual review.')

### Display a Sample of the Copied Images

The code initially uses the `random` library to select a random subset of images from the destination directory, with a maximum number of images equal to the number of images in the destination directory.

Then the `matplotlib` library is used to generate a 20x10 pixel figure in which to display the selected subset. The script iterates over the selected files and for each image uses the `Image.open()` function to display the image in a subplot on the grid. The `fig.add_subplot()` function ensures that a new subplot is created for each image, and determines the number of rows and columns. In this example `fig.add_subplot(3, 5, i+1)` results in a grid with three rows and five columns.

The `axis(off)` function is used to turn off the axis ticks and labels for each subplot, and the `set_title()` function displays the name for each image in size 8 font. `pad=5` creates a padding of five pixels between the image and the title.

In [None]:
# Display a subset of images
selected_files_subset = random.sample(selected_files, min(images_to_display, len(selected_files)))
fig = plt.figure(figsize=(20, 10))
for i, file_name in enumerate(selected_files_subset):
    img = Image.open(os.path.join(destination_dir, file_name))
    ax = fig.add_subplot(3, 5, i+1)  # Assuming a grid of 3x5 images
    ax.imshow(img)
    ax.axis('off')
    ax.set_title(file_name, fontsize=8, pad=5)  # Note the increase in pad

This code block uses the `maplotlib` library to adjust the display layout of the images. The second line, `plt.tight_layout(pad=1)`, ensures that the subplots fit snugly within the figure created earlier and the `pad` parameter adjusts the spacing between subplots. The final line, `plt.show()`, displays the figure.

In [None]:
plt.subplots_adjust(wspace=0.5, hspace=0.5)  # Here we set wspace and hspace to 0.5
plt.tight_layout(pad=1)  # And set pad in tight_layout to 1

plt.show()