<a target="_blank" href="https://colab.research.google.com/github/umanitoba-meagher-projects/public-experiments/blob/main/jupyter-notebooks/Visualize%20Image%20Information/photo-select-tool.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
# This is a script that displays an image in the code and allows for quick selection.
"""
Author: Zhenggang Li & A.V. Ronquillo
Date: May 19, 2024

## Purpose: The script extracts photos from the original files, allows for quick manual selection, categorizes the photos, and saves them into the required folders for future use.
## Note: The author generated this text in part with GPT-4,
OpenAI’s large-scale language-generation model. Upon generating
draft code, the author reviewed, edited, and revised the code
to their own liking and takes ultimate responsibility for
the content of this code.

"""

# Module: Introduction
This notebook defines a function to display images along with their filenames and includes another main function to manage image selection from a directory.

This overall structure is useful for processing large sets of images in manageable chunks, allowing intermittent user control and the option to stop processing.

# Module: Import Python Packages
This module imports the `Image` and display functions from IPython's display module. These are used to display images in Jupyter notebooks or other IPython environments. Next, import the `os` module, which provides functions to interact with the operating system, e.g., path manipulations, directory and file operations. The `shutil` module is also to be imported, which offers high-level file operations such as copying and moving files.

In [None]:
## Import python packages
from IPython.display import Image, display
import os
import shutil

# Module: Printing Image Information


This module defines a function `show_image_with_filename` that takes two arguments: `image_path` and `image_number`. This function is designed to display an image in a notebook environment (e.g., Google Colab) and print information about the image.

The display function `Image(filename=image_path, width=800)` is specifically used to display the image in the notebook. It creates an image object using the provided image path and sets its width to 800 pixels as an inline display for better visibility. This image object is then passed to the display function to show it in the notebook.

The print statement is used to print the `image_number` and the `filename` of the image, while `os.path.basename(image_path)` extracts the filename from the provided image path. As a result, the image is visualized in a notebook environment.

In [None]:
## Show images in coding surface
def show_image_with_filename(image_path, image_number):
    display(Image(filename=image_path, width=800))
    ### Adjust image size to suit for screen
    print(image_number, "Image Filename:", os.path.basename(image_path))

# Module: Mounting Google Drive & Establish File Directory
The data set of images can be accessed by mounting Google Drive and Google CoLab.

In [None]:
# Set your image directory path
# Borealis API configuration
import requests
import zipfile

BOREALIS_SERVER = "https://borealisdata.ca"

def get_public_dataset_info(persistent_id):
    """
    Get information about a public dataset
    """
    url = f"{BOREALIS_SERVER}/api/datasets/:persistentId/"
    params = {"persistentId": persistent_id}

    response = requests.get(url, params=params)

    if response.status_code == 200:
        dataset_info = response.json()
    else:
        print(f"Cannot access dataset: {response.status_code}")
        return None
    """
    Get a list of files in a public dataset
    """
    # Access the list of files from the dataset_info dictionary
    files_list = dataset_info['data']['latestVersion']['files']

    # Create an empty list to store file information
    file_info_list = []

    # Iterate through the files list and append file ID and filename to the list
    for file_info in files_list:
        file_id = file_info['dataFile']['id']
        filename = file_info['dataFile']['filename']
        file_info_list.append({"file_id": file_id, "filename": filename})

    return file_info_list

def download_public_file(file_id, save_path="./"):
    """
    Download a specific public file from a dataset by its file ID
    No authentication required
    """
    url = f"{BOREALIS_SERVER}/api/access/datafile/{file_id}"

    response = requests.get(url, stream=True)

    if response.status_code == 200:
        # Determine filename from headers or URL
        filename = None
        if "Content-Disposition" in response.headers:
            cd = response.headers["Content-Disposition"]
            # Try to extract filename from content disposition
            if "filename=" in cd:
                filename = cd.split("filename=")[1].strip('"')

        # Fallback to extracting from URL if header not available or malformed
        if not filename:
             filename = url.split("/")[-1]

        file_path = f"{save_path}/{filename}"

        with open(file_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)

        print(f"SUCCESS: File downloaded to {file_path}")
        return file_path
    else:
        print(f"ERROR: {response.status_code}: File may be restricted or not found")
        return None

def is_zip_file(filepath):
    """
    Checks if a file is a valid zip file.
    """
    return zipfile.is_zipfile(filepath)

def unzip_file(filepath, extract_path="./"):
    """
    Unzips a zip file to a specified path and returns the name of the top-level extracted folder.
    Returns None if not a zip file or extraction fails.
    """
    if is_zip_file(filepath):
        try:
            with zipfile.ZipFile(filepath, 'r') as zip_ref:
                # Get the name of the top-level directory within the zip
                # Assumes there is a single top-level directory
                top_level_folder = None
                for file_info in zip_ref.infolist():
                    parts = file_info.filename.split('/')
                    if parts[0] and len(parts) > 1:
                        top_level_folder = parts[0]
                        break # Assuming the first entry gives the top-level folder

                zip_ref.extractall(extract_path)
                print(f"SUCCESS: Successfully unzipped {filepath} to {extract_path}")
                return top_level_folder

        except Exception as e:
            print(f"ERROR: Error unzipping {filepath}: {e}")
            return None
    else:
        print(f"INFO: {filepath} is not a valid zip file.")
        return None

# Initialize Borealis dataset access
public_doi = "doi:10.5683/SP3/H3HGWF"
print("Borealis dataset initialized for animal notebook data.")# File path on Google Drive
file_path = './cat-100'

The `main()` program is defined. Two directory paths are declared, one is the source directory containing the images (`file_path`) and the other is the destination directory for selected images (`selected_dir`).

In [None]:
## Set path
def main():
    file_path = './cat-100'
    select_dir = './cat-100-select'

After establishing the file paths, we must use an `if not` statement to check if the `selected_dir` exists. If it does not, `os.makedirs(select_dir)` creates it along with any necessary parent directories.

In [None]:
    if not os.path.exists(select_dir):
        os.makedirs(select_dir)

# Module: Sorting Images into Batches
`os.listdir(file_path)` lists the names of the entries in the `file_path` while `images.sort(...)` sorts the images based on their modification time, where oldest comes first. After this, the total count of images in the directory are accounted for.

In [None]:
    images = os.listdir(file_path)
    images.sort(key=lambda x: os.path.getmtime(os.path.join(file_path, x)))
    total_images = len(images)

The script then initializes batch processing parameters. The images are put into a `batch_size` and defines how many images will be processed per batch, in this case the `current_batch` is `0` but it will be `100` images each batch.

In [None]:
    batch_size = 100 ### Seperate total images to multi-batch，each batch have 100 images
    current_batch = 0

# Module: Displaying Numbered Images through Loops
This module utilizes a `for` loop statement to iterate over each image. For each image, its full path (`image_path`) is determined and the image is displayed with its number. Then, `input` is requested from the user to select the image or not.

In [None]:
    ## Set and show the number for each image
    for i, image in enumerate(images):
        image_path = os.path.join(file_path, image)
        image_number = i + 1  ### iamge number start at 1
        show_image_with_filename(image_path, image_number)
        choice = input("Do you want to select this image? (y/n): ")

# Module: Processing the User Input

Users are prompted to make decisions about images viewed. This piece specifically deals with interpreting user responses. User input determines whether to keep the image, discard it, or delete it entirely.

The `choice.strip()` takes the choice string inputted by the user and removes any surrounding whitespace (spaces, tabs, newlines, etc.). It then checks if the result is an empty string (''), which would mean that the user simply pressed "Enter" without typing anything. `choice = 'y'` is inputted if the condition is true that the user only pressed "Enter" and the script assigns the string 'y' to choice. This essentially defaults the choice to 'yes' when the user does not input any specific answer.

In [None]:
        ### Press ENTER=y
        if choice.strip() == '':
            choice = 'y'

In continuation, `elif choice.lower()` converts the choice string to lowercase and checks if it is `'d'`. If the choice is indeed `'d'`, the function from the `os` module will delete the file located at `image_path`, effectively removing the image from the file system.

`continue` then immediately ends the current iteration of the loop and starts the next image in the loop. This means that the the file is deleted, there is no need to execute further code specific to this iteration (like copying/moving the file).

In [None]:
        elif choice.lower() == 'd':
            os.remove(image_path)
            continue

Similarly, `elif choice.lower()` checks if the user's input, when converted to lowercase, equals `'n'`. If the choice is `'n'`, the loop skips the rest of the code in the current iteration and proceeds to the next image. However, in this context, `'n'` indicates the user does not select the image, so no action (like copying or moving the image) is taken, and the loop just moves on.

In [None]:
        elif choice.lower() == 'n':
            continue

If `'y'` is the choice, it moves the images to the `select_dir`, the destination directory.

In [None]:
        if choice.lower() == 'y':
            shutil.move(image_path, os.path.join(select_dir, image))

# Module: Batch Completion Check
This module is designed to process groups of images batch-wise and checks at each image whether that image completes a batch or is the last in the series of images.

To run this module, firstly, a conditional `if` statement to check if the current image `i + 1` (in which 'i' is likely zero-based index) is either the last image of the current batch `(i + 1) % batch_size == 0` or the last image in the total set of images `(i + 1) == total_images`. The `batch_size` is a predefined number representing how many images are processed in one batch.

In [None]:
        if (i + 1) % batch_size == 0 or (i + 1) == total_images:

A batch counter increment then occurs. If the condition is true, this increments a counter `current_batch` that tracks which batch is currently being processed.

In [None]:
            current_batch += 1

After the increment, the batch completion check can be tested. This code checks if the current batch is not the last batch. The expression `total_images // batch_size` calculates the total number of complete batches possible with the given number of images.

If the current batch is not the last batch, it prompts the user to decide whether to proceed with the next batch. If the user does not respond with `'y'` (yes), the loop (or batch processing) breaks, and it stops any further processing through a `break`.

In [None]:
            if current_batch != total_images // batch_size:
                next_batch = input("Do you want to start the next batch? (y/n): ")
                if next_batch.lower() != 'y':
                    break

# Module: Print Statements of Completion

A message is printed indicating that the `current_batch` has been completed succesfully, showing which batch number just finished. After all batches are processed or the loop is exited early, `"Program copmpleted succesfully"` is printed to show that the program has run its course, either by completing all batches or by user interruption.

In [None]:
            print("Program completed", current_batch, "successfully.")

    print("Program completed successfully.")

# Module: Trigger the Main Function
This is executed only if the file was run as a script, not imported as a module. It essentially calls the `main()` function, which presumably contains the rest of the script, including setting up variables like `batch_size`, initializing `current_batch`, and looping through the images.

In [None]:
if __name__ == '__main__':
    main()

In [None]:
## import python packages
from IPython.display import Image, display
import os
import shutil

## Show images in coding surface
def show_image_with_filename(image_path, image_number):
    display(Image(filename=image_path, width=800))  ### adjust image size to suit for screen
    print(image_number, "Image Filename:", os.path.basename(image_path))

# Set your image directory path
# Borealis API configuration
import requests
import zipfile

BOREALIS_SERVER = "https://borealisdata.ca"

def get_public_dataset_info(persistent_id):
    """
    Get information about a public dataset
    """
    url = f"{BOREALIS_SERVER}/api/datasets/:persistentId/"
    params = {"persistentId": persistent_id}

    response = requests.get(url, params=params)

    if response.status_code == 200:
        dataset_info = response.json()
    else:
        print(f"Cannot access dataset: {response.status_code}")
        return None
    """
    Get a list of files in a public dataset
    """
    # Access the list of files from the dataset_info dictionary
    files_list = dataset_info['data']['latestVersion']['files']

    # Create an empty list to store file information
    file_info_list = []

    # Iterate through the files list and append file ID and filename to the list
    for file_info in files_list:
        file_id = file_info['dataFile']['id']
        filename = file_info['dataFile']['filename']
        file_info_list.append({"file_id": file_id, "filename": filename})

    return file_info_list

def download_public_file(file_id, save_path="./"):
    """
    Download a specific public file from a dataset by its file ID
    No authentication required
    """
    url = f"{BOREALIS_SERVER}/api/access/datafile/{file_id}"

    response = requests.get(url, stream=True)

    if response.status_code == 200:
        # Determine filename from headers or URL
        filename = None
        if "Content-Disposition" in response.headers:
            cd = response.headers["Content-Disposition"]
            # Try to extract filename from content disposition
            if "filename=" in cd:
                filename = cd.split("filename=")[1].strip('"')

        # Fallback to extracting from URL if header not available or malformed
        if not filename:
             filename = url.split("/")[-1]

        file_path = f"{save_path}/{filename}"

        with open(file_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)

        print(f"SUCCESS: File downloaded to {file_path}")
        return file_path
    else:
        print(f"ERROR: {response.status_code}: File may be restricted or not found")
        return None

def is_zip_file(filepath):
    """
    Checks if a file is a valid zip file.
    """
    return zipfile.is_zipfile(filepath)

def unzip_file(filepath, extract_path="./"):
    """
    Unzips a zip file to a specified path and returns the name of the top-level extracted folder.
    Returns None if not a zip file or extraction fails.
    """
    if is_zip_file(filepath):
        try:
            with zipfile.ZipFile(filepath, 'r') as zip_ref:
                # Get the name of the top-level directory within the zip
                # Assumes there is a single top-level directory
                top_level_folder = None
                for file_info in zip_ref.infolist():
                    parts = file_info.filename.split('/')
                    if parts[0] and len(parts) > 1:
                        top_level_folder = parts[0]
                        break # Assuming the first entry gives the top-level folder

                zip_ref.extractall(extract_path)
                print(f"SUCCESS: Successfully unzipped {filepath} to {extract_path}")
                return top_level_folder

        except Exception as e:
            print(f"ERROR: Error unzipping {filepath}: {e}")
            return None
    else:
        print(f"INFO: {filepath} is not a valid zip file.")
        return None

# Initialize Borealis dataset access
public_doi = "doi:10.5683/SP3/H3HGWF"
print("Borealis dataset initialized for animal notebook data.")# File path on Google Drive
file_path = './cat-100'

## Set path
def main():
    file_path = './cat-100'
    select_dir = './cat-100-select'

    if not os.path.exists(select_dir):
        os.makedirs(select_dir)

    images = os.listdir(file_path)
    images.sort(key=lambda x: os.path.getmtime(os.path.join(file_path, x)))
    total_images = len(images)
    batch_size = 100 ### separate total images to multi-batch，each batch have 100 images
    current_batch = 0

    ## Set and show the number for each image
    for i, image in enumerate(images):
        image_path = os.path.join(file_path, image)
        image_number = i + 1  ### image number start at 1
        show_image_with_filename(image_path, image_number)
        choice = input("Do you want to select this image? (y/n): ")

        ### press ENTER=y
        if choice.strip() == '':
            choice = 'y'
        elif choice.lower() == 'd':
            os.remove(image_path)
            continue
        elif choice.lower() == 'n':
            continue

        if choice.lower() == 'y':
            shutil.move(image_path, os.path.join(select_dir, image))

        if (i + 1) % batch_size == 0 or (i + 1) == total_images:
            current_batch += 1
            if current_batch != total_images // batch_size:
                next_batch = input("Do you want to start the next batch? (y/n): ")
                if next_batch.lower() != 'y':
                    break
            print("Program completed", current_batch, "successfully.")

    print("Program completed successfully.")

if __name__ == '__main__':
    main()