# Image Captioning: Data Preparation


This notebook focuses on the initial phase of the image captioning project: preparing the dataset. The primary tasks include copying images into respective train, validation, and test directories, and processing image captions for the same.


## Required Libraries

Importing necessary libraries for file handling, data manipulation, and other utility tasks.


In [None]:
import os
import shutil
import json

## Directory Structure and Constants

Setting up paths to various datasets and defining the main directories for train, test, and validation data.


In [None]:
# Constants
ROOT_DIR = '/content/drive/MyDrive/Msc Project/Flicker8k_Dataset'
TRAIN_TXT = '/content/drive/MyDrive/Msc Project/Dummy Dataset/TrainImages.txt'
VAL_TXT = '/content/drive/MyDrive/Msc Project/Dummy Dataset/DevImages.txt'
TEST_TXT = '/content/drive/MyDrive/Msc Project/Dummy Dataset/TestImages.txt'
NEW_DIR = '/content/drive/MyDrive/Msc Project/Dataset'
CAPTIONS_TXT = '/content/drive/MyDrive/Msc Project/Dummy Dataset/Captions.txt'


## Utility Functions

### Copying Files

The function below is responsible for copying image files from a source directory to a target directory based on filenames provided in a `.txt` file.


In [None]:
def setup_directories():
    """
    Set up directories for train, validation, and test data.

    Returns:
    - tuple: Paths to the train, validation, and test directories.
    """

    # Define the path for the training directory using the global NEW_DIR constant
    train_dir = os.path.join(NEW_DIR, 'train')

    # Define the path for the validation directory using the global NEW_DIR constant
    val_dir = os.path.join(NEW_DIR, 'validation')

    # Define the path for the test directory using the global NEW_DIR constant
    test_dir = os.path.join(NEW_DIR, 'test')

    # Create the training directory, if it doesn't already exist
    os.makedirs(train_dir, exist_ok=True)

    # Create the validation directory, if it doesn't already exist
    os.makedirs(val_dir, exist_ok=True)

    # Create the test directory, if it doesn't already exist
    os.makedirs(test_dir, exist_ok=True)

    # Return the paths to the train, validation, and test directories
    return train_dir, val_dir, test_dir


In [None]:
def copy_files(txt_path, target_dir):
    """
    Copy images specified in a txt file to a target directory.

    Args:
    - txt_path (str): Path to the .txt file containing image filenames.
    - target_dir (str): Path to the target directory.
    """

    # Open the .txt file and extract all image filenames into a list
    with open(txt_path, 'r') as f:
        image_names = [line.strip() for line in f.readlines()]

    # Iterate over each image filename
    for image_name in image_names:

        # Construct the full source path for the image
        source = os.path.join(ROOT_DIR, image_name)

        # Construct the full target path where the image should be copied
        target = os.path.join(target_dir, image_name)

        # Check if the source image exists
        if os.path.exists(source):

            # Copy the image from source to target directory
            shutil.copy(source, target)
        else:

            # Print a warning message if the source image does not exist
            print(f"Warning: {source} does not exist!")


In [None]:
def load_captions(file_path):
    """
    Load image captions from a file.

    Args:
    - file_path (str): Path to the file containing captions.

    Returns:
    - dict: Dictionary with image IDs as keys and lists of captions as values.
    """

    # Initialize an empty dictionary to store captions for each image
    image_captions = {}

    # Open the specified file in read mode with UTF-8 encoding
    with open(file_path, 'r', encoding='utf-8') as f:

        # Read all lines from the file
        lines = f.readlines()

    # Iterate over each line from the file
    for line in lines:

        # Split the line on tab character to separate image ID and caption
        image_id_caption = line.strip().split('\t')

        # Extract the image ID (excluding any #index) and the associated caption
        image_id = image_id_caption[0].split('#')[0]
        caption = image_id_caption[1]

        # If the image ID is not already in the dictionary, initialize an empty list
        if image_id not in image_captions:
            image_captions[image_id] = []

        # Append the caption to the list of captions for the current image ID
        image_captions[image_id].append(caption)

    # Return the dictionary containing image IDs and their associated captions
    return image_captions


In [None]:
def load_image_ids(file_path):
    """
    Load image IDs from a specified file.

    Args:
    - file_path (str): Path to the file containing image IDs.

    Returns:
    - list: List of image IDs.
    """

    # Open the specified file in read mode with UTF-8 encoding
    with open(file_path, 'r', encoding='utf-8') as f:

        # Read all lines from the file and strip any leading or trailing whitespace
        image_ids = [line.strip() for line in f.readlines()]

    # Return the list of image IDs
    return image_ids


In [None]:
def create_caption_list(image_captions, image_ids):
    """
    Create a list of captions with associated image IDs in a structured format.

    Args:
    - image_captions (dict): Dictionary with image IDs as keys and lists of captions as values.
    - image_ids (list): List of image IDs for which to extract captions.

    Returns:
    - list: List of dictionaries, each containing an image ID ("file_name") and a caption ("text").
    """

    # Initialize an empty list to store structured captions
    structured_captions = []

    # Iterate over the provided list of image IDs
    for image_id in image_ids:

        # For each image ID, iterate over its associated captions in the image_captions dictionary
        for caption in image_captions.get(image_id, []):

            # Append a dictionary with the image ID and caption to the structured_captions list
            structured_captions.append({"file_name": image_id, "text": caption})

    # Return the list of structured captions
    return structured_captions


## Data Organization

With the utility functions defined, the next step is to organize images into their respective directories (train, validation, test) and process the image captions.


In [None]:
def main():
    """
    Main execution function to set up directories, copy images, and save captions to jsonl files.
    """
    train_dir, val_dir, test_dir = setup_directories()

    copy_files(TRAIN_TXT, train_dir)
    copy_files(VAL_TXT, val_dir)
    copy_files(TEST_TXT, test_dir)

    # Load captions and image IDs
    image_captions = load_captions(CAPTIONS_TXT)
    train_image_ids = load_image_ids(TRAIN_TXT)
    dev_image_ids = load_image_ids(VAL_TXT)
    test_image_ids = load_image_ids(TEST_TXT)

    # Create lists of captions
    train_captions = create_caption_list(image_captions, train_image_ids)
    dev_captions = create_caption_list(image_captions, dev_image_ids)
    test_captions = create_caption_list(image_captions, test_image_ids)

    # Save captions to separate jsonl files
    for set_name, captions_set in zip(["train", "validation", "test"], [train_captions, dev_captions, test_captions]):
        output_path = os.path.join(NEW_DIR, set_name, "metadata.jsonl")
        with open(output_path, 'w') as f:
            for item in captions_set:
                f.write(json.dumps(item) + "\n")



In [None]:
# If this script is the main module, execute the main function
if __name__ == "__main__":
    main()

## Summary

The images have been organized into their respective directories, and the captions have been processed and saved. The dataset is now ready for the next phase, which involves model training and evaluation.
