# Image Sampling Notebook

This notebook randomly samples images from a source folder and splits them into training and validation sets with a 90/10 ratio. 

## How to use this notebook:
1. Run each cell in order by clicking on it and pressing Shift+Enter.
2. When prompted, enter the required information (source folder, destination folder, and sample size).
3. The script will create 'train' and 'val' subfolders in your destination folder and populate them with the sampled images.

## Step 1: Import required libraries

This cell imports the necessary Python libraries for our script.

In [None]:
import os
import random
import shutil
from tqdm import tqdm

## Step 2: Define the image sampling function

This function will:
1. Scan the source folder for images
2. Randomly sample the specified number of images
3. Split the sampled images into train (90%) and validation (10%) sets
4. Copy the images to the respective folders

In [None]:
def sample_images(source_folder, destination_folder, sample_size):
    # Ensure the destination folders exist
    train_folder = os.path.join(destination_folder, 'train')
    val_folder = os.path.join(destination_folder, 'val')
    os.makedirs(train_folder, exist_ok=True)
    os.makedirs(val_folder, exist_ok=True)

    # List all image files in the source folder and its subfolders
    all_images = []
    for root, _, files in tqdm(os.walk(source_folder), desc="Scanning folders"):
        all_images.extend([os.path.join(root, file) for file in files if file.lower().endswith(('.png', '.jpg', '.jpeg'))])

    # Shuffle the list of images to ensure randomness
    random.shuffle(all_images)

    # Check if the sample size is larger than the number of available images
    if sample_size > len(all_images):
        raise ValueError(f"Sample size ({sample_size}) is larger than the number of available images ({len(all_images)})")

    # Randomly sample the images
    sampled_images = random.sample(all_images, sample_size)

    # Calculate the split
    train_size = int(0.9 * sample_size)
    train_images = sampled_images[:train_size]
    val_images = sampled_images[train_size:]

    # Copy the sampled images to the destination folders
    for img in tqdm(train_images, desc="Copying training images"):
        dst_path = os.path.join(train_folder, os.path.basename(img))
        shutil.copy2(img, dst_path)

    for img in tqdm(val_images, desc="Copying validation images"):
        dst_path = os.path.join(val_folder, os.path.basename(img))
        shutil.copy2(img, dst_path)

    print(f"Copied {len(train_images)} images to the training set")
    print(f"Copied {len(val_images)} images to the validation set")

## Step 3: Get user input

This cell will prompt you to enter the required information:

In [None]:
source_folder = '/downloads'
destination_folder = '/data/images/'
sample_size = 1000

## Step 4: Run the image sampling function

This cell will execute the sampling process based on your input:

In [None]:
try:
    sample_images(source_folder, destination_folder, sample_size)
    print("\nImage sampling completed successfully!")
    print(f"Training images are in: {os.path.join(destination_folder, 'train')}")
    print(f"Validation images are in: {os.path.join(destination_folder, 'val')}")
except Exception as e:
    print(f"An error occurred: {str(e)}")

## Completion

If you've reached this point without any errors, your images have been successfully sampled and split into training and validation sets!

You can find your sampled images in the following locations:
- Training set: `[destination_folder]/train/`
- Validation set: `[destination_folder]/val/`

If you encounter any issues or have questions, please don't hesitate to ask for help.