# Data Splitting and Data Augmentation Summary

## 1. Data Splitting Process
   - Split each dataset into train, validation, and test sets using multiple ratios: 60/20/20, 70/15/15, and 80/10/10.
   - The splitting was performed on all three individual datasets as well as the combined dataset.
   - Ensured consistent directory structure and proper allocation across splits.
   - **Details**:
     - Utilized a shuffling mechanism to ensure randomness in split allocation.
     - **Directory Structure**:
       - Created `train`, `val`, and `test` directories for each class in each dataset split.
   - **Outcome**:
     - Generated separate training, validation, and testing datasets for different split ratios.
     - Provided detailed statistics for per-class and overall splits.

## 2. Data Augmentation Process
   - Applied data augmentations to the pre-split datasets across all splits (60/20/20, 70/15/15, and 80/10/10).
   - Augmentation was performed independently for train, validation, and test splits.
   - **Augmentation Techniques**:
     - **Transformations Used**:
       - `RandomResizedCrop`: Randomly resized crops of images.
       - `RandomRotation`: Applied random rotations.
       - `RandomHorizontalFlip`: Randomly flipped images horizontally.
       - `ColorJitter`: Adjusted brightness, contrast, saturation, and hue.
       - `RandomAffine`: Applied random affine transformations.
       - `RandomErasing`: Performed random erasing for data augmentation.
     - **Augmentation Count**:
       - Generated 10 augmented images per original image.
   - **Outcome**:
     - Created enhanced datasets with multiple augmentations per original image across different splits and combinations.
     - Detailed statistics and summaries provided for augmented data.

## 3. Tools and Libraries Utilized
   - **Data Splitting**: Utilized Python's `os`, `shutil`, and `random` libraries for file handling and directory management.
   - **Image Augmentation**: Used `PIL` for image handling and `torchvision.transforms` for augmentations.
   - **Progress Monitoring**: Employed `tqdm` for tracking file operations and augmentation processes.

## 4. Final Results
   - Delivered train, validation, and test splits with consistent class distribution across various split ratios.
   - Generated augmented datasets with comprehensive transformations to increase data diversity.
   - Processed all three individual datasets as well as the combined dataset.
   - Provided detailed documentation and summaries, including per-class statistics and overall dataset statistics for each stage.


In [9]:
# # Create the virtual environment named 'dmp'
!python3 -m venv /scratch/movi/dmp
# Install ipykernel inside the 'dmp' environment
!/scratch/movi/dmp/bin/pip install ipykernel
# Add 'dmp' as a kernel for Jupyter Notebook
!/scratch/movi/dmp/bin/python -m ipykernel install --user --name=dmp --display-name "Python (dmp)"
# # Upgrade pip in the 'dmp' environment
# !/scratch/movi/dmp/bin/python3 -m pip install --upgrade pip
# # Install necessary packages (NumPy, PyTorch, etc.) inside 'dmp'
# !/scratch/movi/dmp/bin/pip install numpy torch torchvision torchaudio pandas matplotlib scikit-learn


# !pip uninstall -y tensorflow
# !pip install numpy==1.21.4 scikit-learn==1.0.2
# import tensorflow as tf
# print("TensorFlow version:", tf.__version__)

Installed kernelspec dmp in /home/movi/.local/share/jupyter/kernels/dmp


In [2]:
# Prints the installed versions of Python, NumPy, and PyTorch libraries

import sys
import numpy as np
import torch

print(f"Python Version: {sys.version}")
print(f"NumPy Version: {np.__version__}")
print(f"PyTorch Version: {torch.__version__}")



# Function to check GPU availability and display memory statistics using PyTorch's CUDA interface

def check_gpu_status():
    # Check if GPU is available
    if torch.cuda.is_available():
        print(f"CUDA is available. PyTorch is using GPU.\n")

        # Get the number of available GPUs
        num_gpus = torch.cuda.device_count()
        print(f"Number of GPUs available: {num_gpus}")

        # Loop through each GPU and display its details
        for gpu_id in range(num_gpus):
            gpu_name = torch.cuda.get_device_name(gpu_id)
            gpu_memory_allocated = torch.cuda.memory_allocated(gpu_id) / (1024 ** 3)  # In GB
            gpu_memory_cached = torch.cuda.memory_reserved(gpu_id) / (1024 ** 3)      # In GB
            gpu_memory_total = torch.cuda.get_device_properties(gpu_id).total_memory / (1024 ** 3)  # In GB

            print(f"\nGPU {gpu_id}: {gpu_name}")
            print(f"  Total Memory: {gpu_memory_total:.2f} GB")
            print(f"  Memory Allocated: {gpu_memory_allocated:.2f} GB")
            print(f"  Memory Reserved (Cached): {gpu_memory_cached:.2f} GB")
    else:
        print("CUDA is not available. PyTorch is using the CPU.")

# Run the GPU status check
check_gpu_status()

Python Version: 3.9.9 (main, Mar 25 2022, 16:08:31) 
[GCC 10.3.0]
NumPy Version: 1.21.4
PyTorch Version: 1.12.1+cu113
CUDA is available. PyTorch is using GPU.

Number of GPUs available: 1

GPU 0: NVIDIA A100-SXM4-80GB MIG 3g.40gb
  Total Memory: 39.25 GB
  Memory Allocated: 0.00 GB
  Memory Reserved (Cached): 0.00 GB


# Splits dataset into train/validation/test sets with specified ratios

In [21]:
# Written by Ovi
# Code to split dataset into train, validation, and test sets, with detailed analysis and logging

import os
import random
import shutil

# Paths
original_dataset = '/scratch/movi/dm_project/data/dataset1_unique'  # Replace with the path to your original dataset
split_base_dir = '/scratch/movi/dm_project/data/split_70/dataset1_split'        # Base directory to store train/val/test splits

# Split ratios
TRAIN_RATIO = 0.7
VAL_RATIO = 0.15
TEST_RATIO = 0.15

def create_dir_structure(base_dir, class_names):
    """Create train, val, and test directories for each class."""
    for split in ['train', 'val', 'test']:
        for class_name in class_names:
            os.makedirs(os.path.join(base_dir, split, class_name), exist_ok=True)

def analyze_and_split_dataset(original_dataset, split_base_dir):
    """Analyze dataset and split into train, val, and test sets."""
    class_names = sorted(os.listdir(original_dataset))  # Get class names in alphabetical order
    create_dir_structure(split_base_dir, class_names)   # Create the necessary directory structure

    total_images = 0  # Track the total number of images across all classes
    split_summary = {}  # Dictionary to store per-class split details

    # Loop through each class folder
    for class_name in class_names:
        class_path = os.path.join(original_dataset, class_name)

        if os.path.isdir(class_path):  # Ensure it's a folder
            # List all images in the class folder
            image_files = [f for f in os.listdir(class_path) if os.path.isfile(os.path.join(class_path, f))]
            random.shuffle(image_files)  # Shuffle images to ensure randomness

            # Calculate split indices
            total_images_in_class = len(image_files)
            train_end = int(total_images_in_class * TRAIN_RATIO)
            val_end = train_end + int(total_images_in_class * VAL_RATIO)

            # Split the image files into train, val, and test sets
            train_files = image_files[:train_end]
            val_files = image_files[train_end:val_end]
            test_files = image_files[val_end:]

            # Copy files to the respective split directories
            for file in train_files:
                shutil.copy(os.path.join(class_path, file), os.path.join(split_base_dir, 'train', class_name, file))
            for file in val_files:
                shutil.copy(os.path.join(class_path, file), os.path.join(split_base_dir, 'val', class_name, file))
            for file in test_files:
                shutil.copy(os.path.join(class_path, file), os.path.join(split_base_dir, 'test', class_name, file))

            # Store the split summary for this class
            split_summary[class_name] = {
                'Total': total_images_in_class,
                'Train': len(train_files),
                'Validation': len(val_files),
                'Test': len(test_files)
            }

            # Update total image count
            total_images += total_images_in_class

            # Print per-class summary
            print(f"{class_name}: {len(train_files)} train, {len(val_files)} val, {len(test_files)} test (Total: {total_images_in_class})")

    # Print overall summary
    print("\nOverall Dataset Summary:")
    print(f"Total Images: {total_images}")
    print(f"Train Ratio: {TRAIN_RATIO}, Validation Ratio: {VAL_RATIO}, Test Ratio: {TEST_RATIO}\n")

    # Print detailed split summary for all classes
    print("Detailed Split Summary:")
    for class_name, counts in split_summary.items():
        print(f"{class_name} - Total: {counts['Total']}, Train: {counts['Train']}, Val: {counts['Validation']}, Test: {counts['Test']}")

    return split_summary

# Run the split function and store the summary
dataset_summary = analyze_and_split_dataset(original_dataset, split_base_dir)


1: 24 train, 5 val, 6 test (Total: 35)
10: 110 train, 23 val, 25 test (Total: 158)
100: 110 train, 23 val, 25 test (Total: 158)
1000: 60 train, 13 val, 14 test (Total: 87)
2: 111 train, 23 val, 25 test (Total: 159)
20: 102 train, 21 val, 23 test (Total: 146)
200: 0 train, 0 val, 0 test (Total: 0)
5: 104 train, 22 val, 23 test (Total: 149)
50: 98 train, 21 val, 21 test (Total: 140)
500: 77 train, 16 val, 17 test (Total: 110)

Overall Dataset Summary:
Total Images: 1142
Train Ratio: 0.7, Validation Ratio: 0.15, Test Ratio: 0.15

Detailed Split Summary:
1 - Total: 35, Train: 24, Val: 5, Test: 6
10 - Total: 158, Train: 110, Val: 23, Test: 25
100 - Total: 158, Train: 110, Val: 23, Test: 25
1000 - Total: 87, Train: 60, Val: 13, Test: 14
2 - Total: 159, Train: 111, Val: 23, Test: 25
20 - Total: 146, Train: 102, Val: 21, Test: 23
200 - Total: 0, Train: 0, Val: 0, Test: 0
5 - Total: 149, Train: 104, Val: 22, Test: 23
50 - Total: 140, Train: 98, Val: 21, Test: 21
500 - Total: 110, Train: 77, Val

In [22]:
# Written by Ovi, 2024-11-03
# Code to apply augmentations to pre-split dataset

import os
from PIL import Image
from torchvision import transforms
import shutil
from tqdm import tqdm

# Paths - update these paths
split_base_dir = '/scratch/movi/dm_project/data/split_70/dataset1_split'  # Your already split dataset
augmented_data_dir = '/scratch/movi/dm_project/data/split_70/dataset1_aug'  # Where to save augmented data

# Number of augmentations per image
NUM_AUGMENTATIONS = 10

# Define augmentation transformations
augmentation_transforms = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
    transforms.RandomRotation(degrees=15),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.RandomAffine(degrees=0, translate=(0.1, 0.1), scale=(0.9, 1.1), shear=10),
    transforms.RandomErasing(p=0.5, scale=(0.02, 0.15), ratio=(0.3, 3.3)),
])

def apply_augmentations():
    """Apply augmentations to each split of the pre-split dataset."""
    print("\n=== Starting Augmentation Process ===")
    
    # Create destination directory structure
    for split in ['train', 'val', 'test']:
        split_path = os.path.join(augmented_data_dir, split)
        os.makedirs(split_path, exist_ok=True)
        for class_name in os.listdir(os.path.join(split_base_dir, split)):
            class_path = os.path.join(split_path, class_name)
            os.makedirs(class_path, exist_ok=True)

    # Stats dictionary
    stats = {'train': {}, 'val': {}, 'test': {}}

    # Process each split
    for split in ['train', 'val', 'test']:
        print(f"\nProcessing {split.upper()} split:")
        split_source = os.path.join(split_base_dir, split)
        split_dest = os.path.join(augmented_data_dir, split)
        
        # Process each class
        for class_name in sorted(os.listdir(split_source)):
            class_source = os.path.join(split_source, class_name)
            class_dest = os.path.join(split_dest, class_name)
            
            if os.path.isdir(class_source):
                # Get list of original images
                original_files = [f for f in os.listdir(class_source) 
                                if os.path.isfile(os.path.join(class_source, f))]
                
                print(f"\nProcessing class {class_name}:")
                print(f"Found {len(original_files)} original images")
                
                # First copy original files
                for file in tqdm(original_files, desc="Copying originals"):
                    shutil.copy2(os.path.join(class_source, file),
                               os.path.join(class_dest, file))
                
                # Then create augmented versions
                print("Generating augmented images...")
                for file in tqdm(original_files, desc="Generating augmentations"):
                    img_path = os.path.join(class_source, file)
                    try:
                        with Image.open(img_path) as img:
                            # Convert to RGB if needed
                            if img.mode != 'RGB':
                                img = img.convert('RGB')
                            
                            img_tensor = transforms.ToTensor()(img)
                            
                            # Generate augmentations
                            for i in range(NUM_AUGMENTATIONS):
                                try:
                                    augmented_tensor = augmentation_transforms(img_tensor)
                                    augmented_img = transforms.ToPILImage()(augmented_tensor)
                                    
                                    # Save augmented image
                                    base_name = os.path.splitext(file)[0]
                                    aug_name = f"{base_name}_aug_{i+1}.jpg"
                                    augmented_img.save(os.path.join(class_dest, aug_name))
                                except Exception as e:
                                    print(f"Error generating augmentation {i+1} for {file}: {str(e)}")
                    except Exception as e:
                        print(f"Error processing file {file}: {str(e)}")
                
                # Update stats
                total_augmented = len(original_files) * NUM_AUGMENTATIONS
                stats[split][class_name] = {
                    'original': len(original_files),
                    'augmented': total_augmented,
                    'total': len(original_files) + total_augmented
                }

    # Print comprehensive summary
    print("\n=== Augmentation Summary ===")
    print("\nPer Split Statistics:")
    for split in ['train', 'val', 'test']:
        print(f"\n{split.upper()} Split:")
        split_total_orig = 0
        split_total_aug = 0
        
        for class_name, counts in sorted(stats[split].items()):
            print(f"Class {class_name}:")
            print(f"  Original: {counts['original']}")
            print(f"  Augmented: {counts['augmented']}")
            print(f"  Total: {counts['total']}")
            split_total_orig += counts['original']
            split_total_aug += counts['augmented']
        
        print(f"\n{split.upper()} Split Totals:")
        print(f"  Original Images: {split_total_orig}")
        print(f"  Augmented Images: {split_total_aug}")
        print(f"  Total Images: {split_total_orig + split_total_aug}")
        print("-" * 50)

    # Overall totals
    total_orig = sum(sum(c['original'] for c in s.values()) for s in stats.values())
    total_aug = sum(sum(c['augmented'] for c in s.values()) for s in stats.values())
    print("\nOverall Dataset Statistics:")
    print(f"Total Original Images: {total_orig}")
    print(f"Total Augmented Images: {total_aug}")
    print(f"Total Images: {total_orig + total_aug}")

if __name__ == "__main__":
    apply_augmentations()


=== Starting Augmentation Process ===

Processing TRAIN split:

Processing class 1:
Found 24 original images


Copying originals: 100%|██████████| 24/24 [00:00<00:00, 221.27it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 24/24 [00:02<00:00,  8.48it/s]



Processing class 10:
Found 110 original images


Copying originals: 100%|██████████| 110/110 [00:00<00:00, 219.61it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 110/110 [00:12<00:00,  8.97it/s]



Processing class 100:
Found 110 original images


Copying originals: 100%|██████████| 110/110 [00:00<00:00, 227.08it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 110/110 [00:13<00:00,  8.43it/s]



Processing class 1000:
Found 60 original images


Copying originals: 100%|██████████| 60/60 [00:00<00:00, 181.75it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 60/60 [00:08<00:00,  7.01it/s]



Processing class 2:
Found 111 original images


Copying originals: 100%|██████████| 111/111 [00:00<00:00, 115.99it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 111/111 [00:11<00:00, 10.01it/s]



Processing class 20:
Found 102 original images


Copying originals: 100%|██████████| 102/102 [00:00<00:00, 248.90it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 102/102 [00:10<00:00,  9.32it/s]



Processing class 200:
Found 0 original images


Copying originals: 0it [00:00, ?it/s]


Generating augmented images...


Generating augmentations: 0it [00:00, ?it/s]



Processing class 5:
Found 104 original images


Copying originals: 100%|██████████| 104/104 [00:00<00:00, 209.03it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 104/104 [00:23<00:00,  4.50it/s]



Processing class 50:
Found 98 original images


Copying originals: 100%|██████████| 98/98 [00:00<00:00, 221.73it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 98/98 [00:23<00:00,  4.22it/s]



Processing class 500:
Found 77 original images


Copying originals: 100%|██████████| 77/77 [00:00<00:00, 216.09it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 77/77 [00:12<00:00,  6.39it/s]



Processing VAL split:

Processing class 1:
Found 5 original images


Copying originals: 100%|██████████| 5/5 [00:00<00:00, 196.20it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 5/5 [00:00<00:00,  9.40it/s]



Processing class 10:
Found 23 original images


Copying originals: 100%|██████████| 23/23 [00:00<00:00, 230.71it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 23/23 [00:02<00:00,  9.97it/s]



Processing class 100:
Found 23 original images


Copying originals: 100%|██████████| 23/23 [00:00<00:00, 203.77it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 23/23 [00:02<00:00,  9.29it/s]



Processing class 1000:
Found 13 original images


Copying originals: 100%|██████████| 13/13 [00:00<00:00, 212.94it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 13/13 [00:02<00:00,  6.29it/s]



Processing class 2:
Found 23 original images


Copying originals: 100%|██████████| 23/23 [00:00<00:00, 188.05it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 23/23 [00:02<00:00,  8.99it/s]



Processing class 20:
Found 21 original images


Copying originals: 100%|██████████| 21/21 [00:00<00:00, 225.45it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 21/21 [00:02<00:00,  9.32it/s]



Processing class 200:
Found 0 original images


Copying originals: 0it [00:00, ?it/s]


Generating augmented images...


Generating augmentations: 0it [00:00, ?it/s]



Processing class 5:
Found 22 original images


Copying originals: 100%|██████████| 22/22 [00:00<00:00, 236.50it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 22/22 [00:03<00:00,  6.12it/s]



Processing class 50:
Found 21 original images


Copying originals: 100%|██████████| 21/21 [00:00<00:00, 217.65it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 21/21 [00:02<00:00,  8.47it/s]



Processing class 500:
Found 16 original images


Copying originals: 100%|██████████| 16/16 [00:00<00:00, 200.67it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 16/16 [00:01<00:00, 10.28it/s]



Processing TEST split:

Processing class 1:
Found 6 original images


Copying originals: 100%|██████████| 6/6 [00:00<00:00, 250.34it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 6/6 [00:00<00:00,  9.66it/s]



Processing class 10:
Found 25 original images


Copying originals: 100%|██████████| 25/25 [00:00<00:00, 230.48it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 25/25 [00:02<00:00, 10.22it/s]



Processing class 100:
Found 25 original images


Copying originals: 100%|██████████| 25/25 [00:00<00:00, 191.59it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 25/25 [00:06<00:00,  3.58it/s]



Processing class 1000:
Found 14 original images


Copying originals: 100%|██████████| 14/14 [00:00<00:00, 231.46it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 14/14 [00:01<00:00,  9.83it/s]



Processing class 2:
Found 25 original images


Copying originals: 100%|██████████| 25/25 [00:00<00:00, 239.80it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 25/25 [00:02<00:00, 10.10it/s]



Processing class 20:
Found 23 original images


Copying originals: 100%|██████████| 23/23 [00:00<00:00, 216.56it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 23/23 [00:02<00:00,  8.26it/s]



Processing class 200:
Found 0 original images


Copying originals: 0it [00:00, ?it/s]


Generating augmented images...


Generating augmentations: 0it [00:00, ?it/s]



Processing class 5:
Found 23 original images


Copying originals: 100%|██████████| 23/23 [00:00<00:00, 125.41it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 23/23 [00:02<00:00, 10.34it/s]



Processing class 50:
Found 21 original images


Copying originals: 100%|██████████| 21/21 [00:00<00:00, 194.43it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 21/21 [00:02<00:00,  9.65it/s]



Processing class 500:
Found 17 original images


Copying originals: 100%|██████████| 17/17 [00:00<00:00, 214.55it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 17/17 [00:02<00:00,  7.29it/s]


=== Augmentation Summary ===

Per Split Statistics:

TRAIN Split:
Class 1:
  Original: 24
  Augmented: 240
  Total: 264
Class 10:
  Original: 110
  Augmented: 1100
  Total: 1210
Class 100:
  Original: 110
  Augmented: 1100
  Total: 1210
Class 1000:
  Original: 60
  Augmented: 600
  Total: 660
Class 2:
  Original: 111
  Augmented: 1110
  Total: 1221
Class 20:
  Original: 102
  Augmented: 1020
  Total: 1122
Class 200:
  Original: 0
  Augmented: 0
  Total: 0
Class 5:
  Original: 104
  Augmented: 1040
  Total: 1144
Class 50:
  Original: 98
  Augmented: 980
  Total: 1078
Class 500:
  Original: 77
  Augmented: 770
  Total: 847

TRAIN Split Totals:
  Original Images: 796
  Augmented Images: 7960
  Total Images: 8756
--------------------------------------------------

VAL Split:
Class 1:
  Original: 5
  Augmented: 50
  Total: 55
Class 10:
  Original: 23
  Augmented: 230
  Total: 253
Class 100:
  Original: 23
  Augmented: 230
  Total: 253
Class 1000:
  Original: 13
  Augmented: 130
  Total: 14




In [23]:
# Written by Ovi
# Code to split dataset into train, validation, and test sets, with detailed analysis and logging

import os
import random
import shutil

# Paths
original_dataset = '/scratch/movi/dm_project/data/dataset2_unique'  # Replace with the path to your original dataset
split_base_dir = '/scratch/movi/dm_project/data/split_80/dataset2_split'        # Base directory to store train/val/test splits

# Split ratios
TRAIN_RATIO = 0.8
VAL_RATIO = 0.1
TEST_RATIO = 0.1

def create_dir_structure(base_dir, class_names):
    """Create train, val, and test directories for each class."""
    for split in ['train', 'val', 'test']:
        for class_name in class_names:
            os.makedirs(os.path.join(base_dir, split, class_name), exist_ok=True)

def analyze_and_split_dataset(original_dataset, split_base_dir):
    """Analyze dataset and split into train, val, and test sets."""
    class_names = sorted(os.listdir(original_dataset))  # Get class names in alphabetical order
    create_dir_structure(split_base_dir, class_names)   # Create the necessary directory structure

    total_images = 0  # Track the total number of images across all classes
    split_summary = {}  # Dictionary to store per-class split details

    # Loop through each class folder
    for class_name in class_names:
        class_path = os.path.join(original_dataset, class_name)

        if os.path.isdir(class_path):  # Ensure it's a folder
            # List all images in the class folder
            image_files = [f for f in os.listdir(class_path) if os.path.isfile(os.path.join(class_path, f))]
            random.shuffle(image_files)  # Shuffle images to ensure randomness

            # Calculate split indices
            total_images_in_class = len(image_files)
            train_end = int(total_images_in_class * TRAIN_RATIO)
            val_end = train_end + int(total_images_in_class * VAL_RATIO)

            # Split the image files into train, val, and test sets
            train_files = image_files[:train_end]
            val_files = image_files[train_end:val_end]
            test_files = image_files[val_end:]

            # Copy files to the respective split directories
            for file in train_files:
                shutil.copy(os.path.join(class_path, file), os.path.join(split_base_dir, 'train', class_name, file))
            for file in val_files:
                shutil.copy(os.path.join(class_path, file), os.path.join(split_base_dir, 'val', class_name, file))
            for file in test_files:
                shutil.copy(os.path.join(class_path, file), os.path.join(split_base_dir, 'test', class_name, file))

            # Store the split summary for this class
            split_summary[class_name] = {
                'Total': total_images_in_class,
                'Train': len(train_files),
                'Validation': len(val_files),
                'Test': len(test_files)
            }

            # Update total image count
            total_images += total_images_in_class

            # Print per-class summary
            print(f"{class_name}: {len(train_files)} train, {len(val_files)} val, {len(test_files)} test (Total: {total_images_in_class})")

    # Print overall summary
    print("\nOverall Dataset Summary:")
    print(f"Total Images: {total_images}")
    print(f"Train Ratio: {TRAIN_RATIO}, Validation Ratio: {VAL_RATIO}, Test Ratio: {TEST_RATIO}\n")

    # Print detailed split summary for all classes
    print("Detailed Split Summary:")
    for class_name, counts in split_summary.items():
        print(f"{class_name} - Total: {counts['Total']}, Train: {counts['Train']}, Val: {counts['Validation']}, Test: {counts['Test']}")

    return split_summary

# Run the split function and store the summary
dataset_summary = analyze_and_split_dataset(original_dataset, split_base_dir)


1: 28 train, 3 val, 5 test (Total: 36)
10: 107 train, 13 val, 14 test (Total: 134)
100: 111 train, 13 val, 15 test (Total: 139)
1000: 60 train, 7 val, 9 test (Total: 76)
2: 102 train, 12 val, 14 test (Total: 128)
20: 104 train, 13 val, 13 test (Total: 130)
200: 15 train, 1 val, 3 test (Total: 19)
5: 96 train, 12 val, 13 test (Total: 121)
50: 94 train, 11 val, 13 test (Total: 118)
500: 72 train, 9 val, 10 test (Total: 91)

Overall Dataset Summary:
Total Images: 992
Train Ratio: 0.8, Validation Ratio: 0.1, Test Ratio: 0.1

Detailed Split Summary:
1 - Total: 36, Train: 28, Val: 3, Test: 5
10 - Total: 134, Train: 107, Val: 13, Test: 14
100 - Total: 139, Train: 111, Val: 13, Test: 15
1000 - Total: 76, Train: 60, Val: 7, Test: 9
2 - Total: 128, Train: 102, Val: 12, Test: 14
20 - Total: 130, Train: 104, Val: 13, Test: 13
200 - Total: 19, Train: 15, Val: 1, Test: 3
5 - Total: 121, Train: 96, Val: 12, Test: 13
50 - Total: 118, Train: 94, Val: 11, Test: 13
500 - Total: 91, Train: 72, Val: 9, Tes

In [24]:
# Written by Ovi, 2024-11-03
# Code to apply augmentations to pre-split dataset

import os
from PIL import Image
from torchvision import transforms
import shutil
from tqdm import tqdm

# Paths - update these paths
split_base_dir = '/scratch/movi/dm_project/data/split_80/dataset2_split'  # Your already split dataset
augmented_data_dir = '/scratch/movi/dm_project/data/split_80/dataset2_aug'  # Where to save augmented data

# Number of augmentations per image
NUM_AUGMENTATIONS = 10

# Define augmentation transformations
augmentation_transforms = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
    transforms.RandomRotation(degrees=15),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.RandomAffine(degrees=0, translate=(0.1, 0.1), scale=(0.9, 1.1), shear=10),
    transforms.RandomErasing(p=0.5, scale=(0.02, 0.15), ratio=(0.3, 3.3)),
])

def apply_augmentations():
    """Apply augmentations to each split of the pre-split dataset."""
    print("\n=== Starting Augmentation Process ===")
    
    # Create destination directory structure
    for split in ['train', 'val', 'test']:
        split_path = os.path.join(augmented_data_dir, split)
        os.makedirs(split_path, exist_ok=True)
        for class_name in os.listdir(os.path.join(split_base_dir, split)):
            class_path = os.path.join(split_path, class_name)
            os.makedirs(class_path, exist_ok=True)

    # Stats dictionary
    stats = {'train': {}, 'val': {}, 'test': {}}

    # Process each split
    for split in ['train', 'val', 'test']:
        print(f"\nProcessing {split.upper()} split:")
        split_source = os.path.join(split_base_dir, split)
        split_dest = os.path.join(augmented_data_dir, split)
        
        # Process each class
        for class_name in sorted(os.listdir(split_source)):
            class_source = os.path.join(split_source, class_name)
            class_dest = os.path.join(split_dest, class_name)
            
            if os.path.isdir(class_source):
                # Get list of original images
                original_files = [f for f in os.listdir(class_source) 
                                if os.path.isfile(os.path.join(class_source, f))]
                
                print(f"\nProcessing class {class_name}:")
                print(f"Found {len(original_files)} original images")
                
                # First copy original files
                for file in tqdm(original_files, desc="Copying originals"):
                    shutil.copy2(os.path.join(class_source, file),
                               os.path.join(class_dest, file))
                
                # Then create augmented versions
                print("Generating augmented images...")
                for file in tqdm(original_files, desc="Generating augmentations"):
                    img_path = os.path.join(class_source, file)
                    try:
                        with Image.open(img_path) as img:
                            # Convert to RGB if needed
                            if img.mode != 'RGB':
                                img = img.convert('RGB')
                            
                            img_tensor = transforms.ToTensor()(img)
                            
                            # Generate augmentations
                            for i in range(NUM_AUGMENTATIONS):
                                try:
                                    augmented_tensor = augmentation_transforms(img_tensor)
                                    augmented_img = transforms.ToPILImage()(augmented_tensor)
                                    
                                    # Save augmented image
                                    base_name = os.path.splitext(file)[0]
                                    aug_name = f"{base_name}_aug_{i+1}.jpg"
                                    augmented_img.save(os.path.join(class_dest, aug_name))
                                except Exception as e:
                                    print(f"Error generating augmentation {i+1} for {file}: {str(e)}")
                    except Exception as e:
                        print(f"Error processing file {file}: {str(e)}")
                
                # Update stats
                total_augmented = len(original_files) * NUM_AUGMENTATIONS
                stats[split][class_name] = {
                    'original': len(original_files),
                    'augmented': total_augmented,
                    'total': len(original_files) + total_augmented
                }

    # Print comprehensive summary
    print("\n=== Augmentation Summary ===")
    print("\nPer Split Statistics:")
    for split in ['train', 'val', 'test']:
        print(f"\n{split.upper()} Split:")
        split_total_orig = 0
        split_total_aug = 0
        
        for class_name, counts in sorted(stats[split].items()):
            print(f"Class {class_name}:")
            print(f"  Original: {counts['original']}")
            print(f"  Augmented: {counts['augmented']}")
            print(f"  Total: {counts['total']}")
            split_total_orig += counts['original']
            split_total_aug += counts['augmented']
        
        print(f"\n{split.upper()} Split Totals:")
        print(f"  Original Images: {split_total_orig}")
        print(f"  Augmented Images: {split_total_aug}")
        print(f"  Total Images: {split_total_orig + split_total_aug}")
        print("-" * 50)

    # Overall totals
    total_orig = sum(sum(c['original'] for c in s.values()) for s in stats.values())
    total_aug = sum(sum(c['augmented'] for c in s.values()) for s in stats.values())
    print("\nOverall Dataset Statistics:")
    print(f"Total Original Images: {total_orig}")
    print(f"Total Augmented Images: {total_aug}")
    print(f"Total Images: {total_orig + total_aug}")

if __name__ == "__main__":
    apply_augmentations()


=== Starting Augmentation Process ===

Processing TRAIN split:

Processing class 1:
Found 28 original images


Copying originals: 100%|██████████| 28/28 [00:00<00:00, 199.21it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 28/28 [00:05<00:00,  4.97it/s]



Processing class 10:
Found 107 original images


Copying originals: 100%|██████████| 107/107 [00:00<00:00, 206.80it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 107/107 [00:15<00:00,  7.06it/s]



Processing class 100:
Found 111 original images


Copying originals: 100%|██████████| 111/111 [00:00<00:00, 206.76it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 111/111 [00:40<00:00,  2.74it/s]



Processing class 1000:
Found 60 original images


Copying originals: 100%|██████████| 60/60 [00:00<00:00, 236.38it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 60/60 [00:07<00:00,  7.54it/s]



Processing class 2:
Found 102 original images


Copying originals: 100%|██████████| 102/102 [00:00<00:00, 171.25it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 102/102 [00:17<00:00,  5.88it/s]



Processing class 20:
Found 104 original images


Copying originals: 100%|██████████| 104/104 [00:00<00:00, 246.17it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 104/104 [00:10<00:00,  9.47it/s]



Processing class 200:
Found 15 original images


Copying originals: 100%|██████████| 15/15 [00:00<00:00, 129.59it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 15/15 [00:03<00:00,  4.49it/s]



Processing class 5:
Found 96 original images


Copying originals: 100%|██████████| 96/96 [00:00<00:00, 194.49it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 96/96 [00:10<00:00,  9.22it/s]



Processing class 50:
Found 94 original images


Copying originals: 100%|██████████| 94/94 [00:00<00:00, 191.03it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 94/94 [00:22<00:00,  4.19it/s]



Processing class 500:
Found 72 original images


Copying originals: 100%|██████████| 72/72 [00:00<00:00, 198.24it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 72/72 [00:07<00:00, 10.14it/s]



Processing VAL split:

Processing class 1:
Found 3 original images


Copying originals: 100%|██████████| 3/3 [00:00<00:00, 189.48it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 3/3 [00:00<00:00, 10.21it/s]



Processing class 10:
Found 13 original images


Copying originals: 100%|██████████| 13/13 [00:00<00:00, 210.59it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 13/13 [00:01<00:00, 10.35it/s]



Processing class 100:
Found 13 original images


Copying originals: 100%|██████████| 13/13 [00:00<00:00, 165.43it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 13/13 [00:01<00:00, 10.00it/s]



Processing class 1000:
Found 7 original images


Copying originals: 100%|██████████| 7/7 [00:00<00:00, 176.64it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 7/7 [00:00<00:00, 10.07it/s]



Processing class 2:
Found 12 original images


Copying originals: 100%|██████████| 12/12 [00:00<00:00, 196.65it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 12/12 [00:01<00:00,  8.57it/s]



Processing class 20:
Found 13 original images


Copying originals: 100%|██████████| 13/13 [00:00<00:00, 210.52it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 13/13 [00:17<00:00,  1.34s/it]



Processing class 200:
Found 1 original images


Copying originals: 100%|██████████| 1/1 [00:00<00:00, 193.05it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 1/1 [00:00<00:00,  9.99it/s]



Processing class 5:
Found 12 original images


Copying originals: 100%|██████████| 12/12 [00:00<00:00, 158.39it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 12/12 [00:04<00:00,  2.62it/s]



Processing class 50:
Found 11 original images


Copying originals: 100%|██████████| 11/11 [00:00<00:00, 228.56it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 11/11 [00:01<00:00, 10.55it/s]



Processing class 500:
Found 9 original images


Copying originals: 100%|██████████| 9/9 [00:00<00:00, 207.64it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 9/9 [00:05<00:00,  1.69it/s]



Processing TEST split:

Processing class 1:
Found 5 original images


Copying originals: 100%|██████████| 5/5 [00:00<00:00, 205.09it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 5/5 [00:02<00:00,  2.34it/s]



Processing class 10:
Found 14 original images


Copying originals: 100%|██████████| 14/14 [00:00<00:00, 40.34it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 14/14 [00:01<00:00,  9.90it/s]



Processing class 100:
Found 15 original images


Copying originals: 100%|██████████| 15/15 [00:00<00:00, 208.67it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 15/15 [00:01<00:00,  8.07it/s]



Processing class 1000:
Found 9 original images


Copying originals: 100%|██████████| 9/9 [00:00<00:00, 219.42it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 9/9 [00:00<00:00,  9.18it/s]



Processing class 2:
Found 14 original images


Copying originals: 100%|██████████| 14/14 [00:00<00:00, 38.52it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 14/14 [00:09<00:00,  1.49it/s]



Processing class 20:
Found 13 original images


Copying originals: 100%|██████████| 13/13 [00:00<00:00, 161.95it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 13/13 [00:01<00:00, 11.34it/s]



Processing class 200:
Found 3 original images


Copying originals: 100%|██████████| 3/3 [00:00<00:00, 195.20it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 3/3 [00:00<00:00, 11.32it/s]



Processing class 5:
Found 13 original images


Copying originals: 100%|██████████| 13/13 [00:00<00:00, 219.85it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 13/13 [00:01<00:00, 10.92it/s]



Processing class 50:
Found 13 original images


Copying originals: 100%|██████████| 13/13 [00:00<00:00, 208.03it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 13/13 [00:01<00:00, 11.03it/s]



Processing class 500:
Found 10 original images


Copying originals: 100%|██████████| 10/10 [00:00<00:00, 218.29it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 10/10 [00:00<00:00, 10.93it/s]


=== Augmentation Summary ===

Per Split Statistics:

TRAIN Split:
Class 1:
  Original: 28
  Augmented: 280
  Total: 308
Class 10:
  Original: 107
  Augmented: 1070
  Total: 1177
Class 100:
  Original: 111
  Augmented: 1110
  Total: 1221
Class 1000:
  Original: 60
  Augmented: 600
  Total: 660
Class 2:
  Original: 102
  Augmented: 1020
  Total: 1122
Class 20:
  Original: 104
  Augmented: 1040
  Total: 1144
Class 200:
  Original: 15
  Augmented: 150
  Total: 165
Class 5:
  Original: 96
  Augmented: 960
  Total: 1056
Class 50:
  Original: 94
  Augmented: 940
  Total: 1034
Class 500:
  Original: 72
  Augmented: 720
  Total: 792

TRAIN Split Totals:
  Original Images: 789
  Augmented Images: 7890
  Total Images: 8679
--------------------------------------------------

VAL Split:
Class 1:
  Original: 3
  Augmented: 30
  Total: 33
Class 10:
  Original: 13
  Augmented: 130
  Total: 143
Class 100:
  Original: 13
  Augmented: 130
  Total: 143
Class 1000:
  Original: 7
  Augmented: 70
  Total: 7




In [25]:
# Written by Ovi
# Code to split dataset into train, validation, and test sets, with detailed analysis and logging

import os
import random
import shutil

# Paths
original_dataset = '/scratch/movi/dm_project/data/dataset3_unique'  # Replace with the path to your original dataset
split_base_dir = '/scratch/movi/dm_project/data/split_80/dataset3_split'        # Base directory to store train/val/test splits

# Split ratios
TRAIN_RATIO = 0.8
VAL_RATIO = 0.1
TEST_RATIO = 0.1

def create_dir_structure(base_dir, class_names):
    """Create train, val, and test directories for each class."""
    for split in ['train', 'val', 'test']:
        for class_name in class_names:
            os.makedirs(os.path.join(base_dir, split, class_name), exist_ok=True)

def analyze_and_split_dataset(original_dataset, split_base_dir):
    """Analyze dataset and split into train, val, and test sets."""
    class_names = sorted(os.listdir(original_dataset))  # Get class names in alphabetical order
    create_dir_structure(split_base_dir, class_names)   # Create the necessary directory structure

    total_images = 0  # Track the total number of images across all classes
    split_summary = {}  # Dictionary to store per-class split details

    # Loop through each class folder
    for class_name in class_names:
        class_path = os.path.join(original_dataset, class_name)

        if os.path.isdir(class_path):  # Ensure it's a folder
            # List all images in the class folder
            image_files = [f for f in os.listdir(class_path) if os.path.isfile(os.path.join(class_path, f))]
            random.shuffle(image_files)  # Shuffle images to ensure randomness

            # Calculate split indices
            total_images_in_class = len(image_files)
            train_end = int(total_images_in_class * TRAIN_RATIO)
            val_end = train_end + int(total_images_in_class * VAL_RATIO)

            # Split the image files into train, val, and test sets
            train_files = image_files[:train_end]
            val_files = image_files[train_end:val_end]
            test_files = image_files[val_end:]

            # Copy files to the respective split directories
            for file in train_files:
                shutil.copy(os.path.join(class_path, file), os.path.join(split_base_dir, 'train', class_name, file))
            for file in val_files:
                shutil.copy(os.path.join(class_path, file), os.path.join(split_base_dir, 'val', class_name, file))
            for file in test_files:
                shutil.copy(os.path.join(class_path, file), os.path.join(split_base_dir, 'test', class_name, file))

            # Store the split summary for this class
            split_summary[class_name] = {
                'Total': total_images_in_class,
                'Train': len(train_files),
                'Validation': len(val_files),
                'Test': len(test_files)
            }

            # Update total image count
            total_images += total_images_in_class

            # Print per-class summary
            print(f"{class_name}: {len(train_files)} train, {len(val_files)} val, {len(test_files)} test (Total: {total_images_in_class})")

    # Print overall summary
    print("\nOverall Dataset Summary:")
    print(f"Total Images: {total_images}")
    print(f"Train Ratio: {TRAIN_RATIO}, Validation Ratio: {VAL_RATIO}, Test Ratio: {TEST_RATIO}\n")

    # Print detailed split summary for all classes
    print("Detailed Split Summary:")
    for class_name, counts in split_summary.items():
        print(f"{class_name} - Total: {counts['Total']}, Train: {counts['Train']}, Val: {counts['Validation']}, Test: {counts['Test']}")

    return split_summary

# Run the split function and store the summary
dataset_summary = analyze_and_split_dataset(original_dataset, split_base_dir)


1: 0 train, 0 val, 0 test (Total: 0)
10: 160 train, 20 val, 20 test (Total: 200)
100: 68 train, 8 val, 9 test (Total: 85)
1000: 131 train, 16 val, 17 test (Total: 164)
2: 77 train, 9 val, 11 test (Total: 97)
20: 173 train, 21 val, 23 test (Total: 217)
200: 0 train, 0 val, 0 test (Total: 0)
5: 124 train, 15 val, 16 test (Total: 155)
50: 172 train, 21 val, 23 test (Total: 216)
500: 149 train, 18 val, 20 test (Total: 187)

Overall Dataset Summary:
Total Images: 1321
Train Ratio: 0.8, Validation Ratio: 0.1, Test Ratio: 0.1

Detailed Split Summary:
1 - Total: 0, Train: 0, Val: 0, Test: 0
10 - Total: 200, Train: 160, Val: 20, Test: 20
100 - Total: 85, Train: 68, Val: 8, Test: 9
1000 - Total: 164, Train: 131, Val: 16, Test: 17
2 - Total: 97, Train: 77, Val: 9, Test: 11
20 - Total: 217, Train: 173, Val: 21, Test: 23
200 - Total: 0, Train: 0, Val: 0, Test: 0
5 - Total: 155, Train: 124, Val: 15, Test: 16
50 - Total: 216, Train: 172, Val: 21, Test: 23
500 - Total: 187, Train: 149, Val: 18, Test: 

In [26]:
# Written by Ovi, 2024-11-03
# Code to apply augmentations to pre-split dataset

import os
from PIL import Image
from torchvision import transforms
import shutil
from tqdm import tqdm

# Paths - update these paths
split_base_dir = '/scratch/movi/dm_project/data/split_80/dataset3_split'  # Your already split dataset
augmented_data_dir = '/scratch/movi/dm_project/data/split_80/dataset3_aug'  # Where to save augmented data

# Number of augmentations per image
NUM_AUGMENTATIONS = 10

# Define augmentation transformations
augmentation_transforms = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
    transforms.RandomRotation(degrees=15),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.RandomAffine(degrees=0, translate=(0.1, 0.1), scale=(0.9, 1.1), shear=10),
    transforms.RandomErasing(p=0.5, scale=(0.02, 0.15), ratio=(0.3, 3.3)),
])

def apply_augmentations():
    """Apply augmentations to each split of the pre-split dataset."""
    print("\n=== Starting Augmentation Process ===")
    
    # Create destination directory structure
    for split in ['train', 'val', 'test']:
        split_path = os.path.join(augmented_data_dir, split)
        os.makedirs(split_path, exist_ok=True)
        for class_name in os.listdir(os.path.join(split_base_dir, split)):
            class_path = os.path.join(split_path, class_name)
            os.makedirs(class_path, exist_ok=True)

    # Stats dictionary
    stats = {'train': {}, 'val': {}, 'test': {}}

    # Process each split
    for split in ['train', 'val', 'test']:
        print(f"\nProcessing {split.upper()} split:")
        split_source = os.path.join(split_base_dir, split)
        split_dest = os.path.join(augmented_data_dir, split)
        
        # Process each class
        for class_name in sorted(os.listdir(split_source)):
            class_source = os.path.join(split_source, class_name)
            class_dest = os.path.join(split_dest, class_name)
            
            if os.path.isdir(class_source):
                # Get list of original images
                original_files = [f for f in os.listdir(class_source) 
                                if os.path.isfile(os.path.join(class_source, f))]
                
                print(f"\nProcessing class {class_name}:")
                print(f"Found {len(original_files)} original images")
                
                # First copy original files
                for file in tqdm(original_files, desc="Copying originals"):
                    shutil.copy2(os.path.join(class_source, file),
                               os.path.join(class_dest, file))
                
                # Then create augmented versions
                print("Generating augmented images...")
                for file in tqdm(original_files, desc="Generating augmentations"):
                    img_path = os.path.join(class_source, file)
                    try:
                        with Image.open(img_path) as img:
                            # Convert to RGB if needed
                            if img.mode != 'RGB':
                                img = img.convert('RGB')
                            
                            img_tensor = transforms.ToTensor()(img)
                            
                            # Generate augmentations
                            for i in range(NUM_AUGMENTATIONS):
                                try:
                                    augmented_tensor = augmentation_transforms(img_tensor)
                                    augmented_img = transforms.ToPILImage()(augmented_tensor)
                                    
                                    # Save augmented image
                                    base_name = os.path.splitext(file)[0]
                                    aug_name = f"{base_name}_aug_{i+1}.jpg"
                                    augmented_img.save(os.path.join(class_dest, aug_name))
                                except Exception as e:
                                    print(f"Error generating augmentation {i+1} for {file}: {str(e)}")
                    except Exception as e:
                        print(f"Error processing file {file}: {str(e)}")
                
                # Update stats
                total_augmented = len(original_files) * NUM_AUGMENTATIONS
                stats[split][class_name] = {
                    'original': len(original_files),
                    'augmented': total_augmented,
                    'total': len(original_files) + total_augmented
                }

    # Print comprehensive summary
    print("\n=== Augmentation Summary ===")
    print("\nPer Split Statistics:")
    for split in ['train', 'val', 'test']:
        print(f"\n{split.upper()} Split:")
        split_total_orig = 0
        split_total_aug = 0
        
        for class_name, counts in sorted(stats[split].items()):
            print(f"Class {class_name}:")
            print(f"  Original: {counts['original']}")
            print(f"  Augmented: {counts['augmented']}")
            print(f"  Total: {counts['total']}")
            split_total_orig += counts['original']
            split_total_aug += counts['augmented']
        
        print(f"\n{split.upper()} Split Totals:")
        print(f"  Original Images: {split_total_orig}")
        print(f"  Augmented Images: {split_total_aug}")
        print(f"  Total Images: {split_total_orig + split_total_aug}")
        print("-" * 50)

    # Overall totals
    total_orig = sum(sum(c['original'] for c in s.values()) for s in stats.values())
    total_aug = sum(sum(c['augmented'] for c in s.values()) for s in stats.values())
    print("\nOverall Dataset Statistics:")
    print(f"Total Original Images: {total_orig}")
    print(f"Total Augmented Images: {total_aug}")
    print(f"Total Images: {total_orig + total_aug}")

if __name__ == "__main__":
    apply_augmentations()


=== Starting Augmentation Process ===

Processing TRAIN split:

Processing class 1:
Found 0 original images


Copying originals: 0it [00:00, ?it/s]


Generating augmented images...


Generating augmentations: 0it [00:00, ?it/s]



Processing class 10:
Found 160 original images


Copying originals: 100%|██████████| 160/160 [00:00<00:00, 227.96it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 160/160 [00:32<00:00,  4.87it/s]



Processing class 100:
Found 68 original images


Copying originals: 100%|██████████| 68/68 [00:00<00:00, 177.68it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 68/68 [00:09<00:00,  6.91it/s]



Processing class 1000:
Found 131 original images


Copying originals: 100%|██████████| 131/131 [00:00<00:00, 238.41it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 131/131 [00:13<00:00,  9.91it/s]



Processing class 2:
Found 77 original images


Copying originals: 100%|██████████| 77/77 [00:00<00:00, 243.16it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 77/77 [00:14<00:00,  5.15it/s]



Processing class 20:
Found 173 original images


Copying originals: 100%|██████████| 173/173 [00:00<00:00, 203.34it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 173/173 [01:06<00:00,  2.61it/s]



Processing class 200:
Found 0 original images


Copying originals: 0it [00:00, ?it/s]


Generating augmented images...


Generating augmentations: 0it [00:00, ?it/s]



Processing class 5:
Found 124 original images


Copying originals: 100%|██████████| 124/124 [00:00<00:00, 207.95it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 124/124 [00:11<00:00, 10.52it/s]



Processing class 50:
Found 172 original images


Copying originals: 100%|██████████| 172/172 [00:00<00:00, 224.82it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 172/172 [00:26<00:00,  6.56it/s]



Processing class 500:
Found 149 original images


Copying originals: 100%|██████████| 149/149 [00:00<00:00, 191.16it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 149/149 [00:42<00:00,  3.52it/s]



Processing VAL split:

Processing class 1:
Found 0 original images


Copying originals: 0it [00:00, ?it/s]


Generating augmented images...


Generating augmentations: 0it [00:00, ?it/s]



Processing class 10:
Found 20 original images


Copying originals: 100%|██████████| 20/20 [00:00<00:00, 213.11it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 20/20 [00:02<00:00,  9.49it/s]



Processing class 100:
Found 8 original images


Copying originals: 100%|██████████| 8/8 [00:00<00:00, 217.60it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 8/8 [00:00<00:00,  8.99it/s]



Processing class 1000:
Found 16 original images


Copying originals: 100%|██████████| 16/16 [00:00<00:00, 249.32it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 16/16 [00:01<00:00, 10.61it/s]



Processing class 2:
Found 9 original images


Copying originals: 100%|██████████| 9/9 [00:00<00:00, 241.63it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 9/9 [00:03<00:00,  2.36it/s]



Processing class 20:
Found 21 original images


Copying originals: 100%|██████████| 21/21 [00:00<00:00, 233.96it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 21/21 [00:02<00:00, 10.01it/s]



Processing class 200:
Found 0 original images


Copying originals: 0it [00:00, ?it/s]


Generating augmented images...


Generating augmentations: 0it [00:00, ?it/s]



Processing class 5:
Found 15 original images


Copying originals: 100%|██████████| 15/15 [00:00<00:00, 250.80it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 15/15 [00:01<00:00, 10.50it/s]



Processing class 50:
Found 21 original images


Copying originals: 100%|██████████| 21/21 [00:00<00:00, 254.07it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 21/21 [00:02<00:00,  9.51it/s]



Processing class 500:
Found 18 original images


Copying originals: 100%|██████████| 18/18 [00:00<00:00, 222.13it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 18/18 [00:03<00:00,  5.37it/s]



Processing TEST split:

Processing class 1:
Found 0 original images


Copying originals: 0it [00:00, ?it/s]


Generating augmented images...


Generating augmentations: 0it [00:00, ?it/s]



Processing class 10:
Found 20 original images


Copying originals: 100%|██████████| 20/20 [00:00<00:00, 221.76it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 20/20 [00:03<00:00,  5.52it/s]



Processing class 100:
Found 9 original images


Copying originals: 100%|██████████| 9/9 [00:00<00:00, 246.14it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 9/9 [00:00<00:00, 10.96it/s]



Processing class 1000:
Found 17 original images


Copying originals: 100%|██████████| 17/17 [00:00<00:00, 217.19it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 17/17 [00:05<00:00,  2.90it/s]



Processing class 2:
Found 11 original images


Copying originals: 100%|██████████| 11/11 [00:00<00:00, 232.21it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 11/11 [00:01<00:00, 10.52it/s]



Processing class 20:
Found 23 original images


Copying originals: 100%|██████████| 23/23 [00:00<00:00, 224.04it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 23/23 [00:02<00:00,  9.74it/s]



Processing class 200:
Found 0 original images


Copying originals: 0it [00:00, ?it/s]


Generating augmented images...


Generating augmentations: 0it [00:00, ?it/s]



Processing class 5:
Found 16 original images


Copying originals: 100%|██████████| 16/16 [00:00<00:00, 215.09it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 16/16 [00:01<00:00,  8.87it/s]



Processing class 50:
Found 23 original images


Copying originals: 100%|██████████| 23/23 [00:00<00:00, 237.05it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 23/23 [00:02<00:00, 10.62it/s]



Processing class 500:
Found 20 original images


Copying originals: 100%|██████████| 20/20 [00:00<00:00, 234.17it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 20/20 [00:02<00:00,  9.62it/s]


=== Augmentation Summary ===

Per Split Statistics:

TRAIN Split:
Class 1:
  Original: 0
  Augmented: 0
  Total: 0
Class 10:
  Original: 160
  Augmented: 1600
  Total: 1760
Class 100:
  Original: 68
  Augmented: 680
  Total: 748
Class 1000:
  Original: 131
  Augmented: 1310
  Total: 1441
Class 2:
  Original: 77
  Augmented: 770
  Total: 847
Class 20:
  Original: 173
  Augmented: 1730
  Total: 1903
Class 200:
  Original: 0
  Augmented: 0
  Total: 0
Class 5:
  Original: 124
  Augmented: 1240
  Total: 1364
Class 50:
  Original: 172
  Augmented: 1720
  Total: 1892
Class 500:
  Original: 149
  Augmented: 1490
  Total: 1639

TRAIN Split Totals:
  Original Images: 1054
  Augmented Images: 10540
  Total Images: 11594
--------------------------------------------------

VAL Split:
Class 1:
  Original: 0
  Augmented: 0
  Total: 0
Class 10:
  Original: 20
  Augmented: 200
  Total: 220
Class 100:
  Original: 8
  Augmented: 80
  Total: 88
Class 1000:
  Original: 16
  Augmented: 160
  Total: 176
Cla




In [27]:
# Written by Ovi
# Code to split dataset into train, validation, and test sets, with detailed analysis and logging

import os
import random
import shutil

# Paths
original_dataset = '/scratch/movi/dm_project/data/dataset_combined_unique'  # Replace with the path to your original dataset
split_base_dir = '/scratch/movi/dm_project/data/split_80/dataset_combined_split'        # Base directory to store train/val/test splits

# Split ratios
TRAIN_RATIO = 0.8
VAL_RATIO = 0.1
TEST_RATIO = 0.1

def create_dir_structure(base_dir, class_names):
    """Create train, val, and test directories for each class."""
    for split in ['train', 'val', 'test']:
        for class_name in class_names:
            os.makedirs(os.path.join(base_dir, split, class_name), exist_ok=True)

def analyze_and_split_dataset(original_dataset, split_base_dir):
    """Analyze dataset and split into train, val, and test sets."""
    class_names = sorted(os.listdir(original_dataset))  # Get class names in alphabetical order
    create_dir_structure(split_base_dir, class_names)   # Create the necessary directory structure

    total_images = 0  # Track the total number of images across all classes
    split_summary = {}  # Dictionary to store per-class split details

    # Loop through each class folder
    for class_name in class_names:
        class_path = os.path.join(original_dataset, class_name)

        if os.path.isdir(class_path):  # Ensure it's a folder
            # List all images in the class folder
            image_files = [f for f in os.listdir(class_path) if os.path.isfile(os.path.join(class_path, f))]
            random.shuffle(image_files)  # Shuffle images to ensure randomness

            # Calculate split indices
            total_images_in_class = len(image_files)
            train_end = int(total_images_in_class * TRAIN_RATIO)
            val_end = train_end + int(total_images_in_class * VAL_RATIO)

            # Split the image files into train, val, and test sets
            train_files = image_files[:train_end]
            val_files = image_files[train_end:val_end]
            test_files = image_files[val_end:]

            # Copy files to the respective split directories
            for file in train_files:
                shutil.copy(os.path.join(class_path, file), os.path.join(split_base_dir, 'train', class_name, file))
            for file in val_files:
                shutil.copy(os.path.join(class_path, file), os.path.join(split_base_dir, 'val', class_name, file))
            for file in test_files:
                shutil.copy(os.path.join(class_path, file), os.path.join(split_base_dir, 'test', class_name, file))

            # Store the split summary for this class
            split_summary[class_name] = {
                'Total': total_images_in_class,
                'Train': len(train_files),
                'Validation': len(val_files),
                'Test': len(test_files)
            }

            # Update total image count
            total_images += total_images_in_class

            # Print per-class summary
            print(f"{class_name}: {len(train_files)} train, {len(val_files)} val, {len(test_files)} test (Total: {total_images_in_class})")

    # Print overall summary
    print("\nOverall Dataset Summary:")
    print(f"Total Images: {total_images}")
    print(f"Train Ratio: {TRAIN_RATIO}, Validation Ratio: {VAL_RATIO}, Test Ratio: {TEST_RATIO}\n")

    # Print detailed split summary for all classes
    print("Detailed Split Summary:")
    for class_name, counts in split_summary.items():
        print(f"{class_name} - Total: {counts['Total']}, Train: {counts['Train']}, Val: {counts['Validation']}, Test: {counts['Test']}")

    return split_summary

# Run the split function and store the summary
dataset_summary = analyze_and_split_dataset(original_dataset, split_base_dir)


1: 28 train, 3 val, 5 test (Total: 36)
10: 284 train, 35 val, 36 test (Total: 355)
100: 196 train, 24 val, 26 test (Total: 246)
1000: 200 train, 25 val, 25 test (Total: 250)
2: 211 train, 26 val, 27 test (Total: 264)
20: 289 train, 36 val, 37 test (Total: 362)
200: 15 train, 1 val, 3 test (Total: 19)
5: 237 train, 29 val, 31 test (Total: 297)
50: 284 train, 35 val, 36 test (Total: 355)
500: 238 train, 29 val, 31 test (Total: 298)

Overall Dataset Summary:
Total Images: 2482
Train Ratio: 0.8, Validation Ratio: 0.1, Test Ratio: 0.1

Detailed Split Summary:
1 - Total: 36, Train: 28, Val: 3, Test: 5
10 - Total: 355, Train: 284, Val: 35, Test: 36
100 - Total: 246, Train: 196, Val: 24, Test: 26
1000 - Total: 250, Train: 200, Val: 25, Test: 25
2 - Total: 264, Train: 211, Val: 26, Test: 27
20 - Total: 362, Train: 289, Val: 36, Test: 37
200 - Total: 19, Train: 15, Val: 1, Test: 3
5 - Total: 297, Train: 237, Val: 29, Test: 31
50 - Total: 355, Train: 284, Val: 35, Test: 36
500 - Total: 298, Train

In [28]:
# Written by Ovi, 2024-11-03
# Code to apply augmentations to pre-split dataset

import os
from PIL import Image
from torchvision import transforms
import shutil
from tqdm import tqdm

# Paths - update these paths
split_base_dir = '/scratch/movi/dm_project/data/split_80/dataset_combined_split'  # Your already split dataset
augmented_data_dir = '/scratch/movi/dm_project/data/custom/dataset_combined_aug'  # Where to save augmented data

# Number of augmentations per image
NUM_AUGMENTATIONS = 10

# Define augmentation transformations
augmentation_transforms = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
    transforms.RandomRotation(degrees=15),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.RandomAffine(degrees=0, translate=(0.1, 0.1), scale=(0.9, 1.1), shear=10),
    transforms.RandomErasing(p=0.5, scale=(0.02, 0.15), ratio=(0.3, 3.3)),
])

def apply_augmentations():
    """Apply augmentations to each split of the pre-split dataset."""
    print("\n=== Starting Augmentation Process ===")
    
    # Create destination directory structure
    for split in ['train', 'val', 'test']:
        split_path = os.path.join(augmented_data_dir, split)
        os.makedirs(split_path, exist_ok=True)
        for class_name in os.listdir(os.path.join(split_base_dir, split)):
            class_path = os.path.join(split_path, class_name)
            os.makedirs(class_path, exist_ok=True)

    # Stats dictionary
    stats = {'train': {}, 'val': {}, 'test': {}}

    # Process each split
    for split in ['train', 'val', 'test']:
        print(f"\nProcessing {split.upper()} split:")
        split_source = os.path.join(split_base_dir, split)
        split_dest = os.path.join(augmented_data_dir, split)
        
        # Process each class
        for class_name in sorted(os.listdir(split_source)):
            class_source = os.path.join(split_source, class_name)
            class_dest = os.path.join(split_dest, class_name)
            
            if os.path.isdir(class_source):
                # Get list of original images
                original_files = [f for f in os.listdir(class_source) 
                                if os.path.isfile(os.path.join(class_source, f))]
                
                print(f"\nProcessing class {class_name}:")
                print(f"Found {len(original_files)} original images")
                
                # First copy original files
                for file in tqdm(original_files, desc="Copying originals"):
                    shutil.copy2(os.path.join(class_source, file),
                               os.path.join(class_dest, file))
                
                # Then create augmented versions
                print("Generating augmented images...")
                for file in tqdm(original_files, desc="Generating augmentations"):
                    img_path = os.path.join(class_source, file)
                    try:
                        with Image.open(img_path) as img:
                            # Convert to RGB if needed
                            if img.mode != 'RGB':
                                img = img.convert('RGB')
                            
                            img_tensor = transforms.ToTensor()(img)
                            
                            # Generate augmentations
                            for i in range(NUM_AUGMENTATIONS):
                                try:
                                    augmented_tensor = augmentation_transforms(img_tensor)
                                    augmented_img = transforms.ToPILImage()(augmented_tensor)
                                    
                                    # Save augmented image
                                    base_name = os.path.splitext(file)[0]
                                    aug_name = f"{base_name}_aug_{i+1}.jpg"
                                    augmented_img.save(os.path.join(class_dest, aug_name))
                                except Exception as e:
                                    print(f"Error generating augmentation {i+1} for {file}: {str(e)}")
                    except Exception as e:
                        print(f"Error processing file {file}: {str(e)}")
                
                # Update stats
                total_augmented = len(original_files) * NUM_AUGMENTATIONS
                stats[split][class_name] = {
                    'original': len(original_files),
                    'augmented': total_augmented,
                    'total': len(original_files) + total_augmented
                }

    # Print comprehensive summary
    print("\n=== Augmentation Summary ===")
    print("\nPer Split Statistics:")
    for split in ['train', 'val', 'test']:
        print(f"\n{split.upper()} Split:")
        split_total_orig = 0
        split_total_aug = 0
        
        for class_name, counts in sorted(stats[split].items()):
            print(f"Class {class_name}:")
            print(f"  Original: {counts['original']}")
            print(f"  Augmented: {counts['augmented']}")
            print(f"  Total: {counts['total']}")
            split_total_orig += counts['original']
            split_total_aug += counts['augmented']
        
        print(f"\n{split.upper()} Split Totals:")
        print(f"  Original Images: {split_total_orig}")
        print(f"  Augmented Images: {split_total_aug}")
        print(f"  Total Images: {split_total_orig + split_total_aug}")
        print("-" * 50)

    # Overall totals
    total_orig = sum(sum(c['original'] for c in s.values()) for s in stats.values())
    total_aug = sum(sum(c['augmented'] for c in s.values()) for s in stats.values())
    print("\nOverall Dataset Statistics:")
    print(f"Total Original Images: {total_orig}")
    print(f"Total Augmented Images: {total_aug}")
    print(f"Total Images: {total_orig + total_aug}")

if __name__ == "__main__":
    apply_augmentations()


=== Starting Augmentation Process ===

Processing TRAIN split:

Processing class 1:
Found 28 original images


Copying originals: 100%|██████████| 28/28 [00:00<00:00, 215.88it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 28/28 [00:02<00:00, 10.04it/s]



Processing class 10:
Found 284 original images


Copying originals: 100%|██████████| 284/284 [00:01<00:00, 212.88it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 284/284 [01:03<00:00,  4.48it/s]



Processing class 100:
Found 196 original images


Copying originals: 100%|██████████| 196/196 [00:00<00:00, 210.80it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 196/196 [01:16<00:00,  2.55it/s]



Processing class 1000:
Found 200 original images


Copying originals: 100%|██████████| 200/200 [00:01<00:00, 181.07it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 200/200 [01:30<00:00,  2.21it/s]



Processing class 2:
Found 211 original images


Copying originals: 100%|██████████| 211/211 [00:00<00:00, 245.11it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 211/211 [00:59<00:00,  3.55it/s]



Processing class 20:
Found 289 original images


Copying originals: 100%|██████████| 289/289 [00:01<00:00, 213.73it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 289/289 [00:44<00:00,  6.44it/s]



Processing class 200:
Found 15 original images


Copying originals: 100%|██████████| 15/15 [00:00<00:00, 239.04it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 15/15 [00:01<00:00,  9.84it/s]



Processing class 5:
Found 237 original images


Copying originals: 100%|██████████| 237/237 [00:01<00:00, 211.31it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 237/237 [01:09<00:00,  3.39it/s]



Processing class 50:
Found 284 original images


Copying originals: 100%|██████████| 284/284 [00:02<00:00, 110.86it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 284/284 [01:50<00:00,  2.57it/s]



Processing class 500:
Found 238 original images


Copying originals: 100%|██████████| 238/238 [00:01<00:00, 205.77it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 238/238 [00:30<00:00,  7.72it/s]



Processing VAL split:

Processing class 1:
Found 3 original images


Copying originals: 100%|██████████| 3/3 [00:00<00:00, 129.05it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 3/3 [00:00<00:00,  9.61it/s]



Processing class 10:
Found 35 original images


Copying originals: 100%|██████████| 35/35 [00:00<00:00, 145.58it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 35/35 [00:05<00:00,  6.63it/s]



Processing class 100:
Found 24 original images


Copying originals: 100%|██████████| 24/24 [00:00<00:00, 233.34it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 24/24 [00:02<00:00, 10.31it/s]



Processing class 1000:
Found 25 original images


Copying originals: 100%|██████████| 25/25 [00:00<00:00, 203.42it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 25/25 [00:02<00:00, 10.48it/s]



Processing class 2:
Found 26 original images


Copying originals: 100%|██████████| 26/26 [00:00<00:00, 255.34it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 26/26 [00:02<00:00,  9.90it/s]



Processing class 20:
Found 36 original images


Copying originals: 100%|██████████| 36/36 [00:00<00:00, 229.83it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 36/36 [00:20<00:00,  1.72it/s]



Processing class 200:
Found 1 original images


Copying originals: 100%|██████████| 1/1 [00:00<00:00, 211.45it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 1/1 [00:00<00:00,  9.66it/s]



Processing class 5:
Found 29 original images


Copying originals: 100%|██████████| 29/29 [00:00<00:00, 161.29it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 29/29 [00:05<00:00,  5.74it/s]



Processing class 50:
Found 35 original images


Copying originals: 100%|██████████| 35/35 [00:00<00:00, 232.11it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 35/35 [00:03<00:00, 10.34it/s]



Processing class 500:
Found 29 original images


Copying originals: 100%|██████████| 29/29 [00:00<00:00, 146.76it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 29/29 [00:02<00:00, 10.13it/s]



Processing TEST split:

Processing class 1:
Found 5 original images


Copying originals: 100%|██████████| 5/5 [00:00<00:00, 236.03it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 5/5 [00:00<00:00, 11.07it/s]



Processing class 10:
Found 36 original images


Copying originals: 100%|██████████| 36/36 [00:00<00:00, 184.27it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 36/36 [00:03<00:00,  9.41it/s]



Processing class 100:
Found 26 original images


Copying originals: 100%|██████████| 26/26 [00:00<00:00, 219.93it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 26/26 [00:37<00:00,  1.44s/it]



Processing class 1000:
Found 25 original images


Copying originals: 100%|██████████| 25/25 [00:00<00:00, 227.67it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 25/25 [00:02<00:00, 10.17it/s]



Processing class 2:
Found 27 original images


Copying originals: 100%|██████████| 27/27 [00:00<00:00, 215.46it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 27/27 [00:02<00:00,  9.82it/s]



Processing class 20:
Found 37 original images


Copying originals: 100%|██████████| 37/37 [00:00<00:00, 182.73it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 37/37 [00:07<00:00,  4.84it/s]



Processing class 200:
Found 3 original images


Copying originals: 100%|██████████| 3/3 [00:00<00:00, 213.45it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 3/3 [00:00<00:00,  9.80it/s]



Processing class 5:
Found 31 original images


Copying originals: 100%|██████████| 31/31 [00:00<00:00, 166.60it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 31/31 [00:03<00:00,  8.85it/s]



Processing class 50:
Found 36 original images


Copying originals: 100%|██████████| 36/36 [00:00<00:00, 233.09it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 36/36 [00:03<00:00,  9.79it/s]



Processing class 500:
Found 31 original images


Copying originals: 100%|██████████| 31/31 [00:00<00:00, 187.83it/s]


Generating augmented images...


Generating augmentations: 100%|██████████| 31/31 [00:04<00:00,  6.87it/s]


=== Augmentation Summary ===

Per Split Statistics:

TRAIN Split:
Class 1:
  Original: 28
  Augmented: 280
  Total: 308
Class 10:
  Original: 284
  Augmented: 2840
  Total: 3124
Class 100:
  Original: 196
  Augmented: 1960
  Total: 2156
Class 1000:
  Original: 200
  Augmented: 2000
  Total: 2200
Class 2:
  Original: 211
  Augmented: 2110
  Total: 2321
Class 20:
  Original: 289
  Augmented: 2890
  Total: 3179
Class 200:
  Original: 15
  Augmented: 150
  Total: 165
Class 5:
  Original: 237
  Augmented: 2370
  Total: 2607
Class 50:
  Original: 284
  Augmented: 2840
  Total: 3124
Class 500:
  Original: 238
  Augmented: 2380
  Total: 2618

TRAIN Split Totals:
  Original Images: 1982
  Augmented Images: 19820
  Total Images: 21802
--------------------------------------------------

VAL Split:
Class 1:
  Original: 3
  Augmented: 30
  Total: 33
Class 10:
  Original: 35
  Augmented: 350
  Total: 385
Class 100:
  Original: 24
  Augmented: 240
  Total: 264
Class 1000:
  Original: 25
  Augmented:




---

# END of Data Split and Augmentation

---