# Creating a Subset from a Larger Dataset

This notebook demonstrates how to extract a smaller subset of images and their corresponding annotation files from a larger dataset contained in a ZIP file. 

The assumed structure of the ZIP file is:

```
LargeDataset.zip
 ├── images/
 │    ├── image_001.jpg
 │    ├── image_002.jpg
 │    └── ...
 └── annotations/
      ├── image_001.txt
      ├── image_002.txt
      └── ...
```

We will create a subset with a specified number of images (e.g., 1000) and copy both the images and their corresponding annotation files into a new folder structure for further experimentation.

## 1. Setup & Unzipping the Dataset

Place your ZIP file (e.g., `hard-hat-detection.zip`) in the current working directory. Then, run the following code to unzip it.

In [1]:
import zipfile
import os

# Define the path to the zip file and the extraction folder
zip_path = 'hard-hat-detection.zip'  # Change to your zip file name
extract_folder = 'hard-hat-detection'

# Check if the folder already exists
if not os.path.exists(extract_folder):
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_folder)
    print(f"Dataset unzipped to '{extract_folder}'")
else:
    print(f"Folder '{extract_folder}' already exists. Skipping unzipping.")

Folder 'hard-hat-detection' already exists. Skipping unzipping.


## 2. List and Sample Images

Next, we list all the image files in the `images/` folder and randomly select a specified number of them to form our subset. You can adjust the number of images (`sample_size`) as needed.

In [2]:
import glob
import random

# Path to images folder
images_folder = os.path.join(extract_folder, 'images')
image_files = glob.glob(os.path.join(images_folder, '*.png'))  # Adjust extension if necessary
print(f"Total images found: {len(image_files)}")

# Set the sample size (number of images to select)
sample_size = 1000  # Change this to the number you want in your subset

# Ensure sample size does not exceed total available images
sample_size = min(sample_size, len(image_files))

# Randomly sample images
sampled_images = random.sample(image_files, sample_size)
print(f"Number of images in subset: {len(sampled_images)}")

Total images found: 5000
Number of images in subset: 1000


## 3. Copy Sampled Images and Annotations

We now copy the sampled images and their corresponding annotation files (assumed to be in the `annotations/` folder with the same base filename) into a new folder structure.

The new structure will look like this:

```
hard-hat-detection-1k/
 ├── images/
 └── annotations/
```


In [3]:
import shutil

# Define output folders
subset_folder = 'hard-hat-detection-1k'
subset_images = os.path.join(subset_folder, 'images')
subset_annotations = os.path.join(subset_folder, 'annotations')

# Create the folders if they don't exist
os.makedirs(subset_images, exist_ok=True)
os.makedirs(subset_annotations, exist_ok=True)

# Define annotations folder in the original dataset
annotations_folder = os.path.join(extract_folder, 'annotations')

# Loop through sampled images, copy each image and its annotation
for img_path in sampled_images:
    # Get the base filename without extension
    base_name = os.path.splitext(os.path.basename(img_path))[0]
    
    # Define source and destination paths for image
    dest_img_path = os.path.join(subset_images, os.path.basename(img_path))
    shutil.copy(img_path, dest_img_path)
    
    # Define source annotation file (assuming .txt format)
    annotation_file = os.path.join(annotations_folder, base_name + '.xml')
    if os.path.exists(annotation_file):
        dest_ann_path = os.path.join(subset_annotations, base_name + '.xml')
        shutil.copy(annotation_file, dest_ann_path)
    else:
        print(f"Annotation file not found for {base_name}")

print("Subset creation completed.")

Subset creation completed.


## 4. Verify the Subset

Finally, let's verify that the subset folders contain the expected number of files.

In [4]:
subset_image_files = glob.glob(os.path.join(subset_images, '*.png'))
subset_annotation_files = glob.glob(os.path.join(subset_annotations, '*.xml'))
print(f"Subset images: {len(subset_image_files)}")
print(f"Subset annotations: {len(subset_annotation_files)}")

Subset images: 1000
Subset annotations: 1000


## 5. Conclusion

You have now created a subset from a larger dataset with a total of sampled images and corresponding annotations copied into a new folder structure. 

This subset is now ready for further experiments, such as training an object detection model using YOLOv5.