# Section 01 â€” Dataset Preparation

This section prepares the COVID-19 Radiography CT Scan dataset for training.  
The dataset is reorganized into a standard machine learning structure:

- data/train
- data/val
- data/test

We split each class (COVID, Lung Opacity, Normal, Viral Pneumonia) using a 70/15/15 ratio.  
Images are copied into the correct folders so all models can load the dataset consistently.  
This prepares the foundation for reproducible training across all later sections.

In [17]:
# Import necessary libraries
import os
import shutil
from sklearn.model_selection import train_test_split

# Define source and destination directories
SOURCE = "COVID-19_Radiography_Dataset"
DEST = "data"

CLASSES = ["COVID", "Lung_Opacity", "Normal", "Viral Pneumonia"]

In [18]:
for label in CLASSES:
    src_folder = os.path.join(SOURCE, label, "images")

    # List all image filenames from the source
    all_images = os.listdir(src_folder)

    # Make target folders
    for split in ["train", "validation", "test"]:
        os.makedirs(os.path.join(DEST, split, label), exist_ok=True)

    # Split into 70% train, 15% val, 15% test
    train_images, temp = train_test_split(all_images, test_size=0.3, random_state=42)
    validation_images, test_images = train_test_split(temp, test_size=0.5, random_state=42)

    # Copy training images
    for image in train_images:
        shutil.copy(
            os.path.join(src_folder, image),
            os.path.join(DEST, "train", label)
        )

    # Copy validation images
    for image in validation_images:
        shutil.copy(
            os.path.join(src_folder, image),
            os.path.join(DEST, "validation", label)
        )

    # Copy test images
    for image in test_images:
        shutil.copy(
            os.path.join(src_folder, image),
            os.path.join(DEST, "test", label)
        )

print("Dataset split completed successfully!")

Dataset split completed successfully!
