<a href="https://colab.research.google.com/github/tamer-elkoT/Chest_X_Ray/blob/main/01_Member1_Dataset_Manager.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chest X-Ray Multi‑Class Project — Role Notebook

**Dataset:** Kaggle “Lungs Disease Dataset (4 types)” by Omkar Manohar Dalvi  
**Classes:** Normal, Bacterial Pneumonia, Viral Pneumonia, COVID‑19, Tuberculosis

> Use this notebook in **Google Colab**. If you’re running locally, adapt the Drive mount steps accordingly.

## Role — Member 1: Dataset Manager

**Responsibilities**  
- Download and organize the dataset from Kaggle  
- Inspect for corrupted/mislabeled files  
- Count samples per class; generate initial stats  
- Document dataset source & licensing  
- Produce a cleaned directory ready for training

## Environment & Paths

- The code below mounts Google Drive (for persistence) and prepares base paths.  
- Set `DATASET_DIR` to where the extracted dataset resides (after Kaggle download).

## (Optional) Download from Kaggle directly in Colab

Run the following once to set up your Kaggle API, then download and unzip the dataset directly into Drive.

## Verify Structure & Basic Stats

We assume the dataset contains subfolders by split and class, e.g.:

```
lungs_dataset/
  train/
    Bacterial Pneumonia/
    Viral Pneumonia/
    COVID/
    TB/
    Normal/
  val/
  test/
```

In [None]:
# === Colab & Paths ===
import os, sys, glob, json, random, shutil, time
from pathlib import Path

# If in Colab, mount Drive (safe to run elsewhere; it will just fail silently)
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    IN_COLAB = True
except Exception as e:
    print("Not running on Colab or Drive not available:", e)
    IN_COLAB = False

Mounted at /content/drive


In [None]:


# Project root inside Drive (you can change this)
PROJECT_ROOT = Path('/content/drive/MyDrive/Colab_Notebooks/NTI_Offline/Lunge_Disease_Project')
PROJECT_ROOT.mkdir(parents=True, exist_ok=True)

# Where the dataset will live (after download & unzip). Adjust as needed.
DATASET_DIR = PROJECT_ROOT / 'lungs_dataset'
OUTPUTS_DIR = PROJECT_ROOT / 'outputs'
MODELS_DIR = PROJECT_ROOT / 'models'
REPORTS_DIR = PROJECT_ROOT / 'reports'

for p in [OUTPUTS_DIR, MODELS_DIR, REPORTS_DIR]:
    p.mkdir(parents=True, exist_ok=True)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATASET_DIR :", DATASET_DIR)
print("OUTPUTS_DIR :", OUTPUTS_DIR)
print("MODELS_DIR  :", MODELS_DIR)
print("REPORTS_DIR :", REPORTS_DIR)

PROJECT_ROOT: /content/drive/MyDrive/Colab_Notebooks/NTI_Offline/Lunge_Disease_Project
DATASET_DIR : /content/drive/MyDrive/Colab_Notebooks/NTI_Offline/Lunge_Disease_Project/lungs_dataset
OUTPUTS_DIR : /content/drive/MyDrive/Colab_Notebooks/NTI_Offline/Lunge_Disease_Project/outputs
MODELS_DIR  : /content/drive/MyDrive/Colab_Notebooks/NTI_Offline/Lunge_Disease_Project/models
REPORTS_DIR : /content/drive/MyDrive/Colab_Notebooks/NTI_Offline/Lunge_Disease_Project/reports


In [None]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"tamerelkot","key":"e5278c0945fba995929906b7c4ad341f"}'}

In [None]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d omkarmanohardalvi/lungs-disease-dataset-4-types

Dataset URL: https://www.kaggle.com/datasets/omkarmanohardalvi/lungs-disease-dataset-4-types
License(s): unknown
Downloading lungs-disease-dataset-4-types.zip to /content
 99% 2.01G/2.02G [00:17<00:00, 340MB/s]
100% 2.02G/2.02G [00:17<00:00, 122MB/s]


In [None]:
!unzip lungs-disease-dataset-4-types.zip -d lungs_dataset


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: lungs_dataset/Lung Disease Dataset/train/Normal/NORMAL2-IM-0673-0001.jpeg  
  inflating: lungs_dataset/Lung Disease Dataset/train/Normal/NORMAL2-IM-0675-0001.jpeg  
  inflating: lungs_dataset/Lung Disease Dataset/train/Normal/NORMAL2-IM-0678-0001.jpeg  
  inflating: lungs_dataset/Lung Disease Dataset/train/Normal/NORMAL2-IM-0680-0001.jpeg  
  inflating: lungs_dataset/Lung Disease Dataset/train/Normal/NORMAL2-IM-0682-0001.jpeg  
  inflating: lungs_dataset/Lung Disease Dataset/train/Normal/NORMAL2-IM-0683-0001.jpeg  
  inflating: lungs_dataset/Lung Disease Dataset/train/Normal/NORMAL2-IM-0684-0001-0001.jpeg  
  inflating: lungs_dataset/Lung Disease Dataset/train/Normal/NORMAL2-IM-0686-0001.jpeg  
  inflating: lungs_dataset/Lung Disease Dataset/train/Normal/NORMAL2-IM-0690-0001.jpeg  
  inflating: lungs_dataset/Lung Disease Dataset/train/Normal/NORMAL2-IM-0692-0001.jpeg  
  inflating: lungs_dataset/Lung Disease 

In [None]:
# # === Kaggle direct download (optional) ===
# # 1) Upload kaggle.json when prompted
# try:
#     from google.colab import files
#     print("Upload kaggle.json (from Kaggle > Account > Create API Token)")
#     uploaded = files.upload()
#     if 'kaggle.json' in uploaded:
#         os.makedirs('/root/.kaggle', exist_ok=True)
#         shutil.move('kaggle.json', '/root/.kaggle/kaggle.json')
#         os.chmod('/root/.kaggle/kaggle.json', 0o600)
#         !pip -q install kaggle >/dev/null
#         # Download dataset
#         !kaggle datasets download -d omkarmanohardalvi/lungs-disease-dataset-4-types -p $PROJECT_ROOT
#         # Unzip
#         !unzip -q -o $PROJECT_ROOT/lungs-disease-dataset-4-types.zip -d $PROJECT_ROOT
#         # Standardize folder name if needed
#         if not Path(DATASET_DIR).exists():
#             # Try to infer the extracted folder
#             candidates = [p for p in PROJECT_ROOT.iterdir() if p.is_dir() and 'lungs' in p.name.lower()]
#             if candidates:
#                 candidates[0].rename(DATASET_DIR)
#         print("Download & extraction complete.")
#     else:
#         print("kaggle.json not found. Skipping Kaggle step.")
# except Exception as e:
#     print("Not on Colab or Kaggle step skipped:", e)

In [None]:
!ls -R "/content/drive/MyDrive/Colab_Notebooks/NTI_Offline/Lunge_Disease_Project"


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 00030561_009.png
 00030561_012.png
 00030561_015.png
 00030561_016.png
 00030561_019.png
 00030561_021.png
 00030562_000.png
 00030570_000.png
 00030571_001.png
 00030573_004.png
 00030573_013.png
 00030573_015.png
 00030582_000.png
 00030582_003.png
 00030586_000.png
 00030591_000.png
 00030594_000.png
 00030606_001.png
 00030606_002.png
 00030609_002.png
 00030609_005.png
 00030609_007.png
 00030609_011.png
 00030609_024.png
 00030609_025.png
 00030615_001.png
 00030621_002.png
 00030621_005.png
 00030622_001.png
 00030635_001.png
 00030635_007.png
 00030635_010.png
 00030636_001.png
 00030636_005.png
 00030636_019.png
 00030637_000.png
 00030637_001.png
 00030640_005.png
 00030645_000.png
 00030645_001.png
 00030648_000.png
 00030650_006.png
 00030650_009.png
 00030651_000.png
 00030651_001.png
 00030652_003.png
 00030654_000.png
 00030655_000.png
 00030656_000.png
 00030657_000.png
 00030659_000.png
 00030669_000.png

In [None]:
import os
import shutil
import random

# === Settings ===
original_dataset_dir = "/content/drive/MyDrive/Colab_Notebooks/NTI_Offline/Lunge_Disease_Project/DataSet/lungs_dataset"  # your current dataset folder
output_base_dir = "/content/drive/MyDrive/Colab_Notebooks/NTI_Offline/Lunge_Disease_Project/DataSet"  # where new train/val/test will be created

train_ratio = 0.7
val_ratio = 0.15
test_ratio = 0.15

# === Make sure ratios sum to 1 ===
assert abs((train_ratio + val_ratio + test_ratio) - 1.0) < 1e-6, "Ratios must sum to 1"

# === Create output directories ===
for split in ["train", "val", "test"]:
    os.makedirs(os.path.join(output_base_dir, split), exist_ok=True)

In [None]:

# === Detect class folders ===
classes = [d for d in os.listdir(original_dataset_dir) if os.path.isdir(os.path.join(original_dataset_dir, d))]
classes

['Lung Disease Dataset']

In [None]:



for cls in classes:
    print(f"Processing class: {cls}")
    cls_dir = os.path.join(original_dataset_dir, cls)
    images = [f for f in os.listdir(cls_dir) if os.path.isfile(os.path.join(cls_dir, f))]

    random.shuffle(images)  # shuffle for randomness

    # Calculate split sizes
    total = len(images)
    train_end = int(total * train_ratio)
    val_end = train_end + int(total * val_ratio)

    # Split the data
    train_files = images[:train_end]
    val_files = images[train_end:val_end]
    test_files = images[val_end:]

    # Create class folders inside each split
    for split_name, split_files in zip(["train", "val", "test"], [train_files, val_files, test_files]):
        split_class_dir = os.path.join(output_base_dir, split_name, cls)
        os.makedirs(split_class_dir, exist_ok=True)
        for file in split_files:
            src = os.path.join(cls_dir, file)
            dst = os.path.join(split_class_dir, file)
            shutil.copy2(src, dst)  # copy file while keeping metadata

print("\n✅ Dataset split complete!")


Processing class: Lung Disease Dataset

✅ Dataset split complete!


In [None]:
from pathlib import Path
from PIL import Image

# List of splits to check
splits = ["train", "val", "test"]
# === Detect corrupted images ===
def is_image_ok(path):
    try:
        with Image.open(path) as img:
            img.verify()  # quick integrity check
        return True
    except Exception:
        return False

bad_files = []
for split in splits:
    split_dir = DATASET_DIR / split
    if not split_dir.exists():
        continue
    for c in classes:
        cdir = split_dir / c
        for ext in ('*.png','*.jpg','*.jpeg','*.JPG'):
            for p in cdir.glob(ext):
                if not is_image_ok(p):
                    bad_files.append(str(p))

print("Corrupted images found:", len(bad_files))
if bad_files:
    with open(PROJECT_ROOT / 'corrupted_images.txt', 'w') as f:
        f.write("\n".join(bad_files))
    print("List saved to:", PROJECT_ROOT / 'corrupted_images.txt')

# Optionally remove corrupted files
# for bf in bad_files:
#     os.remove(bf)

Corrupted images found: 0


In [None]:
# === Licensing & provenance README ===
readme = f"""
# Dataset Provenance & Licensing

- **Source**: Kaggle — Lungs Disease Dataset (4 types) by Omkar Manohar Dalvi
- **URL**: https://www.kaggle.com/datasets/omkarmanohardalvi/lungs-disease-dataset-4-types
- **Classes**: {classes if classes else 'To be detected after download'}
- **Generated**: This README created by Member 1 Dataset Manager notebook.

Notes:
- Confirm the original dataset license on Kaggle before redistribution.
- Avoid pushing raw data into public repos unless license permits.
"""

with open(PROJECT_ROOT / 'DATASET_README.md', 'w', encoding='utf-8') as f:
    f.write(readme)
print("Wrote:", PROJECT_ROOT / 'DATASET_README.md')

Wrote: /content/drive/MyDrive/Colab_Notebooks/NTI_Offline/Lunge_Disease_Project/DATASET_README.md
