# Chest X-Ray Multi‑Class Project — Role Notebook

**Dataset:** Kaggle “Lungs Disease Dataset (4 types)” by Omkar Manohar Dalvi  
**Classes:** Normal, Bacterial Pneumonia, Viral Pneumonia, COVID‑19, Tuberculosis

> Use this notebook in **Google Colab**. If you’re running locally, adapt the Drive mount steps accordingly.

## Role — Member 1: Dataset Manager

**Responsibilities**  
- Download and organize the dataset from Kaggle  
- Inspect for corrupted/mislabeled files  
- Count samples per class; generate initial stats  
- Document dataset source & licensing  
- Produce a cleaned directory ready for training

## Environment & Paths

- The code below mounts Google Drive (for persistence) and prepares base paths.  
- Set `DATASET_DIR` to where the extracted dataset resides (after Kaggle download).

## (Optional) Download from Kaggle directly in Colab

Run the following once to set up your Kaggle API, then download and unzip the dataset directly into Drive.

## Verify Structure & Basic Stats

We assume the dataset contains subfolders by split and class, e.g.:

```
lungs_dataset/
  train/
    Bacterial Pneumonia/
    Viral Pneumonia/
    COVID/
    TB/
    Normal/
  val/
  test/
```

In [None]:
# === Colab & Paths ===
import os, sys, glob, json, random, shutil, time
from pathlib import Path

# If in Colab, mount Drive (safe to run elsewhere; it will just fail silently)
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    IN_COLAB = True
except Exception as e:
    print("Not running on Colab or Drive not available:", e)
    IN_COLAB = False

# Project root inside Drive (you can change this)
PROJECT_ROOT = Path('/content/drive/MyDrive/Chest_XRay_Project')
PROJECT_ROOT.mkdir(parents=True, exist_ok=True)

# Where the dataset will live (after download & unzip). Adjust as needed.
DATASET_DIR = PROJECT_ROOT / 'lungs_dataset'
OUTPUTS_DIR = PROJECT_ROOT / 'outputs'
MODELS_DIR = PROJECT_ROOT / 'models'
REPORTS_DIR = PROJECT_ROOT / 'reports'

for p in [OUTPUTS_DIR, MODELS_DIR, REPORTS_DIR]:
    p.mkdir(parents=True, exist_ok=True)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATASET_DIR :", DATASET_DIR)
print("OUTPUTS_DIR :", OUTPUTS_DIR)
print("MODELS_DIR  :", MODELS_DIR)
print("REPORTS_DIR :", REPORTS_DIR)

In [None]:
# === Kaggle direct download (optional) ===
# 1) Upload kaggle.json when prompted
try:
    from google.colab import files
    print("Upload kaggle.json (from Kaggle > Account > Create API Token)")
    uploaded = files.upload()
    if 'kaggle.json' in uploaded:
        os.makedirs('/root/.kaggle', exist_ok=True)
        shutil.move('kaggle.json', '/root/.kaggle/kaggle.json')
        os.chmod('/root/.kaggle/kaggle.json', 0o600)
        !pip -q install kaggle >/dev/null
        # Download dataset
        !kaggle datasets download -d omkarmanohardalvi/lungs-disease-dataset-4-types -p $PROJECT_ROOT
        # Unzip
        !unzip -q -o $PROJECT_ROOT/lungs-disease-dataset-4-types.zip -d $PROJECT_ROOT
        # Standardize folder name if needed
        if not Path(DATASET_DIR).exists():
            # Try to infer the extracted folder
            candidates = [p for p in PROJECT_ROOT.iterdir() if p.is_dir() and 'lungs' in p.name.lower()]
            if candidates:
                candidates[0].rename(DATASET_DIR)
        print("Download & extraction complete.")
    else:
        print("kaggle.json not found. Skipping Kaggle step.")
except Exception as e:
    print("Not on Colab or Kaggle step skipped:", e)

In [None]:
# === Verify directory structure and counts ===
from pathlib import Path
from PIL import Image

splits = ['train', 'val', 'test']
classes = []

stats = {}
for split in splits:
    split_dir = DATASET_DIR / split
    if not split_dir.exists():
        print(f"Warning: split {split} not found at {split_dir}")
        continue
    class_dirs = [p for p in split_dir.iterdir() if p.is_dir()]
    if not classes and class_dirs:
        classes = [c.name for c in class_dirs]
    stats[split] = {}
    for c in class_dirs:
        n = len(list(c.glob('*.png'))) + len(list(c.glob('*.jpg'))) + len(list(c.glob('*.jpeg'))) + len(list(c.glob('*.JPG')))
        stats[split][c.name] = n

print("Classes detected:", classes)
print(json.dumps(stats, indent=2))

# Save stats
with open(PROJECT_ROOT / 'dataset_stats.json', 'w') as f:
    json.dump(stats, f, indent=2)
print("Saved:", PROJECT_ROOT / 'dataset_stats.json')

In [None]:
# === Detect corrupted images ===
def is_image_ok(path):
    try:
        with Image.open(path) as img:
            img.verify()  # quick integrity check
        return True
    except Exception:
        return False

bad_files = []
for split in splits:
    split_dir = DATASET_DIR / split
    if not split_dir.exists():
        continue
    for c in classes:
        cdir = split_dir / c
        for ext in ('*.png','*.jpg','*.jpeg','*.JPG'):
            for p in cdir.glob(ext):
                if not is_image_ok(p):
                    bad_files.append(str(p))

print("Corrupted images found:", len(bad_files))
if bad_files:
    with open(PROJECT_ROOT / 'corrupted_images.txt', 'w') as f:
        f.write("\n".join(bad_files))
    print("List saved to:", PROJECT_ROOT / 'corrupted_images.txt')

# Optionally remove corrupted files
# for bf in bad_files:
#     os.remove(bf)

In [None]:
# === Licensing & provenance README ===
readme = f"""
# Dataset Provenance & Licensing

- **Source**: Kaggle — Lungs Disease Dataset (4 types) by Omkar Manohar Dalvi
- **URL**: https://www.kaggle.com/datasets/omkarmanohardalvi/lungs-disease-dataset-4-types
- **Classes**: {classes if classes else 'To be detected after download'}
- **Generated**: This README created by Member 1 Dataset Manager notebook.

Notes:
- Confirm the original dataset license on Kaggle before redistribution.
- Avoid pushing raw data into public repos unless license permits.
"""

with open(PROJECT_ROOT / 'DATASET_README.md', 'w', encoding='utf-8') as f:
    f.write(readme)
print("Wrote:", PROJECT_ROOT / 'DATASET_README.md')