# CS171 Project — Data Preparation 

## E-Waste Image Dataset

**Course:** CS 171 — Introduction to Machine Learning  
**Name:** Samriddhi Matharu  
**Student ID:** 016328156  

**Dataset Name:** E-Waste Image Dataset  
**Source:** [Kaggle — E-Waste Image Dataset](https://www.kaggle.com/datasets/akshat103/e-waste-image-dataset/data)  
**License:** Apache 2.0  

---

### Project Overview
This dataset contains approximately **3,000 images** of common electronic waste items, organized into **10 categories** such as batteries, circuit boards, televisions, and mobile devices.  
Images are pre-labeled and stored in separate folders for training and testing. This is in aim to determine how computer vision can support real-world waste management pipelines. This work demonstrates the role of machine learning in promoting sustainability and reducing landfill impact.

---

### Objective 
The goal is to build a **Convolutional Neural Network (CNN)** that can classify images of e-waste into their respective categories.  
This will serve as part of a broader project on **machine learning for waste processing and recycling**.  
The primary task for now is **image classification**.

---

### Planned Data Preparation
This notebook prepares the e-waste image dataset for modeling.

Steps:
- Load dataset using `torchvision.datasets.ImageFolder`.  
- Inspect the Kaggle `train`, `val`, and `test` folders.
- Merge Kaggle’s `val` split into the `test` split to create one larger test set.
- Summarize per-class image counts after merging.
- Load and summarize an additional **hand-curated validation set** (`val (by hand)`) of real-world images collected from the web.

The modeling and training code live in a separate notebook.


### Import Libraries

In [1]:
# Imports
import os
import shutil
import stat
from collections import defaultdict

import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")


Using device: cpu


## 1. Inspect Kaggles original `train`, `val`, and `test` splits

The Kaggle E-Waste dataset comes pre-split into:
- `data/train`
- `data/val`
- `data/test`

Before changing anything, we:
- List the class names in each split.
- Count how many images exist per class and per split.


In [3]:
# Config & helper functions ===
BASE_DIR = "data"
SPLITS   = ["train", "val", "test"] #given by kaggle 

def sp(split: str) -> str:
    """Return the full path for a split under BASE_DIR."""
    return os.path.join(BASE_DIR, split)

def list_classes(split: str):
    """List class folders inside a split"""
    p = sp(split)
    if not os.path.exists(p):
        return []
    return sorted(
        d for d in os.listdir(p)
        if os.path.isdir(os.path.join(p, d))
    )

def count_images(split: str):
    """Count number of image files per class in a given split."""
    p = sp(split)
    counts = {}
    if not os.path.exists(p):
        return counts

    for cls in list_classes(split):
        cls_path = os.path.join(p, cls)
        counts[cls] = sum(
            1 for f in os.listdir(cls_path)
            if os.path.isfile(os.path.join(cls_path, f))
        )
    return counts

def summarize(split: str):
    """Pretty-print per-class counts for one split."""
    counts = count_images(split)
    if not counts:
        print(f"[info] '{split}' not found or empty.")
        return 0

    print(f"\n=== {split.upper()} per-class counts ===")
    for cls, n in counts.items():
        print(f"{cls:>16}: {n}")
    total = sum(counts.values())
    print(f"Total in {split}: {total}")
    return total

def summarize_all():
    """Summarize all splits (train/val/test)."""
    print("\n=== Classes per split ===")
    for s in SPLITS:
        print(f"{s:<5}: {list_classes(s) or '—'}")
    totals = {s: summarize(s) for s in SPLITS}
    return totals

# Initial summary before any changes
_ = summarize_all()



=== Classes per split ===
train: ['Battery', 'Keyboard', 'Microwave', 'Mobile', 'Mouse', 'PCB', 'Player', 'Printer', 'Television', 'Washing Machine']
val  : ['Battery', 'Keyboard', 'Microwave', 'Mobile', 'Mouse', 'PCB', 'Player', 'Printer', 'Television', 'Washing Machine']
test : ['Battery', 'Keyboard', 'Microwave', 'Mobile', 'Mouse', 'PCB', 'Player', 'Printer', 'Television', 'Washing Machine']

=== TRAIN per-class counts ===
         Battery: 240
        Keyboard: 240
       Microwave: 240
          Mobile: 240
           Mouse: 240
             PCB: 240
          Player: 240
         Printer: 240
      Television: 240
 Washing Machine: 240
Total in train: 2400

=== VAL per-class counts ===
         Battery: 30
        Keyboard: 30
       Microwave: 30
          Mobile: 30
           Mouse: 30
             PCB: 30
          Player: 30
         Printer: 30
      Television: 30
 Washing Machine: 30
Total in val: 300

=== TEST per-class counts ===
         Battery: 30
        Keyboard: 

## 2. Merge Kaggle `val` into `test` to create one big test folder 

For this project, we follow the professor’s guidance:

- Use **all original Kaggle `train` images** for training.
- Combine Kaggle `val` and `test` into **one larger test set**.
- Later, use a **separate hand-curated validation set** of real-world images (outside Kaggle).

Below, I move all images from `data/val/*` into `data/test/*` (class-by-class) and then remove the now-empty `val` folder. The function is written to be **idempotent**: if you run it again, it won’t duplicate files.


In [5]:
# One-time merge: move val -> test 

def _force_writable(func, path, exc_info):
    """
    Helper for shutil.rmtree on Windows:
    Make read-only files writable, then retry.
    """
    try:
        os.chmod(path, stat.S_IWRITE)
    except Exception:
        pass
    func(path)

def merge_val_into_test_once():
    val_root, test_root = sp("val"), sp("test")

    if not os.path.exists(val_root):
        print("[skip] 'val' folder not present. Nothing to merge.")
        return

    os.makedirs(test_root, exist_ok=True)
    transferred = 0

    # Move files class-by-class
    for cls in list_classes("val"):
        src_cls = os.path.join(val_root, cls)
        if not os.path.isdir(src_cls):
            continue

        dst_cls = os.path.join(test_root, cls)
        os.makedirs(dst_cls, exist_ok=True)

        for fname in os.listdir(src_cls):
            src = os.path.join(src_cls, fname)
            if not os.path.isfile(src):
                continue

            dst = os.path.join(dst_cls, fname)
            # Skip if file already exists in test
            if os.path.exists(dst):
                continue

            shutil.move(src, dst)
            transferred += 1

    # Remove 'val' folder (and any empty subfolders) once everything is moved
    try:
        shutil.rmtree(val_root, onerror=_force_writable)
        print(f"[done] Merged (moved) {transferred} files from 'val' → 'test' and removed 'val/'.")
    except Exception as e:
        print(f"[warn] Could not completely remove 'val': {e}. You can delete it manually if needed.")

# Run the merge (only meaningful the first time)
merge_val_into_test_once()

# Re-summarize after merging
_ = summarize_all()


[done] Merged (moved) 300 files from 'val' → 'test' and removed 'val/'.

=== Classes per split ===
train: ['Battery', 'Keyboard', 'Microwave', 'Mobile', 'Mouse', 'PCB', 'Player', 'Printer', 'Television', 'Washing Machine']
val  : —
test : ['Battery', 'Keyboard', 'Microwave', 'Mobile', 'Mouse', 'PCB', 'Player', 'Printer', 'Television', 'Washing Machine']

=== TRAIN per-class counts ===
         Battery: 240
        Keyboard: 240
       Microwave: 240
          Mobile: 240
           Mouse: 240
             PCB: 240
          Player: 240
         Printer: 240
      Television: 240
 Washing Machine: 240
Total in train: 2400
[info] 'val' not found or empty.

=== TEST per-class counts ===
         Battery: 60
        Keyboard: 60
       Microwave: 60
          Mobile: 60
           Mouse: 60
             PCB: 60
          Player: 60
         Printer: 60
      Television: 60
 Washing Machine: 60
Total in test: 600


## 3. Hand-Curated Validation Set (`val (by hand)`)

To test how well the model generalizes beyond curated Kaggle images,  
we created a **separate folder**:

- `data/val (by hand)/Battery/`
- `data/val (by hand)/Keyboard/`
- ...
- `data/val (by hand)/Washing Machine/`

Each subfolder contains ~10 real-world images per class collected from the web  
(e.g., product photos, different backgrounds, lighting, viewpoints).

Below, we:
- Define a transform consistent with the CNN input size (128×128).
- Load this hand-curated validation set with `ImageFolder`.
- Print the class names and the number of valid image files per class.


In [6]:
# Hand-curated validation set: data/val (by hand) 

val_byhand_dir = "data/val (by hand)"   # folder created manually

IMG_SIZE_V3 = 128

val_transform_v3 = transforms.Compose([
    transforms.Resize((IMG_SIZE_V3, IMG_SIZE_V3)),
    transforms.ToTensor(),
    transforms.Normalize([0.5, 0.5, 0.5],
                         [0.5, 0.5, 0.5]),
])

# Dataset + loader for the hand-curated validation set
val_byhand_ds = datasets.ImageFolder(root=val_byhand_dir,
                                     transform=val_transform_v3)
val_byhand_loader = DataLoader(val_byhand_ds, batch_size=32, shuffle=False)

print("Classes in hand-curated val set:", val_byhand_ds.classes)
print("Total images in val (by hand):", len(val_byhand_ds))

# Optional: per-class image counts (only valid image extensions)
valid_exts = {".jpg", ".jpeg", ".png", ".bmp", ".tif", ".tiff", ".webp"}

print("\n=== Per-class counts in 'val (by hand)' ===")
for cls in sorted(os.listdir(val_byhand_dir)):
    cls_path = os.path.join(val_byhand_dir, cls)
    if os.path.isdir(cls_path):
        imgs = [
            f for f in os.listdir(cls_path)
            if os.path.splitext(f)[1].lower() in valid_exts
        ]
        print(f"{cls:>16}: {len(imgs)} image files")


Classes in hand-curated val set: ['Battery', 'Keyboard', 'Microwave', 'Mobile', 'Mouse', 'PCB', 'Player', 'Printer', 'Television', 'Washing Machine']
Total images in val (by hand): 108

=== Per-class counts in 'val (by hand)' ===
         Battery: 10 image files
        Keyboard: 10 image files
       Microwave: 10 image files
          Mobile: 10 image files
           Mouse: 10 image files
             PCB: 10 image files
          Player: 10 image files
         Printer: 10 image files
      Television: 10 image files
 Washing Machine: 10 image files


## 4. Summary

After running this notebook:

- `data/train/` contains the Kaggle training images (10 e-waste classes).
- `data/test/` contains the original Kaggle test images **plus** the merged Kaggle validation images.
- `data/val (by hand)/` contains real-world images collected manually, one folder per class.

These prepared splits are now ready to be used in the main modeling notebook  
(e.g., `02_modeling_ewaste.ipynb`), where I train:

- A custom CNN (Model V3), and  
- A pretrained ResNet-18 (Model V4),

and compare their performance both on Kaggle data and the hand-curated validation set.
