# 01 — Data Exploration: Weed Detection Datasets for Indonesian Rice Fields

**Purpose:** Understand the datasets we'll use for training before touching any models.  
**Runtime:** CPU only — no GPU needed. Save your GPU hours for training notebooks.  
**Platform:** Works on both Kaggle and Google Colab.

## What This Notebook Covers

1. **Platform detection** — auto-detect Kaggle vs Colab and set paths accordingly
2. **Dataset switcher** — configure which dataset to explore (YOLO detection, folder classification, or mask segmentation)
3. **Dataset exploration** — class distribution, sample images, image properties, data quality
4. **Segmentation survey** — RiceSEG for pixel-level weed masks (D2)
5. **Key observations** — what we learned that affects training decisions

### Why Not DeepWeeds?

The original notebook used **DeepWeeds** (Australian rangeland weeds, 9 classes). We replaced it because:
- **Zero species overlap** with Indonesian rice paddy weeds
- Australian rangeland ecology is fundamentally different from tropical rice fields
- Training on irrelevant species teaches the model features that don't transfer

### Recommended Datasets

| Dataset | Task | Role | Source |
|---------|------|------|--------|
| **Crop & Weed Detection** (YOLO) | Object Detection | Pipeline learning, YOLO format | Kaggle (already available) |
| **Bangladesh Rice Field Weed** | Classification | 11 tropical rice weed species | Mendeley Data (upload to Kaggle) |
| **RiceSEG** | Segmentation | Pixel-level weed masks in rice fields | HuggingFace (upload to Kaggle) |

This notebook explores the **detection** and **classification** datasets. Notebook 02 covers **RiceSEG** (segmentation).

---
## 1. Platform Detection & Setup

Since we want this notebook to run on **both Kaggle and Colab**, we detect the platform and set file paths accordingly.

**How it works:**
- Kaggle notebooks have `/kaggle/input/` directory
- Colab notebooks have `google.colab` module available
- If neither → running locally

### Install Dependencies

On Kaggle and Colab, pip installs are **ephemeral** — they disappear when the session restarts.  
That's why every notebook starts with an install cell. This is normal for cloud notebooks.

In [9]:
import os
import sys

# --- Platform Detection ---
IS_KAGGLE = os.path.exists('/kaggle/input')

try:
    import google.colab
    IS_COLAB = True
except ImportError:
    IS_COLAB = False

IS_LOCAL = not IS_KAGGLE and not IS_COLAB

PLATFORM = 'kaggle' if IS_KAGGLE else ('colab' if IS_COLAB else 'local')
print(f'Platform detected: {PLATFORM}')
print(f'Python version: {sys.version}')

Platform detected: local
Python version: 3.14.2 (main, Dec  5 2025, 16:49:16) [Clang 17.0.0 (clang-1700.6.3.2)]


In [10]:
# These are lightweight — no ML frameworks needed for exploration
# matplotlib and pandas are pre-installed on both platforms
import subprocess
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'Pillow'])

print('Dependencies ready.')

Dependencies ready.


In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.gridspec as gridspec
from pathlib import Path
from PIL import Image
from collections import Counter
import json
import warnings
warnings.filterwarnings('ignore')

# Consistent plot style
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 100
plt.rcParams['font.size'] = 11

print('Imports ready.')

Imports ready.


---
## 2. Dataset Configuration

This notebook supports multiple datasets through a **switcher pattern**. Change `ACTIVE_DATASET` to explore a different dataset — all analysis cells adapt automatically.

### Available Datasets

| Key | Dataset | Format | Status |
|-----|---------|--------|--------|
| `crop_weed_yolo` | Crop & Weed Detection | YOLO (images + `.txt` annotations) | On Kaggle — ready to use |
| `bangladesh_rice_weed` | Bangladesh Rice Field Weed | Folder-based (one folder per class) | Upload from Mendeley Data |

**Default:** `crop_weed_yolo` — works immediately on Kaggle with no extra setup.

In [12]:
# ============================================================
# DATASET CONFIGURATION — Change ACTIVE_DATASET to switch
# ============================================================

DATASET_CONFIGS = {
    'crop_weed_yolo': {
        'name': 'Crop & Weed Detection (YOLO)',
        'format': 'yolo',          # YOLO annotation format
        'task': 'detection',
        'kaggle_slug': 'crop-and-weed-detection-data-with-bounding-boxes',
        'paths': {
            'kaggle': '/kaggle/input/crop-and-weed-detection-data-with-bounding-boxes',
            'colab': '/content/crop_weed_yolo',
            'local': './data/crop_weed_yolo',
        },
        'description': (
            '2-class detection dataset (crop vs weed) with YOLO-format bounding boxes. '
            'Good for learning the YOLO annotation pipeline and detection basics.'
        ),
    },
    'bangladesh_rice_weed': {
        'name': 'Bangladesh Rice Field Weed',
        'format': 'folder',         # One subfolder per class
        'task': 'classification',
        'kaggle_slug': None,        # Must upload from Mendeley
        'paths': {
            'kaggle': '/kaggle/input/bangladesh-rice-field-weed',
            'colab': '/content/bangladesh_rice_weed',
            'local': './data/bangladesh_rice_weed',
        },
        'description': (
            '11 rice weed species from Bangladesh (tropical climate, closest to Indonesia). '
            'Requires manual upload from Mendeley Data: '
            'https://data.mendeley.com/datasets/mt72bmxz73/4'
        ),
    },
}

# >>> CHANGE THIS to switch datasets <<<
ACTIVE_DATASET = 'crop_weed_yolo'

config = DATASET_CONFIGS[ACTIVE_DATASET]
print(f'Active dataset: {config["name"]}')
print(f'Format: {config["format"]}')
print(f'Task: {config["task"]}')
print(f'\nDescription: {config["description"]}')

Active dataset: Crop & Weed Detection (YOLO)
Format: yolo
Task: detection

Description: 2-class detection dataset (crop vs weed) with YOLO-format bounding boxes. Good for learning the YOLO annotation pipeline and detection basics.


---
## 3. Dataset Catalog

### Crop & Weed Detection (YOLO)

| Property | Value |
|----------|-------|
| **Classes** | 2 (crop, weed) |
| **Format** | YOLO — each image has a `.txt` annotation with bounding boxes |
| **Source** | Kaggle (search: "crop and weed detection") |
| **Task** | Object detection |
| **Annotation** | `class_id center_x center_y width height` (normalized 0-1) |

### Bangladesh Rice Field Weed

| Property | Value |
|----------|-------|
| **Classes** | 11 rice weed species |
| **Format** | Folder-based — one subdirectory per species |
| **Source** | Mendeley Data (NOT on Kaggle — must upload as private dataset) |
| **Task** | Image classification |
| **Climate** | Tropical (Bangladesh) — closest match to Indonesian rice fields |
| **Species** | *Cyperus difformis*, *Echinochloa crus-galli*, *Fimbristylis miliacea*, etc. |

### Why These Datasets?

- **Crop & Weed Detection** — Already on Kaggle, zero setup, teaches YOLO annotation format
- **Bangladesh Rice Weed** — 11 species from tropical rice fields, directly relevant to Indonesia
- Both are better than DeepWeeds (Australian rangeland, zero species overlap with Indonesian rice)

> **Note:** For segmentation (pixel-level masks), see **notebook 02** which explores RiceSEG.

In [13]:
# --- Set dataset path based on platform ---

if IS_KAGGLE:
    DATA_ROOT = Path(config['paths']['kaggle'])
elif IS_COLAB:
    DATA_ROOT = Path(config['paths']['colab'])
else:
    DATA_ROOT = Path(config['paths']['local'])

print(f'Data root: {DATA_ROOT}')
print(f'Exists: {DATA_ROOT.exists()}')

if not DATA_ROOT.exists():
    print()
    print('=' * 60)
    print('DATASET NOT FOUND — Setup Instructions')
    print('=' * 60)
    if config['format'] == 'yolo':
        print(f'On Kaggle:')
        print(f'  1. Click "Add Data" in the sidebar')
        print(f'  2. Search: "crop and weed detection"')
        print(f'  3. Add the dataset by ravirajsinh45')
        print(f'  4. Re-run this cell')
        print()
        print(f'On Colab:')
        print(f'  1. Download from Kaggle: kaggle datasets download -d ravirajsinh45/{config["kaggle_slug"]}')
        print(f'  2. Unzip to {DATA_ROOT}')
    elif config['format'] == 'folder':
        print(f'This dataset is NOT on Kaggle. You must upload it manually:')
        print(f'  1. Download from Mendeley Data:')
        print(f'     https://data.mendeley.com/datasets/mt72bmxz73/4')
        print(f'  2. On Kaggle: New Dataset > upload the extracted folder')
        print(f'  3. Name it "bangladesh-rice-field-weed"')
        print(f'  4. Attach it to this notebook via "Add Data"')
    print('=' * 60)

Data root: data/crop_weed_yolo
Exists: False

DATASET NOT FOUND — Setup Instructions
On Kaggle:
  1. Click "Add Data" in the sidebar
  2. Search: "crop and weed detection"
  3. Add the dataset by ravirajsinh45
  4. Re-run this cell

On Colab:
  1. Download from Kaggle: kaggle datasets download -d ravirajsinh45/crop-and-weed-detection-data-with-bounding-boxes
  2. Unzip to data/crop_weed_yolo


### Dataset Structure Discovery

Before loading anything, let's see what files and folders exist.  
Different datasets have different structures — YOLO has `.txt` annotation files alongside images, while folder-based datasets organize images into class subdirectories.

In [14]:
# List top-level contents
if DATA_ROOT.exists():
    contents = sorted(DATA_ROOT.iterdir())
    print(f'Contents of {DATA_ROOT}:')
    for item in contents[:30]:  # Show first 30 items
        kind = 'DIR' if item.is_dir() else f'FILE ({item.suffix})'
        size = item.stat().st_size if item.is_file() else ''
        size_str = f' — {size / 1024:.1f} KB' if size else ''
        print(f'  {kind}: {item.name}{size_str}')
    
    if len(contents) > 30:
        print(f'  ... and {len(contents) - 30} more items')
    
    # Count file types
    image_extensions = {'.jpg', '.jpeg', '.png', '.bmp'}
    all_images = [f for f in DATA_ROOT.rglob('*') if f.suffix.lower() in image_extensions]
    all_txts = [f for f in DATA_ROOT.rglob('*.txt')]
    all_csvs = [f for f in DATA_ROOT.rglob('*.csv')]
    all_xmls = [f for f in DATA_ROOT.rglob('*.xml')]
    
    print(f'\nFile type summary:')
    print(f'  Images (.jpg/.jpeg/.png/.bmp): {len(all_images)}')
    print(f'  Text files (.txt):             {len(all_txts)}')
    print(f'  CSV files (.csv):              {len(all_csvs)}')
    print(f'  XML files (.xml):              {len(all_xmls)}')
    
    # Show which directories contain images
    image_dirs = Counter(str(f.parent.relative_to(DATA_ROOT)) for f in all_images)
    print(f'\nImage directories: {dict(image_dirs)}')
else:
    print(f'Data root does not exist: {DATA_ROOT}')
    print('Follow the setup instructions above for your platform.')

Data root does not exist: data/crop_weed_yolo
Follow the setup instructions above for your platform.


### Load Dataset

The loading strategy depends on the dataset format:
- **YOLO format:** Parse `.txt` annotation files alongside images
- **Folder format:** Walk subdirectories, each directory name = class name

In [None]:
# --- YOLO annotation parser ---

def parse_yolo_annotation(txt_path):
    """Parse a YOLO annotation file into a list of bounding boxes.
    
    Each line: class_id center_x center_y width height (all normalized 0-1)
    Returns: list of dicts with keys: class_id, cx, cy, w, h
    """
    boxes = []
    try:
        with open(txt_path) as f:
            for line in f:
                parts = line.strip().split()
                if len(parts) >= 5:
                    boxes.append({
                        'class_id': int(parts[0]),
                        'cx': float(parts[1]),
                        'cy': float(parts[2]),
                        'w': float(parts[3]),
                        'h': float(parts[4]),
                    })
    except Exception:
        pass
    return boxes


def load_classes_txt(data_root):
    """Load class names from classes.txt or similar files."""
    for candidate in ['classes.txt', 'obj.names', 'data.names']:
        path = data_root / candidate
        if not path.exists():
            # Try one level deeper
            for subdir in data_root.iterdir():
                if subdir.is_dir():
                    deeper = subdir / candidate
                    if deeper.exists():
                        path = deeper
                        break
        if path.exists():
            with open(path) as f:
                names = [line.strip() for line in f if line.strip()]
            return {i: name for i, name in enumerate(names)}
    
    # Also try to find in any .txt that looks like a class list (short, no spaces per line)
    return None


def load_yolo_dataset(data_root):
    """Load a YOLO-format dataset into a DataFrame.
    
    Returns DataFrame with columns: image_path, annotation_path, num_objects, class_ids
    """
    image_extensions = {'.jpg', '.jpeg', '.png', '.bmp'}
    rows = []
    
    # Find all images
    all_images = sorted(f for f in data_root.rglob('*') if f.suffix.lower() in image_extensions)
    
    for img_path in all_images:
        # Look for matching .txt annotation
        txt_path = img_path.with_suffix('.txt')
        if not txt_path.exists():
            # Try in a parallel directory (e.g., images/ vs labels/)
            rel = img_path.relative_to(data_root)
            parts = list(rel.parts)
            for i, part in enumerate(parts):
                if part.lower() in ('images', 'image', 'img'):
                    parts[i] = 'labels'
                    alt = data_root / Path(*parts)
                    alt = alt.with_suffix('.txt')
                    if alt.exists():
                        txt_path = alt
                        break
        
        boxes = parse_yolo_annotation(txt_path) if txt_path.exists() else []
        class_ids = [b['class_id'] for b in boxes]
        
        rows.append({
            'image_path': str(img_path),
            'annotation_path': str(txt_path) if txt_path.exists() else None,
            'num_objects': len(boxes),
            'class_ids': class_ids,
            'has_annotation': txt_path.exists(),
        })
    
    return pd.DataFrame(rows)


def load_folder_dataset(data_root):
    """Load a folder-based classification dataset into a DataFrame.
    
    Expects: data_root/class_name/*.jpg
    Returns DataFrame with columns: image_path, class_name, label
    """
    image_extensions = {'.jpg', '.jpeg', '.png', '.bmp'}
    rows = []
    
    # Find all subdirectories that contain images
    subdirs = sorted(d for d in data_root.iterdir() if d.is_dir())
    
    # Check one level deeper if there's only one subdir
    if len(subdirs) == 1:
        deeper = subdirs[0]
        deeper_subdirs = sorted(d for d in deeper.iterdir() if d.is_dir())
        if len(deeper_subdirs) > 1:
            subdirs = deeper_subdirs
            print(f'  Found nested structure: using {deeper.name}/ as root')
    
    class_names = {}
    for idx, subdir in enumerate(subdirs):
        imgs = [f for f in subdir.iterdir() if f.suffix.lower() in image_extensions]
        if imgs:
            class_names[idx] = subdir.name
            for img in imgs:
                rows.append({
                    'image_path': str(img),
                    'class_name': subdir.name,
                    'label': idx,
                })
    
    return pd.DataFrame(rows), class_names


print('Dataset loaders defined.')

In [15]:
# --- Load the active dataset ---
df = None
CLASS_NAMES = {}

if config['format'] == 'yolo':
    print(f'Loading YOLO dataset from {DATA_ROOT}...')
    
    # Load class names
    CLASS_NAMES = load_classes_txt(DATA_ROOT)
    if CLASS_NAMES:
        print(f'Class names from file: {CLASS_NAMES}')
    else:
        # Default for Crop & Weed Detection
        CLASS_NAMES = {0: 'crop', 1: 'weed'}
        print(f'Using default class names: {CLASS_NAMES}')
    
    NUM_CLASSES = len(CLASS_NAMES)
    df = load_yolo_dataset(DATA_ROOT)
    
    FNAME_COL = 'image_path'
    LABEL_COL = 'class_ids'  # list of class IDs per image
    
    print(f'\nLoaded {len(df)} images')
    print(f'With annotations: {df["has_annotation"].sum()}')
    print(f'Without annotations: {(~df["has_annotation"]).sum()}')
    print(f'Total objects: {df["num_objects"].sum()}')

elif config['format'] == 'folder':
    print(f'Loading folder-based dataset from {DATA_ROOT}...')
    
    df, CLASS_NAMES = load_folder_dataset(DATA_ROOT)
    NUM_CLASSES = len(CLASS_NAMES)
    
    FNAME_COL = 'image_path'
    LABEL_COL = 'label'
    
    print(f'\nLoaded {len(df)} images across {NUM_CLASSES} classes:')
    for idx, name in sorted(CLASS_NAMES.items()):
        count = len(df[df['label'] == idx])
        print(f'  {idx}: {name} — {count} images')

if df is not None:
    print(f'\nDataFrame ready: {df.shape[0]} rows, {df.shape[1]} columns')
    display(df.head(10))
else:
    print('Could not load dataset. Check DATA_ROOT and dataset structure above.')

Loading YOLO dataset from data/crop_weed_yolo...


FileNotFoundError: [Errno 2] No such file or directory: 'data/crop_weed_yolo'

In [16]:
if df is None:
    raise RuntimeError(
        'DataFrame not loaded. Check the output of the previous cell.\n'
        'The dataset structure may not match what this notebook expects.\n'
        'Look at the "Dataset Structure Discovery" output for clues.'
    )

# --- Normalize and validate ---
print(f'Dataset: {config["name"]}')
print(f'Format: {config["format"]}')
print(f'Rows: {len(df)}')
print(f'Columns: {list(df.columns)}')
print(f'Classes ({NUM_CLASSES}): {CLASS_NAMES}')

RuntimeError: DataFrame not loaded. Check the output of the previous cell.
The dataset structure may not match what this notebook expects.
Look at the "Dataset Structure Discovery" output for clues.

---
## 4. Class Distribution Analysis

**Why this matters:** If classes are imbalanced (some have far more examples), the model will:
- Perform well on majority classes (lots of examples to learn from)
- Perform poorly on minority classes (too few examples)
- Report misleadingly high accuracy (just by guessing the majority class)

**What to look for:**
- Are any classes severely underrepresented?
- What's the ratio between largest and smallest class?
- For YOLO: how many objects per image on average?

In [17]:
# Class distribution — adapts to dataset format

if config['format'] == 'yolo':
    # Count objects per class across all images
    all_class_ids = []
    for ids in df['class_ids']:
        all_class_ids.extend(ids)
    
    class_counts_raw = Counter(all_class_ids)
    class_counts = pd.Series({CLASS_NAMES.get(k, f'class_{k}'): v 
                              for k, v in sorted(class_counts_raw.items())})
    
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # Objects per class
    colors = plt.cm.Set3(np.linspace(0, 1, NUM_CLASSES))
    bars = axes[0].barh(class_counts.index, class_counts.values, color=colors)
    axes[0].set_xlabel('Number of Objects')
    axes[0].set_title('Objects per Class')
    for bar, count in zip(bars, class_counts.values):
        axes[0].text(bar.get_width() + 20, bar.get_y() + bar.get_height()/2,
                     f'{count:,}', va='center', fontsize=10)
    
    # Objects per image histogram
    axes[1].hist(df['num_objects'], bins=range(0, df['num_objects'].max() + 2),
                 color='steelblue', edgecolor='white')
    axes[1].set_xlabel('Objects per Image')
    axes[1].set_ylabel('Number of Images')
    axes[1].set_title('Objects per Image Distribution')
    
    # Pie chart
    axes[2].pie(class_counts.values, labels=class_counts.index, colors=colors,
                autopct='%1.1f%%', startangle=90, textprops={'fontsize': 10})
    axes[2].set_title('Class Proportions (by object count)')
    
    plt.suptitle(f'{config["name"]} — Class Distribution', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print(f'Total images: {len(df):,}')
    print(f'Total objects: {sum(class_counts.values):,}')
    print(f'Mean objects per image: {df["num_objects"].mean():.1f}')
    print(f'Images with no objects: {(df["num_objects"] == 0).sum()}')
    if len(class_counts) > 1:
        print(f'Imbalance ratio: {class_counts.max() / max(class_counts.min(), 1):.1f}x')

elif config['format'] == 'folder':
    class_counts = df['class_name'].value_counts().sort_index()
    
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    colors = plt.cm.Set3(np.linspace(0, 1, NUM_CLASSES))
    bars = axes[0].barh(class_counts.index, class_counts.values, color=colors[:len(class_counts)])
    axes[0].set_xlabel('Number of Images')
    axes[0].set_title('Images per Class')
    for bar, count in zip(bars, class_counts.values):
        axes[0].text(bar.get_width() + 5, bar.get_y() + bar.get_height()/2,
                     f'{count:,}', va='center', fontsize=10)
    
    axes[1].pie(class_counts.values, labels=class_counts.index, colors=colors[:len(class_counts)],
                autopct='%1.1f%%', startangle=90, textprops={'fontsize': 9})
    axes[1].set_title('Class Proportions')
    
    plt.suptitle(f'{config["name"]} — Class Distribution', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print(f'Total images: {len(df):,}')
    print(f'Largest class: {class_counts.idxmax()} ({class_counts.max():,} images)')
    print(f'Smallest class: {class_counts.idxmin()} ({class_counts.min():,} images)')
    print(f'Imbalance ratio: {class_counts.max() / class_counts.min():.1f}x')

TypeError: 'NoneType' object is not subscriptable

### Interpreting the Distribution

**For YOLO datasets:**
- If one class dominates, the detector may ignore the minority class
- Images with 0 objects are "negative" examples — useful but should not dominate
- Many objects per image = dense scenes (harder detection task)

**For folder-based datasets:**
- Imbalance ratio >5x usually requires special handling (weighted loss, oversampling)
- Very small classes (<100 images) will need heavy data augmentation
- The confusion matrix in notebook 04 will reveal if imbalance causes prediction bias

---
## 5. Sample Images

**Why?** Visual inspection tells you things statistics can't:
- Are images clear enough for a model to learn from?
- Do some classes look visually similar?
- Are there obviously mislabeled images?
- What's the image quality and resolution?

For YOLO datasets, we overlay bounding boxes on the images to verify annotation quality.

In [18]:
def find_image_path(filename):
    """Find the full path of an image file across different dataset structures."""
    p = Path(filename)
    if p.is_absolute() and p.exists():
        return p
    
    direct = DATA_ROOT / filename
    if direct.exists():
        return direct
    
    for subdir in ['images', 'train', 'data', 'img']:
        candidate = DATA_ROOT / subdir / filename
        if candidate.exists():
            return candidate
    
    for parent in DATA_ROOT.iterdir():
        if parent.is_dir():
            candidate = parent / filename
            if candidate.exists():
                return candidate
            for subdir in ['images', 'train']:
                candidate = parent / subdir / filename
                if candidate.exists():
                    return candidate
    
    matches = list(DATA_ROOT.rglob(Path(filename).name))
    if matches:
        return matches[0]
    
    return None


def draw_yolo_boxes(ax, img, boxes, class_names, colors=None):
    """Draw YOLO bounding boxes on a matplotlib axes."""
    h, w = img.shape[:2] if hasattr(img, 'shape') else img.size[::-1]
    
    if colors is None:
        cmap = plt.cm.Set1
        colors = {i: cmap(i / max(len(class_names), 1)) for i in class_names}
    
    for box in boxes:
        cx, cy, bw, bh = box['cx'], box['cy'], box['w'], box['h']
        x1 = (cx - bw / 2) * w
        y1 = (cy - bh / 2) * h
        rect_w = bw * w
        rect_h = bh * h
        
        cid = box['class_id']
        color = colors.get(cid, 'red')
        label = class_names.get(cid, f'class_{cid}')
        
        rect = patches.Rectangle((x1, y1), rect_w, rect_h,
                                  linewidth=2, edgecolor=color, facecolor='none')
        ax.add_patch(rect)
        ax.text(x1, y1 - 2, label, fontsize=7, color='white',
                bbox=dict(boxstyle='round,pad=0.2', facecolor=color, alpha=0.8))


# Test: find the first image
test_path = find_image_path(df[FNAME_COL].iloc[0])
print(f'Test image: {df[FNAME_COL].iloc[0]}')
print(f'Found at: {test_path}')
if test_path is None:
    print('\nCould not find image file. Check DATA_ROOT structure above.')

NameError: name 'FNAME_COL' is not defined

In [None]:
# Show sample images — adapts to dataset format
np.random.seed(42)

if config['format'] == 'yolo':
    # Show images WITH annotations, overlay bounding boxes
    annotated = df[df['has_annotation'] & (df['num_objects'] > 0)]
    if len(annotated) == 0:
        annotated = df[df['has_annotation']]
    
    samples = annotated.sample(n=min(12, len(annotated)), random_state=42)
    
    rows_display = min(3, (len(samples) + 3) // 4)
    cols_display = min(4, len(samples))
    fig, axes = plt.subplots(rows_display, cols_display, 
                              figsize=(4 * cols_display, 4 * rows_display))
    if rows_display == 1:
        axes = [axes] if cols_display == 1 else [axes]
    axes_flat = np.array(axes).flat
    
    for ax, (_, row) in zip(axes_flat, samples.iterrows()):
        img_path = find_image_path(row['image_path'])
        if img_path and Path(img_path).exists():
            img = Image.open(img_path).convert('RGB')
            img_arr = np.array(img)
            ax.imshow(img_arr)
            
            # Draw bounding boxes
            if row['annotation_path']:
                boxes = parse_yolo_annotation(row['annotation_path'])
                draw_yolo_boxes(ax, img_arr, boxes, CLASS_NAMES)
            
            ax.set_title(f'{row["num_objects"]} objects', fontsize=9)
        else:
            ax.text(0.5, 0.5, 'Not found', ha='center', va='center', transform=ax.transAxes)
        ax.axis('off')
    
    # Hide unused axes
    for ax in list(axes_flat)[len(samples):]:
        ax.axis('off')
    
    plt.suptitle(f'{config["name"]} — Sample Images with Bounding Boxes', 
                 fontsize=14, fontweight='bold', y=1.01)
    plt.tight_layout()
    plt.show()

elif config['format'] == 'folder':
    # Show samples per class
    SAMPLES_PER_CLASS = 5
    num_classes_display = min(NUM_CLASSES, 15)
    
    fig, axes = plt.subplots(num_classes_display, SAMPLES_PER_CLASS,
                              figsize=(3 * SAMPLES_PER_CLASS, 3 * num_classes_display))
    
    for row_idx, (class_idx, class_name) in enumerate(sorted(CLASS_NAMES.items())[:num_classes_display]):
        class_df = df[df['label'] == class_idx]
        samples = class_df.sample(n=min(SAMPLES_PER_CLASS, len(class_df)), random_state=42)
        
        for col_idx, (_, sample) in enumerate(samples.iterrows()):
            ax = axes[row_idx][col_idx] if num_classes_display > 1 else axes[col_idx]
            img_path = find_image_path(sample['image_path'])
            
            if img_path and Path(img_path).exists():
                img = Image.open(img_path)
                ax.imshow(img)
            else:
                ax.text(0.5, 0.5, 'Not found', ha='center', va='center', transform=ax.transAxes)
            
            ax.axis('off')
            if col_idx == 0:
                ax.set_title(f'{class_name}\n(n={len(class_df)})', 
                            fontsize=10, fontweight='bold', loc='left')
    
    plt.suptitle(f'{config["name"]} — Samples per Class', fontsize=14, fontweight='bold', y=1.01)
    plt.tight_layout()
    plt.show()

### Visual Observations Checklist

After looking at the samples above, answer these questions (edit this cell with your notes):

- [ ] **Can you distinguish classes visually?** For YOLO: are crop/weed boxes accurate?
- [ ] **Image quality:** Are images generally sharp and well-lit?
- [ ] **Background variation:** Do images have diverse backgrounds?
- [ ] **Annotation quality:** For YOLO: do boxes tightly fit the objects?
- [ ] **Scale variation:** Are objects shown at different sizes?
- [ ] **Edge cases:** Any images that seem mislabeled or ambiguous?

---
## 6. Image Properties Analysis

**Why this matters for training:**
- **Dimensions** — Models expect fixed input size. If images vary, we need resizing/cropping.
- **Color channels** — RGB (3 channels) vs grayscale (1 channel) affects model architecture.
- **File size** — Affects loading speed. Large files = slower data pipeline.
- **Aspect ratio** — Square images are easier to work with. Non-square requires padding or cropping.

In [None]:
# Sample a subset for analysis
ANALYSIS_SAMPLE_SIZE = 500
sample_df = df.sample(n=min(ANALYSIS_SAMPLE_SIZE, len(df)), random_state=42)

widths, heights, channels, file_sizes = [], [], [], []
errors = []

for _, row in sample_df.iterrows():
    img_path = find_image_path(row[FNAME_COL])
    if img_path and Path(img_path).exists():
        try:
            img = Image.open(img_path)
            w, h = img.size
            c = len(img.getbands())
            widths.append(w)
            heights.append(h)
            channels.append(c)
            file_sizes.append(Path(img_path).stat().st_size / 1024)  # KB
        except Exception as e:
            errors.append((row[FNAME_COL], str(e)))
    else:
        errors.append((row[FNAME_COL], 'File not found'))

print(f'Analyzed {len(widths)} images (sampled from {len(df)})\n')
print(f'Width  — min: {min(widths)}, max: {max(widths)}, unique: {len(set(widths))}')
print(f'Height — min: {min(heights)}, max: {max(heights)}, unique: {len(set(heights))}')
print(f'Channels — unique: {set(channels)} ({"RGB" if 3 in channels else "Grayscale"})')
print(f'File size — min: {min(file_sizes):.1f} KB, max: {max(file_sizes):.1f} KB, mean: {np.mean(file_sizes):.1f} KB')
print(f'Errors: {len(errors)}')

if errors:
    print(f'\nFirst 5 errors:')
    for fname, err in errors[:5]:
        print(f'  {fname}: {err}')

In [None]:
# Distribution plots
fig, axes = plt.subplots(1, 3, figsize=(16, 4))

axes[0].hist(widths, bins=30, color='steelblue', edgecolor='white')
axes[0].set_title('Image Width Distribution')
axes[0].set_xlabel('Width (px)')

axes[1].hist(heights, bins=30, color='coral', edgecolor='white')
axes[1].set_title('Image Height Distribution')
axes[1].set_xlabel('Height (px)')

axes[2].hist(file_sizes, bins=30, color='mediumseagreen', edgecolor='white')
axes[2].set_title('File Size Distribution')
axes[2].set_xlabel('Size (KB)')

plt.suptitle(f'{config["name"]} — Image Properties (sample of {len(widths)})', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

### Training Implications

**If all images are the same size:**
- No resizing needed — can feed directly to the model
- Consistent size simplifies the data pipeline

**If images vary in size (common for smartphone datasets):**
- Need a `Resize()` transform in the data pipeline
- For classification: resize to 224x224 or 384x384 (EfficientNet defaults)
- For detection: resize to 640x640 (YOLO default)
- Choose: resize (distorts aspect ratio) vs center-crop (loses edges) vs pad (adds borders)

---
## 7. Per-Class Image Statistics

**Why?** Different classes might have different visual characteristics (brightness, color).  
This tells us whether color-based augmentation (brightness, contrast jitter) could help.

In [None]:
# Compute mean pixel values per class
class_stats = {}
STATS_SAMPLE = 50  # images per class

if config['format'] == 'folder':
    # Per-class color stats (each image = one class)
    for class_idx, class_name in sorted(CLASS_NAMES.items()):
        class_df = df[df['label'] == class_idx].sample(
            n=min(STATS_SAMPLE, len(df[df['label'] == class_idx])), random_state=42
        )
        
        pixel_means = []
        pixel_stds = []
        
        for _, row in class_df.iterrows():
            img_path = find_image_path(row[FNAME_COL])
            if img_path and Path(img_path).exists():
                img = np.array(Image.open(img_path).convert('RGB')) / 255.0
                pixel_means.append(img.mean(axis=(0, 1)))
                pixel_stds.append(img.std(axis=(0, 1)))
        
        if pixel_means:
            class_stats[class_name] = {
                'mean_rgb': np.mean(pixel_means, axis=0),
                'std_rgb': np.mean(pixel_stds, axis=0),
                'brightness': np.mean([m.mean() for m in pixel_means])
            }

elif config['format'] == 'yolo':
    # For YOLO: overall color stats (images contain multiple classes)
    sample = df.sample(n=min(200, len(df)), random_state=42)
    pixel_means = []
    pixel_stds = []
    
    for _, row in sample.iterrows():
        img_path = find_image_path(row[FNAME_COL])
        if img_path and Path(img_path).exists():
            img = np.array(Image.open(img_path).convert('RGB')) / 255.0
            pixel_means.append(img.mean(axis=(0, 1)))
            pixel_stds.append(img.std(axis=(0, 1)))
    
    if pixel_means:
        class_stats['All Images'] = {
            'mean_rgb': np.mean(pixel_means, axis=0),
            'std_rgb': np.mean(pixel_stds, axis=0),
            'brightness': np.mean([m.mean() for m in pixel_means])
        }

# Display
if class_stats:
    stats_df = pd.DataFrame({
        name: {
            'R_mean': f"{s['mean_rgb'][0]:.3f}",
            'G_mean': f"{s['mean_rgb'][1]:.3f}",
            'B_mean': f"{s['mean_rgb'][2]:.3f}",
            'Brightness': f"{s['brightness']:.3f}"
        }
        for name, s in class_stats.items()
    }).T
    
    print('Per-class mean pixel values (0-1 scale):')
    display(stats_df)
else:
    print('No color statistics computed (no images found).')

In [None]:
# Visual: brightness/color comparison
if class_stats and len(class_stats) > 1:
    names = list(class_stats.keys())
    brightness = [class_stats[n]['brightness'] for n in names]
    
    fig, axes = plt.subplots(1, 2, figsize=(16, max(5, len(names) * 0.5)))
    
    bars = axes[0].barh(names, brightness, color='goldenrod')
    axes[0].set_xlabel('Mean Brightness (0-1)')
    axes[0].set_title('Average Brightness per Class')
    axes[0].set_xlim(0, 1)
    
    x = np.arange(len(names))
    width = 0.25
    r_vals = [class_stats[n]['mean_rgb'][0] for n in names]
    g_vals = [class_stats[n]['mean_rgb'][1] for n in names]
    b_vals = [class_stats[n]['mean_rgb'][2] for n in names]
    
    axes[1].barh(x - width, r_vals, width, color='red', alpha=0.7, label='Red')
    axes[1].barh(x, g_vals, width, color='green', alpha=0.7, label='Green')
    axes[1].barh(x + width, b_vals, width, color='blue', alpha=0.7, label='Blue')
    axes[1].set_yticks(x)
    axes[1].set_yticklabels(names)
    axes[1].set_xlabel('Mean Channel Value (0-1)')
    axes[1].set_title('Mean RGB Channels per Class')
    axes[1].legend()
    
    plt.suptitle(f'{config["name"]} — Per-Class Color Statistics', fontsize=13, fontweight='bold')
    plt.tight_layout()
    plt.show()
elif class_stats:
    # Single entry (YOLO) — just show RGB bars
    name = list(class_stats.keys())[0]
    s = class_stats[name]
    print(f'{name}:')
    print(f'  Mean RGB: R={s["mean_rgb"][0]:.3f}, G={s["mean_rgb"][1]:.3f}, B={s["mean_rgb"][2]:.3f}')
    print(f'  Brightness: {s["brightness"]:.3f}')
    print(f'\nNote: YOLO images contain multiple classes per image,')
    print(f'so per-class color stats are not meaningful. Overall stats shown instead.')

### What This Tells Us

- **High green channel** = images are outdoor vegetation scenes (expected for weed/crop data)
- **Similar brightness across classes** = the model can't rely on brightness alone — it must learn texture/shape features
- **If one class is much darker/brighter** = brightness augmentation will help the model generalize

**Normalization:** When training, we'll normalize images to ImageNet statistics `(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])` because we're using ImageNet-pretrained backbones.

---
## 8. Data Quality Check

Quick checks for potential issues before training:
- For YOLO: image/annotation file pairing, coordinate validation
- For folders: duplicate images, missing files, corrupted files

In [None]:
if config['format'] == 'yolo':
    # Check image-annotation pairing
    with_ann = df['has_annotation'].sum()
    without_ann = (~df['has_annotation']).sum()
    print(f'Images WITH annotation: {with_ann}')
    print(f'Images WITHOUT annotation: {without_ann}')
    
    if without_ann > 0:
        print(f'\nImages missing annotations (first 10):')
        missing_ann = df[~df['has_annotation']].head(10)
        for _, row in missing_ann.iterrows():
            print(f'  {Path(row["image_path"]).name}')
    
    # Validate YOLO coordinates are within [0, 1]
    invalid_count = 0
    checked = 0
    for _, row in df[df['has_annotation']].iterrows():
        boxes = parse_yolo_annotation(row['annotation_path'])
        for box in boxes:
            checked += 1
            if not (0 <= box['cx'] <= 1 and 0 <= box['cy'] <= 1 and
                    0 <= box['w'] <= 1 and 0 <= box['h'] <= 1):
                invalid_count += 1
    
    print(f'\nAnnotation coordinate validation:')
    print(f'  Checked: {checked} boxes')
    print(f'  Invalid (outside 0-1): {invalid_count}')
    if invalid_count == 0:
        print(f'  All coordinates valid.')
    else:
        print(f'  {invalid_count} boxes have out-of-range coordinates — may cause issues.')

elif config['format'] == 'folder':
    # Check for duplicates
    filenames = df['image_path'].apply(lambda p: Path(p).name)
    dupes = filenames[filenames.duplicated(keep=False)]
    print(f'Duplicate filenames: {dupes.nunique()} unique names appear multiple times')
    if len(dupes) > 0:
        print(f'Total duplicate entries: {len(dupes)}')
    
    # Check for missing/corrupted files (sample)
    check_sample = df.sample(n=min(200, len(df)), random_state=42)
    missing = []
    corrupted = []
    for _, row in check_sample.iterrows():
        path = find_image_path(row['image_path'])
        if path is None or not Path(path).exists():
            missing.append(row['image_path'])
        else:
            try:
                Image.open(path).verify()
            except Exception:
                corrupted.append(row['image_path'])
    
    print(f'\nMissing files (in sample of {len(check_sample)}): {len(missing)}')
    print(f'Corrupted files (in sample of {len(check_sample)}): {len(corrupted)}')
    if missing:
        print(f'Missing rate estimate: {len(missing)/len(check_sample)*100:.1f}%')

---
## 9. Dataset Summary for Training

Everything we need to know to configure the training pipeline.

In [None]:
# Generate a summary dict that we can reference in later notebooks
summary = {
    'dataset': config['name'],
    'active_key': ACTIVE_DATASET,
    'task': config['task'],
    'format': config['format'],
    'total_images': len(df),
    'num_classes': NUM_CLASSES,
    'class_names': CLASS_NAMES,
    'image_size': f'{min(widths)}x{min(heights)}' if len(set(widths)) == 1 and len(set(heights)) == 1 else 'varies',
    'channels': 3,
    'has_segmentation_masks': False,
    'platform': PLATFORM,
    'data_root': str(DATA_ROOT),
}

if config['format'] == 'yolo':
    summary['total_objects'] = int(df['num_objects'].sum())
    summary['mean_objects_per_image'] = round(df['num_objects'].mean(), 1)

print(f'=== {config["name"]} — Dataset Summary ===')
print(json.dumps({k: v for k, v in summary.items() if k != 'class_names'}, indent=2))

print(f'\n=== Training Configuration Recommendations ===')
if config['format'] == 'yolo':
    print(f'Task: Object detection')
    print(f'Input size: 640x640 (YOLO default) — resize needed if images differ')
    print(f'Format: Already in YOLO format — ready for YOLOv5/v8 training')
    print(f'Note: This is a 2-class detector. For species-level classification, use Bangladesh Rice Weed.')
elif config['format'] == 'folder':
    print(f'Task: Classification')
    print(f'Input size: 224x224 (EfficientNet default) — resize transform required')
    print(f'Normalize to: ImageNet stats (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])')
    print(f'Augmentation: RandomHorizontalFlip, RandomVerticalFlip, ColorJitter, RandomRotation')

if config['format'] == 'folder':
    imbalance = class_counts.max() / class_counts.min()
    if imbalance > 3:
        print(f'Class imbalance ({imbalance:.1f}x) — use weighted CrossEntropyLoss or oversample minority')
    else:
        print(f'Class balance OK ({imbalance:.1f}x) — standard CrossEntropyLoss should work')

---
## 10. Segmentation Dataset Survey

### The Problem

Neither the Crop & Weed Detection nor Bangladesh Rice Weed datasets include **pixel-level masks**. For our segmentation deliverable (D2: DeepLabV3+), we need a dataset where each image has a corresponding **mask** that marks exactly which pixels are "weed" and which are "background".

### What is a Segmentation Mask?

```
Original Image          Segmentation Mask
+------------------+    +------------------+
|  crop crop  soil |    |  1  1  0  0  0   |   0 = background
|  crop  soil weed |    |  1  0  0  2  2   |   1 = crop
| soil  weed weed  |    |  0  2  2  2  2   |   2 = weed
+------------------+    +------------------+
```

Each pixel in the mask has a class label. The model learns to predict this mask from the image.

### Recommended: RiceSEG

| Property | Value |
|----------|-------|
| **Images** | 3,078 |
| **Resolution** | 512x512 pixels |
| **Classes** | 6: Background, Green vegetation, Senescent vegetation, Panicle, **Weeds**, **Duckweed** |
| **Source** | 5 countries: China, Japan, India, Philippines, Tanzania |
| **Format** | Image + pixel-level mask pairs |
| **Relevance** | Rice field weeds from tropical/subtropical climates |
| **Available at** | HuggingFace (must upload to Kaggle as private dataset) |

### Why RiceSEG?

1. **Rice-field specific** — not generic agriculture, but actual rice paddy environments
2. **Weed classes** — explicitly labels weeds and duckweed at the pixel level
3. **Multi-country** — Philippines subset is closest to Indonesian conditions
4. **Proper size** — 3,078 images is enough for a POC segmentation model

### Important Notes

- Weed pixels are **sparse** (~1.6% of total pixels) due to herbicide use at collection sites
- This makes training harder — will need focal loss or heavy class weighting
- The **Philippines subset** is most relevant (closest tropical climate to Indonesia)
- See **notebook 02** for detailed RiceSEG exploration

### Getting RiceSEG for Kaggle

1. **Download** from HuggingFace (search: "RiceSEG")
2. **Create a private Kaggle Dataset:**
   - Go to kaggle.com > Datasets > New Dataset
   - Upload the extracted RiceSEG folder
   - Name it `riceseg`
   - Set visibility to **Private**
3. **Attach** to notebook 02 via "Add Data" sidebar

This is a one-time setup. Once uploaded, you can attach RiceSEG to any notebook.

---
## 11. Key Takeaways

### What We Learned

1. **Dataset explored:** See the summary above for class counts, image properties, and quality
2. **No segmentation masks** in this dataset — need RiceSEG for D2 (explored in notebook 02)
3. **Class imbalance** — check the distribution plots; may need weighted loss
4. **Image quality** — review your visual observations from section 5
5. **Format-specific notes:** YOLO annotations validated (if applicable)

### What's Next

| Next Notebook | What It Does | Dataset Needed |
|---------------|-------------|----------------|
| `02-segmentation-exploration.ipynb` | Explore RiceSEG masks and class distribution | RiceSEG (from HuggingFace) |
| `04-classification-baseline.ipynb` | Train EfficientNetV2-S classifier | Crop & Weed (YOLO) or Bangladesh Rice Weed |

**Recommended path:**
- **PATH A:** Notebook 04 (classification) — start with Crop & Weed Detection (already on Kaggle), upgrade to Bangladesh for better species-level accuracy
- **PATH B:** Notebook 02 (segmentation exploration) — explore RiceSEG before training DeepLabV3+