# Dataset Exploration - PlantVillage
## Solinfitec Solix - Disease Detection System

**Objective**: Exploratory analysis of the PlantVillage dataset (15 classes, ~20K images)

**Key Questions**:
- Class distribution and imbalance
- Image dimensions and quality
- Channel statistics for normalization
- Identify minority classes needing augmentation boost

## 1. Setup e Imports

In [None]:
import sys
sys.path.insert(0, '../..')

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from PIL import Image
from collections import Counter

from src.data.preprocessing import DataPreprocessor
from src.visualization.dataset_plots import plot_class_distribution, plot_sample_grid, plot_image_size_distribution

plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

DATA_DIR = Path('../../data/raw')
print(f"Data directory: {DATA_DIR.resolve()}")
print(f"Exists: {DATA_DIR.exists()}")

## 2. Carregar Paths dos Dados

In [None]:
# Initialize preprocessor
preprocessor = DataPreprocessor(str(DATA_DIR), skip_nested="PlantVillage/PlantVillage")

# Get valid class directories (skipping nested duplicate)
class_dirs = preprocessor.get_valid_class_dirs()
print(f"Found {len(class_dirs)} valid classes:")
for d in class_dirs:
    print(f"  - {d.name}")

## 3. Análise de Distribuição de Classes

In [None]:
# Class distribution
class_counts = preprocessor.get_class_counts()
total = sum(class_counts.values())
print(f"Total images: {total}")
print(f"\nClass distribution:")
for name, count in sorted(class_counts.items(), key=lambda x: x[1], reverse=True):
    pct = count / total * 100
    print(f"  {name}: {count} ({pct:.1f}%)")

# Imbalance ratio
max_count = max(class_counts.values())
min_count = min(class_counts.values())
print(f"\nImbalance ratio: {max_count/min_count:.1f}x")
print(f"Largest class: {max(class_counts, key=class_counts.get)} ({max_count})")
print(f"Smallest class: {min(class_counts, key=class_counts.get)} ({min_count})")

# Plot
plot_class_distribution(class_counts, title="PlantVillage Class Distribution",
                       save_path="../../reports/class_distribution.png")

## 4. Visualização de Amostras

In [None]:
# Sample images from each class
import random
random.seed(42)

fig, axes = plt.subplots(3, 5, figsize=(18, 12))
axes = axes.flatten()

for i, class_dir in enumerate(class_dirs[:15]):
    images = list(class_dir.glob("*.jpg")) + list(class_dir.glob("*.JPG"))
    if images:
        sample = random.choice(images)
        img = Image.open(sample).convert("RGB")
        axes[i].imshow(img)
        axes[i].set_title(class_dir.name.replace("_", "\n"), fontsize=7)
    axes[i].axis("off")

plt.suptitle("Sample Images per Class", fontsize=14)
plt.tight_layout()
plt.savefig("../../reports/sample_images.png", dpi=150, bbox_inches="tight")
plt.show()

## 5. Análise de Dimensões das Imagens

In [None]:
# Image size distribution
sizes = preprocessor.get_image_sizes(sample_size=500)
widths = [s[0] for s in sizes]
heights = [s[1] for s in sizes]

print(f"Width  - mean: {np.mean(widths):.0f}, std: {np.std(widths):.0f}, "
      f"range: [{min(widths)}, {max(widths)}]")
print(f"Height - mean: {np.mean(heights):.0f}, std: {np.std(heights):.0f}, "
      f"range: [{min(heights)}, {max(heights)}]")

# Aspect ratios
aspects = [w/h for w, h in sizes]
print(f"Aspect ratio - mean: {np.mean(aspects):.3f}, std: {np.std(aspects):.3f}")

plot_image_size_distribution(sizes, save_path="../../reports/image_sizes.png")

## 6. Análise de Canais de Cor

In [None]:
# Channel statistics (for normalization)
print("Computing channel mean/std (this may take a minute)...")
mean, std = preprocessor.compute_channel_stats(sample_size=2000)

print(f"\nChannel Mean: {mean}")
print(f"Channel Std:  {std}")
print(f"\nImageNet default: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]")

# Minority classes identification
from src.features.augmentation import get_minority_classes
minorities = get_minority_classes(class_counts, threshold=500)
print(f"\nMinority classes (<500 samples): {minorities}")
print(f"These will receive 3x augmentation boost during training.")

## Conclusions

### Key Findings:
- **15 classes** (3 crops: Pepper, Potato, Tomato with diseases and healthy variants)
- **High class imbalance**: Tomato_YellowLeaf_Curl_Virus (~3200) vs Potato_healthy (~150)
- **Minority classes** (<500): Potato_healthy, Tomato_mosaic_virus need augmentation boost
- **Images**: Mostly 256x256 RGB, suitable for 224x224 Swin input

### Strategy:
- WeightedRandomSampler for balanced training batches
- 3x augmentation multiplier for minority classes
- FocalLoss with per-class alpha weights
- Stratified train/val/test split (70/15/15)