# PlantDoc 数据探索 (EDA)

该笔记本基于 `src/data/prepare_dataset.py` 生成的统计结果，复现核心可视化并记录类别不平衡与样本特征，供后续报告引用。


In [None]:
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")
DATA_DIR = Path("../data")
LOGS_DIR = Path("../outputs/logs")
FIG_DIR = Path("../outputs/figures")


In [None]:
stats_path = LOGS_DIR / "class_stats.csv"
class_stats = pd.read_csv(stats_path)
print(f"Loaded {len(class_stats)} classes from {stats_path}")
class_stats.head()


In [None]:
total_images = int(class_stats["count"].sum())
print(f"Total images accounted for: {total_images}")
print("Top 5 classes by frequency:")
print(class_stats.head()[["class_name", "count", "ratio"]])
print("\nBottom 5 classes by frequency:")
print(class_stats.tail()[["class_name", "count", "ratio"]])


In [None]:
plt.figure(figsize=(10, 12))
sns.barplot(data=class_stats, y="class_name", x="count", palette="viridis")
plt.title("PlantDoc Class Distribution")
plt.xlabel("Image count")
plt.ylabel("Class")
plt.tight_layout()
plt.show()


In [None]:
from PIL import Image

grid_img = Image.open(FIG_DIR / "sample_grid.png")
grid_img


## 关键观察

- 数据集中 `leaf blight of corn`、`Tomato Septoria leaf spot` 等类的样本数明显多于 `bell pepper leaf`、`cherry leaf` 等类别，长尾现象需要在训练阶段考虑（类别权重或过采样）。
- 原始图像来源多样，存在纯色背景、田间环境、以及多光照条件，提示在数据增强里加入颜色与几何扰动（ColorJitter、随机旋转）可提高鲁棒性。
- 随机样本中仍可看到一些标注在 TEST 子集中，已通过 split 脚本混入 train/val/test 进行统一处理；后续加载数据时应严格依赖 `data/splits/plantdoc_split_seed42.json` 确保不泄漏测试图像。
