# Exploratory Data Analysis - Industrial Surface Defects

## Project Overview
This notebook performs an exploratory data analysis (EDA) on the NEU Surface Defect Dataset.  
The goal is to understand the dataset structure, identify potential issues, and prepare it for building an image classification model.

### Objectives:
1. Inspect the dataset and class distribution
2. Analyze image sizes and formats
3. Visualize sample images per defect class
4. Identify corrupted or problematic images
5. Draw conclusions to guide preprocessing and model training

In [1]:
import os
from pathlib import Path
from collections import Counter
from PIL import Image
import matplotlib.pyplot as plt
import random

Matplotlib is building the font cache; this may take a moment.


In [None]:
# Path to the processed dataset
DATA_DIR = Path("../data/processed/train")  # adjust path if needed

# List all classes
classes = [d.name for d in DATA_DIR.iterdir() if d.is_dir()]
print("Classes found:", classes)

# Gather image paths for each class
image_paths = {cls: list((DATA_DIR/cls).glob("*.jpg")) for cls in classes}

# Count images per class
image_counts = {cls: len(imgs) for cls, imgs in image_paths.items()}
print("Number of images per class:", image_counts)
