# 🔥 ThermoSight: In-Depth Exploratory Data Analysis 🔬

Welcome to the Exploratory Data Analysis (EDA) notebook for the ThermoSight project! This notebook aims to:

-   Understand the **structure and composition** of the thermal image dataset.
-   Visualize **class distributions** to check for imbalances.
-   Analyze **image properties** (dimensions, color modes, brightness, contrast).
-   Display **sample images** from each temperature class.
-   Provide **statistical insights** to inform preprocessing and model training strategies.

Let's dive in! 🚀

In [None]:
# Imports
import os
import random
import numpy as np
import pandas as pd
from PIL import Image, ImageStat # For image properties
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings

# Setup
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid') # Using a seaborn style
sns.set_palette("husl") # Setting a color palette for seaborn

print("📚 All necessary libraries imported successfully!")

In [None]:
# Configuration & Constants
BASE_DATA_DIR = os.path.join('..', 'data')
RAW_DATA_DIR = os.path.join(BASE_DATA_DIR, 'raw')
PROCESSED_TRAIN_DIR = os.path.join(BASE_DATA_DIR, 'processed', 'train') # Using processed train for more relevant EDA for training

# Assuming these are your classes, adjust if different
# These should ideally be read from the directory structure if consistent
EXPECTED_CLASSES = ['200°C', '400°C', '600°C', '800°C'] 
# Define a color map for classes for consistent plotting
CLASS_COLORS = px.colors.qualitative.Plotly[:len(EXPECTED_CLASSES)] 

print(f"📁 Raw Data Directory: {RAW_DATA_DIR}")
print(f"📂 Processed Train Data Directory (for EDA): {PROCESSED_TRAIN_DIR}")
print(f"🌡️ Expected Classes: {EXPECTED_CLASSES}")

# Use PROCESSED_TRAIN_DIR for EDA as it's what the model will see after make_dataset.py
# If PROCESSED_TRAIN_DIR doesn't exist, fall back to RAW_DATA_DIR or notify user.
if os.path.exists(PROCESSED_TRAIN_DIR) and any(os.scandir(PROCESSED_TRAIN_DIR)):
    data_dir_for_eda = PROCESSED_TRAIN_DIR
    print(f"✅ Using processed train data for EDA: {data_dir_for_eda}")
elif os.path.exists(RAW_DATA_DIR) and any(os.scandir(RAW_DATA_DIR)):
    data_dir_for_eda = RAW_DATA_DIR
    print(f"⚠️ Processed train data not found or empty. Using raw data for EDA: {data_dir_for_eda}")
    print("💡 It's recommended to run `src/data/make_dataset.py` first for a more relevant EDA.")
else:
    data_dir_for_eda = None
    print(f"❌ CRITICAL: Neither raw nor processed data directories found or are empty. EDA cannot proceed.")
    print(f"Checked paths: {RAW_DATA_DIR}, {PROCESSED_TRAIN_DIR}")


In [None]:
# Load dataset and compute class distribution
if data_dir_for_eda:
    classes = sorted([d for d in os.listdir(data_dir_for_eda) if os.path.isdir(os.path.join(data_dir_for_eda, d)) and not d.startswith('.')])
    
    # Filter classes to match EXPECTED_CLASSES if necessary, or use discovered classes
    # For robustness, we'll use discovered classes but warn if they don't match expected.
    if set(classes) != set(EXPECTED_CLASSES) and EXPECTED_CLASSES:
        print(f"⚠️ Warning: Discovered classes {classes} do not perfectly match expected classes {EXPECTED_CLASSES}.")
        # You might want to decide how to handle this: error, use discovered, or filter.
        # For now, we proceed with discovered classes.

    counts = {cls: len([name for name in os.listdir(os.path.join(data_dir_for_eda, cls)) if not name.startswith('.')]) for cls in classes}
    
    if not counts:
        print("❌ No images found in the dataset directories.")
    else:
        df_counts = pd.DataFrame(list(counts.items()), columns=['Class', 'Count']).sort_values('Class')

        # Interactive Plotly Bar Chart
        fig_bar = px.bar(df_counts, x='Class', y='Count', color='Class',
                         color_discrete_map={cls: CLASS_COLORS[i % len(CLASS_COLORS)] for i, cls in enumerate(df_counts['Class'])},
                         title='📊 Image Count per Temperature Class',
                         labels={'Count': 'Number of Images', 'Class': 'Temperature Class'},
                         text='Count')
        fig_bar.update_layout(title_x=0.5, xaxis_title_font_size=14, yaxis_title_font_size=14)
        fig_bar.update_traces(texttemplate='%{text}', textposition='outside')
        fig_bar.show()

        # Interactive Plotly Pie Chart
        fig_pie = px.pie(df_counts, values='Count', names='Class', 
                         title='🥧 Class Distribution Percentage',
                         color='Class',
                         color_discrete_map={cls: CLASS_COLORS[i % len(CLASS_COLORS)] for i, cls in enumerate(df_counts['Class'])})
        fig_pie.update_layout(title_x=0.5)
        fig_pie.update_traces(textinfo='percent+label', pull=[0.05 if i==0 else 0 for i in range(len(df_counts))]) # Pull first slice
        fig_pie.show()
        
        print("\n📋 Class Counts Summary:")
        print(df_counts.to_string(index=False))
else:
    print("Skipping class distribution analysis as data directory is not available.")

In [None]:
# Image Properties Analysis (Dimensions, Mode, Brightness, Contrast)
def get_image_properties(image_path):
    try:
        img = Image.open(image_path)
        stat = ImageStat.Stat(img.convert('L')) # Convert to grayscale for brightness/contrast
        width, height = img.size
        mode = img.mode
        # Brightness (mean pixel value), Contrast (std dev of pixel values)
        brightness = stat.mean[0]
        contrast = stat.stddev[0]
        aspect_ratio = width / height if height > 0 else 0
        return width, height, mode, brightness, contrast, aspect_ratio
    except Exception as e:
        # print(f"Error processing {image_path}: {e}")
        return None, None, None, None, None, None

if data_dir_for_eda:
    image_data = []
    print("\n🖼️ Analyzing image properties (this may take a moment for large datasets)...")
    for cls in classes:
        class_path = os.path.join(data_dir_for_eda, cls)
        for img_name in os.listdir(class_path):
            if img_name.startswith('.'): continue # Skip hidden files
            img_path = os.path.join(class_path, img_name)
            props = get_image_properties(img_path)
            if props[0] is not None: # If width is not None, successful read
                image_data.append([cls, img_name] + list(props))

    props_df = pd.DataFrame(image_data, columns=['Class', 'ImageName', 'Width', 'Height', 'Mode', 'Brightness', 'Contrast', 'AspectRatio'])

    if not props_df.empty:
        print(f"\n✅ Analyzed {len(props_df)} images.")
        print("\n🔍 Summary of Image Properties (First 5 rows):")
        print(props_df.head())

        # Plotting distributions
        fig_dims = make_subplots(rows=2, cols=2, subplot_titles=('Width Distribution', 'Height Distribution', 'Brightness Distribution', 'Contrast Distribution'))

        for i, cls in enumerate(props_df['Class'].unique()):
            cls_df = props_df[props_df['Class'] == cls]
            fig_dims.add_trace(go.Histogram(x=cls_df['Width'], name=cls, marker_color=CLASS_COLORS[i % len(CLASS_COLORS)], opacity=0.75), row=1, col=1)
            fig_dims.add_trace(go.Histogram(x=cls_df['Height'], name=cls, marker_color=CLASS_COLORS[i % len(CLASS_COLORS)], showlegend=False, opacity=0.75), row=1, col=2)
            fig_dims.add_trace(go.Box(y=cls_df['Brightness'], name=cls, marker_color=CLASS_COLORS[i % len(CLASS_COLORS)], showlegend=False), row=2, col=1)
            fig_dims.add_trace(go.Box(y=cls_df['Contrast'], name=cls, marker_color=CLASS_COLORS[i % len(CLASS_COLORS)], showlegend=False), row=2, col=2)

        fig_dims.update_layout(height=700, title_text='🔬 Image Property Distributions by Class', title_x=0.5, barmode='overlay')
        fig_dims.update_xaxes(title_text="Pixels", row=1, col=1)
        fig_dims.update_xaxes(title_text="Pixels", row=1, col=2)
        fig_dims.update_yaxes(title_text="Count", row=1, col=1)
        fig_dims.update_yaxes(title_text="Count", row=1, col=2)
        fig_dims.update_xaxes(title_text="Class", row=2, col=1)
        fig_dims.update_xaxes(title_text="Class", row=2, col=2)
        fig_dims.update_yaxes(title_text="Brightness", row=2, col=1)
        fig_dims.update_yaxes(title_text="Contrast", row=2, col=2)
        fig_dims.show()

        print("\n📊 Aggregate Statistics for Image Properties:")
        print(props_df.groupby('Class')[['Width', 'Height', 'Brightness', 'Contrast', 'AspectRatio']].agg(['mean', 'std', 'min', 'max']).round(2))
        
        print("\n🎨 Image Modes Present:")
        print(props_df['Mode'].value_counts())
    else:
        print("❌ No image properties could be analyzed.")
else:
    print("Skipping image property analysis as data directory is not available.")

In [None]:
# Display a random sample per class
if data_dir_for_eda and not props_df.empty:
    num_samples_per_class = 4  # Number of samples to display per class
    
    # Determine grid size
    num_classes_to_display = len(props_df['Class'].unique())
    
    # Adjust figsize dynamically
    fig_width = num_samples_per_class * 4 
    fig_height = num_classes_to_display * 4
    
    fig, axes = plt.subplots(num_classes_to_display, num_samples_per_class, figsize=(fig_width, fig_height))
    fig.suptitle('🖼️ Random Image Samples per Class', fontsize=20, y=1.02 if num_classes_to_display > 1 else 1.05)

    if num_classes_to_display == 1: # Handle single class case for axes indexing
        axes = np.array([axes]) 
        if num_samples_per_class == 1:
             axes = axes.reshape(1,1)


    for i, cls in enumerate(props_df['Class'].unique()):
        class_images = props_df[props_df['Class'] == cls]['ImageName'].tolist()
        if not class_images:
            if num_samples_per_class > 0: # Check if axes[i,0] exists
                 axes[i,0].set_title(f"{cls} (No Images)", fontsize=12)
                 axes[i,0].axis('off')
            continue

        sample_img_names = random.sample(class_images, min(len(class_images), num_samples_per_class))
        
        for j, img_name in enumerate(sample_img_names):
            img_path = os.path.join(data_dir_for_eda, cls, img_name)
            try:
                img = Image.open(img_path)
                ax = axes[i, j] if num_classes_to_display > 0 else axes[j]
                ax.imshow(img)
                
                # Get properties for this image
                img_props_row = props_df[(props_df['Class']==cls) & (props_df['ImageName']==img_name)]
                if not img_props_row.empty:
                    w, h, mode = img_props_row[['Width', 'Height', 'Mode']].iloc[0]
                    ax.set_title(f"{cls}\n{w}x{h} ({mode})", fontsize=10)
                else:
                    ax.set_title(f"{cls}\n{img_name[:15]}...", fontsize=10) # Fallback title
                ax.axis('off')
            except Exception as e:
                # print(f"Error displaying {img_path}: {e}")
                ax = axes[i, j] if num_classes_to_display > 0 else axes[j]
                ax.text(0.5, 0.5, 'Error', horizontalalignment='center', verticalalignment='center')
                ax.axis('off')
        
        # Turn off axes for any remaining subplots in the row if fewer samples than num_samples_per_class
        for j in range(len(sample_img_names), num_samples_per_class):
            ax = axes[i, j] if num_classes_to_display > 0 else axes[j]
            ax.axis('off')

    plt.tight_layout(rect=[0, 0, 1, 0.98]) # Adjust layout to make space for suptitle
    plt.show()
else:
    print("Skipping sample image display as data or properties are not available.")

## 🔑 Key Findings & Next Steps

This Exploratory Data Analysis has provided valuable insights into the ThermoSight dataset.

**Summary of Findings:**

1.  **Class Distribution:**
    *   *(Summarize balance/imbalance observed from charts, e.g., "Classes appear relatively balanced/imbalanced, with X class having the most/least samples.")*
2.  **Image Properties:**
    *   **Dimensions:** *(e.g., "Most images are around WxH pixels, with some variations.")*
    *   **Modes:** *(e.g., "The predominant image mode is RGB.")*
    *   **Brightness/Contrast:** *(e.g., "Class X tends to have higher/lower brightness/contrast.")*
3.  **Data Quality:**
    *   *(e.g., "No major issues like corrupted images were found during property analysis / Some images could not be read.")*

**Recommendations & Considerations for Next Steps:**

*   **Data Augmentation:** Based on observed variations (or lack thereof), consider specific augmentations (e.g., brightness/contrast adjustments if they vary significantly but shouldn't be class discriminators).
*   **Preprocessing:**
    *   Ensure all images are converted to a consistent mode (e.g., RGB) if not already.
    *   Standard normalization of pixel values is crucial.
*   **Class Imbalance:** If significant imbalance exists, consider techniques like weighted loss, oversampling, or undersampling during training.
*   **Outliers:** Investigate any extreme outliers in image properties if they seem erroneous.

This EDA forms a solid foundation for the subsequent stages of data preprocessing, model training, and evaluation.