# Wildfire Spread: Exploratory Data Analysis Report

This notebook conducts a rigorous, multi-stage Exploratory Data Analysis (EDA) on the wildfire dataset. The structure follows a formal, academically-sound methodology, separating pure observation from intervention to ensure reproducibility and clarity.

## Section 1: Baseline Data Characterization & Integrity Assessment (EDA Level 0)

**Objective:** To establish a baseline, objective understanding of the dataset as received, covering its structure, content, and quality, prior to any modifications. This stage is purely observational and constitutes an immutable record of the data's initial state.

In [None]:
import h5py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from tqdm import tqdm

# Adjust display options for pandas
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 120)

# Define the feature names based on the final documentation (0-indexed)
FEATURE_NAMES = [
    'VIIRS_M11', 'VIIRS_I2', 'VIIRS_I1', 'NDVI', 'EVI2', # 0-4
    'Total_Precip', 'Wind_Speed', 'Wind_Direction', 'Min_Temp_K', 'Max_Temp_K', # 5-9
    'ERC', 'Spec_Hum', 'Slope', 'Aspect', 'Elevation', # 10-14
    'Landcover', 'Forecast_Precip', 'Forecast_Wind_Speed', 'Forecast_Wind_Dir', # 15-18
    'Forecast_Temp_C', 'Forecast_Spec_Hum', 'Active_Fire' # 19-21
]

# For this EDA, we will load a single file and flatten it for analysis.
# This approach is suitable for understanding distributions and initial quality.
file_path = '../data/processed/2020/fire_24462610.hdf5'

def load_and_flatten_h5(file_path, feature_names):
    """Loads data from a single H5 file and flattens it into a 2D pandas DataFrame."""
    with h5py.File(file_path, 'r') as f:
        data = f['data'][:] # Load all data into memory
        t, c, h, w = data.shape
        # Reshape to (T*H*W, C)
        data_reshaped = data.transpose(0, 2, 3, 1).reshape(-1, c)
        df = pd.DataFrame(data_reshaped, columns=feature_names)
        return df

df_raw = load_and_flatten_h5(file_path, FEATURE_NAMES)

print("--- 1.1 Data Import & Structural Overview ---")
print(f"Dataset Dimensions: {df_raw.shape[0]} observations (pixels) and {df_raw.shape[1]} variables (channels).")
print("\nFirst 5 rows:")
display(df_raw.head())
print("\nLast 5 rows:")
display(df_raw.tail())

### 1.2 Variable Identification & Data Type Validation

**Rationale:** Incorrect data types are a primary source of errors. This step ensures that each variable's data type is consistent with its real-world meaning.

In [None]:
print("--- 1.2 Variable Data Types ---")
print(df_raw.info())

### 1.3 Preliminary Data Health Check: Macro-Structural Issues

**Rationale:** Removing duplicates is critical to avoid biased results. Fixing structural errors ensures data integrity.

In [None]:
print("--- 1.3 Data Health Check ---")
duplicate_rows = df_raw.duplicated().sum()
print(f"Number of duplicate rows found: {duplicate_rows}")

# For structural errors, we'll focus on the 'Landcover' categorical variable
print("Unique values in 'Landcover' (Channel 16):")
print(sorted(df_raw['Landcover'].unique()))

### 1.4 Initial Global Descriptive Summary & Data Dictionary

**Rationale:** This provides the first quantitative insight into the dataset's scale, range, and distributions, forming the final step of the 'pure understanding' phase.

In [None]:
print("--- 1.4.1 Descriptive Summary for Numerical Variables ---")
display(df_raw.describe().T)

print("\n--- 1.4.2 Frequency Distribution for Categorical Variables ---")
print("Value counts for 'Landcover':")
display(df_raw['Landcover'].value_counts(normalize=True).sort_index())

#### Table 1: Data Dictionary & Preliminary Assessment

This table serves as the foundational reference document for the entire analysis.

In [None]:
def create_data_dictionary(df):
    """Generates a data dictionary DataFrame from the raw data."""
    dict_data = {
        'Variable Name': df.columns,
        'Original Dtype': [str(t) for t in df.dtypes],
        'Missing Values': df.isnull().sum().values,
        'Missing Percentage': (df.isnull().sum() / len(df) * 100).values
    }
    data_dict_df = pd.DataFrame(dict_data)
    
    # Add corrected analysis type and description based on documentation
    analysis_types = [
        'Numerical-Continuous', 'Numerical-Continuous', 'Numerical-Continuous', 'Numerical-Continuous', 'Numerical-Continuous',
        'Numerical-Continuous', 'Numerical-Continuous', 'Numerical-Cyclical', 'Numerical-Continuous (K)', 'Numerical-Continuous (K)',
        'Numerical-Continuous', 'Numerical-Continuous', 'Numerical-Continuous', 'Numerical-Cyclical', 'Numerical-Continuous',
        'Categorical-Nominal', 'Numerical-Continuous', 'Numerical-Continuous', 'Numerical-Special', 
        'Numerical-Continuous (C)', 'Numerical-Continuous', 'Categorical-Binary'
    ]
    descriptions = [
        'VIIRS Surface Reflectance Band M11', 'VIIRS Surface Reflectance Band I2', 'VIIRS Surface Reflectance Band I1',
        'Normalized Difference Vegetation Index', 'Enhanced Vegetation Index 2',
        'Total Daily Precipitation', 'Daily Wind Speed', 'Daily Wind Direction (degrees)',
        'Minimum Daily Temperature (Kelvin)', 'Maximum Daily Temperature (Kelvin)',
        'Energy Release Component', 'Specific Humidity', 'Topographical Slope (degrees)',
        'Topographical Aspect (degrees)', 'Elevation (meters)',
        'MODIS Land Cover Class', '24h Forecast Precipitation', '24h Forecast Wind Speed',
        '24h Forecast Wind Direction (Vector Component)', '24h Forecast Temperature (Celsius)',
        '24h Forecast Specific Humidity', 'Active Fire Mask'
    ]
    data_dict_df['Analysis Type'] = analysis_types
    data_dict_df['Description'] = descriptions
    
    return data_dict_df[['Variable Name', 'Original Dtype', 'Analysis Type', 'Description', 'Missing Values', 'Missing Percentage']]

data_dictionary = create_data_dictionary(df_raw)
print("--- Table 1: Data Dictionary & Preliminary Assessment ---")
display(data_dictionary)

## Section 2: In-Depth Data Cleaning & Preprocessing (EDA Level 1)

**Objective:** To systematically address the data quality issues identified in Section 1. This is the 'intervention' phase, transforming the raw data into an analysis-ready format. Every decision and its rationale must be meticulously documented.

### 2.1 Systematic Management of Missing Values

**Rationale:** Understanding the pattern of missingness is more critical than just knowing the count. This diagnosis informs the selection of an appropriate imputation strategy.

In [None]:
# Note: The current dataset from one file shows no missing values.
# The following code is a template for how one would proceed if NaNs were present.
print("--- 2.1.1 Missing Value Pattern Diagnosis ---")

if df_raw.isnull().sum().sum() > 0:
    plt.figure(figsize=(12, 8))
    sns.heatmap(df_raw.isnull(), cbar=False, cmap='viridis')
    plt.title('Missing Value Heatmap')
    plt.show()
else:
    print("No missing values found in the sample file. Skipping heatmap.")

print("\n--- 2.1.2 Imputation Strategy ---")
print("Rationale: For this dataset, the primary concern would be NaN values in the Active Fire channel, which should be imputed with 0. For other features, median imputation is a robust choice for skewed distributions, while mean is suitable for symmetric ones. A more advanced approach like KNN imputation could be used for higher accuracy.")

# Create a copy for cleaning
df_cleaned = df_raw.copy()

# Example Imputation (if needed):
# median_val = df_cleaned['Some_Variable'].median()
# df_cleaned['Some_Variable'].fillna(median_val, inplace=True)

### 2.2 Outlier Detection & Management Strategy

**Rationale:** Using multiple methods (visual and statistical) provides a more robust identification of potential outliers, prompting careful investigation rather than automatic removal.

In [None]:
print("--- 2.2.1 Univariate Outlier Identification ---")
# We will visualize a few key continuous variables for outliers
outlier_check_vars = ['Max_Temp_K', 'Wind_Speed', 'NDVI', 'Elevation']

plt.figure(figsize=(15, 10))
for i, var in enumerate(outlier_check_vars):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(y=df_cleaned[var])
    plt.title(f'Box Plot of {var}')
plt.tight_layout()
plt.show()

print("\n--- 2.2.3 Handling Philosophy ---")
print("Rationale: Outliers are not always errors; they can be genuine extreme values. For this dataset, outliers in temperature or wind speed are likely real weather events and should be kept. Outliers in sensor data (e.g., VIIRS, NDVI) might be due to atmospheric interference and could be capped or transformed. A log transform is a common strategy for right-skewed data with extreme positive outliers.")

#### Table 2: Data Processing Strategy Log

This table serves as the official logbook of the data cleaning process, documenting every modification made to the raw data.

In [None]:
processing_log_data = {
    'Variable Name': ['Forecast_Temp_C', 'Wind_Direction', 'Aspect', 'Landcover', 'Active_Fire'],
    'Identified Issue': ['Unit Mismatch (Celsius)', 'Cyclical Nature', 'Cyclical Nature', 'Categorical Nature', 'Target Variable'],
    'Chosen Strategy': ['Convert to Kelvin', 'Sine Transformation', 'Sine Transformation', 'One-Hot Encoding', 'Separate as Target'],
    'Rationale': [
        'To align with historical temperature units (Kelvin) for model consistency.',
        'To represent the cyclical continuity of degrees (360 is close to 0).',
        'To represent the cyclical continuity of degrees.',
        'To convert nominal categories into a numerical format suitable for modeling without implying order.',
        'This is the variable to be predicted, not an input feature for the model itself.'
    ]
}
processing_log_df = pd.DataFrame(processing_log_data)

print("--- Table 2: Data Processing Strategy Log ---")
display(processing_log_df)