# 01: Exploratory Data Analysis

This notebook focuses on understanding the raw dataset, its metadata, and the properties of the processed audio files. It **does not** perform any new feature extraction or training.

We will:
1.  Load and analyze the metadata (`train.csv`).
2.  Check the distribution of speakers and accents.
3.  Load a sample of *processed* audio files (`data/processed/`) to analyze their duration and sample rate.
4.  Visualize a sample waveform and Mel Spectrogram to confirm data quality.

In [None]:
import pandas as pd
import numpy as np
import librosa
import librosa.display
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from tqdm.notebook import tqdm
import warnings

# Import utils if it has plotting helpers (assuming it might)
try:
    import sys
    sys.path.append('../src')
    import utils
except ImportError:
    print("src/utils.py not found or has no plotting helpers. Using standard plots.")

# --- Configuration ---
sns.set_style("whitegrid")
warnings.filterwarnings('ignore')

RAW_DATA_DIR = Path("../data/raw")
PROCESSED_DATA_DIR = Path("../data/processed")
REPORTS_PLOTS_DIR = Path("../reports/plots")

# Ensure plots directory exists
REPORTS_PLOTS_DIR.mkdir(parents=True, exist_ok=True)

## 1. Load Metadata

First, let's load the metadata file to see what information we have about each audio clip. We'll assume the main metadata file is `train.csv`.

In [None]:
metadata_path = RAW_DATA_DIR / "train.csv"

if not metadata_path.exists():
    print(f"Error: Metadata file not found at {metadata_path}")
    # Create a dummy dataframe if file not found, to allow notebook to run
    df = pd.DataFrame(columns=['file_path', 'speaker', 'accent', 'duration_sec'])
else:
    df = pd.read_csv(metadata_path)

print(f"Loaded metadata with {len(df)} entries.")
df.head()

### 1.1. Analyze Metadata Distribution

Let's check the distribution of key columns, like `speaker` or `accent` (if they exist).

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isna().sum())

# Check unique values for categorical columns
if 'speaker' in df.columns:
    print(f"\nTotal unique speakers: {df['speaker'].nunique()}")
    
if 'accent' in df.columns:
    print(f"Total unique accents: {df['accent'].nunique()}")
    
    # Plot accent distribution
    plt.figure(figsize=(12, 6))
    sns.countplot(y=df['accent'], order=df['accent'].value_counts().index)
    plt.title("Distribution of Accents in the Dataset")
    plt.xlabel("Number of Samples")
    plt.ylabel("Accent")
    plt.savefig(REPORTS_PLOTS_DIR / "01_accent_distribution.png")
    plt.show()

## 2. Analyze Processed Audio Properties

The metadata might not have audio properties like duration. Let's load a sample of the *processed* audio files from `data/processed/` to get their duration and sample rate.

In [None]:
def get_audio_properties(file_path: Path):
    """Loads an audio file and returns its duration and sample rate."""
    try:
        y, sr = librosa.load(file_path, sr=None)
        duration = librosa.get_duration(y=y, sr=sr)
        return {'duration_sec': duration, 'sample_rate': sr}
    except Exception as e:
        print(f"Could not load {file_path.name}: {e}")
        return {'duration_sec': None, 'sample_rate': None}

# Get file paths from the 'file_path' column
# We assume paths in CSV are relative to the project root, e.g., "data/raw/train/file.wav"
# We need to map them to the processed directory
# Let's sample 500 files to speed this up
sample_size = min(500, len(df))
file_properties = []

print(f"Analyzing properties of {sample_size} processed audio files...")
for idx, row in tqdm(df.sample(sample_size).iterrows(), total=sample_size):
    # Construct the path to the processed file
    # Assumes processed files are flat or in speaker dirs: data/processed/speaker/file.wav
    # We take just the filename from the raw path
    raw_path = Path(row['file_path'])
    
    # This logic depends on your data_preprocessing.py
    # Assuming "data/processed/{speaker}/{filename}"
    if 'speaker' in df.columns:
        processed_path = PROCESSED_DATA_DIR / row['speaker'] / raw_path.name
    else:
        # Fallback if no speaker column
        processed_path = PROCESSED_DATA_DIR / raw_path.name
        
    if processed_path.exists():
        props = get_audio_properties(processed_path)
        props['id'] = row.get('id', raw_path.stem) # Get a unique ID
        file_properties.append(props)

props_df = pd.DataFrame(file_properties).set_index('id')
df = df.join(props_df, on=df['file_path'].apply(lambda x: Path(x).stem))

### 2.1. Plot Audio Duration Distribution

Now we can visualize the distribution of audio clip lengths.

In [None]:
# Drop any files we couldn't load
df_clean = df.dropna(subset=['duration_sec'])

plt.figure(figsize=(12, 6))
sns.histplot(df_clean['duration_sec'], bins=50, kde=True)
plt.title("Distribution of Audio Durations (in seconds)")
plt.xlabel("Duration (sec)")
plt.ylabel("Count")
plt.savefig(REPORTS_PLOTS_DIR / "01_duration_distribution.png")
plt.show()

print(f"Audio Duration Stats (seconds):")
print(df_clean['duration_sec'].describe())

### 2.2. Check Sample Rates

Let's confirm all processed files have the same sample rate (e.g., 16000 Hz) as defined in `data_preprocessing.py`.

In [None]:
plt.figure(figsize=(8, 4))
sns.countplot(x=df_clean['sample_rate'])
plt.title("Sample Rate of Processed Files")
plt.xlabel("Sample Rate (Hz)")
plt.ylabel("Count")
plt.savefig(REPORTS_PLOTS_DIR / "01_sample_rate_distribution.png")
plt.show()

print("\nSample Rate Counts:")
print(df_clean['sample_rate'].value_counts())

## 3. Visualize a Sample Waveform and Spectrogram

Finally, let's load one of the processed audio files to visualize its waveform and a standard Mel Spectrogram.

In [None]:
# Select a sample file to load (use the first valid one from our props_df)
if not props_df.empty:
    sample_id = props_df.index[0]
    
    # Reconstruct the file path
    # This assumes 'id' in props_df matches the filename stem
    # We need to find the full path from the original df
    
    try:
        # Find the original row
        original_row = df[df['file_path'].str.contains(sample_id)].iloc[0]
        
        # Construct processed path
        if 'speaker' in df.columns:
            sample_path = PROCESSED_DATA_DIR / original_row['speaker'] / f"{sample_id}.wav"
        else:
            sample_path = PROCESSED_DATA_DIR / f"{sample_id}.wav"

        print(f"Loading sample file: {sample_path}")
        y, sr = librosa.load(sample_path, sr=None) # Load with its native SR

        print(f"Sample Rate: {sr} Hz, Duration: {len(y)/sr:.2f}s")

        # --- Plot Waveform ---
        plt.figure(figsize=(15, 4))
        librosa.display.waveshow(y, sr=sr)
        plt.title(f"Waveform - {sample_path.name}")
        plt.xlabel("Time (s)")
        plt.ylabel("Amplitude")
        plt.tight_layout()
        plt.savefig(REPORTS_PLOTS_DIR / "01_sample_waveform.png")
        plt.show()

        # --- Plot Mel Spectrogram ---
        # Use parameters from your feature_extraction.py (assuming 80 mels)
        S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=80)
        S_db = librosa.power_to_db(S, ref=np.max)

        plt.figure(figsize=(15, 4))
        librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='mel')
        plt.colorbar(format='%+2.0f dB')
        plt.title(f"Mel Spectrogram - {sample_path.name}")
        plt.xlabel("Time (s)")
        plt.ylabel("Frequency (Hz)")
        plt.tight_layout()
        plt.savefig(REPORTS_PLOTS_DIR / "01_sample_melspectrogram.png")
        plt.show()
        
    except Exception as e:
        print(f"Could not load or plot sample file: {e}")
else:
    print("No valid processed files found to visualize.")

## 4. Initial Findings

*(This section would be filled in after running the notebook)*

* **Metadata:** The dataset contains `[Number]` entries, `[X]` unique speakers, and `[Y]` unique accents. The accent `[Accent Name]` is the most common.
* **Audio Properties:** All processed audio files are confirmed to be at **`[Sample Rate]` Hz**, which matches the preprocessing target.
* **Duration:** The audio clips have a mean duration of `[Mean Duration]` seconds, with most files falling between `[Min]` and `[Max]` seconds. This consistency is good for training.
* **Quality:** The sample waveform appears clean and normalized (amplitude is within a standard range), and the spectrogram shows clear speech components.

This EDA confirms that the raw data is well-understood and the output of `src/data_preprocessing.py` is correctly formatted for the next step, feature extraction.