# Speech Emotion Recognition – EDA & Mini Pipeline

This notebook shows a small end-to-end workflow for **Emotion Recognition from Speech**:

1. Load an example audio file from a dataset (e.g. RAVDESS / TESS / EMO-DB)
2. Visualize waveform and spectrogram
3. Extract MFCC features
4. Load precomputed MFCC dataset (`.npz`) created by `extract_features.py`
5. Build and train a small CNN
6. Evaluate with a confusion matrix
7. Run a sample prediction

> **Note:** You must first download a dataset and run `src/extract_features.py` to create the `.npz` file used here.


In [None]:
# Imports
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import librosa
import librosa.display
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import tensorflow as tf
from tensorflow.keras import layers, models

%matplotlib inline
sns.set(style="whitegrid")


## 1. Load one example audio file

Update `example_wav_path` below to point to a real `.wav` file from your dataset.
For RAVDESS it may look like: `data/RAVDESS/Actor_01/03-01-03-01-01-01-01.wav`.


In [None]:
# TODO: change this to a real .wav in your data folder
example_wav_path = "data/RAVDESS/example.wav"  # <- replace with real file path

if not os.path.exists(example_wav_path):
    print("WARNING: example_wav_path does not exist yet. Update the path above.")
else:
    y, sr = librosa.load(example_wav_path, sr=22050, mono=True)
    print(f"Sample rate: {sr}, duration: {len(y)/sr:.2f} seconds")


## 2. Visualize waveform and spectrogram

Waveform shows amplitude over time, spectrogram shows frequency content (energy at each frequency vs time).


In [None]:
if os.path.exists(example_wav_path):
    plt.figure(figsize=(12, 3))
    librosa.display.waveshow(y, sr=sr)
    plt.title("Waveform")
    plt.xlabel("Time (s)")
    plt.ylabel("Amplitude")
    plt.show()

    # Spectrogram
    D = np.abs(librosa.stft(y))
    DB = librosa.amplitude_to_db(D, ref=np.max)
    plt.figure(figsize=(12, 4))
    librosa.display.specshow(DB, sr=sr, x_axis='time', y_axis='log')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Spectrogram (log-frequency)')
    plt.show()
else:
    print("Skipping plots because example_wav_path is not set correctly.")


## 3. Extract MFCC features from a single file

MFCCs (Mel-Frequency Cepstral Coefficients) are widely used audio features for speech and emotion recognition.


In [None]:
def extract_mfcc_single(y, sr=22050, n_mfcc=40, max_len=174):
    y_trim, _ = librosa.effects.trim(y)
    mfcc = librosa.feature.mfcc(y=y_trim, sr=sr, n_mfcc=n_mfcc)
    delta = librosa.feature.delta(mfcc)
    delta2 = librosa.feature.delta(mfcc, order=2)
    feat = np.vstack([mfcc, delta, delta2])  # shape: (n_mfcc*3, t)
    if feat.shape[1] < max_len:
        pad_width = max_len - feat.shape[1]
        feat = np.pad(feat, ((0, 0), (0, pad_width)), mode='constant')
    else:
        feat = feat[:, :max_len]
    return feat

if os.path.exists(example_wav_path):
    feat = extract_mfcc_single(y, sr=sr)
    print("MFCC feature shape (channels x time):", feat.shape)
    plt.figure(figsize=(10, 4))
    librosa.display.specshow(feat[:40], x_axis='time')
    plt.colorbar()
    plt.title('MFCC (first 40 coefficients)')
    plt.tight_layout()
    plt.show()
else:
    print("Skipping MFCC extraction plot because example_wav_path is not set correctly.")


## 4. Load precomputed MFCC dataset (.npz)

Run this command **before** using the cell below:

```bash
python src/extract_features.py --data-dir data/RAVDESS --out-file data/ravdess_mfcc.npz
```

Then set `features_path` to that `.npz` file.


In [None]:
features_path = "data/ravdess_mfcc.npz"  # change if needed

if not os.path.exists(features_path):
    print("WARNING: features_path does not exist yet. Run extract_features.py first.")
else:
    data = np.load(features_path, allow_pickle=True)
    X = data['X'].astype('float32')  # (N, channels, time)
    y = data['y']                     # label indices
    labels = list(data['labels'])
    print("X shape:", X.shape)
    print("y shape:", y.shape)
    print("Labels:", labels)

    # Normalize per sample
    X = (X - X.mean(axis=(1, 2), keepdims=True)) / (X.std(axis=(1, 2), keepdims=True) + 1e-6)

    # Train / validation split
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
    print("Train shape:", X_train.shape, "Val shape:", X_val.shape)


## 5. Build a small CNN model

We treat the MFCC feature map as a 2D image (channels × time) with a single depth channel. The model is similar to what `src/train.py` uses.


In [None]:
def build_cnn(input_shape, n_classes):
    model = models.Sequential([
        layers.Input(shape=input_shape),
        layers.Reshape((*input_shape, 1)),
        layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
        layers.BatchNormalization(),
        layers.MaxPool2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
        layers.BatchNormalization(),
        layers.MaxPool2D((2, 2)),
        layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
        layers.BatchNormalization(),
        layers.GlobalAveragePooling2D(),
        layers.Dropout(0.3),
        layers.Dense(64, activation='relu'),
        layers.Dense(n_classes, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

if os.path.exists(features_path):
    input_shape = X_train.shape[1:]
    n_classes = len(labels)
    model = build_cnn(input_shape, n_classes)
    model.summary()
else:
    print("Skipping model creation because features_path is missing.")


## 6. Train the CNN (demo training)

This is a small demo training loop. For serious training, you can increase epochs or use `src/train.py`.


In [None]:
history = None

if os.path.exists(features_path):
    callbacks = [
        tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=5, restore_best_weights=True)
    ]
    history = model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=20,
        batch_size=32,
        callbacks=callbacks
    )
else:
    print("Skipping training because features_path is missing.")


### Plot training history


In [None]:
if history is not None:
    hist = history.history
    plt.figure(figsize=(12,4))
    plt.subplot(1,2,1)
    plt.plot(hist['loss'], label='train')
    plt.plot(hist['val_loss'], label='val')
    plt.title('Loss')
    plt.legend()
    plt.subplot(1,2,2)
    plt.plot(hist['accuracy'], label='train')
    plt.plot(hist['val_accuracy'], label='val')
    plt.title('Accuracy')
    plt.legend()
    plt.show()
else:
    print("No history to plot.")


## 7. Evaluation: confusion matrix & classification report


In [None]:
if os.path.exists(features_path) and history is not None:
    y_val_pred = np.argmax(model.predict(X_val), axis=1)
    print(classification_report(y_val, y_val_pred, target_names=labels))

    cm = confusion_matrix(y_val, y_val_pred)
    plt.figure(figsize=(8,6))
    sns.heatmap(cm, annot=True, fmt='d', xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.title('Validation confusion matrix')
    plt.show()
else:
    print("Skipping evaluation because training did not run.")


## 8. Single-sample prediction example

We reuse the MFCC extraction and model to predict emotion for one new audio file.


In [None]:
# TODO: update this path to another audio sample
predict_wav_path = example_wav_path

if os.path.exists(predict_wav_path) and history is not None:
    y_pred_audio, sr_pred = librosa.load(predict_wav_path, sr=22050, mono=True)
    feat_pred = extract_mfcc_single(y_pred_audio, sr=sr_pred)
    # normalize like others (sample-wise)
    feat_pred = (feat_pred - feat_pred.mean()) / (feat_pred.std() + 1e-6)
    inp = np.expand_dims(feat_pred, axis=0)
    probs = model.predict(inp)[0]
    idx = int(np.argmax(probs))
    print("Predicted emotion:", labels[idx])
    for i, label in enumerate(labels):
        print(f"  {label}: {probs[i]:.3f}")
else:
    print("Skipping prediction example. Check paths and that training finished.")


---
### Next steps / ideas
- Try different model architectures: LSTM, CRNN, or deeper CNNs.
- Add data augmentation (noise, time stretch, pitch shift).
- Use `KerasTuner` for hyperparameter tuning.
- Export the trained model and build a small demo app (Streamlit / Gradio) for interactive testing.
