## Classify Voice By Gender & Emotion

<h3><span style="color:green">For SDSU CS 450: Intro to AI</span></h3>
<h3><span style="color:green">Authors: Noah Nielsen, Young Min Park, Dylan Murphy, Elijah Pearce, ALex Colmenar  </span></h3>

<h3><span style="color:red">Domain & Problem Described</span></h3>

This project is in the domain of speech processing and classification<br>
We will train a classifer to predict the speaker's gender and/or emotion in a voice sample in the multiple languages including spanish, german, english, russian<br>
We will compare the accuracy when predicting gender alone, emotion alone, and gender and emotion together

From the results, we will form hypotheses about why we think the results are what they are<br>
We will also try several methods to optimize the accuracy. We will then compare our optimized and uptomized results<br>
and propose why our optimization methods improved the results

<h3><span style="color:red">Dataset Described</span></h3>

The dataset consists of 7 emotion categories: <br>1) anger; 2) fear; 3) enthusiasm; 4) happiness; 5) sadness; 6) disgust; and 7) neutral. 
The files are in the .wav (uncomprossed, high-quality) format

<h3><span style="color:red">Workflow Described</span></h3>

#### 2. Split dataset into train & test subsets
#### 3A. Use Librosa library to extract features from the voice data
#### 3B. Exploratory Data Analysis: Plot the Principal Components
#### 4A.  Build and train a classifier
#### 4B.  Compare the accuracy across the prediction scenarios
#### 5. Optimize the classifier, analyze the effect
#### 6. Draw conclusion about the results

<h3><span style="color:red">0. Install Libraries</span></h3>

| Library      | Description                                                     |
|--------------|-----------------------------------------------------------------|
| scikit-learn | Machine learning library used to run the classification algo    |
| numpy        | Array and numerical analysis library                            |
| pandas       | Library that puts data into structured DataFrame objects      |
| matplotlib   | Library used to plot numerical data                             |
| seaborn      | Extends matplotlib to allow for creating more complex plots     |
|              |                                                                 |
| librosa      | Audio processing library that can extract, modify, and plot sound features |

#### Uncomment the code below and install the modules

In [1]:
!pip install scikit-learn numpy pandas matplotlib seaborn
!pip install librosa



<h3><span style="color:red">1. Obtain Dataset</span></h3>

#### Procedure

When you receive this Notebook file, you should also receive the **emodb.zip** dataset. If not, download the dataset from

https://www.kaggle.com/datasets/piyushagni5/berlin-database-of-emotional-speech-emodb (Need Kaggle account)

In [2]:
import os, zipfile

def extract_zip(zip_path, dest_dir):
    """
    Unzips *zip_path* into *dest_dir* (creates it if missing)
    and skips extraction if dest_dir already contains files.
    """
    # 1. Make the folder if it doesn't exist
    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)
        print(f"Made directory {dest_dir}")

    # 2. Extract only if the folder is still empty
    if not os.listdir(dest_dir):
        with zipfile.ZipFile(zip_path, "r") as z:
            z.extractall(dest_dir)
            print(f"Extracted {zip_path} → {dest_dir}")
    else:
        print(f"{dest_dir} already populated – skipping extraction.")
        
extract_zip("emodb.zip",   "datasets/emodb")
extract_zip("MACHINlEARNINGdATA.zip", "datasets/mldata")

FileNotFoundError: [Errno 2] No such file or directory: 'emodb.zip'

<h3><span style="color:red">2. Split Dataset into Train & Test Subsets</span></h3>

In [None]:
import glob, itertools, pprint
pprint.pprint(
    list(itertools.islice(
        glob.glob("datasets/mldata/**/*.wav", recursive=True), 20))
)

#### Procedure

Populate train_filenames and test_filenames by naming each voice sample into either
the training or test subset.<br>
From each voice sample's filename, extract its gender label and emotion label (ie each speaker ID is specifically<br>
noted in the emodb documentation to belong to a male or female speaker).<br>

The result will be 6 lists: train_filenames, test_filenames, train_gender_labels, test_gender_labels, train_emotion_labels, test_emotion_labels.

The ratio of training : test samples will be 5:1.

In [None]:

import os, re
from sklearn.model_selection import train_test_split

# ------------------------------------------------------------------
# 1)  Dataset‑specific filename parsers
# ------------------------------------------------------------------

EMO_RAVDESS = {
    "01": "neutral",
    "02": "calm",
    "03": "happy",
    "04": "sad",
    "05": "angry",
    "06": "fearful",
    "07": "disgust",
    "08": "surprised",
}

def parse_emodb(path: str):
    fname = os.path.basename(path)
    gender = "Female" if fname[:2] in {"08","09","13","14","16"} else "Male"
    emotion = fname[5].lower()
    return gender, emotion

def parse_ravdess(path: str):
    """
    Works for filenames like .../Anger/03-01-05-01-01-01-16.wav
    Returns (gender, emotion)
    """
    fname = os.path.basename(path)
    parts = fname.split('-')                     # ['03','01','05','01','01','01','16.wav']
    if len(parts) < 7:
        raise ValueError(f"Unexpected RAVDESS name: {fname}")

    emotion_code = parts[2]                     # '05'
    actor_id     = int(parts[-1].split('.')[0]) # 16
    gender       = "Male" if actor_id % 2 else "Female"
    emotion      = EMO_RAVDESS.get(emotion_code, "unknown")
    return gender, emotion

def parse_auto(path: str):
    """
    Chooses the correct parser based on the filename:
      • filenames that contain a dash ('-')   → RAVDESS format
      • filenames without a dash              → Emo‑DB format
    """
    fname = os.path.basename(path)
    if '-' in fname:
        return parse_ravdess(path)
    else:
        return parse_emodb(path)
# ------------------------------------------------------------------
# 2)  Register every corpus you want to merge
# ------------------------------------------------------------------

DATASETS = {
    "emodb" : {"root": "datasets/emodb",                       "parser": parse_emodb},
    "mldata": {"root": "datasets/mldata/MACHINlEARNINGdATA",   "parser": parse_auto},
    # add more here …
}

# ------------------------------------------------------------------
# 3)  Walk *all* roots and collect files + labels
# ------------------------------------------------------------------

filenames, gender_labels, emotion_labels, dataset_labels = [], [], [], []

for ds_name, cfg in DATASETS.items():
    root, parser = cfg["root"], cfg["parser"]

    for cur_dir, _, files in os.walk(root):
        for f in files:
            if f.lower().endswith(".wav"):
                full_path = os.path.join(cur_dir, f)

                gender, emotion = parser(full_path)   # ← pass full path!
                filenames.append(full_path)
                gender_labels.append(gender)
                emotion_labels.append(emotion)
                dataset_labels.append(ds_name)

print(f"Total .wav files across datasets: {len(filenames)}")

# ------------------------------------------------------------------
# 4)  Train / test split
# ------------------------------------------------------------------

(train_filenames, test_filenames,
 train_gender_labels,  test_gender_labels,
 train_emotion_labels, test_emotion_labels,
 train_dataset_labels, test_dataset_labels) = train_test_split(
     filenames, gender_labels,
     emotion_labels, dataset_labels,
     test_size=0.167, random_state=42, shuffle=True
 )

print(f"Training samples : {len(train_filenames)}")
print(f"Testing  samples : {len(test_filenames)}")


<h3><span style="color:red">3. Exploratory Data Analysis</span></h3>

#### Procedure

Separately analyze the distribution of emotions and genders in the dataset. Make these observations of the results:<br>


In [None]:
import collections
from collections import defaultdict

def extract_labels_from_filename(filenames):
    emotions = []
    genders = []
    for filename in filenames:
        basename = os.path.basename(filename)
        gender = "Female" if basename[:2] in ["08", "09", "13", "14", "16"] else "Male"
        emotion = basename[5]
        emotions.append(emotion)
        genders.append(gender)
    return emotions, genders

train_emotions, train_genders = extract_labels_from_filename(train_filenames)
test_emotions, test_genders = extract_labels_from_filename(test_filenames)

emotion_counts = collections.Counter(train_emotions)
gender_counts = collections.Counter(train_genders)
print("Emotion counts:", emotion_counts)
print("Gender  counts:", gender_counts)

#### Procedure

Extract the Mel-Frequency Cepstral Coefficients of each voice sample using the Librosa module.<br>
MFCCs capture the short-term power spectrum of a sound. It essentially captures the intensity/energy distributions<br>
of the most relevant frequencies in a sound. The sound is splitted into small, overlapping windows and the spectrum of<br>
each window is computed. This allows us to observe how the frequency changes over time.

In [None]:
import librosa
import numpy as np

def extract_mfccs(file_path, n_mfcc=13):
    y, sr = librosa.load(file_path, sr=None) 
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)
    mfccs_mean = np.mean(mfccs.T, axis=0)
    return mfccs_mean

In [None]:
X_train = [extract_mfccs(f) for f in train_filenames]
X_test  = [extract_mfccs(f) for f in test_filenames]

print(X_train[0])

#### Procedure

Merge the gender and emotion labels to allow multiclass prediction (predict both together)<br>
Then do Principal Component Analysis on the MFCC features.<br>

First, normalize each feature to have mean of 0.0 and standard deviation of 1.0 <br>
PCA transform the high-dimensional data into just two principal components<br>

Observe that the first principal component captures more than a third of the information (= 0.375),<br>
and the first two components capture half of the information. We will need to use more if we want to <br>
capture anywhere close to the full information in the data, which further component has smaller
marginal benefit.

In [None]:
train_combined_labels = [f"{gender}-{emotion}" for gender, emotion in zip(train_gender_labels, train_emotion_labels)]
test_combined_labels  =  [f"{gender}-{emotion}" for gender, emotion in zip(test_gender_labels, test_emotion_labels)]

print(train_combined_labels[0:5])

In [None]:
import numpy as np
import pandas as pd
import librosa
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Standardize (normalize) the MFCC features
scaler = StandardScaler()
X_train_standardized = scaler.fit_transform(X_train)

# Step 2: Apply PCA to reduce dimensionality to 6 components
pca = PCA(n_components=6)
X_train_6d = pca.fit_transform(X_train_standardized)

# Step 3: Create a DataFrame for visualization (only use the first 2 components for the plot)
df = pd.DataFrame(X_train_6d, columns=[f'PC{i+1}' for i in range(6)])  # Add all 6 PCs as columns
df['Combined_Label'] = train_combined_labels  # Assuming you have these labels

# Step 4: Plot the first 2 principal components (PC1 and PC2)
plt.figure(figsize=(10, 6))
sns.scatterplot(x='PC1', y='PC2', hue='Combined_Label', data=df, palette='viridis', s=60)
plt.title('2D PCA of MFCC Features by Combined Label (First 2 PCs)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

# Optional: Check how much variance is explained by each principal component
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")


<h3><span style="color:red">4. Support Vector Machines Model</span></h3>

#### Procedure

We use Support Vector Machines as our prediction model <br>

Imagine the data as being points on a hyperspace of dimensions D up to the number of features of the data (ie 13)<br>
The goal is to draw a hyperplane of dimensions (D - 1) that best separate the classes based on a metric.<br>
The separating hyperplane is the **decision boundary** and the metric is the **kernel function**.<br>
We use only points that are closest to any particular decision boundary we choose, these points being the support vectors<br>
Sometimes can we cannot draw a linear hyperplane, we can define a kernel function that results in a decision<br>
boundary that is curved, but that works better to distinguish the classes


In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Gender classifier
gender_clf = SVC(kernel='linear')
gender_clf.fit(X_train, train_gender_labels)
gender_pred = gender_clf.predict(X_test)
gender_accuracy = accuracy_score(test_gender_labels, gender_pred)
print(f'Gender Prediction Accuracy: {gender_accuracy * 100:.2f}%')

# Emotion classifier
emotion_clf = SVC(kernel='linear')
emotion_clf.fit(X_train, train_emotion_labels)
emotion_pred = emotion_clf.predict(X_test)
emotion_accuracy = accuracy_score(test_emotion_labels, emotion_pred)
print(f'Emotion Prediction Accuracy: {emotion_accuracy * 100:.2f}%')

#### Analysis

As we can see, a SVM model almost perfectly predicts the gender of a sample based on its MFCCS features
This should not be suprisingly because:
1. Gender prediction involves just 2 classes. Just a random guess would result in an accuracy of ~50%
2. The information relating to gender is more incisive. The differences between male and female voices in relation to<br>
pitch, tone, and range just larger. These differences are biological and thus are even present in different languages.

Compare that to the accuracy in emotion classification (96.67% for gender vs 43% for gender):
1. Emotions and ways humans exhibit emotions, especially by voice, is complex (ie sarcasm).<br>
How humans exhibit emotion vary across individuals and cultures. These differences will cloud the data
2. Different emotions may sound very similar, for example anger vs. fear or happiness vs. surprise. For instance,<br> angry and
fearful voices may both show higher pitch, fast speech rate, and loud volume.
3. The target labels for emotion involve 6 classes, more than the 2 for gender.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# Normalize the Data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 2: Hyperparameter Tuning for Gender Classifier
param_grid = {
    'kernel': ['linear', 'rbf'],
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1]  # Only applies to 'rbf' kernel
}

gender_clf = GridSearchCV(SVC(), param_grid, cv=5)
gender_clf.fit(X_train_scaled, train_gender_labels)

# Evaluate Gender Classifier
best_gender_clf = gender_clf.best_estimator_
gender_pred = best_gender_clf.predict(X_test_scaled)
gender_accuracy = accuracy_score(test_gender_labels, gender_pred)
print(f'Best Parameters for Gender Classifier: {gender_clf.best_params_}')
print(f'Gender Prediction Accuracy: {gender_accuracy * 100:.2f}%')

# Hyperparameter Tuning for Emotion Classifier
emotion_clf = GridSearchCV(SVC(), param_grid, cv=5)
emotion_clf.fit(X_train_scaled, train_emotion_labels)

# Evaluate Emotion Classifier
best_emotion_clf = emotion_clf.best_estimator_
emotion_pred = best_emotion_clf.predict(X_test_scaled)
emotion_accuracy = accuracy_score(test_emotion_labels, emotion_pred)
print(f'Best Parameters for Emotion Classifier: {emotion_clf.best_params_}')
print(f'Emotion Prediction Accuracy: {emotion_accuracy * 100:.2f}%')

#### Procedure

We attempt to optimize the accuracy of our step by 1. normalizing our data, 2. tuning our hyperparameters
1. Normalizing improve results because it modify the features to have the same scale. Thus, a feature with larger<br>
magnitude of values will not dominate another feature with smaller magnitude of values.

2. Hyperparameters tuning is simply running the model with different parameters to find the combination that gives the <br>
best result. GridSearchCV does this for us (run all combinations) rather than us having to run our own loop.

#### Analysis

As we can see, our 2-step optimization cause a negligible effect on the accuracy of gender prediction.<br>
But normalizing the data and tuning the hyperparameters improve the accuracy of emotion classification to a small but significant degree<br>

Why do we think? Because gender classification is a straighforward problem, there's not much that can be done to improve<br>
the prediction except by training on more data (which we cannot do). But because emotion prediction is a more complex problem,<br>there's more room for us to use tricks to improve the model.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# ---------------------------------------------------------------
# 1)  Unified mapping → 7 target classes
# ---------------------------------------------------------------

TARGET_MAP = {
    # --- Emo‑DB single‑letter codes ---
    'W': 'anger',
    'E': 'disgust',
    'F': 'happy',
    'T': 'sad',
    'N': 'neutral',
    'A': 'fear/anxious',
    'L': 'other',           # bored  → other

    # --- RAVDESS / word labels ---
    'angry'    : 'anger',
    'disgust'  : 'disgust',
    'happy'    : 'happy',
    'sad'      : 'sad',
    'neutral'  : 'neutral',
    'fearful'  : 'fear/anxious',
    'anxious'  : 'fear/anxious',

    # everything else goes to 'other'
    'bored'    : 'other',
    'calm'     : 'other',
    'surprised': 'other',
}

def to_target(label):
    """Map raw label to one of the 7 target classes."""
    return TARGET_MAP.get(label, 'other')

# Map ground‑truth and predictions
mapped_true  = [to_target(lbl) for lbl in test_emotion_labels]
mapped_pred  = [to_target(lbl) for lbl in emotion_pred]

# ---------------------------------------------------------------
# 2)  Confusion matrix over the 7 categories
# ---------------------------------------------------------------

targets = ['anger', 'disgust', 'happy', 'sad',
           'neutral', 'fear/anxious', 'other']

cm_norm = confusion_matrix(mapped_true, mapped_pred,
                           labels=targets, normalize='true')

disp = ConfusionMatrixDisplay(confusion_matrix=cm_norm, display_labels=targets)
disp.plot(cmap=plt.cm.Blues, xticks_rotation=45)
plt.title('Normalized Confusion Matrix (7‑class emotion set)')
plt.tight_layout()
plt.show()



<h3><span style="color:red">5. SVM: Multiclass Classification</span></h3>

In [None]:
# Hyperparameter Tuning for Combined Classifier
combined_clf = GridSearchCV(SVC(), param_grid, cv=5)
combined_clf.fit(X_train_scaled, train_combined_labels)

# Evaluate Combined Classifier
best_combined_clf = combined_clf.best_estimator_
combined_pred = best_combined_clf.predict(X_test_scaled)
combined_accuracy = accuracy_score(test_combined_labels, combined_pred)

print(f'Best Parameters for Combined Classifier: {combined_clf.best_params_}')
print(f'Combined Prediction Accuracy: {combined_accuracy * 100:.2f}%')

#### Analysis

We combine the gender and emotion labels and train a SVM model to predict them together

At first glance, it was suprising that the accuracy of predicting both gives a similar accuracy to that obtained when we
try to predict only emotion<br>
We expected the combined accuracy to be lower than either individual accuracy. However, thinking it over more, the result is not that suprising<br>
Gender accuracy is nearly 100%. Functionally, it is a given that a SVM model can effortlessly predict the gender correctly

If we view it like that, predicting gender X and gender together is like predicting X alone and we get a similar accuracy

<h3><span style="color:red">6. Conclusion</span></h3>

We successfully built gender and emotion classifiers for the German language, trained on the emodb voice dataset<br>
We used Support Vector Machines as our algorithm and obtained results that are respectable (96.67%, 70.00%)

We verified things we suspected or would not be suprised by:
1. emotion prediction is a more difficult problem than gender prediction.
2. attempts to optimize a model will improve its performance for a complex problem, but will produce little to no benefit for a simpler problem
3. Predicting gender is so effortless that multiclass prediction of gender and X is like predicting X alone

<h3><span style="color:red">Sources & Documentation</span></h3>

#### https://medium.com/@derutycsl/intuitive-understanding-of-mfccs-836d36a1f779
#### https://librosa.org/doc/0.9.2/generated/librosa.feature.mfcc.html
#### https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
#### https://seaborn.pydata.org/generated/seaborn.scatterplot.html
#### https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
#### https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
#### https://scikit-learn.org/1.5/modules/grid_search.html

In [None]:
#### https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio 

In [None]:
#### https://wwww.kaggle.com/datasets/piyushagni5/berlin-database-of-emotional-speech-emodb