## GenAI Pioneer Labs: Advanced AI-Integrated App Dev: Voice Authentication System

**Objective:** Build a CPU-friendly speaker recognition pipeline that:
1. Extracts MFCC features from short audio samples.
2. Trains a simple KNN classifier to distinguish enrolled speakers.
3. Implements enrollment + authentication functions.
4. Demonstrates accuracy and confidence scoring.

**Skills Covered:**
- Audio processing with `librosa`
- Feature engineering (MFCC)
- CPU-based machine learning (scikit-learn)
- Model evaluation (accuracy, classification report)
- Deployment pipeline in Google Colab
- Documentation and code commentary for placements

Voice Authentication System (Speaker Recognition)
This Google Colab notebook guides you through building a fundamental Voice Authentication (also known as Speaker Recognition) system. This system identifies who is speaking, rather than what they are saying. It's similar to how your phone might recognize your voice to unlock.

Project Overview
We will develop a simple machine learning pipeline that can "learn" voices and then attempt to identify a speaker from a new, unseen voice sample.

Skills Covered
By working through this notebook, you will learn about:

Environment Setup: Managing Python packages and ensuring compatibility in a Colab environment (especially important for libraries like NumPy).

Audio Data Handling: Loading and processing audio files.

Feature Extraction (MFCCs): Understanding and extracting Mel-frequency Cepstral Coefficients (MFCCs), which are powerful numerical representations of voice characteristics.

Machine Learning Model Training: Using a simple classifier (e.g., Support Vector Machine or Random Forest) to learn patterns from voice features.

Enrollment & Authentication Logic: Simulating how new speakers are "enrolled" into the system and how unknown voices are "authenticated" against known ones.

Model Evaluation: Basic understanding of how to test if our system is working.

CPU-Only Processing: Ensuring our entire system runs efficiently without needing a GPU, making it accessible to more users.


In [1]:
# Cell 1: Environment Setup & NumPy Control

# This cell installs all necessary libraries and *crucially* ensures a specific
# NumPy version is loaded. This will cause a runtime restart.

print("--- Starting Environment Setup ---")
print("1. Installing core libraries: librosa, soundfile, scikit-learn, matplotlib, Pillow...")

# librosa: For audio feature extraction (MFCCs)
# soundfile: For robust audio file reading/writing (used by librosa)
# scikit-learn: For building our machine learning classifier
# matplotlib: For plotting/visualization
# Pillow: For image handling (though not strictly needed here, good practice for general ML setups)
!pip install librosa soundfile scikit-learn matplotlib Pillow

print("\n2. Ensuring NumPy compatibility (this will force a restart)...")
# NumPy version control is critical due to recent changes in Colab and
# potential incompatibilities with other libraries.
# We explicitly install a stable NumPy 1.x version.
# '--force-reinstall' ensures it replaces any existing version.
!pip install numpy==1.26.4 --force-reinstall

print("\n-------------------------------------------------------------")
print("Installation complete for now. YOU MUST RESTART THE RUNTIME!")
print("Look for the 'Restart runtime' button/prompt and click it.")
print("After restarting, run all cells again from the beginning.")
print("-------------------------------------------------------------\n")

# This command will terminate the current Python session.
# Cell 1: Environment Setup & NumPy Control

# This cell installs all necessary libraries and *crucially* ensures a specific
# NumPy version is loaded. This will cause a runtime restart.

print("--- Starting Environment Setup ---")
print("1. Installing core libraries: librosa, soundfile, scikit-learn, matplotlib, Pillow...")

# librosa: For audio feature extraction (MFCCs)
# soundfile: For robust audio file reading/writing (used by librosa)
# scikit-learn: For building our machine learning classifier
# matplotlib: For plotting/visualization
# Pillow: For image handling (though not strictly needed here, good practice for general ML setups)
!pip install librosa soundfile scikit-learn matplotlib Pillow

print("\n2. Ensuring NumPy compatibility (this will force a restart)...")
# NumPy version control is critical due to recent changes in Colab and
# potential incompatibilities with other libraries.
# We explicitly install a stable NumPy 1.x version.
# '--force-reinstall' ensures it replaces any existing version.
!pip install numpy==1.26.4 --force-reinstall

print("\n-------------------------------------------------------------")
print("Installation complete for now. YOU MUST RESTART THE RUNTIME!")
print("Look for the 'Restart runtime' button/prompt and click it.")
print("After restarting, run all cells again from the beginning.")
print("-------------------------------------------------------------\n")

# This command will terminate the current Python session.
# Cell 1: Environment Setup & NumPy Control

# This cell installs all necessary libraries and *crucially* ensures a specific
# NumPy version is loaded. This will cause a runtime restart.

print("--- Starting Environment Setup ---")
print("1. Installing core libraries: librosa, soundfile, scikit-learn, matplotlib, Pillow...")

# librosa: For audio feature extraction (MFCCs)
# soundfile: For robust audio file reading/writing (used by librosa)
# scikit-learn: For building our machine learning classifier
# matplotlib: For plotting/visualization
# Pillow: For image handling (though not strictly needed here, good practice for general ML setups)
!pip install librosa soundfile scikit-learn matplotlib Pillow

print("\n2. Ensuring NumPy compatibility (this will force a restart)...")
# NumPy version control is critical due to recent changes in Colab and
# potential incompatibilities with other libraries.
# We explicitly install a stable NumPy 1.x version.
# '--force-reinstall' ensures it replaces any existing version.
!pip install numpy==1.26.4 --force-reinstall

print("\n-------------------------------------------------------------")
print("Installation complete for now. YOU MUST RESTART THE RUNTIME!")
print("Look for the 'Restart runtime' button/prompt and click it.")
print("After restarting, run all cells again from the beginning.")
print("-------------------------------------------------------------\n")

# This command will terminate the current Python session.
# In Colab, this typically triggers a prompt to restart the runtime cleanly.
exit()

--- Starting Environment Setup ---
1. Installing core libraries: librosa, soundfile, scikit-learn, matplotlib, Pillow...

2. Ensuring NumPy compatibility (this will force a restart)...
Collecting numpy==1.26.4
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following

In [1]:
# Cell 2: Re-Import Libraries & Verify Setup

# This cell should be run *after* the runtime has been restarted from Cell 1.
# It re-imports libraries and verifies that our NumPy version is correctly loaded.

print("--- Re-Importing Libraries ---")

# Import standard libraries
import numpy as np        # Numerical operations (should be 1.26.4 now)
import matplotlib.pyplot as plt # Plotting library
import time               # For timing operations
import os                 # Operating system interactions (e.g., path creation)

# Import machine learning and audio processing libraries
import librosa            # For Mel-frequency Cepstral Coefficients (MFCCs)
import soundfile as sf    # Used by librosa for specific audio formats
from sklearn.svm import SVC # Support Vector Classifier for our model
from sklearn.ensemble import RandomForestClassifier # Another good classifier option
from sklearn.model_selection import train_test_split # For splitting data
from sklearn.metrics import accuracy_score, classification_report # For evaluation
from sklearn.preprocessing import StandardScaler # For scaling features

# Import Colab-specific tools for file upload
from google.colab import files

print("Libraries re-imported. Verifying NumPy version...")
# Verify the NumPy version to ensure the downgrade was successful.
print(f"NumPy version currently loaded: {np.__version__}")

# Check if it's the expected version
if np.__version__ == '1.26.4':
    print("NumPy version is correct (1.26.4). Good to go!")
else:
    print("WARNING: NumPy version might not be 1.26.4. If you face errors, manually restart runtime ('Runtime' -> 'Restart session') and try again.")
# Cell 2: Re-Import Libraries & Verify Setup

# This cell should be run *after* the runtime has been restarted from Cell 1.
# It re-imports libraries and verifies that our NumPy version is correctly loaded.

print("--- Re-Importing Libraries ---")

# Import standard libraries
import numpy as np        # Numerical operations (should be 1.26.4 now)
import matplotlib.pyplot as plt # Plotting library
import time               # For timing operations
import os                 # Operating system interactions (e.g., path creation)

# Import machine learning and audio processing libraries
import librosa            # For Mel-frequency Cepstral Coefficients (MFCCs)
import soundfile as sf    # Used by librosa for specific audio formats
from sklearn.svm import SVC # Support Vector Classifier for our model
from sklearn.ensemble import RandomForestClassifier # Another good classifier option
from sklearn.model_selection import train_test_split # For splitting data
from sklearn.metrics import accuracy_score, classification_report # For evaluation
from sklearn.preprocessing import StandardScaler # For scaling features

# Import Colab-specific tools for file upload
from google.colab import files

print("Libraries re-imported. Verifying NumPy version...")
# Verify the NumPy version to ensure the downgrade was successful.
print(f"NumPy version currently loaded: {np.__version__}")

# Check if it's the expected version
if np.__version__ == '1.26.4':
    print("NumPy version is correct (1.26.4). Good to go!")
else:
    print("WARNING: NumPy version might not be 1.26.4. If you face errors, manually restart runtime ('Runtime' -> 'Restart session') and try again.")

print("\nSetup verification complete. Proceed to next cell.")

--- Re-Importing Libraries ---
Libraries re-imported. Verifying NumPy version...
NumPy version currently loaded: 1.26.4
NumPy version is correct (1.26.4). Good to go!
--- Re-Importing Libraries ---
Libraries re-imported. Verifying NumPy version...
NumPy version currently loaded: 1.26.4
NumPy version is correct (1.26.4). Good to go!

Setup verification complete. Proceed to next cell.


## 3. Data Preparation

1. Create a `voice_data.zip` structured as:
   ```
   voice_data/
   ├── speaker_A/
   │   ├── A_01.wav
   │   └── ...
   ├── speaker_B/
   │   ├── B_01.wav
   │   └── ...
   └── speaker_C/
       ├── C_01.wav
       └── ...
   ```

2. Each WAV: 16 kHz, 16-bit PCM, 2–5 sec, passphrase.


In [2]:
# Cell 3: Data Preparation: Uploading Audio Samples

# In this cell, you will upload your audio files for speaker enrollment and testing.
# Please create a few short .wav files for 2-3 distinct speakers.
# For each speaker, have:
# - At least 2-3 ENROLLMENT samples (e.g., 'speaker1_enroll_01.wav', 'speaker1_enroll_02.wav')
# - At least 1 TEST sample (e.g., 'speaker1_test_01.wav')

# Example file naming convention:
# speaker_name_type_number.wav
# e.g., 'john_enroll_01.wav', 'john_enroll_02.wav', 'jane_enroll_01.wav', 'jane_test_01.wav'

print("--- Data Preparation: Upload Your Audio Samples ---")
print("Please upload your .wav audio files now. (e.g., speaker1_enroll_01.wav, speaker2_test_01.wav)")

# Create directories for organizing audio files
enrollment_dir = 'enrollment_samples'
test_dir = 'test_samples'
os.makedirs(enrollment_dir, exist_ok=True)
os.makedirs(test_dir, exist_ok=True)

# Upload files from your local machine
uploaded_files = files.upload()

print("\nProcessing uploaded files...")

# List to store information about uploaded files
all_audio_files = []

# Iterate through uploaded files and move them to appropriate directories
for filename in uploaded_files.keys():
    # Example parsing: 'speaker_name_type_number.wav'
    # This assumes a specific naming convention.
    parts = filename.split('_')
    speaker_name = parts[0]
    sample_type = parts[1] # 'enroll' or 'test'

    destination_path = ""
    if sample_type == 'enroll':
        destination_path = os.path.join(enrollment_dir, filename)
    elif sample_type == 'test':
        destination_path = os.path.join(test_dir, filename)
    else:
        print(f"Skipping unknown file type: {filename}. Please use 'enroll' or 'test' in filename.")
        continue

    # Move the file from Colab's default upload location to our organized directory
    with open(filename, 'wb') as f:
        f.write(uploaded_files[filename])
    os.rename(filename, destination_path) # Move the file

    all_audio_files.append({'speaker': speaker_name, 'type': sample_type, 'path': destination_path})
    print(f"  Moved '{filename}' to '{destination_path}'")

print("\nAudio file organization complete.")
print(f"Total uploaded files: {len(all_audio_files)}")
print(f"Enrollment files: {len([f for f in all_audio_files if f['type'] == 'enroll'])}")
print(f"Test files: {len([f for f in all_audio_files if f['type'] == 'test'])}")

--- Data Preparation: Upload Your Audio Samples ---
Please upload your .wav audio files now. (e.g., speaker1_enroll_01.wav, speaker2_test_01.wav)


Saving speaker1_enroll_01.wav to speaker1_enroll_01.wav
Saving speaker1_enroll_02.wav to speaker1_enroll_02.wav
Saving speaker1_enroll_03.wav to speaker1_enroll_03.wav
Saving speaker2_enroll_01.wav to speaker2_enroll_01.wav
Saving speaker2_enroll_02.wav to speaker2_enroll_02.wav
Saving speaker2_enroll_03.wav to speaker2_enroll_03.wav
Saving speaker2_test_01.wav to speaker2_test_01.wav
Saving speaker2_test_02.wav to speaker2_test_02.wav
Saving speaker2_test_03.wav to speaker2_test_03.wav
Saving speaker1_test_01.wav to speaker1_test_01.wav
Saving speaker1_test_02.wav to speaker1_test_02.wav
Saving speaker1_test_03.wav to speaker1_test_03.wav

Processing uploaded files...
  Moved 'speaker1_enroll_01.wav' to 'enrollment_samples/speaker1_enroll_01.wav'
  Moved 'speaker1_enroll_02.wav' to 'enrollment_samples/speaker1_enroll_02.wav'
  Moved 'speaker1_enroll_03.wav' to 'enrollment_samples/speaker1_enroll_03.wav'
  Moved 'speaker2_enroll_01.wav' to 'enrollment_samples/speaker2_enroll_01.wav'
  

## 4. Feature Extraction (MFCC)

We’ll extract 13 MFCCs per frame, then compute mean + variance → 26-dim vector.


In [3]:
# Cell 4: Feature Extraction (MFCCs)

# This cell defines a function to extract MFCCs (Mel-frequency Cepstral Coefficients)
# from audio files. MFCCs are widely used in speech recognition because they
# effectively represent the short-term power spectrum of a sound.

print("--- Feature Extraction (MFCCs) ---")

# Define parameters for MFCC extraction
# sr (sampling rate): How many samples per second in the audio. 22050 is common.
# n_mfcc: Number of MFCCs to extract. 13-20 is typical.
SAMPLING_RATE = 22050
N_MFCC = 13 # Number of MFCCs to extract per frame

def extract_mfcc(audio_path, sr=SAMPLING_RATE, n_mfcc=N_MFCC):
    """
    Loads an audio file and extracts its MFCCs.

    Args:
        audio_path (str): Path to the audio file.
        sr (int): Sampling rate.
        n_mfcc (int): Number of MFCC coefficients.

    Returns:
        numpy.ndarray: Averaged MFCCs, or None if an error occurs.
    """
    try:
        # Load the audio file
        # 'y' is the audio time series (waveform)
        # 'sr' is the sampling rate
        y, loaded_sr = librosa.load(audio_path, sr=sr)

        # Extract MFCC features from the audio waveform
        # librosa.feature.mfcc returns an array where each column is an MFCC vector for a frame.
        mfccs = librosa.feature.mfcc(y=y, sr=loaded_sr, n_mfcc=n_mfcc)

        # For simplicity in this basic classifier, we average the MFCCs across all frames.
        # This reduces each audio sample to a single vector of length n_mfcc.
        averaged_mfccs = np.mean(mfccs.T, axis=0)

        return averaged_mfccs
    except Exception as e:
        print(f"Error processing {audio_path}: {e}")
        return None

# Lists to store features and corresponding labels
X_features = [] # To store MFCC features
y_labels = []   # To store speaker labels

print("\nExtracting MFCCs from enrollment samples...")
# Loop through all enrollment files and extract features
enrollment_files_info = [f for f in all_audio_files if f['type'] == 'enroll']
for file_info in enrollment_files_info:
    mfcc_vector = extract_mfcc(file_info['path'])
    if mfcc_vector is not None:
        X_features.append(mfcc_vector)
        y_labels.append(file_info['speaker'])
        print(f"  Extracted MFCCs for {file_info['speaker']} from {os.path.basename(file_info['path'])}")

# Convert lists to NumPy arrays for machine learning
X_features = np.array(X_features)
y_labels = np.array(y_labels)

print("\nMFCC extraction complete.")
print(f"Shape of features (X): {X_features.shape}")
print(f"Shape of labels (y): {y_labels.shape}")

if X_features.shape[0] == 0:
    print("WARNING: No features extracted. Please ensure you uploaded valid audio files following the naming convention.")
# Cell 4: Feature Extraction (MFCCs)

# This cell defines a function to extract MFCCs (Mel-frequency Cepstral Coefficients)
# from audio files. MFCCs are widely used in speech recognition because they
# effectively represent the short-term power spectrum of a sound.

print("--- Feature Extraction (MFCCs) ---")

# Define parameters for MFCC extraction
# sr (sampling rate): How many samples per second in the audio. 22050 is common.
# n_mfcc: Number of MFCCs to extract. 13-20 is typical.
SAMPLING_RATE = 22050
N_MFCC = 13 # Number of MFCCs to extract per frame

import librosa # Assuming librosa is installed and imported earlier
import numpy as np # Assuming numpy is installed and imported earlier
import os # Assuming os is installed and imported earlier


def extract_mfcc(audio_path, sr=SAMPLING_RATE, n_mfcc=N_MFCC):
    """
    Loads an audio file and extracts its MFCCs.

    Args:
        audio_path (str): Path to the audio file.
        sr (int): Sampling rate.
        n_mfcc (int): Number of MFCC coefficients.

    Returns:
        numpy.ndarray: Averaged MFCCs, or None if an error occurs.
    """
    try:
        # Load the audio file
        # 'y' is the audio time series (waveform)
        # 'sr' is the sampling rate
        y, loaded_sr = librosa.load(audio_path, sr=sr)

        # Extract MFCC features from the audio waveform
        # librosa.feature.mfcc returns an array where each column is an MFCC vector for a frame.
        mfccs = librosa.feature.mfcc(y=y, sr=loaded_sr, n_mfcc=n_mfcc)

        # For simplicity in this basic classifier, we average the MFCCs across all frames.
        # This reduces each audio sample to a single vector of length n_mfcc.
        averaged_mfccs = np.mean(mfccs.T, axis=0)

        return averaged_mfccs
    except Exception as e:
        print(f"Error processing {audio_path}: {e}")
        return None

# Lists to store features and corresponding labels
X_features = [] # To store MFCC features
y_labels = []   # To store speaker labels

print("\nExtracting MFCCs from enrollment samples...")
# Loop through all enrollment files and extract features
enrollment_files_info = [f for f in all_audio_files if f['type'] == 'enroll']
for file_info in enrollment_files_info:
    mfcc_vector = extract_mfcc(file_info['path'])
    if mfcc_vector is not None:
        X_features.append(mfcc_vector)
        y_labels.append(file_info['speaker'])
        print(f"  Extracted MFCCs for {file_info['speaker']} from {os.path.basename(file_info['path'])}")

# Convert lists to NumPy arrays for machine learning
X_features = np.array(X_features)
y_labels = np.array(y_labels)

print("\nMFCC extraction complete.")
print(f"Shape of features (X): {X_features.shape}")
print(f"Shape of labels (y): {y_labels.shape}")

if X_features.shape[0] == 0:
    print("WARNING: No features extracted. Please ensure you uploaded valid audio files following the naming convention.")
else:
    print("Features ready for training!")

--- Feature Extraction (MFCCs) ---

Extracting MFCCs from enrollment samples...
  Extracted MFCCs for speaker1 from speaker1_enroll_01.wav
  Extracted MFCCs for speaker1 from speaker1_enroll_02.wav
  Extracted MFCCs for speaker1 from speaker1_enroll_03.wav
  Extracted MFCCs for speaker2 from speaker2_enroll_01.wav
  Extracted MFCCs for speaker2 from speaker2_enroll_02.wav
  Extracted MFCCs for speaker2 from speaker2_enroll_03.wav

MFCC extraction complete.
Shape of features (X): (6, 13)
Shape of labels (y): (6,)
--- Feature Extraction (MFCCs) ---

Extracting MFCCs from enrollment samples...
  Extracted MFCCs for speaker1 from speaker1_enroll_01.wav
  Extracted MFCCs for speaker1 from speaker1_enroll_02.wav
  Extracted MFCCs for speaker1 from speaker1_enroll_03.wav
  Extracted MFCCs for speaker2 from speaker2_enroll_01.wav
  Extracted MFCCs for speaker2 from speaker2_enroll_02.wav
  Extracted MFCCs for speaker2 from speaker2_enroll_03.wav

MFCC extraction complete.
Shape of features (X)

## 5. Model Training

1. Traverse `voice_data/{speaker}/*.wav`
2. Extract MFCCs, build X, y
3. Split 80/20, train KNN (k=3, Euclidean)
4. Evaluate on test set


In [4]:
# Cell 5: Model Training (Simple Classifier on CPU)

# Now that we have our numerical features (MFCCs) and speaker labels,
# we can train a machine learning model to learn the association between them.
# We'll use a Support Vector Classifier (SVC) as a robust choice,
# but a RandomForestClassifier could also be used.

print("--- Model Training ---")

if X_features.shape[0] < 2 or len(np.unique(y_labels)) < 2:
    print("ERROR: Not enough data or unique speakers to train a model.")
    print("Please upload at least 2 enrollment samples for at least 2 distinct speakers.")
else:
    # Scale the features
    # Scaling is important for many ML algorithms (like SVC) to ensure
    # all features contribute equally, regardless of their original scale.
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_features)
    print(f"Features scaled. Shape: {X_scaled.shape}")

    # Split data into training and testing sets
    # This helps us evaluate how well our model performs on unseen data.
    # test_size=0.2 means 20% of data will be for testing, 80% for training.
    # random_state ensures reproducibility of the split.
    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled, y_labels, test_size=0.2, random_state=42, stratify=y_labels
    )
    print(f"Data split: Train samples={X_train.shape[0]}, Test samples={X_test.shape[0]}")

    # Initialize the Support Vector Classifier model
    # C=1.0 is a regularization parameter. kernel='linear' uses a linear decision boundary.
    # The model will run on CPU by default.
    model = SVC(kernel='linear', C=1.0, random_state=42, probability=True)
    # Alternatively, you could use a RandomForestClassifier:
    # model = RandomForestClassifier(n_estimators=100, random_state=42)

    print("\nTraining the model...")
    train_start_time = time.time()
    model.fit(X_train, y_train) # Train the model using the training data
    train_end_time = time.time()
    print(f"Model training complete in {round(train_end_time - train_start_time, 2)} seconds.")

    # Evaluate the model's performance on the test set (sanity check)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"\nModel accuracy on internal test set: {accuracy * 100:.2f}%")
    print("Classification Report:\n", classification_report(y_test, y_pred))
# Cell 5: Model Training (Simple Classifier on CPU)

# Now that we have our numerical features (MFCCs) and speaker labels,
# we can train a machine learning model to learn the association between them.
# We'll use a Support Vector Classifier (SVC) as a robust choice,
# but a RandomForestClassifier could also be used.

print("--- Model Training ---")

if X_features.shape[0] < 2 or len(np.unique(y_labels)) < 2:
    print("ERROR: Not enough data or unique speakers to train a model.")
    print("Please upload at least 2 enrollment samples for at least 2 distinct speakers.")
else:
    # Scale the features
    # Scaling is important for many ML algorithms (like SVC) to ensure
    # all features contribute equally, regardless of their original scale.
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_features)
    print(f"Features scaled. Shape: {X_scaled.shape}")

    # Split data into training and testing sets
    # This helps us evaluate how well our model performs on unseen data.
    # test_size=0.2 means 20% of data will be for testing, 80% for training.
    # random_state ensures reproducibility of the split.
    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled, y_labels, test_size=0.2, random_state=42, stratify=y_labels
    )
    print(f"Data split: Train samples={X_train.shape[0]}, Test samples={X_test.shape[0]}")

    # Initialize the Support Vector Classifier model
    # C=1.0 is a regularization parameter. kernel='linear' uses a linear decision boundary.
    # The model will run on CPU by default.
    model = SVC(kernel='linear', C=1.0, random_state=42, probability=True)
    # Alternatively, you could use a RandomForestClassifier:
    # model = RandomForestClassifier(n_estimators=100, random_state=42)

    print("\nTraining the model...")
    train_start_time = time.time()
    model.fit(X_train, y_train) # Train the model using the training data
    train_end_time = time.time()
    print(f"Model training complete in {round(train_end_time - train_start_time, 2)} seconds.")

    # Evaluate the model's performance on the test set (sanity check)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"\nModel accuracy on internal test set: {accuracy * 100:.2f}%")
    print("Classification Report:\n", classification_report(y_test, y_pred))

    print("\nModel trained and ready for authentication!")


--- Model Training ---
Features scaled. Shape: (6, 13)
Data split: Train samples=4, Test samples=2

Training the model...
Model training complete in 0.0 seconds.

Model accuracy on internal test set: 100.00%
Classification Report:
               precision    recall  f1-score   support

    speaker1       1.00      1.00      1.00         1
    speaker2       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

--- Model Training ---
Features scaled. Shape: (6, 13)
Data split: Train samples=4, Test samples=2

Training the model...
Model training complete in 0.0 seconds.

Model accuracy on internal test set: 100.00%
Classification Report:
               precision    recall  f1-score   support

    speaker1       1.00      1.00      1.00         1
    speaker2       1.00      1.00      1.00         1

    accuracy                           1.00  

In [5]:
# Cell 6: Enrollment & Authentication Logic

# This cell defines the core logic for how a new voice sample can be
# authenticated against our trained model.

print("--- Authentication Logic ---")

if 'model' not in locals() or 'scaler' not in locals():
    print("ERROR: Model or scaler not trained/initialized. Please run Cell 5 first.")
else:
    # Get test files info
    test_files_info = [f for f in all_audio_files if f['type'] == 'test']

    if not test_files_info:
        print("No test files found. Please upload test samples in Cell 3.")
    else:
        print("\nPerforming authentication on test samples...")

        # Loop through each test file and try to authenticate it
        for i, test_file_info in enumerate(test_files_info):
            test_audio_path = test_file_info['path']
            true_speaker = test_file_info['speaker']

            print(f"\n--- Test Sample {i+1}: '{os.path.basename(test_audio_path)}' (True speaker: {true_speaker}) ---")

            # Extract MFCCs from the test audio
            test_mfcc = extract_mfcc(test_audio_path)

            if test_mfcc is not None:
                # Reshape for prediction (model expects 2D array: 1 sample, N_MFCC features)
                test_mfcc_reshaped = test_mfcc.reshape(1, -1)

                # Scale the test features using the same scaler used for training
                test_mfcc_scaled = scaler.transform(test_mfcc_reshaped)

                # Predict the speaker using the trained model
                predicted_speaker = model.predict(test_mfcc_scaled)[0]

                # Get prediction probabilities (if using SVC with probability=True or RandomForest)
                # This gives a confidence score for each possible speaker.
                probabilities = model.predict_proba(test_mfcc_scaled)[0]

                # Get the names of all classes (speakers) from the model
                class_labels = model.classes_

                # Create a dictionary of speaker probabilities
                speaker_probabilities = dict(zip(class_labels, probabilities))

                print(f"Predicted Speaker: {predicted_speaker}")
                print(f"True Speaker: {true_speaker}")
                print("Probabilities per speaker:")
                for speaker, prob in speaker_probabilities.items():
                    print(f"  {speaker}: {prob*100:.2f}%")

                if predicted_speaker == true_speaker:
                    print("Status: AUTHENTICATED (Correctly identified)")
                else:
                    print("Status: FAILED AUTHENTICATION (Incorrectly identified)")
# Cell 6: Enrollment & Authentication Logic

# This cell defines the core logic for how a new voice sample can be
# authenticated against our trained model.

print("--- Authentication Logic ---")

if 'model' not in locals() or 'scaler' not in locals():
    print("ERROR: Model or scaler not trained/initialized. Please run Cell 5 first.")
else:
    # Get test files info
    test_files_info = [f for f in all_audio_files if f['type'] == 'test']

    if not test_files_info:
        print("No test files found. Please upload test samples in Cell 3.")
    else:
        print("\nPerforming authentication on test samples...")

        # Loop through each test file and try to authenticate it
        for i, test_file_info in enumerate(test_files_info):
            test_audio_path = test_file_info['path']
            true_speaker = test_file_info['speaker']

            print(f"\n--- Test Sample {i+1}: '{os.path.basename(test_audio_path)}' (True speaker: {true_speaker}) ---")

            # Extract MFCCs from the test audio
            test_mfcc = extract_mfcc(test_audio_path)

            if test_mfcc is not None:
                # Reshape for prediction (model expects 2D array: 1 sample, N_MFCC features)
                test_mfcc_reshaped = test_mfcc.reshape(1, -1)

                # Scale the test features using the same scaler used for training
                test_mfcc_scaled = scaler.transform(test_mfcc_reshaped)

                # Predict the speaker using the trained model
                predicted_speaker = model.predict(test_mfcc_scaled)[0]

                # Get prediction probabilities (if using SVC with probability=True or RandomForest)
                # This gives a confidence score for each possible speaker.
                probabilities = model.predict_proba(test_mfcc_scaled)[0]

                # Get the names of all classes (speakers) from the model
                class_labels = model.classes_

                # Create a dictionary of speaker probabilities
                speaker_probabilities = dict(zip(class_labels, probabilities))

                print(f"Predicted Speaker: {predicted_speaker}")
                print(f"True Speaker: {true_speaker}")
                print("Probabilities per speaker:")
                for speaker, prob in speaker_probabilities.items():
                    print(f"  {speaker}: {prob*100:.2f}%")

                if predicted_speaker == true_speaker:
                    print("Status: AUTHENTICATED (Correctly identified)")
                else:
                    print("Status: FAILED AUTHENTICATION (Incorrectly identified)")
            else:
                print(f"  Skipping test sample due to MFCC extraction error: {os.path.basename(test_audio_path)}")

--- Authentication Logic ---

Performing authentication on test samples...

--- Test Sample 1: 'speaker2_test_01.wav' (True speaker: speaker2) ---
Predicted Speaker: speaker2
True Speaker: speaker2
Probabilities per speaker:
  speaker1: 25.00%
  speaker2: 75.00%
Status: AUTHENTICATED (Correctly identified)

--- Test Sample 2: 'speaker2_test_02.wav' (True speaker: speaker2) ---
Predicted Speaker: speaker2
True Speaker: speaker2
Probabilities per speaker:
  speaker1: 25.00%
  speaker2: 75.00%
Status: AUTHENTICATED (Correctly identified)

--- Test Sample 3: 'speaker2_test_03.wav' (True speaker: speaker2) ---
Predicted Speaker: speaker2
True Speaker: speaker2
Probabilities per speaker:
  speaker1: 25.00%
  speaker2: 75.00%
Status: AUTHENTICATED (Correctly identified)

--- Test Sample 4: 'speaker1_test_01.wav' (True speaker: speaker1) ---
Predicted Speaker: speaker2
True Speaker: speaker1
Probabilities per speaker:
  speaker1: 26.60%
  speaker2: 73.40%
Status: FAILED AUTHENTICATION (Incorre

In [6]:
# Conclusion & Next Steps

Congratulations! You've successfully built a basic Voice Authentication System that runs entirely on CPU in Google Colab, carefully managing library versions.

### Summary of What We Achieved:

* **Robust Setup:** Handled common NumPy and `dlib` installation issues in Colab by enforcing a stable environment.
* **Audio Processing:** Learned to load audio files and extract key features (MFCCs).
* **Machine Learning Workflow:** Implemented a full ML pipeline from data preparation and feature extraction to model training and prediction.
* **Speaker Recognition:** Developed a system that can, with varying degrees of accuracy, identify speakers based on their voice.

### Further Enhancements & Exploration:

This is just the beginning! Here are some ideas to take this project further:

1.  **More Data:** The accuracy of ML models heavily depends on the amount and quality of training data. Try uploading more enrollment samples per speaker, and more speakers overall.
2.  **Advanced Features:**
    * Explore other audio features like Chroma, Tonal Centroid Feature (Tonnetz), Zero Crossing Rate, Spectral Centroid, etc.
    * Instead of averaging MFCCs, you could use sequences of MFCCs with Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs) for more complex models.
3.  **Different Classifiers:** Experiment with other `scikit-learn` classifiers like `KNeighborsClassifier`, `GaussianNB`, or `XGBoost` (requires `pip install xgboost`).
4.  **Deep Learning for Voice:** For higher accuracy, especially with more data, you could move to deep learning architectures. This would involve libraries like `TensorFlow` or `PyTorch` and specialized voice models.
5.  **Robustness:**
    * Implement noise reduction techniques on audio.
    * Consider voice activity detection (VAD) to ensure only speech is processed.
6.  **User Interface:** Build a simple web interface (e.g., using `Gradio` or `Streamlit` in Colab) to make the demo more interactive.
7.  **"Imposter" Detection:** Modify the authentication logic to not just predict *who* the speaker is, but also determine *if* the speaker is one of the enrolled individuals (i.e., reject imposters). This usually involves setting a confidence threshold.

Keep experimenting and have fun!

SyntaxError: unterminated string literal (detected at line 3) (<ipython-input-6-89ae4667cc38>, line 3)

## 8. Next Steps & Extensions

- **Collect More Data:** 10+ utterances per speaker improves robustness.
- **Feature Variations:** Delta MFCC, spectral contrast, or PLP features.
- **Classifier Experiments:** SVM (`sklearn.svm.SVC`), GMM (e.g., `sklearn.mixture.GaussianMixture` per speaker), or Random Forest.
- **Real-Time Mic Capture:** Use JavaScript/HTML in a web app to record short clips and send to this pipeline via an API.
- **Threshold Tuning:** Adjust the threshold for `authenticate()` based on ROC curve to minimize false acceptance/rejection.
- **Embed in an App:** Export KNN to ONNX or pickle, then load in a Flask/Streamlit front-end for a complete demo.
