# SpeechBrain Fine-Tuned Custom Wake Word Model

### Overview

This repository contains a fine-tuned model of SpeechBrain for custom wake word detection. The model was trained on a small dataset of custom wake word samples and non-wake word samples. Please note that due to the limited training data (5 minutes of 3-second samples), the model's accuracy is currently around 50%.

In [None]:
!pip install speechbrain torchaudio torch librosa numpy


Collecting speechbrain
  Downloading speechbrain-1.0.2-py3-none-any.whl.metadata (23 kB)
Collecting hyperpyyaml (from speechbrain)
  Downloading HyperPyYAML-1.2.2-py3-none-any.whl.metadata (7.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (f

In [None]:
import os
import torch
import librosa
import numpy as np
import speechbrain as sb
from speechbrain.pretrained import EncoderClassifier
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim


  from speechbrain.pretrained import EncoderClassifier


In [None]:
# Load the pre-trained SpeechBrain wake word model
wakeword_model = EncoderClassifier.from_hparams(
    source="speechbrain/google_speech_command_xvector",
    savedir="wakeword_model"
)

# Check the model structure
print(wakeword_model)


hyperparams.yaml:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)


embedding_model.ckpt:   0%|          | 0.00/16.9M [00:00<?, ?B/s]

classifier.ckpt:   0%|          | 0.00/1.10M [00:00<?, ?B/s]

label_encoder.txt:   0%|          | 0.00/182 [00:00<?, ?B/s]

EncoderClassifier(
  (mods): ModuleDict(
    (compute_features): Fbank(
      (compute_STFT): STFT()
      (compute_fbanks): Filterbank()
      (compute_deltas): Deltas()
      (context_window): ContextWindow()
    )
    (mean_var_norm): InputNormalization()
    (embedding_model): Xvector(
      (blocks): ModuleList(
        (0): Conv1d(
          (conv): Conv1d(24, 512, kernel_size=(5,), stride=(1,))
        )
        (1): LeakyReLU(negative_slope=0.01)
        (2): BatchNorm1d(
          (norm): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (3): Conv1d(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(1,), dilation=(2,))
        )
        (4): LeakyReLU(negative_slope=0.01)
        (5): BatchNorm1d(
          (norm): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (6): Conv1d(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(1,), dilation=(3,))
        )
   

  state_dict = torch.load(path, map_location=device)


### Dataset


The model was trained on a small dataset consisting of:

-> 2 minutes of 3-second samples of the custom wake word audio.

-> 2 minutes of 3-second samples of non-wake word audio.

Due to the limited amount of training data, the model's performance is currently limited.

In [None]:
import os
print(len(os.listdir('/content/wake_word')))
print(len(os.listdir('/content/not_wake_word')))


50
50


In [None]:
# Paths to wake word and non-wake word samples
WAKE_WORD_DIR = "/content/wake_word/"
NON_WAKE_WORD_DIR = "/content/not_wake_word/"

# Audio processing parameters
SAMPLE_RATE = 16000  # 16kHz expected
DURATION = 3  # 3-second clips

# Function to extract embeddings from audio
def extract_speechbrain_embeddings(file_path):
    # Load audio and ensure it's 3 seconds long
    audio, sr = librosa.load(file_path, sr=SAMPLE_RATE)
    if len(audio) < SAMPLE_RATE * DURATION:
        pad_length = (SAMPLE_RATE * DURATION) - len(audio)
        audio = np.pad(audio, (0, pad_length))  # Pad if shorter
    elif len(audio) > SAMPLE_RATE * DURATION:
        audio = audio[:SAMPLE_RATE * DURATION]  # Trim if longer

    # Convert to Tensor and extract embeddings
    audio_tensor = torch.tensor(audio).unsqueeze(0)
    with torch.no_grad():
        embeddings = wakeword_model.encode_batch(audio_tensor)

    return embeddings.squeeze(0).numpy()

# Load dataset
wake_word_files = [os.path.join(WAKE_WORD_DIR, f) for f in os.listdir(WAKE_WORD_DIR)]
non_wake_word_files = [os.path.join(NON_WAKE_WORD_DIR, f) for f in os.listdir(NON_WAKE_WORD_DIR)]

# Extract embeddings for all samples
wake_embeddings = np.array([extract_speechbrain_embeddings(f) for f in wake_word_files])
non_wake_embeddings = np.array([extract_speechbrain_embeddings(f) for f in non_wake_word_files])

# Create labels (1 = wake word, 0 = non-wake word)
wake_labels = np.ones(len(wake_embeddings))
non_wake_labels = np.zeros(len(non_wake_embeddings))

# Combine dataset
X = np.vstack((wake_embeddings, non_wake_embeddings))
y = np.concatenate((wake_labels, non_wake_labels))

# Shuffle dataset
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=42)


In [None]:
class WakeWordDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Create dataset and dataloaders
dataset = WakeWordDataset(X, y)
train_loader = DataLoader(dataset, batch_size=16, shuffle=True)


In [None]:
# Define a simple classifier using SpeechBrain embeddings
class WakeWordClassifier(nn.Module):
    def __init__(self, input_dim):
        super(WakeWordClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 1)  # Binary Classification (Wake Word or Not)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.sigmoid(self.fc3(x))
        return x

# Initialize model
input_dim = 512 # SpeechBrain embedding size
model = WakeWordClassifier(input_dim)

# Loss function and optimizer
criterion = nn.BCELoss()  # Binary Cross Entropy
optimizer = optim.Adam(model.parameters(), lr=0.001)


In [None]:
print(input_dim)

512


In [None]:
# Initialize model with correct input size
model = WakeWordClassifier(input_dim=512)

# Train the model again
num_epochs = 500
for epoch in range(num_epochs):
    running_loss = 0.0
    correct = 0
    total = 0

    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs).squeeze()

        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        predicted = (outputs > 0.5).float()
        correct += (predicted == labels).sum().item()
        total += labels.size(0)

    accuracy = correct / total
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss:.4f}, Accuracy: {accuracy:.4f}")


Epoch [1/1000], Loss: 5.0995, Accuracy: 0.5000
Epoch [2/1000], Loss: 5.2179, Accuracy: 0.5000
Epoch [3/1000], Loss: 5.3566, Accuracy: 0.5000
Epoch [4/1000], Loss: 5.1104, Accuracy: 0.5000
Epoch [5/1000], Loss: 5.3679, Accuracy: 0.5000
Epoch [6/1000], Loss: 5.2194, Accuracy: 0.5000
Epoch [7/1000], Loss: 5.5228, Accuracy: 0.5000
Epoch [8/1000], Loss: 5.3688, Accuracy: 0.5000
Epoch [9/1000], Loss: 5.1023, Accuracy: 0.5000
Epoch [10/1000], Loss: 5.5236, Accuracy: 0.5000
Epoch [11/1000], Loss: 5.3255, Accuracy: 0.5000
Epoch [12/1000], Loss: 5.3687, Accuracy: 0.5000
Epoch [13/1000], Loss: 5.3717, Accuracy: 0.5000
Epoch [14/1000], Loss: 5.4994, Accuracy: 0.5000
Epoch [15/1000], Loss: 5.3803, Accuracy: 0.5000
Epoch [16/1000], Loss: 5.1173, Accuracy: 0.5000
Epoch [17/1000], Loss: 5.5625, Accuracy: 0.5000
Epoch [18/1000], Loss: 5.3807, Accuracy: 0.5000
Epoch [19/1000], Loss: 5.5907, Accuracy: 0.5000
Epoch [20/1000], Loss: 5.6513, Accuracy: 0.5000
Epoch [21/1000], Loss: 5.2132, Accuracy: 0.5000
E

In [None]:

# Save the fine-tuned model
torch.save(model.state_dict(), "fine_tuned_wakeword_model.pth")
print("Fine-tuned model saved successfully!")

Fine-tuned model saved successfully!


In [None]:
# Extract embeddings for a single sample to check shape
sample_embedding = extract_speechbrain_embeddings(wake_word_files[0])
print("Embedding shape:", sample_embedding.shape)


Embedding shape: (1, 512)


In [None]:
# Load and test on a new wake word sample
def predict_wake_word(file_path):
    feature = extract_speechbrain_embeddings(file_path)
    feature = torch.tensor(feature, dtype=torch.float32).unsqueeze(0)

    with torch.no_grad():
        prediction = model(feature).item()

    return "Wake Word Detected!" if prediction > 0.8 else "No Wake Word Detected."

# Test with an audio sample
sample_audio = "/content/Sample3.wav"
result = predict_wake_word(sample_audio)
print(result)


No Wake Word Detected.


For Ideal model 8 hours dataset (3-4 hours of wake words and non wake words) sample is needed to get best model and minimum Dataset 2-3 hours must need for training these LLM models, when datasets is more diverse the model will generalize better for example getting datasample with different factor like speaker varity, speaking style, background noise and microphone type and collection of these variety of data sample with our coutom word is challenging 

In [None]:
# Evaluate on test data
test_correct = 0
test_total = 0

for inputs, labels in train_loader:
    with torch.no_grad():
        outputs = model(inputs).squeeze()
        predicted = (outputs > 0.8).float()  # Use new threshold

    test_correct += (predicted == labels).sum().item()
    test_total += labels.size(0)

accuracy = test_correct / test_total
print(f"Test Accuracy: {accuracy:.4f}")


Test Accuracy: 0.5000
