## Audio Fingerprinting

In [1]:
import torch
import torchaudio
import torchaudio.transforms as transforms
from tqdm import tqdm
import warnings
import os
warnings.filterwarnings('ignore')


This code defines a Python class named AudioFingerprinting for creating an audio fingerprinting system, leveraging the LibriSpeech dataset inbuilt in torchvision for testing. 

The code is designed to work with audio files, extracting features using Mel-frequency cepstral coefficients (MFCC) and identifying similar audio clips within the database based on these features. Here's a brief overview of its functionalities:


In [2]:
class AudioFingerprinting:
    """
    A class for creating an audio fingerprinting system that builds a database of audio features
    from the LibriSpeech dataset and finds top matches for a given audio input.
    """

    def __init__(self, root_dir, device='cuda'):
        """
        Initializes the audio fingerprinting system.

        Parameters:
        - root_dir: str, the directory where the LibriSpeech dataset will be stored.
        - device: str, the device to use for computations. Defaults to 'cuda' if available.
        """
        self.device = torch.device(device if torch.cuda.is_available() else 'cpu')
        self.dataset = torchaudio.datasets.LIBRISPEECH(root=root_dir, url="dev-clean", download=True)

        # Initialize an empty list to store the database of audio features and metadata.
        self.database = []
        self.mfcc_transform = transforms.MFCC(sample_rate=16000, n_mfcc=12).to(self.device)

    def extract_features(self, waveform):
        """
        Extracts MFCC features from a waveform.

        Parameters:
        - waveform: Tensor, the audio waveform.

        Returns:
        - Tensor, the extracted MFCC features.
        """
        waveform = waveform.to(self.device)
        mfcc = self.mfcc_transform(waveform).mean(dim=2)
        return mfcc.squeeze()

    def build_features_database(self):
        """
        Builds a database of features and metadata for each audio file in the dataset.
        """
        for i, data in enumerate(tqdm(self.dataset, desc="Building features database")):
            waveform, _, _, speaker_id, chapter_id, utterance_id = data
            # Retrieve the file path for reference.
            file_path = os.path.basename(self.dataset._walker[i])
            features = self.extract_features(waveform)
            # Append the features and metadata to the database.
            self.database.append({
                'features': features.unsqueeze(0),
                'speaker_id': speaker_id,
                'chapter_id': chapter_id,
                'utterance_id': utterance_id,
                'path': file_path
            })

    def find_top_matches(self, input_features, top_n=5):
        """
        Finds the top N matches for a given set of input features in the database.

        Parameters:
        - input_features: Tensor, the features of the input audio to match.
        - top_n: int, the number of top matches to return.

        Returns:
        - List of tuples, each containing the database entry and similarity score for a match.
        """
        input_features = input_features.to(self.device).unsqueeze(0)
        # Calculate the cosine similarity between input features and each entry in the database.
        similarities = [torch.cosine_similarity(input_features, entry['features'], dim=1).item() for entry in self.database]
        # Get indices of the top N matches.
        top_n_indices = sorted(range(len(similarities)), key=lambda i: similarities[i], reverse=True)[:top_n]
        # Retrieve the top matches and their scores.
        top_matches = [(self.database[idx], similarities[idx]) for idx in top_n_indices]

        return top_matches

    @staticmethod
    def print_table(matches):
        """
        Prints a formatted table of the top matches. 

        Parameters:
        - matches: List of tuples, the top matches to print.
        """
        max_path_len = max(len(match['path']) for match, _ in matches)
        path_col_width = max(max_path_len, 10)  # Ensure a minimum column width for the audio filename.

        # Define the header and row format with appropriate spacing.
        header_format = f"{{:<5}} | {{:^10}} | {{:^8}} | {{:^10}} | {{:^6}} | {{:^{path_col_width}}}"
        row_format = f"{{:<5}} | {{:^10}} | {{:^8}} | {{:^10}} | {{:^6.4f}} | {{:{path_col_width}}}"

        print(header_format.format('Match', 'Speaker', 'Chapter', 'Utterance', 'Score', 'Audio Filename'))
        print("-" * (48 + path_col_width + (5 * 3)))  # Adjust the total length to account for separators.
        
        for i, (match, score) in enumerate(matches, start=1):
            print(row_format.format(i, match['speaker_id'], match['chapter_id'], match['utterance_id'], score, match['path']))

# Example usage
root_dir = './data'
os.makedirs(root_dir, exist_ok=True)
afp = AudioFingerprinting(root_dir)
print('Building database')
afp.build_features_database()

# Take a sample audio and extract it features
sample_id = 100
input_waveform, _, _, input_label, input_chapter_id, input_utterance_id = afp.dataset[sample_id]
input_file_path = afp.dataset._walker[sample_id]
print('Extracting Features')
input_features = afp.extract_features(input_waveform)

# Find top n similar audio 
print('Finding Matches')
top_matches = afp.find_top_matches(input_features, top_n=10)
print(f"\nInput Speaker ID: {input_label}, Path: {input_file_path}, Chapter ID: {input_chapter_id}, Utterance ID: {input_utterance_id}")
print("\nTop Matches:")
afp.print_table(top_matches)


Building database


Building features database: 100%|██████████| 2703/2703 [00:18<00:00, 144.20it/s]


Extracting Features
Finding Matches

Input Speaker ID: 1462, Path: 1462-170138-0027, Chapter ID: 170138, Utterance ID: 27

Top Matches:
Match |  Speaker   | Chapter  | Utterance  | Score  |  Audio Filename 
-------------------------------------------------------------------------------
1     |    1462    |  170138  |     27     | 1.0000 | 1462-170138-0027
2     |    1462    |  170145  |     16     | 0.9997 | 1462-170145-0016
3     |    1462    |  170142  |     15     | 0.9997 | 1462-170142-0015
4     |    1462    |  170138  |     6      | 0.9995 | 1462-170138-0006
5     |    1462    |  170138  |     24     | 0.9995 | 1462-170138-0024
6     |    1462    |  170142  |     11     | 0.9994 | 1462-170142-0011
7     |    1462    |  170145  |     22     | 0.9994 | 1462-170145-0022
8     |    1462    |  170142  |     16     | 0.9993 | 1462-170142-0016
9     |    1462    |  170138  |     16     | 0.9993 | 1462-170138-0016
10    |    1462    |  170138  |     3      | 0.9993 | 1462-170138-0003


## Interesting Observation
- All top matches belong to the same speaker (Speaker ID: 1462), demonstrating that the system is very effective in recognizing and matching features specific to a speaker's voice. This suggests that the MFCC features extracted are robust indicators of speaker characteristics.

- The fact that the input audio matches itself with a perfect score as the first result is a sanity check for the system, confirming that when the exact audio is present in the database, the system will indeed recognize and rank it as the most similar.

