# Audio Fingerprint I: Build a Demo with Towhee & Milvus

Audio fingerprinting is the process of extracting features to represent audio in digital numbers. Normally the process cuts the input audio into shorter clips with a fixed length. Then it converts each clip to a single fingerprint piece in a fixed size. With all small pieces together ordered by timestamps, a complete fingerprint is generated for the input audio. 

<img src=fingerprint.png  width=450 align='centre'>

With audio fingerprints as identities, a system can recognize music with various transformations. This tutorial will use [Towhee](https://towhee.io) as the feature extractor and [Milvus](https://milvus.io) as the database to build a simple demo of music recognition system. It includes 4 sections, last two of which are optional and additional for evaluation and user interface.

1. Prepare packages, data, Milvus service in advance
2. Build the system with wrapped APIs and test with an example
3. Evaluate the system performance over all example data
4. Play it online

## Preparation

We need to install some python packages, prepare example data, and set up Milvus service.

### Dependencies

Install the following python packages with proper versions. The command below will use pip for installation and try to import all packages in python. If it fails or unexpected error occurs, please manlually install required packages in your environment.

| package | version |
| -- | -- |
| towhee | 0.8.1 |
| towhee.models | 0.8.1 |
| pymilvus | 2.1.2 |
| ipython | |
| gradio | |

In [1]:
! python -m pip install -q towhee towhee.models pymilvus gradio ipython

In [2]:
import os
import pandas as pd
import statistics

import IPython
import gradio

import towhee
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

### Data

The example data uses a subset of [GTZAN](http://marsyas.info/downloads/datasets.html) as candidates. Query audio files are converted from each candidate: random crop of 10s segment & mixure with 2db random background noise. You can download it from github with command below:

In [3]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/audio_fp.zip -O
! unzip -q -o audio_fp.zip

The data directory `audio_fp` is organized as follows:

- candidates: 100 wav files, 30s each
- queries: 100 wav files, 10s each
- ground_truth.csv: a csv file mapping each query to its answer in candidates, including augmentation information as well

Let's take a quick look:

In [4]:
df = pd.read_csv('audio_fp/ground_truth.csv')
df.head()

Unnamed: 0,query,answer,time,snr,reverb
0,audio_fp/queries/q0003_blues.00002_snr2_stairw...,audio_fp/candidates/blues.00002.wav,10.393288,2.0,stairway4
1,audio_fp/queries/q0026_blues.00025_snr2_meetin...,audio_fp/candidates/blues.00025.wav,11.627528,2.0,meeting
2,audio_fp/queries/q0034_blues.00033_snr2_meetin...,audio_fp/candidates/blues.00033.wav,5.264717,2.0,meeting
3,audio_fp/queries/q0059_blues.00058_snr2_lectur...,audio_fp/candidates/blues.00058.wav,0.261224,2.0,lecture
4,audio_fp/queries/q0061_blues.00060_snr2_lectur...,audio_fp/candidates/blues.00060.wav,10.232834,2.0,lecture


As observed from the csv, the answer can be derived from the query path. For future use, we define a function to get ground truth given a query path:

In [5]:
def get_gt(query_path):
    filename = query_path.split('/')[-1]
    name = filename.split('_')[1]
    answer = os.path.join('audio_fp', 'candidates', name + '.wav')
    return answer

How does the query segment sounds like compared to the original music? Click play buttons below to listen an example pair of data:

In [6]:
example_query = df['query'][0]
example_candidate = df['answer'][0]

IPython.display.display(
    f'example query: {example_query}',
    IPython.display.Audio(example_query),
    f'example answer: {example_candidate}',
    IPython.display.Audio(example_candidate)
)

'example query: audio_fp/queries/q0003_blues.00002_snr2_stairway4.wav'

'example answer: audio_fp/candidates/blues.00002.wav'

### Setup Milvus

The last thing to be prepared is Milvus. For more options & detailed instructions, you can refer to [Milvus doc](https://milvus.io/docs/v2.1.x). If you need more help for Milvus, feel free to submit tickets or join discussion in [Milvus github](https://github.com/milvus-io/milvus).

In [7]:
# Download docker yaml for Milvus standalone
! wget https://github.com/milvus-io/milvus/releases/download/v2.1.1/milvus-standalone-docker-compose.yml -O docker-compose.yml
# Run command below under the same directory as the docker yaml
! docker-compose up -d

## Build System

By this step, you should have successfully installed all packages, downloaded example data, and started Milvus service. Now it's time to build the music recognition system. This section is divided into 3 parts:

1. Prepare APIs for easy use
2. Create database archived with fingerprints
3. Query with an example segment


### Wrap APIs

We will first prepare some APIs before dealing with actual data. By the end of this section, we will provide core APIs to build system. Before that, it would be helpful to have some supportive functions.

#### Fingerprinting

Towhee makes it easy to build neural data processing pipelines for AI applications. It provides hundreds of models, algorithms, and transformations that can be used as standard pipeline building blocks. You can use Towhee's audio embedding operators to extract features for audio input. If you have any questions about this step, you can visit [Towhee Github](https://github.com/towhee-io/towhee).

In this tutorial, we select the Towhee operator [`audio_embedding.nnfp`](https://towhee.io/audio-embedding/nnfp), which uses a pretrained deep learning model specific for audio retrieval. With default configuration, it generates 1 embedding with dimension of 128 for each second of audio without overlap.

In [8]:
def fp(query_pattern):
    fp_res = (
        towhee.glob['path'](query_pattern)
              .audio_decode.ffmpeg['path', 'frames'](batch_size=99999)  # use smaller batch_size to reduce mem
              .flatten['frames']()
              .audio_embedding.nnfp['frames', 'embs']()
              .select['path', 'embs']
              .to_list()
    )
    
    ids = []
    vecs = []
    for x in fp_res:
        x_embs = x.embs.tolist()
        vecs = vecs + x_embs
        x_ids = [x.path] * len(x_embs)
        ids = ids + x_ids
    return ids, vecs

#### Database

Although Milvus has its own Python SDK, we better wrap them in some simpler functions with respect to purposes (create collection, insert data, search in Milvus).

In [9]:
def milvus_create(collection_name, dim):  
    fields = [
        FieldSchema(name='path', dtype=DataType.VARCHAR, description='path to audio', max_length=500, 
                    is_primary=True, auto_id=False),
        FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='audio embedding vectors', dim=dim)
    ]
    schema = CollectionSchema(fields=fields, description='audio fingerprints')
    collection = Collection(name=collection_name, schema=schema)

    index_params = {
        'metric_type':'IP',
        'index_type':'IVF_FLAT',
        'params':{"nlist":2048}
    }
    collection.create_index(field_name='embedding', index_params=index_params)
    return collection

def milvus_insert(collection_name, ids, vecs):
    assert utility.has_collection(collection_name)
    collection = Collection(collection_name)
    mr = collection.insert([ids, vecs])

def milvus_search(collection_name, query_data, topk=10):
    assert utility.has_collection(collection_name)
    collection = Collection(collection_name)
    collection.load()

    mil_res = collection.search(
        query_data,
        anns_field='embedding',
        param={'metric_type': 'IP', 'params': {'nprobe': 12}},
        limit=topk
    )
    return mil_res

#### System

With all functions above, we can create our APIs needed to build system:

- `connect`: connect to Milvus & check collection by name
- `insert`: fingerprint all wav files under the source directory & insert all fingerprints into Milvus collection
- `query`: given a piece of music, search over database and match the most possible music in archive

In [10]:
def connect(collection_name=None, host='127.0.0.1', port='19530'):
    connections.connect(host=host, port=port)
    if collection_name:
        if utility.has_collection(collection_name):
            collection = Collection(collection_name)
            print(f'Collection {collection_name} has {collection.num_entities} data.')
        else:
            print(f'Collection {collection_name} does NOT exist.')

def insert(data_dir, collection_name='nnfp', dim=128):  
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)  # Drop collection if it exists to create a new one
    collection = milvus_create(collection_name, dim)
    ids, vecs = fp(os.path.join(data_dir, '*.wav'))
    mr = milvus_insert(collection_name, ids, vecs)
    collection = Collection(collection_name)
    print(f'{collection.num_entities} embeddings are inserted for {len(set(ids))} audio files.')
    
def query(query_path, collection_name):
    ids, query_vecs = fp(query_path)
    mil_res = milvus_search(collection_name, query_vecs)
    votes = []
    for i in range(len(mil_res)):
        vote = statistics.mode(mil_res[i].ids)
        # print(f"Search results for no.{i+1} segment:", vote)
        votes.append(vote)
    final_vote = statistics.mode(votes)
    return final_vote

### Insert data

With APIs ready, we can start to build the database with `insert`. It generates fingerprints and assigns audio paths as fingerprint ids. In other words, embeddings generated from the same audio input will share the same id. Then all embeddings with corresponding ids are inserted into the desired Milvus collection.

<img src=storage.png  width=500 align='centre'>

Let's insert all fingerprints of candidate audios into database.

In [11]:
%%time

candidate_dir = 'audio_fp/candidates'
query_dir = 'audio_fp/queries'
collection_name = 'nnfp'
dim = 128

# Connect collection
connect()

# Fingerprint & insert embeddings for all candidates
insert(candidate_dir, collection_name, dim)

3000 embeddings are inserted for 100 audio files.
CPU times: user 3min 35s, sys: 6.26 s, total: 3min 41s
Wall time: 12.4 s


### Query example

With database ready, you are now able to find the most similar music in archive given a piece of audio.

<img src=query.png  width=500 align='centre'>

In [12]:
# Connect collection
connect(collection_name)

# Detect for example audio
final_vote = query(example_query, 'nnfp')
print(f"\nFinal result for {example_query}:", final_vote)

assert os.path.samefile(final_vote, get_gt(example_query))


Final result for audio_fp/queries/q0003_blues.00002_snr2_stairway4.wav: audio_fp/candidates/blues.00002.wav


### Performance

You have already built the music recognition system and tried with an example audio. But how's the overall performance? Here we measure the system over all example data with accuracy. From results below, we call tell the average time taken for a 10s segment detection is about 0.5s (128 CPUs & 1 GPU) and the overall accuracy reaches 96%.

In [13]:
%%time

performance = (
    towhee.glob['path']('audio_fp/queries/*.wav').stream()
          .runas_op['path', 'result'](func=lambda p: query(p, 'nnfp'))
          .runas_op['path', 'ground_truth'](func=get_gt)
          .with_metrics(['accuracy'])
          .evaluate['ground_truth', 'result']('nnfp')
          .report()
)

Unnamed: 0,accuracy
nnfp,0.96


CPU times: user 39min 29s, sys: 1min 15s, total: 40min 45s
Wall time: 53.7 s


## Showcase

We can also build a simple demo on top of the music detection system with user interface provided by Gradio. Of course you can build your own system with customized front end using APIs above as backend.

In [14]:
import gradio

def query_function(query_path):
    return query(query_path, 'nnfp')

interface = gradio.Interface(query_function, 
                             gradio.inputs.Audio(type="filepath", source='upload'),
                             gradio.outputs.Audio()
                            )

interface.launch(inline=True, share=True)

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://24373.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces: https://huggingface.co/spaces


(<gradio.routes.App at 0x7fc74daaa0a0>,
 'http://127.0.0.1:7860/',
 'https://24373.gradio.app')

## Next ...

This demo shows how to build a simple audio fingerprint system with basic Towhee & Milvus operations. There are some advanced options and tricks to improve performance, increase efficiency, and save resources. The next notebook will illustrate more detailed examples and optimization methods.