# Audio Fingerprint II: Music Detection with Temporal Localization

From previous tutorial [Audio Fingerprint I: Build a Demo with Towhee & Milvus](https://github.com/towhee-io/examples/blob/main/audio/audio_fingerprint/audio_fingerprint_beginner.ipynb), we have learnt how to build a simple music detection system with performance evaluation & user interface. But the system can't tell which parts in the query audio are identified as a song in archive. This tutorial will introduce how to apply temporal alignment together with audio fingerprinting, which allows the music detection system to localize similar segments between two audios. Furthermore, with temporal localization as an additional postprocessing step, we can improve the system accuracy as well.

<img src=audio_tn.png  width=450 align='centre'>

## Preparation

Basically we need the same preparation work as in the [previous tutorial](./audio_fingerprint_beginner.ipynb), which includes Dependencies, Data and Setup Milvus.

In [1]:
# ! curl -L https://github.com/towhee-io/examples/releases/download/data/audio_fp.zip -O
# ! unzip -q -o audio_fp.zip

In [2]:
import os
import pandas as pd
import statistics

import IPython
import gradio
import glob

import towhee
from towhee import pipe, ops
from towhee.datacollection import DataCollection
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
from sklearn.metrics import accuracy_score


  from .autonotebook import tqdm as notebook_tqdm


## Temporal Localization

Temporal localization is the processing of find overlap parts between two sequences. It is in common use for video tasks, especially video deduplication. In simple words, the basic idea of temporal localization is to find continuous similar frames. Audio data in machine learning is similar as video, both of them are a sequence of frames ordered by timestamps. So we are able to use video temporal localization methods for audio tasks. There is a Towhee operator we can take as the temporal localization tool: [`video_copy_detection.temporal_network`](https://towhee.io/video-copy-detection/temporal-network). 

Let's apply this operator to an example audio query & its answer. It will return ranges and scores of overlap parts. There could be multiple overlap parts, which are wrapped in a list of ranges & scores.

- `ranges`: [[start_time1, start_time2, end_time1, end_time2]], where `start_time1` and `end_time1` defines the range of overlap parts in the first input file, and `start_time2` and `end_time2` defines a range of conrresponding parts in the second input file.
- `scores`: [score], where `score` measures the confidence of the corresponding range.

In [3]:
emb_pipe = (
	pipe.input('url')
		.map('url', 'frames', ops.audio_decode.ffmpeg())
		.map('frames', 'emb', ops.audio_embedding.nnfp())
)


tn_pipe = (
	pipe.input('src', 'dst')
		.map(('src', 'dst'), ('segment', 'score'), ops.video_copy_detection.temporal_network(min_length=1))
)

In [4]:
def tn(path1, path2, show=True):
    ep = emb_pipe.output('emb')
    emb1, emb2 = ep(path1).get()[0], ep(path2).get()[0]
    tp = tn_pipe.output('src', 'dst', 'segment', 'score')
    res = tp(emb1, emb2)
    if show:
        DataCollection(res).show()
    else:
        res = res.get()
        return res[2], res[3]

In [5]:
df = pd.read_csv('audio_fp/ground_truth.csv')

example_query = df['query'][0]
example_candidate = df['answer'][0]

tn(example_query, example_candidate)

objc[22854]: Class AVFFrameReceiver is implemented in both /Users/chizzy/opt/anaconda3/envs/torch19/lib/python3.8/site-packages/av/.dylibs/libavdevice.59.7.100.dylib (0x1288f0118) and /usr/local/Cellar/ffmpeg/4.4_1/lib/libavdevice.58.13.100.dylib (0x134f85a28). One of the two will be used. Which one is undefined.
objc[22854]: Class AVFAudioReceiver is implemented in both /Users/chizzy/opt/anaconda3/envs/torch19/lib/python3.8/site-packages/av/.dylibs/libavdevice.59.7.100.dylib (0x1288f0168) and /usr/local/Cellar/ffmpeg/4.4_1/lib/libavdevice.58.13.100.dylib (0x134f85a78). One of the two will be used. Which one is undefined.


src,dst,segment,score
"[-0.11100991, 0.034402225, 0.046072867, ...] shape=(10, 128)","[0.0070678946, 0.097111076, 0.017180752, ...] shape=(30, 128)","[[0, 10, 9, 19],[0, 16, 8, 29],[0, 0, 9, 3]] len=3","[0.864746438132392,0.23572891099112375,0.35445693135261536] len=3"


The results above tells us there are 2 overlap parts detected:
1. The example_query has its 0-9s similar as 9-19s in example_candidate, with high confidence score 0.86.
2. The example_query has its 0-8s similar as 16-29s in example_candidate, with low confidence score 0.24.

Filtering out ranges with low scores, we can conclude that the `example_query` (0-9s from 10s) is identified as 9-19s in the `example_candidate`.

Play example audios and check similar parts by yourself. We can also confirm the overlap parts by column 'time' as start time in answer from the `ground_truth.csv`.

In [6]:
df.head(1)

Unnamed: 0,query,answer,time,snr,reverb
0,audio_fp/queries/q0079_blues.00078_snr3_stairw...,audio_fp/candidates/blues.00078.wav,10.095964,3.930908,stairway1


In [7]:
IPython.display.display(
    f'example query: {example_query}',
    IPython.display.Audio(example_query),
    f'example answer: {example_candidate}',
    IPython.display.Audio(example_candidate)
)

'example query: audio_fp/queries/q0079_blues.00078_snr3_stairway1.wav'

'example answer: audio_fp/candidates/blues.00078.wav'

## Performance

With temporal alignment as additional postprocessing step for query, the music detection system can be more accurate. We will evaluate the system with temporal alignment function `tn`.

In order to build a system with all example data, you will first need to insert fingerprints of all candidates into a Milvus collection.

In [8]:
HOST = 'localhost'
PORT = '19530'
COLLECTION_NAME = 'nnfp_advanced'
INDEX_TYPE = 'IVF_FLAT'
METRIC_TYPE = 'IP'
DIM = 128
TOPK = 10

# Connect Milvus
connections.connect(host=HOST, port=PORT)

# Create Milvus collection
fields = [
    FieldSchema(name='id', dtype=DataType.INT64, description='embedding ids', is_primary=True, auto_id=True),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='audio embeddings', dim=DIM),
    FieldSchema(name='path', dtype=DataType.VARCHAR, description='audio path', max_length=500)
    ]
schema = CollectionSchema(fields=fields, description='audio fingerprints')

if utility.has_collection(COLLECTION_NAME):
    collection = Collection(COLLECTION_NAME)
    collection.drop() # drop collection if it exists
    
collection = Collection(name=COLLECTION_NAME, schema=schema)

# Create index
index_params = {
    'metric_type': METRIC_TYPE,
    'index_type': INDEX_TYPE,
    'params':{"nlist":2048}
}

collection.create_index(field_name='embedding', index_params=index_params)

Status(code=0, message='')

In [9]:
insert_pipe = (
	pipe.input('path')
		.map('path', 'frames', ops.audio_decode.ffmpeg())
		.flat_map('frames', 'fingerprints', ops.audio_embedding.nnfp())
        .map(('fingerprints', 'path'), 'milvus_res', ops.ann_insert.milvus_client(host=HOST, port=PORT, collection_name=COLLECTION_NAME))
        .output()
)

path = glob.glob('audio_fp/candidates/*.wav')

for i in path:
    insert_pipe(i)

In [10]:
print(f'Total number of embeddings in the collection: {collection.num_entities}')

Total number of embeddings in the collection: 3000


Now the collection is ready with candidates, we need to re-define a new query function with temporal localization as extra step compared to which used in the beginner notebook. Then we will still use the built-in APIs by Towhee DataCollection to evaluate the system with all example queries.

In [11]:
def vote(milvus_res):
    votes = {}
    for res in milvus_res:
        path = res[2]
        score = res[1]
        if path not in votes:
            votes[path] = score
        else:
            votes[path] = votes[path] + score
    votes = sorted(votes.items(), key=lambda item: item[1], reverse=True)
    return votes[0]

def select(query_path, pred, score):
    preds = {}
    for i, j in zip(pred, score):
        if i not in preds:
            preds[i] = j
        else:
            preds[i] += j
    
    final_preds = sorted(preds.items(), key=lambda item: item[1], reverse=True)

    for i in range(len(final_preds)):
        pred_path = final_preds[i][0]
        ranges, scores = tn(query_path, pred_path, False)
        if len(ranges)!=0:
            break 
        
    return final_preds[i][0]

In [12]:
import numpy as np

collection.load()
search_pipe = (
	pipe.input('path')
		.map('path', 'frames', ops.audio_decode.ffmpeg())
		.map('frames', 'emb', ops.audio_embedding.nnfp())
		.flat_map('emb', 'embs', lambda x: x)
		.map('embs', 'milvus_res', ops.ann_search.milvus_client(
										host=HOST,
										port=PORT,
										collection_name=COLLECTION_NAME,
										metric_type=METRIC_TYPE,
										limit=TOPK,
										output_fields=['path']
									))
		.map('milvus_res', ('pred', 'score'), vote)
		.window_all('pred', 'pred', lambda x: x)
		.window_all('score', 'score', lambda x: x)
		.map(('path', 'pred', 'score'), 'result', select)
)

query_pipe = (
	search_pipe.map('result', 'pred_frames', ops.audio_decode.ffmpeg())
		.map('pred_frames', 'pred_emb', ops.audio_embedding.nnfp())
		.map(('emb', 'pred_emb'), ('segment', 'segment_score'), ops.video_copy_detection.temporal_network(min_length=1))
)

result = query_pipe.output('path', 'result')(example_query)
DataCollection(result).show()

path,result
audio_fp/queries/q0079_blues.00078_snr3_stairway1.wav,audio_fp/candidates/blues.00078.wav


In [13]:
%%time
def get_gt(query_path):
    filename = query_path.split('/')[-1]
    name = filename.split('_')[1]
    answer = os.path.join('audio_fp', 'candidates', name + '.wav')
    return answer

eval_pipe = (
    query_pipe.map('path', 'ground_truth', get_gt)
        .output('result', 'ground_truth')
)

path = glob.glob('audio_fp/queries/*.wav')

preds = []
facts = []
for i in path:
    res = eval_pipe(i).get()
    preds.append(res[0])
    facts.append(res[1])

accuracy_score(facts, preds)

CPU times: user 2min 27s, sys: 28.9 s, total: 2min 56s
Wall time: 2min 29s


0.86

With the same example candidates & queries, the system implemented with temporal localization reaches an accuracy of 86%.

## Explore

We have successfully improved the music detection sytem with a higher accuracy using temporal localization. As observed from evaluation above, it consumes more time on detection due to extra step of temporal localization. There are some more options to increase accuracy, such as modifying model configurations or using different weights, and save time & resources, like different inference methods and parallel execution.