# Recognize Music Genre using Embeddings


## Scenario Introduction

A **music genre classification system** automatically identifies a piece of music by matching a short snippet against a database of known music. Compared to the traditional methods using frequency domain analysis, the use of embedding vectors generated by 1D convolutional neural networks improves recall and can, in some cases, improve query speed.

A music genre classification system generally transforms audio data to embeddings and compares similarity based on distances between embeddings. Therefore, an encoder converting audio to embedding and a database for vector storage and retrieval are main components.

## Tutorial Overview

Normally an audio embedding pipeline generates a set of embeddings given an audio path, which composes a unique fingerprint representing the input music. Each embedding corresponds to features extracted for a snippet of the input audio. By comparing embeddings of audio snippets, the system can determine the similarity between audios. The image below explains the music fingerprinting by audio embeddings.

<img src="./music_embedding.png" width = "60%" height = "60%" align=center />


A block diagram for a basic music genre classification system is shown in images below. The first image illustrates how the system transforms a music dataset to vectors with [Towhee](https://github.com/towhee-io/towhee) and then inserts all vectors into [Milvus](https://github.com/milvus-io/milvus). The second image shows the querying process of an unknown music snippet.

<img src="./music_recog_system.png"  width = "60%" height = "60%" align=center />

Building a music genre classification system typically involves the following steps:

1. Model and pipeline selection
2. Computing embeddings for the existing music dataset
3. Insert all generated vectors into a vector database
4. Identify an unknown music snippet by similarity search of vectors

In the upcoming sections, we will first walk you through some of the prep work for this tutorial. After that, we will elaborate on each of the four steps mentioned above.

## Preparation

First, we need to install Python packages, download example data, and prepare Milvus.

### Install packages
Make sure you have installed required python packages with proper versions:

In [1]:
! python -m pip install -q towhee towhee.models gradio scikit-learn

### Download dataset

This tutorial uses a subset of [GTZAN](http://marsyas.info/downloads/datasets.html). You can download it via [Github](https://github.com/towhee-io/examples/releases/download/data/gtzan300.zip).

The data is organized as follows:

- train: candidate music, 10 classes, 30 audio files per class (300 in total)
- test: query music clips, same 10 classes as train data, 3 audio files per class (30 in total)
- gtzan300.csv: a csv file containing an id, path, and label for each video in train data

In [2]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/gtzan300.zip -O
! unzip -q -o gtzan300.zip

Let's take a quick look and prepare a id-label dictionary for future use:

In [3]:
import pandas as pd

df = pd.read_csv('./gtzan300.csv')
id_label = df.set_index('id')['label'].to_dict()

df.head(3)

Unnamed: 0,id,path,label
0,0,./train/hiphop/hiphop.00096.wav,hiphop
1,1,./train/hiphop/hiphop.00024.wav,hiphop
2,2,./train/hiphop/hiphop.00015.wav,hiphop


### Start Milvus

Before getting started with the system, we also need to prepare Milvus in advance. Milvus is an open-source vector database built to power embedding similarity search and AI applications. More info about Milvus is available [here](https://github.com/milvus-io/milvus).

Please make sure that you have started a [Milvus service](https://milvus.io/docs/install_standalone-docker.md). This notebook uses [milvus 2.2.10](https://milvus.io/docs/v2.2.x/install_standalone-docker.md) and [pymilvus 2.2.11](https://milvus.io/docs/release_notes.md#2210).

In [None]:
! python -m pip install -q pymilvus==2.2.11

Here we prepare a function to work with a Milvus collection with the following parameters:

- [L2 Distance](https://milvus.io/docs/metric.md#Euclidean-distance-L2)
- [IVF-Flat Index](https://milvus.io/docs/index.md#IVF_FLAT)

In [4]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

connections.connect(host='localhost', port='19530')

def create_milvus_collection(collection_name, dim):    
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)
    
    fields = [
    FieldSchema(name='id', dtype=DataType.VARCHAR, descrition='ids', max_length=500, is_primary=True, auto_id=False),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, descrition='embedding vectors', dim=dim)
    ]
    schema = CollectionSchema(fields=fields, description='audio classification')
    collection = Collection(name=collection_name, schema=schema)

    # create IVF_FLAT index for collection.
    index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist": 400}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection

In [5]:
collection = create_milvus_collection('vggish', 128)

## Build System

Now we are ready to build a music genre classification system. We will select models, generate & save embeddings, and then perform a query example.

### 1. Model and pipeline selection

The first step in building a music genre classification system is selecting an appropriate embedding model and one of its associated pipelines. Within Towhee, all pipelines can be found on the [Towhee hub](https://towhee.io/tasks/operator). Clicking on any of the categories will list available operators based on the specified task; selecting `audio-embedding` will reveal all audio embedding operators that Towhee offers. We also provide an another option with summary of popular audio embedding pipelines [here](https://docs.towhee.io/pipelines/audio-embedding). 

Resource requirements, accuracy, inference latency are key trade-offs when selecting a proper pipeline. Towhee provides a multitude of pipelines to meet various application demands. For demonstration purposes, we will use [vggish](https://towhee.io/audio-embedding/vggish).

### 2. Generating embeddings for the existing music dataset

With optimal operators selected, generating audio embeddings over our music dataset is the next step. Each audio path will go though the pipeline and then output a set of vectors.

In [6]:
from towhee import pipe, ops
from towhee.datacollection import DataCollection
from csv import reader
import numpy as np


# Please note the first time run will take time to download model and other files.

emb_pipe = (
    pipe.input('id', 'path', 'label')
        .map('path', 'frames', ops.audio_decode.ffmpeg())
        .map('frames', 'frames', lambda x: [i[0] for i in x])
        .flat_map('frames', 'vec', ops.audio_embedding.vggish())
        .map('vec', 'vec', ops.towhee.np_normalize())
)

  from .autonotebook import tqdm as notebook_tqdm


- `ops.audio_decode.ffmpeg()`: an embeded Towhee operator reading audio as frames [learn more](https://towhee.io/audio-decode/ffmpeg)
- `ops.audio_embedding.vggish()`: an embeded Towhee operator applying pretrained VGGish to audio frames, which can be used to generate video embedding [learn more](https://towhee.io/audio-embedding/vggish)
- `ops.towhee.np_normalize()`: normalize the embedding results

### 3. Insert all generated embedding vectors into a vector database

While brute-force computation of distances between queries and all audio vectors is perfectly fine for small datasets, scaling to billions of music dataset items requires a production-grade vector database that utilizes a search index to greatly speed up the query process. Here, we'll insert vectors computed in the previous section into a Milvus collection.

In [7]:
insert_pipe = (
    emb_pipe.map(('id', 'vec'), 'insert_status',ops.ann_insert.milvus_client(host='127.0.0.1', port='19530', collection_name='vggish'))
        .output()
)

with open('gtzan300.csv') as f:
    freader = reader(f)
    next(freader)
    for row in freader:
        res = insert_pipe(*row)

objc[17868]: Class AVFFrameReceiver is implemented in both /Users/chizzy/opt/anaconda3/envs/torch19/lib/python3.8/site-packages/av/.dylibs/libavdevice.59.7.100.dylib (0x11047c118) and /usr/local/Cellar/ffmpeg/4.4_1/lib/libavdevice.58.13.100.dylib (0x11f968a28). One of the two will be used. Which one is undefined.
objc[17868]: Class AVFAudioReceiver is implemented in both /Users/chizzy/opt/anaconda3/envs/torch19/lib/python3.8/site-packages/av/.dylibs/libavdevice.59.7.100.dylib (0x11047c168) and /usr/local/Cellar/ffmpeg/4.4_1/lib/libavdevice.58.13.100.dylib (0x11f968a78). One of the two will be used. Which one is undefined.


In [8]:
print(f'Total inserted data:{collection.num_entities}')

Total inserted data:9300


### 4. Identify an unknown music snippet by similarity search of vectors

We can use the same pipeline to generate a set of vectors for a query audio. Then searching across the collection will find the closest embeddings for each vector in the set. 

Our stratrgy is to find the most likely genre for each frame, acoording to the prediction for all frames, determine the most likely genre of the query audio file.

The following example recognizes music genres for each audio under `./test/pop/*`.

In [9]:
import glob
from statistics import mode
collection.load()

search_pipe = (
    pipe.input('path')
        .map('path', 'frames', ops.audio_decode.ffmpeg())
        .map('frames', 'frames', lambda x: [i[0] for i in x])
        .flat_map('frames', 'vec', ops.audio_embedding.vggish())
        .map('vec', 'vec', ops.towhee.np_normalize())
        .map('vec', 'results', ops.ann_search.milvus_client(host='127.0.0.1', port='19530', collection_name='vggish', limit=10, metric_type='L2'))
        .map('results', 'results', lambda x: [id_label[int(i[0])] for i in x])
        .window_all('results', 'predict', lambda x: mode([mode(i) for i in x]))
)

query_pipe = search_pipe.output('path', 'predict')

path = glob.glob('./test/pop/*')
for i in path:
    res = query_pipe(i)
    DataCollection(res).show()

path,predict
./test/pop/pop.00053.wav,pop


path,predict
./test/pop/pop.00091.wav,pop


path,predict
./test/pop/pop.00054.wav,pop


## Evaluation

We have just built a music genre classification system. But how's its performance? We can evaluate the search engine against the ground truths. Here we use the metric `accuracy` to measure performance with the example test data of 30 audio files.

In [10]:
eval_pipe = (
    search_pipe.map('path', 'ground_truth', lambda x: x.split('/')[-2])
        .output('path', 'predict', 'ground_truth')
)


predicts = []
facts = []
path = glob.glob('./test/*/*')
for i in path:
    res = eval_pipe(i).get()
    predicts.append(res[1])
    facts.append(res[2])

In [11]:
from sklearn.metrics import accuracy_score

accuracy_score(facts, predicts)

0.7666666666666667

From test above, we can tell the accuracy of this basic music genre classification system is 77%. To make your own solution in production, you can build some more complicated system to improve performance. For example, Towhee provides more options of models and APIs to optimize execution.

## Release a Showcase

We've just built a music genre classification system and tested its performance. Now it's time to add some interface and release a showcase. Towhee provides `towhee.api()` to wrap the data processing pipeline as a function with `.as_function()`. So we can build a quick demo with this `demo_function` with [Gradio](https://gradio.app/).

In [12]:
import gradio

def music_classify(path):
    query_pipe = (
        pipe.input('path')
            .map('path', 'frames', ops.audio_decode.ffmpeg())
            .map('frames', 'frames', lambda x: [i[0] for i in x])
            .flat_map('frames', 'vec', ops.audio_embedding.vggish())
            .map('vec', 'vec', ops.towhee.np_normalize())
            .map('vec', 'results', ops.ann_search.milvus_client(host='127.0.0.1', port='19530', collection_name='vggish', limit=10, metric_type='L2'))
            .map('results', 'results', lambda x: [id_label[int(i[0])] for i in x])
            .window_all('results', 'predict', lambda x: mode([mode(i) for i in x]))
            .output('predict')
    )

    res = query_pipe(path).get()[0]
    return res

collection.load()
interface = gradio.Interface(music_classify, 
                             inputs=gradio.Audio(source='upload', type='filepath'),
                             outputs=gradio.Textbox(label="Music Genre")
                            )

interface.launch(inline=True, share=True)

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://ec6e6de541b21311cd.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces


