## Preparation

### Install Dependencies

First we need to install dependencies to support operator training and inference.

In [1]:
! python -m pip install torch torchvision torchaudio torchmetrics==0.7.0 towhee towhee.models>=0.8.0

### Download dataset
This op is trained on the [FMA dataset](https://github.com/mdeff/fma). We need to fine-tune on the [gtzan dataset](https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification). In addition to downloading the gtzan dataset, we also need to download several datasets to add noise during training. They are [Microphone impulse response dataset](http://micirp.blogspot.com/), [Aachen Impulse Response Database](https://www.iks.rwth-aachen.de/en/research/tools-downloads/databases/aachen-impulse-response-database/), and [AudioSet](https://research.google.com/audioset/download.html). These datasets are all publicly available, please contact us if there are any copyright issues.

We have followed this [guide](https://github.com/stdio2016/pfann#prepare-dataset) for data preprocessing. All you need to do is directly download these processed data and information.

Your can create a folder to store all the downloaded data, and it needs about 4G space.

In [2]:
import os
dataset_path = './dataset'
if not os.path.exists(dataset_path):
    os.mkdir(dataset_path)

#### Download gtzan dataset
The [gtzan dataset](https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification) contains 1000 tracks of 30 second length. There are 10 genres, each containing 100 tracks which are all 22050Hz Mono 16-bit audio files in .wav format. 

In [3]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/gtzan_full.zip -o ./dataset/genres_original.zip
! unzip -q -o ./dataset/genres_original.zip -d ./dataset
! rm ./dataset/genres_original.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 1168M  100 1168M    0     0  4129k      0  0:04:49  0:04:49 --:--:-- 4694k


#### Download Microphone impulse response dataset
[Microphone impulse response dataset](http://micirp.blogspot.com/) contains the specially recorded microphone impulse response data, which can be used to adding noise during training.

In [4]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/micirp.zip -o ./dataset/micirp.zip
! unzip -q -o ./dataset/micirp.zip -d ./dataset
! rm ./dataset/micirp.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  152k  100  152k    0     0  83217      0  0:00:01  0:00:01 --:--:-- 1456k


#### Download Aachen Impulse Response Database
[Aachen Impulse Response Database](https://www.iks.rwth-aachen.de/en/research/tools-downloads/databases/aachen-impulse-response-database/) is a set of impulse responses that were measured in a wide variety of rooms. It can be used for adding noise during training.

In [5]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/AIR_1_4.zip -o ./dataset/AIR_1_4.zip
! unzip -q -o ./dataset/AIR_1_4.zip -d ./dataset
! rm ./dataset/AIR_1_4.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  193M  100  193M    0     0  3854k      0  0:00:51  0:00:51 --:--:-- 4098k


#### Download AudioSet
[AudioSet](https://research.google.com/audioset/download.html) is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. We also use it for adding noise during training.

In [6]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/audioset_p1 -o ./dataset/audioset_p1
! curl -L https://github.com/towhee-io/examples/releases/download/data/audioset_p2 -o ./dataset/audioset_p2
! cat ./dataset/audioset_p* > ./dataset/audioset.zip
! unzip -q -o ./dataset/audioset.zip -d ./dataset
! rm ./dataset/audioset_p* ./dataset/audioset.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 1500M  100 1500M    0     0  3133k      0  0:08:10  0:08:10 --:--:-- 4729k8:41  0:06:51  0:01:50 2320k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  582M  100  582M    0     0  2620k      0  0:03:47  0:03:47 --:--:-- 2613k


#### Download data information for training
This is some data information that has been preprocessed for training.

In [7]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/gtzan_info.zip -o ./gtzan_info.zip
! unzip -q -o ./gtzan_info.zip
! rm ./gtzan_info.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 25286  100 25286    0     0  14237      0  0:00:01  0:00:01 --:--:--  355k


## Fine-tune Audio Embedding with Neural Network Fingerprint operator

### Instantiate operator
We can instantiate a towhee nnfp operator. This audio embedding operator converts an input audio into a dense vector which can be used to represent the audio clip's semantics. Each vector represents for an audio clip with a fixed length of around 1s. This operator generates audio embeddings with fingerprinting method introduced by Neural Audio Fingerprint. The nnfp operator is suitable for audio fingerprinting. 

In [8]:
import towhee
nnfp_op = towhee.ops.audio_embedding.nnfp().get_op()

In [9]:
test_audio = 'dataset/audioset/B4lyT64WFjc_0.wav'
embedding = nnfp_op(test_audio)
embedding, embedding.shape

(array([[-0.15469249, -0.02260398, -0.05088959, ...,  0.14650534,
          0.04951884, -0.04235527],
        [-0.00608123, -0.06859994, -0.0750239 , ...,  0.0840608 ,
          0.12196919, -0.1123263 ],
        [-0.18665867,  0.08474724,  0.03795987, ...,  0.06031123,
         -0.09239668, -0.08622654],
        ...,
        [ 0.02841254,  0.01915257,  0.02964114, ...,  0.04307787,
         -0.08863434,  0.0016751 ],
        [-0.0166699 ,  0.08893833,  0.05510458, ...,  0.13624884,
          0.03493905, -0.13401009],
        [-0.04592355, -0.07944845,  0.09267115, ...,  0.02575601,
         -0.09419111,  0.03918429]], dtype=float32),
 (10, 128))

### Start training
When initialized, this operator already contains the model with weights trained on the FMA data. The goal of our training is to fine-tune it on another audio dataset domain to better fit the new data distribution. 

We can first look at the default training configuration. 

In [10]:
! cat ./gtzan_info/default_gtzan.json

{
  "train_csv": "gtzan_info/lists/gtzan_train.csv",
  "validate_csv": "gtzan_info/lists/gtzan_valtest.csv",
  "test_csv": "gtzan_info/lists/gtzan_valtest.csv",
  "music_dir": "dataset/genres_original",
  "model_dir": "fma_test",
  "cache_dir": "caches",
  "batch_size": 640,
  "shuffle_size": null,
  "fftconv_n": 32768,
  "sample_rate": 8000,
  "stft_n": 1024,
  "stft_hop": 256,
  "n_mels": 256,
  "f_min": 300,
  "f_max": 4000,
  "segment_size": 1,
  "hop_size": 0.5,
  "time_offset": 1.2,
  "pad_start": 0,
  "epoch": 1,
  "lr": 1e-4,
  "tau": 0.05,
  "noise": {
    "train": "gtzan_info/lists/noise_train.csv",
    "validate": "gtzan_info/lists/noise_val.csv",
    "dir": "dataset/audioset",
    "snr_max": 10,
    "snr_min": 0
  },
  "micirp": {
    "train": "gtzan_info/lists/micirp_train.csv",
    "validate": "gtzan_info/lists/micirp_val.csv",
    "dir": "dataset/micirp",
    "length": 0.5
  },
  "air": {
    "train": "gtzan_info/lists/air_train.csv",

This json contains some training configurations such as epoch, batch size, etc., as well as some data and model information.  
There are some csv file paths in this json, which contain the audio data information of the corresponding data set.  
We only need to pass this file path to the `train()` interface to train this operator.

In [11]:
nnfp_op.train(config_json_path='./gtzan_info/default_gtzan.json')

loading noise dataset


100%|██████████| 1193/1193 [00:15<00:00, 74.76it/s]


torch.Size([95161077])
loading Aachen IR dataset
loading microphone IR dataset
load cached music from caches/1gtzan_train.bin
training data contains 47200 samples
loading noise dataset


100%|██████████| 299/299 [00:03<00:00, 75.24it/s]


torch.Size([23878784])
loading Aachen IR dataset
loading microphone IR dataset
load cached music from caches/1gtzan_valtest.bin
evaluate before fine-tune...

validate score: 0.805910




  0%|          | 0/148 [00:00<?, ?step/s]


validate score: 0.813064


We only conducted one epoch fine-tuning because too many epochs will lead to overfitting. It can be seen that on the new dataset distribution, the loss has decreased and the validate score has improved after the fine-tuning.  
For more training details you can refer to [this script](https://towhee.io/audio-embedding/nnfp/src/branch/main/train_nnfp.py). If you want to fine-tune with your own data or manner, you can refer to it for modification.

### Use your fine-tuned weights
If you need to use your trained weights to extract embedding, you just reload the trained model weights.

In [12]:
new_nnfp_op = towhee.ops.audio_embedding.nnfp(model_path='./fine_tune_output/final_epoch/model.pth').get_op()
embedding = new_nnfp_op(test_audio)
embedding, embedding.shape

(array([[-0.16501123, -0.01201352, -0.0383786 , ...,  0.12018757,
          0.04253541, -0.01914779],
        [ 0.00619547, -0.07101642, -0.06639319, ...,  0.06269935,
          0.12877102, -0.06712442],
        [-0.18207397,  0.09906715,  0.02120742, ...,  0.03872132,
         -0.07399878, -0.07401986],
        ...,
        [ 0.03872605,  0.01387926,  0.04619657, ..., -0.02789425,
         -0.06140808,  0.03822998],
        [ 0.00434632,  0.08006225,  0.05249849, ...,  0.0985365 ,
          0.03861852, -0.10067634],
        [-0.039698  , -0.07496819,  0.09917679, ..., -0.02591049,
         -0.06788268,  0.0777601 ]], dtype=float32),
 (10, 128))

We can observe that there are some differences between the newly inference output after training and the previous, but not too much. This also shows that we indeed just only fine-tune the model, instead of drastically changing the weights.