# Real-time music auto-tagging
In this tutorial, we use Essentia's TensorFlow integration to perform auto-tagging in real-time.
Additionally, this serves as an example of TensorFlow inference in streaming mode and can be easily adapted to work offline.


## Setup
To install Essentia with TensorFlow support, refer to the [Setup](https://essentia.upf.edu/tutorial_tensorflow_auto-tagging_classification_embeddings.html#setup) section of our previous *Music auto-tagging, classification, and embedding extraction* tutorial for instructions.

Additionally, we rely on the `pysoundcard` package to capture the audio loopback of the system and feed Essentia in real-time. This way we can easily test our models with any music coming from our local player or browser.

In [9]:
#!pip -q install pysoundcard

Let's download `MusiCNN`,  one of our auto-tagging models. This and more models are available from the [Essentia models](https://essentia.upf.edu/models/)' site.

In [10]:
# !wget -q https://essentia.upf.edu/models/autotagging/msd/msd-musicnn-1.pb
# !wget -q https://essentia.upf.edu/models/autotagging/msd/msd-musicnn-1.json
# !wget -q https://essentia.upf.edu/models/tempo/tempocnn/deepsquare-k16-3.json
# !wget -q https://essentia.upf.edu/models/tempo/tempocnn/deepsquare-k16-3.pb
# !wget -q https://essentia.upf.edu/models/tempo/tempocnn/deeptemp-k4-3.json
# !wget -q https://essentia.upf.edu/models/tempo/tempocnn/deeptemp-k4-3.pb

Then we import the required packages and Essentia algorithms.
In this case, we use the TensorFlow functionalities in streaming mode.


In [11]:
import json

from essentia.streaming import (
    VectorInput,
    FrameCutter,
    TensorflowInputMusiCNN,
    VectorRealToTensor,
    TensorToPool,
    TensorflowPredict,
    PoolToTensor,
    TensorToVectorReal,
    TempoCNN,
    TensorflowInputTempoCNN,
    TensorflowPredictTempoCNN,
    FrameToReal
)
from essentia import Pool, run, reset
from IPython import display
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import softmax
import soundcard as sc
from datetime import datetime

%matplotlib nbagg

Define the analysis parameters.
To make this demo work in real-time, we tweaked some of the analysis parameters of `MusiCNN`.
While it was trained on patches of size 187 (\~3 seconds) we set `patch_size` to 64 (\~1 second) to increase the prediction rate.
You can experiment with the `patch_size` and `display_size` parameters to modify the prediction rate to your taste.

In [12]:
with open('deeptemp-k4-3.json', 'r') as json_file:
    metadata = json.load(json_file)

model_file = 'deeptemp-k4-3.pb'
input_layer = metadata['schema']['inputs'][0]['name']
output_layer = metadata['schema']['outputs'][0]['name']

# Analysis parameters.
sample_rate = 11025
frame_size = 1024 
hop_size = 256
n_bands = 40
patch_size = 64
display_size = 10

buffer_size = patch_size * hop_size
buffer_size = buffer_size * 8

Instantiate the algorithms. With this, we create a network similar to the one used inside `TensorflowPredictMusiCNN`, the wrapper algorithm presented in the previous tutorial. However, by instantiating the algorithms separately we gain additional control required for real-time usage.

In [13]:
buffer = np.zeros(buffer_size, dtype='float32')
vimp = VectorInput(buffer)
fc = FrameCutter(frameSize=frame_size, hopSize=hop_size)
tim = TensorflowInputTempoCNN()
vtt = VectorRealToTensor(shape=[1, 1, patch_size, n_bands],
                         lastPatchMode='discard')
ttp = TensorToPool(namespace=input_layer)
# tfp = TensorflowPredict(graphFilename=model_file,
#                         inputs=[input_layer],
#                         outputs=[output_layer])
tfp = TensorflowPredictTempoCNN(graphFilename=model_file,
                        input=input_layer,
                        output=output_layer)
ptt = PoolToTensor(namespace=output_layer)
ttv = TensorToVectorReal()
ftr = FrameToReal(frameSize=frame_size,
                    hopSize=hop_size)
pool = Pool()

[   INFO   ] Successfully loaded graph file: `deeptemp-k4-3.pb`


Connect the algorithms. We also store the mel-spectrograms in the `Pool` for visualization purposes.

In [14]:
### Tagging
# vimp.data   >> fc.signal
# fc.frame    >> tim.frame
# tim.bands   >> vtt.frame
# tim.bands   >> (pool, 'melbands')
# vtt.tensor  >> ttp.tensor
# ttp.pool    >> tfp.poolIn
# tfp.poolOut >> ptt.pool
# ptt.tensor  >> ttv.tensor
# ttv.frame   >> (pool, output_layer)

### not working bpm
# vimp.data   >> fc.signal
# fc.frame    >> tim.frame
# tim.bands   >> vtt.frame
# tim.bands   >> (pool, 'melbands')
# vtt.tensor  >> ttv.tensor
# ttv.frame   >> ftr.signal
# ftr.signal  >> tfp.signal
# tfp.predictions >> (pool, output_layer)

# vimp.data   >> fc.signal
# fc.frame    >> ftr.signal
# ftr.signal  >> tfp.signal
# tfp.predictions >> (pool, output_layer)

vimp.data   >> tfp.signal
tfp.predictions >> (pool, output_layer)

Initialize the plots and start processing the loopback stream.

In [15]:
bpms = [x for x in range(1, 256)]

def callback_console(data):
    buffer[:] = data.flatten()

    # Generate predictions.
    reset(vimp)
    run(vimp)
    
    print(pool.containsKey(output_layer))
    
    if pool.containsKey(output_layer):
        index_max = np.argmax(softmax(20 * pool[output_layer][-1, :].T))
        print(datetime.now())
        print(bpms[index_max])



In [16]:
pool.clear()


# Capture and process the speakers loopback.
with sc.all_microphones(include_loopback=True)[0].recorder(samplerate=sample_rate) as mic:
    while True:
        callback_console(mic.record(numframes=buffer_size).mean(axis=1))

2022-02-14 14:33:48.633739: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3194045000 Hz


True
2022-02-14 14:33:48.717349
93
True
2022-02-14 14:34:00.791557
94
True
2022-02-14 14:34:12.369764
96
True
2022-02-14 14:34:24.372459
127
True
2022-02-14 14:34:36.277556
127


KeyboardInterrupt: 