# Real-time music auto-tagging
In this tutorial, we use Essentia's TensorFlow integration to perform auto-tagging in real-time.
Additionally, this serves as an example of TensorFlow inference in streaming mode and can be easily adapted to work offline.


## Setup
To install Essentia with TensorFlow support, refer to the [Setup](https://essentia.upf.edu/tutorial_tensorflow_auto-tagging_classification_embeddings.html#setup) section of our previous *Music auto-tagging, classification, and embedding extraction* tutorial for instructions.

Additionally, we rely on the `pysoundcard` package to capture the audio loopback of the system and feed Essentia in real-time. This way we can easily test our models with any music coming from our local player or browser.

In [1]:
#!pip -q install pysoundcard

Let's download `MusiCNN`,  one of our auto-tagging models. This and more models are available from the [Essentia models](https://essentia.upf.edu/models/)' site.

In [2]:
# !wget -q https://essentia.upf.edu/models/autotagging/msd/msd-musicnn-1.pb
# !wget -q https://essentia.upf.edu/models/autotagging/msd/msd-musicnn-1.json
# !wget -q https://essentia.upf.edu/models/tempo/tempocnn/deepsquare-k16-3.json
# !wget -q https://essentia.upf.edu/models/tempo/tempocnn/deepsquare-k16-3.pb
# !wget -q https://essentia.upf.edu/models/tempo/tempocnn/deeptemp-k4-3.json
# !wget -q https://essentia.upf.edu/models/tempo/tempocnn/deeptemp-k4-3.pb

Then we import the required packages and Essentia algorithms.
In this case, we use the TensorFlow functionalities in streaming mode.


In [3]:
import json

from essentia.streaming import (
    VectorInput,
    RhythmExtractor2013,
    MonoWriter,
    MonoLoader,
    AudioOnsetsMarker
)
from essentia import Pool, run, reset
from IPython import display
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import softmax
import soundcard as sc
from datetime import datetime

2022-02-15 19:23:19.924557: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-02-15 19:23:19.924590: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-02-15 19:23:20.847644: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-15 19:23:20.847824: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-02-15 19:23:20.847836:

Define the analysis parameters.
To make this demo work in real-time, we tweaked some of the analysis parameters of `MusiCNN`.
While it was trained on patches of size 187 (\~3 seconds) we set `patch_size` to 64 (\~1 second) to increase the prediction rate.
You can experiment with the `patch_size` and `display_size` parameters to modify the prediction rate to your taste.

In [4]:
with open('deeptemp-k4-3.json', 'r') as json_file:
    metadata = json.load(json_file)

model_file = 'deeptemp-k4-3.pb'
input_layer = metadata['schema']['inputs'][0]['name']
output_layer = metadata['schema']['outputs'][0]['name']

# Analysis parameters.
sample_rate = 44100
frame_size = 1024 
hop_size = 256
n_bands = 40
patch_size = 64
display_size = 10

buffer_size = patch_size * hop_size
buffer_size = buffer_size * 20  # clip length: 7.5s

Instantiate the algorithms. With this, we create a network similar to the one used inside `TensorflowPredictMusiCNN`, the wrapper algorithm presented in the previous tutorial. However, by instantiating the algorithms separately we gain additional control required for real-time usage.

In [5]:
buffer = np.zeros(buffer_size, dtype='float32')
vimp = VectorInput(buffer)
rex = RhythmExtractor2013(method="degara")
monoL = MonoLoader(filename='./audio/clip.wav', sampleRate = sample_rate)
monoW = MonoWriter(filename='./audio/clip.wav', format='wav', sampleRate = sample_rate)
pool = Pool()

In [6]:
help(AudioOnsetsMarker)

Help on class StreamingAlgo in module essentia.streaming:

class StreamingAlgo(Algorithm)
 |  StreamingAlgo(**kwargs)
 |  
 |  AudioOnsetsMarker
 |  
 |  
 |  Inputs:
 |  
 |    [real] signal - the input signal
 |  
 |  
 |  Outputs:
 |  
 |    [real] signal - the input signal mixed with bursts at onset locations
 |  
 |  
 |  Parameters:
 |  
 |    onsets:
 |      vector_real (default = [])
 |      the list of onset locations [s]
 |  
 |    sampleRate:
 |      real ∈ (0,inf) (default = 44100)
 |      the sampling rate of the output signal [Hz]
 |  
 |    type:
 |      string ∈ {beep,noise} (default = "beep")
 |      the type of sound to be added on the event
 |  
 |  
 |  Description:
 |  
 |    This algorithm creates a wave file in which a given audio signal is mixed
 |    with a series of time onsets. The sonification of the onsets can be heard as
 |    beeps, or as short white noise pulses if configured to do so.
 |    
 |    This algorithm will throw an exception if parameter "fil

Connect the algorithms. We also store the mel-spectrograms in the `Pool` for visualization purposes.

In [7]:
# vimp.data   >>  rex.signal
# rex.bpm >>  (pool, 'bpm')
# rex.ticks  >> (pool, 'ticks')
# rex.confidence    >>  (pool, 'confidence')
# rex.estimates      >>  (pool, 'estimates')
# rex.bpmIntervals      >>  (pool, 'bpmIntervals')

vimp.data   >>  monoW.audio

monoL.audio >>  rex.signal
rex.bpm >>  (pool, 'bpm')
rex.ticks  >> (pool, 'ticks')
rex.confidence    >>  (pool, 'confidence')
rex.estimates      >>  (pool, 'estimates')
rex.bpmIntervals      >>  (pool, 'bpmIntervals')

Initialize the plots and start processing the loopback stream.

In [8]:
bpms = [x for x in range(1, 256)]

def callback_console(data):
    buffer[:] = data.flatten()

    # Generate predictions.
    reset(vimp)
    run(vimp)
    run(monoL)
    
    print(pool.containsKey('bpm'))
    
    if pool.containsKey('bpm'):
        print(datetime.now())
        print(pool['bpm'])
        print('confidence: ' + str(pool['confidence']))
        # print(str(pool['bpmIntervals']))
        # index_max = np.argmax(softmax(20 * pool[output_layer][-1, :].T))
        # print(bpms[index_max])




In [9]:
pool.clear()


# Capture and process the speakers loopback.
with sc.all_microphones(include_loopback=True)[0].recorder(samplerate=sample_rate) as mic:
    while True:
        callback_console(mic.record(numframes=buffer_size).mean(axis=1))

[wav @ 0x5e9b660] Using AVStream.codec.time_base as a timebase hint to the muxer is deprecated. Set AVStream.time_base instead.
[wav @ 0x5e9b660] Encoder did not produce proper pts, making some up.


True
2022-02-15 19:23:32.072162
125.2012710571289
confidence: 0.0


KeyboardInterrupt: 