# Music Recommendation Service

## Summary

This Notebook uses a library of 11k songs (7GB of song data) to build an ephemeral music recommendation service. As each cell gets executed you learn about how to generate embeddings for each song which together with Qdrant's vector DB will form the backend of of our service. There is no GUI beyond the Jupyter environment. **No code edits should be required.**


## Usage Details

The web interface is used to create and edit notebooks, which are documents containing live code, equations, visualizations, and narrative text. The kernel is the computational engine that executes 
the code in the notebook. When you run a code cell in a notebook, the code is sent to the kernel for execution, and the output is returned to the notebook. **Run a cell by clicking on it and pressing "Shift + Enter".**

### Start download and importing needed modules and models

Note: This code will produce some output to inform you about the state of the model download. GPU support as well as CPU hardware feature support. These will be highlighted in red,

In [None]:
# import the needed modules
from IPython.display import Audio as player
from datasets import load_dataset, load_from_disk, Audio
from panns_inference import AudioTagging
from qdrant_client import QdrantClient
from qdrant_client.http import models
from os.path import join
from glob import glob
import pandas as pd
import numpy as np
import librosa
import openl3
import torch
import os
MUSIC_COLLECTION_DB = "my_collection"

# get the hostname from OS ENV
QDRANT_HOST = os.environ.get('QDRANT_HOST')
# connect to Qdrant vector DB
client = QdrantClient(host=QDRANT_HOST, port=6333)
print("Attempting to connect to %s" % QDRANT_HOST)
# check if our collection already exists
collections = client.get_collections()
music_collection_exits = False
if collections:  
    if len(collections.collections) > 0 and collections.collections[0].name == MUSIC_COLLECTION_DB:
        print("%s exists..." % MUSIC_COLLECTION_DB)
        music_collection_exists = True
    else:
        print("%s doesn't exist..." % MUSIC_COLLECTION_DB)

In [None]:
# aquire the data we need here for the next steps
music_data = load_from_disk("./data/complete_music_data_set.arrow")
metadata = pd.read_json("./data/metatdata_complete_music_data_set.json")
payload = metadata.to_dict(orient="records")

# Note:
# We have roughly 11k songs and embedding creation takes 5-10 seconds per song
# this would take 15-30h on a single CPU
# if you don't have a GPU available you may want to filter the dataset by genre
my_genre = "electronic"
# INPUT: Enable (set FILTER = True) or disable (FILTER = False (default)) the filter here
FILTER = False
def filter_songs(row):
    index = row['index']
    return metadata[(metadata["index"] == index) & (metadata['genre'] == my_genre)].empty

if FILTER:
    subset_meta = metadata[metadata['genre'] == 'electronic']["index"]
    subset_music = music_data.filter(filter_songs)
    music_data = subset_music

In [None]:
# check out one of the songs
a_song = music_data[6623]
from json import dumps
for key, data in a_song.items():
    print(f"{key}: {data}")

sample_rate = a_song['audio']['sampling_rate']
player(a_song['audio']['array'], rate=sample_rate)

### Create our embeddings with either CPU or GPU

This code section is written in Python and uses the PyTorch library to create embeddings 
for a batch of songs. The code first determines whether to use the CPU or GPU for execution 
based on the availability of a CUDA-enabled GPU. It then defines a helper function that 
takes a batch of songs and creates embeddings for each song using the AudioTagging model. 
The embeddings are then added to the batch and returned. Finally, the code checks if the 
embeddings have already been computed for the music data and, if not, computes them using 
the get_panns_embs function. This code section is useful for generating embeddings for a 
large dataset of songs and can be optimized for GPU execution to speed up the process.

In [None]:
# determine if we can use cuda or need to fallback to the cpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# create a helper function which takes a batch of songs and then creates embeddings for each
def get_panns_embs(batch):
    # CPU execution
    if device.type == "cpu":
        arrays = [torch.tensor(val['array'], dtype=torch.double) for val in batch['audio']]
        # padding might not be needed for CPU execution
        inputs = torch.nn.utils.rnn.pad_sequence(arrays, batch_first=True, padding_value=0)
        inputs = inputs.numpy()
    # GPU execution
    else:
        arrays = [torch.tensor(val['array'], dtype=torch.float64) for val in batch['audio']]
        inputs = torch.nn.utils.rnn.pad_sequence(arrays, batch_first=True, padding_value=0).type(torch.cuda.FloatTensor)
    
    # get the embeddings from the model
    _, embedding = at.inference(inputs)
    batch['panns_embeddings'] = embedding
    return batch

# Expensive: only run this if we haven't in the past
if not "panns_embeddings" in music_data.features:
    at = AudioTagging(checkpoint_path=None, device=device.type)
    music_data = music_data.map(get_panns_embs, batched=True, batch_size=8)

### Step here, we will not wait to complete the embedding generation since it takes too long. A "TV kitchen" moment will take us to the next step. Return to the instructions now to learn how.

### Connect and store our Embeddings in Qdrant's vector DB

In [None]:
# check to see if the collection exists before we recreate it
if not music_collection_exits:
    client.recreate_collection(
        collection_name=MUSIC_COLLECTION_DB,
        vectors_config=models.VectorParams(size=2048, distance=models.Distance.COSINE)
    )

# store the embeddings in the vector DB now so we can start using them in various ways
client.upsert(
    collection_name=MUSIC_COLLECTION_DB,
    points=models.Batch(
        ids=music_data['index'],
        vectors=music_data['panns_embeddings'],
        payloads=payload
    )
)