# 04 Vector DataBase

## Table of Contents

1. Overview
2. The Problem
3. Audio Data
4. Transformers and Embeddings
    - What are transformers?
    - What are Embeddings?
    - Fine tunning Wav2Vec
    - Extracting Embeddings
5. Vector Databases
    - What are they?
    - Why do we need them?
    - How can we use them?
6. Enter Qdrant
    - Getting Started
    - Adding Points
    - Payloads
    - Search
7. Putting it all together
    - Basics of Recommender Systems
    - Building a UI
8. Final Thoughts

## 1. Overview

Vector databases are a "relatively" new way for interacting with abstract data representations derived from opaque machine learning models (deep learning architectures). These representations are often called embeddings and they are a compressed version of the data used to train a machine learning model to accomplish a task (e.g., sentiment analysis, speech recognition, object detection, and many more).

One of the best features of a vector database is their ability to serve as the building block of a recommender system, and in this blog post, you'll learn how to accomplish such a feast with usign audio data. Before we go over such a system, let's first cover the main components of a vector database using Qdrant.

Qdrant "is a vector similarity search engine that provides a production-ready service with a convenient API to store, search, and manage points (i.e. vectors) with an additional payload." You can get started with plain python using the `qdrant-client`, run an local docker image of `qdrant` that you can connect to effortlessly, or try out Qdrant's Cloud free tier until you are ready to make the full switch.

## 1. Load Libraries and Data

In [1]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from qdrant_client.http import models
from datasets import load_dataset, Audio
from transformers import AutoModelForAudioClassification, AutoConfig, AutoModel
from IPython.display import Audio as player
from transformers import Wav2Vec2Processor, Wav2Vec2Model
import torch
import numpy as np

In [2]:
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")

Downloading (…)okenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]



Downloading (…)olve/main/vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2Model: ['project_q.bias', 'project_q.weight', 'quantizer.weight_proj.bias', 'quantizer.codevectors', 'project_hid.weight', 'quantizer.weight_proj.weight', 'project_hid.bias']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


hugggof/music-caption-eval-v2

In [3]:
dataset = load_dataset("hugggof/music-caption-eval-v2", split='train')
dataset

Found cached dataset parquet (/home/ramonperez/.cache/huggingface/datasets/hugggof___parquet/hugggof--music-caption-eval-v2-bbf6a56231c17328/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


Dataset({
    features: ['uri', 'artist_name', 'name', 'release_date', 'genre', 'popularity', 'response_gpt4', 'response_gpt3.5-tags', 'response_gpt3.5', 'response_random', 'response_human', 'audio'],
    num_rows: 59
})

In [4]:
len(dataset.unique("genre"))

49

In [5]:
dataset['audio'][0]

{'path': 'spotify:track:6sOa9gg19G0U9DPR39NYQG.mp3',
 'array': array([0., 0., 0., ..., 0., 0., 0.], dtype=float32),
 'sampling_rate': 48000}

In [6]:
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))

In [7]:
sample = dataset[0]['audio']["array"]
player(sample, rate=16_000)

In [30]:
inputs = processor(sample, sampling_rate=16_000, return_tensors="pt", return_attention_mask=True)

In [32]:
inputs['input_values'].size()

torch.Size([1, 3314881])

In [15]:
with torch.no_grad():
    embeddings = model(inputs.input_values, inputs.attention_mask).last_hidden_state

In [33]:
embeddings.size()

torch.Size([1, 10358, 768])

In [18]:
embeddings[0, 0, :].size()

torch.Size([768])

In [19]:
labels = dataset.unique("genre")
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

num_labels = len(id2label)

In [20]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# config = AutoConfig.from_pretrained("facebook/wav2vec2-base")
# model = AutoModelForAudioClassification.from_pretrained(
#     # config
#     "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label,
#     output_attentions=True
# )#.to(device)
model = AutoModel.from_pretrained(
    "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
).to(device)

  return torch._C._cuda_getDeviceCount() > 0
Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2Model: ['project_q.bias', 'project_q.weight', 'quantizer.weight_proj.bias', 'quantizer.codevectors', 'project_hid.weight', 'quantizer.weight_proj.weight', 'project_hid.bias']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [21]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
inputs = feature_extractor(
    sample, sampling_rate=feature_extractor.sampling_rate, 
    return_tensors="pt", padding=True, return_attention_mask=True
).to(device)

In [28]:
inputs['input_values'].size()

torch.Size([1, 3314881])

In [23]:
with torch.no_grad():
    embeds = model(inputs.input_values, inputs.attention_mask)
# hidden

In [29]:
embeds.last_hidden_state.size()

torch.Size([1, 10358, 768])

In [13]:
hidden['last_hidden_state'].size()

torch.Size([1, 10358, 768])

In [None]:
sample[:, None].shape

In [None]:
model(**sample).input_ids

In [None]:
vectors = np.load('vectors.npy')
vectors.shape

In [None]:
client = QdrantClient("localhost", port=6333)

In [None]:
client.recreate_collection(
    collection_name="test_collection",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

In [None]:
from pprint import pprint
collection_info = client.get_collection(collection_name="test_collection")
collection_info

In [None]:
from qdrant_client.http.models import CollectionStatus

assert collection_info.status == CollectionStatus.GREEN
assert collection_info.vectors_count == 0

In [None]:
len(vectors)

In [None]:
client.upsert(
    collection_name="test_collection",
    points=models.Batch(
        ids=list(range(len(vectors))),
        vectors=vectors.tolist()
    ),
)

In [None]:
from diffusers import AudioLDMPipeline
import torch

repo_id = "cvssp/audioldm"
pipe = AudioLDMPipeline.from_pretrained(repo_id)
pipe = pipe.to("cuda")

prompt = "high quality bachata"

audio = pipe(prompt=prompt, num_inference_steps=500, audio_length_in_s=5.0).audios[0]

from IPython.display import Audio

Audio(audio, rate=16000)

In [None]:
classifier(audio, ).

In [None]:
audio.shape

In [None]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("first_mod")
inputs = feature_extractor(audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt", max_length=16000, truncation=True)

In [None]:
inputs['input_values'].size()

In [None]:
with torch.no_grad():
    last_hidden_state = model(**inputs.to(device)).last_hidden_state[:, 0]
last_hidden_state.size()

In [None]:
vectr = last_hidden_state.cpu().numpy()[0, :]

In [None]:
results2 = client.search(
    collection_name="test_collection",
    query_vector=vectr,
    limit=10, 
    # with_vectors=True
)
results2

In [None]:
one_array = np.array(results[0].dict()["vector"])

In [None]:
music = []

for result in results:
    the_song = Audio(np.array(result.dict()["vector"]), rate=16_000)
    # feature_extractor(the_song, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt", max_length=16000, truncation=True)
    music.append(the_song)

In [None]:
music[2]

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
np.sum(one_array)

In [None]:
np.sum(vectors, axis=0) == np.sum(one_array)

In [None]:
scores = cosine_similarity([one_array], vectors)[0]
scores

In [None]:
top_scores_ids = np.argsort(scores)[-5:][::-1]
top_scores_ids