# Recommendation Systems, Vector Databases, and Audio Data

![main](../images/main_pic.png)

## Table of Contents

1. Overview
2. The Challenge
3. Audio Data
    - Intro to Audio Data
    - Data Preparation
4. Vector Databases
    - What are they?
    - Why do we need them?
    - How can we use them?
    - Enter Qdrant
        - Getting Started
        - Adding Points
        - Payloads
        - Search
5. Transformers and Embeddings
    - What are transformers?
    - What are Embeddings?
    - Fine tunning Wav2Vec
    - Extracting Embeddings
6. Putting it all together
    - Adding Vectors to Qdrant
    - Basics of Recommender Systems
    - Building a UI
7. Final Thoughts

## 1. Overview

Vector databases are a "relatively" new way for interacting with abstract data representations derived from opaque machine learning models (deep learning architectures). These representations are often called embeddings and they are a compressed version of the data used to train a machine learning model to accomplish a task (e.g., sentiment analysis, speech recognition, object detection, and many more).

One of the best features of a vector database is their ability to serve as the building block of a recommender system, and in this blog post, you'll learn how to accomplish such a feast with usign audio data. Before we go over such a system, let's first cover the main components of a vector database using Qdrant.

Qdrant "is a vector similarity search engine that provides a production-ready service with a convenient API to store, search, and manage points (i.e. vectors) with an additional payload." You can get started with plain python using the `qdrant-client`, run an local docker image of `qdrant` that you can connect to effortlessly, or try out Qdrant's Cloud free tier until you are ready to make the full switch.

## 2. The Challenge

Building recommendation systems can be quite challenging. For starters, we never know apriori the needs and wants of a new customer of our stores or a new user of our applications, which makes is difficult to recommend a toaster brand to someone searching for skillets, or to recommend a Beatles' song to users that just listened to a bachata from Romeo Santos as their first song in our app.

The aforementioned problems belong to the "cold start" family and, while they might not go away anytime soon, there is a way to bypass these issues and serve relevant results from the get go. That said, that's what we will work on in this tutorial, a recommender system for music built on top of Qdrant, an open-source vector database designed for flexibility, scalability and ease of use.

## 3. Audio Data

The data we'll use can be downloaded from Kaggle [here](https://www.kaggle.com/datasets/jorgeruizdev/ludwig-music-dataset-moods-and-subgenres?resource=download&select=labels.json), and it consists of the following pieces.
- `mp3/mp3`: two mp3 directories (unnecessarily) with audio files categorized by genre (e.g., latin, hip_hop, etc.). The name of each song is also the unique ID of the same.
- `spectogram`: Mel Frequency Spectograms of each song saved as `.npy` files.
- `MFCCS`: This folder contains a .npy file with the extracted MFFCs of each song. Each .npy file contains around 10 MFFCs of 3s of duration.
- `labels.json`: metadata about each song including genre, artist, song, etc. We can use the unique ID to merge it with the main dataset.
- `subgenres.json`: metadata about the subgenre of each song. We can use the unique ID to merge it with the main dataset.

Once you download the data and unzip it in your data directory, make sure you have the directory arranged in the following way to match this tutorial.

```sh
../data/ludwig_music_data
├── labels.json
├── mfccs
│   ├── blues
│   ├── classical
│   ├── electronic
│   ├── funk_soul
│   ├── hip_hop
│   ├── jazz
│   ├── latin
│   ├── pop
│   ├── reggae
│   └── rock
├── mp3
│   ├── blues
│   ├── classical
│   ├── electronic
│   ├── funk _ soul
│   ├── hip hop
│   ├── jazz
│   ├── latin
│   ├── pop
│   ├── reggae
│   └── rock
├── spectogram
│   └── spectogram
└── subgeneres.json
```

Please note that you can substitute this (12gb) dataset for another of your choosing, provided it is for a classification task, and follow the tutorial with barely any tweaks to the code.

Now that we know a bit about the challenge we will be tackling and the data we will be working with, let's talk about audio data in general to build an intuition as how things work.

### 3.1 Intro to Audio Data

Audio data refers to any type of sound that can be stored and transmitted in a digital format. This can include music, spoken words, and other types of sound recordings. In some ways, it's similar to how text can be stored and transmitted in a digital format. Hence, audio data is way of representing sound in a shape and form that can be processed and analyzed by computers. A wide range of applications use audio data and these include music production, telecommunications, and digital assistants like Siri and Alexa.

For data science use cases, we need to convert sound data to Mel spectrograms for use in machine learning models, the audio data is first divided into small segments. Each segment is then processed using a mathematical operation known as the Fourier transform, which breaks down the sound into its component frequencies. 

After the Fourier transform has been applied, mel-scale filter banks are used to group the frequencies into a set of bands that more closely match the human auditory system's perception of sound. These filter banks essentially amplify some frequency ranges while reducing others. This process results in a set of values for each segment of audio, which can be arranged to form a spectrogram.

Mel-spectrograms are useful in machine learning applications as they provide a way to represent audio data in a format that can be easily processed and analyzed by computer models. The mel-spectrograms can be used as inputs to machine learning models, which can then learn to recognize patterns and classify different types of sounds.

Now that we know a little bit about audio data, let's examine a sample to get an intuition for how the process works.

Before you run any line of code, make sure you create a virtual environment with the following packages.

```bash
# with conda
mamba env create -n my_env python=3.10
mamba activate my_env

# or with virtualenv
python -m venv venv
source venv/bin/activate

# install packages
pip install qdrant-client transformers datasets pandas numpy streamlit
```

In [1]:
from IPython.display import Audio as player
import numpy as np

### 3.2 Data Prep

In [None]:
from datasets import load_dataset, Audio

In [None]:
music_data = load_dataset(
    "audiofolder", data_dir="../data/ludwig_music_data/mp3/", split="train"
).shuffle(seed=42).select(range(200))

music_data

In [None]:
music_data[0]

In [None]:
def get_the_id(data):
    data['idx'] = data['audio']['path'].split("/")[-1].replace(".mp3", '')
    return data

In [None]:
music_data = music_data.map(get_the_id, num_proc=6)
music_data

In [None]:
music_data.unique("label")

Time for the labels.

In [None]:
import pandas as pd

In [None]:
labels = pd.read_json("../data/ludwig_music_data/labels.json")

In [None]:
from random import choice
labels['tracks'].iloc[choice(range(200))]

In [None]:
def get_metadata(x):
    try:
        artist = list(x['artist'].values())[0]
        genre = list(x['genre'].values())[0]
        name = list(x['name'].values())[0]
    except:
        artist = "Unknown"
        genre = "Unknown"
        name = "Unknown"
    return pd.Series([artist, genre, name], index=['artist', 'genre', 'name'])

In [None]:
clean_labels = labels['tracks'].apply(get_metadata).reset_index()
clean_labels.head()

In [None]:
clean_labels.name.value_counts()

In [None]:
music_data = music_data.to_pandas().merge(
    right=clean_labels, left_on='idx', right_on='index', how="left"
).drop("index", axis=1)

music_data.head()

In [None]:
from datasets import Dataset

In [None]:
Dataset.from_pandas??

In [None]:
music_data = Dataset.from_pandas(music_data, preserve_index=False)
music_data = music_data.cast_column('audio', Audio(sampling_rate=16_000))
music_data

In [None]:
music_data[0]

## 4. Vector Databases

### 4.1 What Are They?

A vector database is a type of database designed to store and query high-dimensional vectors efficiently. In traditional databases, data is typically organized in rows and columns, and queries are performed based on the values in those columns. However, in certain applications, such as machine learning, image recognition, natural language processing, and recommendation systems, data is often represented as vectors in a high-dimensional space.

A vector in this context is a mathematical representation of an object or data point, where each element of the vector corresponds to a specific feature or attribute of the object. For example, in an image recognition system, a vector could represent an image, with each element of the vector representing a pixel value or a descriptor of that pixel.

Vector databases are optimized for storing and querying these high-dimensional vectors efficiently, often using specialized data structures and indexing techniques. They enable fast similarity searches, allowing users to find vectors that are similar to a given query vector based on some distance metric, such as Euclidean distance or cosine similarity.

### 4.2 Why do we need them?

Vector databases play a crucial role in various applications that require similarity search, such as recommendation systems, content-based image retrieval, and personalized search. By leveraging efficient indexing and search techniques, vector databases enable faster and more accurate retrieval of similar vectors, enabling advanced data analysis and decision-making.

### 4.3 How can we use them?

To get started using vector databases, it is important that we understand the basics of how vectors are generated. We do not need to be a data scientist or a machine learning engineer, but some familiarity with machine learning (and/or the ability to follow a comprehensive tutorial) would be enough to help us get our hands on some vectors. 

Once we train an algorithm, most-likely a deep learning one, you will have at your disposal the vector representation of the data used to train the model, and the model itself to transform new data into vectors. Note that, even if you trained your algorithm for a particular task, what you need is the structured representation of your unstructured data (e.g., images, audio, text, etc.) rather than the predictive function itself. The goal is to compare the new sample to existing ones to get the most similar results on the fly.

The next step is to pick a vector database that aligns with what you, your team, or company is trying to do (e.g., recommendation system, semantic search, ranking, etc.). There are many solutions out and they all differ in terms of ease of use (i.e., how many more tools you need in order to get started), distribution (e.g. SaaS vs Open Source), accessibility (i.e., can you find new creative ways to use the low-level parts of the database?), and coverage (e.g., similarity metrics available and capabilities such as payloads).

Once you have you vector database set up, and have gotten familiarized with the API, start adding vector and create a simple UI to test your use case, for example, use streamlit or nicegui to put together a nice looking prototype that you can share and get feedback from. If everyone is happy with the results, start drafting a plan for how you will productionize, maintain, and monitor your database, as well as what are the steps to provide access to the database to developers or other data professionals working on recommendation systems.

With that out of the way, let's get started with vector databases using the fastest growing open source tool in the market, Qdrant! 😎

### 4.3 Enter Qdrant

![qdrant](../images/qdrant_overview.png)

#### 4.3.1 Getting Started

In order to get started with Qdrant, we can either pull the latest docker image using `docker pull qdrant/qdrant`, and then run it with,
```bash
docker run -p 6333:6333 \
    -v $(pwd)/qdrant_storage:/qdrant/storage \
    qdrant/qdrant
```
or we can use its Python client and use in in-memory functionality to play around with it. We will opt for the former in this tutorial but feel free to use the in-memory version if you prefer.

We'll start by instantiating our client using `host="localhost"` and `port=6333` as opposed to `host=":memory:"` for the in-memory option.

In [3]:
from qdrant_client import QdrantClient
from qdrant_client.http import models

In [14]:
client = QdrantClient(host="localhost", port=6333)
client

<qdrant_client.qdrant_client.QdrantClient at 0x7fa461615480>

In more OLTP or OLAP databases we call specific logical bundles of data `Tables`, but in vector databases, we refer to these collections of vectors as `collections`. In the same way in which we can create many tables in a database, we can create many collections in a vector-based db using a client. The key difference to note is that when we create a collection, we need to specify the width of the collection with the parameter `size=...`, and the similarity metric `distance=...` (which can be changed later on).

Let's create our first collection.

In [26]:
first_collection = client.create_collection(
    collection_name="my_first_collection",
    vectors_config=models.VectorParams(size=100, distance=models.Distance.COSINE)
)

In [21]:
# we can check that our collection was indeed created with
client.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='test_collection'), CollectionDescription(name='my_first_collection'), CollectionDescription(name='music_collection')])

There's a couple of things to notice from what we have done.
- The first is that when we initiated our docker image, we created a local directory, `qdrant_storage`, where all of our collections, plus their metadata, will be saved at. You can have a look at that directory in a *nix system with `tree qdrant_storage -L 2`. You should see the following.
    ```bash
    qdrant_storage
    ├── aliases
    │   └── data.json
    ├── collections
    │   └── my_first_collection
    └── raft_state
    ```
- The second is that we used `client.create_collection` and this command can only be used once per collection. To recreate the collection with new parameters and the like, we would use `client.recreate_collection` instead.

Now that we know how to create collections, let's create a bit of fake data and add some vectors to our collection.

#### 4.3.2 Adding Points

The points are the central entity that Qdrant operates with, and these points contain records consisting of a vector, an optional id and an optional payload (which we'll talk more about in the next section).

The optional id can be represented by unassigned integers or UUIDs. We are going a range of numbers for this.

In [51]:
data = np.random.uniform(low=-1.0, high=1.0, size=(1_000, 100))
data

array([[-0.34902928,  0.39127137,  0.77736609, ...,  0.71480541,
        -0.17359971,  0.36136192],
       [ 0.23179378,  0.91367179, -0.01578303, ...,  0.13227556,
        -0.57514356,  0.54422695],
       [-0.55813868,  0.84245178,  0.93939812, ..., -0.62025177,
        -0.56850792, -0.60533854],
       ...,
       [-0.7460135 ,  0.27494219, -0.76449743, ...,  0.75622879,
         0.56995359, -0.90207819],
       [ 0.4431529 , -0.89273111, -0.69514209, ...,  0.75955554,
         0.19940356,  0.58470656],
       [ 0.28079717, -0.0043887 , -0.77346643, ..., -0.13986271,
        -0.27839941,  0.8953143 ]])

In [52]:
index = list(range(len(data)))
index[-10:]

[990, 991, 992, 993, 994, 995, 996, 997, 998, 999]

In [53]:
client.upsert(
    collection_name="my_first_collection",
    points=models.Batch(
        ids=index,
        vectors=data.tolist()
    )
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

We can retrieve specific points based on their ID (for example, artist X with ID 1000) and get some additional information from that result.

In [55]:
client.retrieve(
    collection_name="my_first_collection",
    ids=[100],
    with_vectors=True # we can turn this on and off depending on our needs
)

[Record(id=100, payload={}, vector=[0.09882033, 0.16172764, -0.09056148, -0.089663155, -0.056109257, 0.060092356, -0.041791342, -0.002466713, -0.047766462, -0.0014954949, 0.048229624, -0.09024368, -0.13306141, 0.09125329, -0.013539769, 0.05341306, -0.09563544, -0.16100252, -0.12960596, -0.092364304, -0.109665275, 0.0070449742, -0.030927822, -0.14808643, 0.14655688, 0.0683223, 0.09412629, 0.15900253, 0.14073977, 0.14068495, -0.0747445, 0.13971582, 0.08059134, -0.1481492, -0.06711511, -0.15731916, 0.09945187, 0.13183598, -0.10551556, 0.03905012, -0.098554, 0.055640813, 0.16447113, 0.14446491, -0.08003098, 0.11543581, -0.15725297, -0.084937714, -0.016937457, -0.09782521, -0.09829953, -0.111397855, 0.051244024, -0.10563912, -0.03890867, -0.16418418, -0.110985085, -0.06731401, -0.019962916, -0.014650847, 0.0140013, 0.016037945, 0.07743408, 0.022067843, 0.09939152, -0.088779725, -0.14362699, -0.035124537, -0.016272748, -0.15549614, -0.05600413, 0.012791288, -0.1149695, -0.11909202, -0.025901

We can also update our collection one point at a time, for example, as new data comes in.

In [124]:
def create_song():
    return np.random.uniform(low=-1.0, high=1.0, size=100).tolist()

In [73]:
client.upsert(
    collection_name="my_first_collection",
    points=[
        models.PointStruct(
            id=1000,
            vector=create_song(),
        )
    ]
)

UpdateResult(operation_id=1, status=<UpdateStatus.COMPLETED: 'completed'>)

We can also delete it in a straightforward fashion.

In [74]:
client.count(
    collection_name="my_first_collection", 
    exact=True,
)

CountResult(count=1001)

In [75]:
client.delete(
    collection_name="my_first_collection",
    points_selector=models.PointIdsList(
        points=[1000],
    ),
)

UpdateResult(operation_id=2, status=<UpdateStatus.COMPLETED: 'completed'>)

In [76]:
client.count(
    collection_name="my_first_collection", 
    exact=True,
)

CountResult(count=1000)

#### 4.3.3 Payloads

Qdrant has incredible features on top of speed and reliability, and one of its most useful ones is without a doubt the ability to store additional information along with vectors. In Qdrant terminology, this information is considered a payload and it is represented as a JSON file. In addition, not only can you get this information back when you search in the database, but you can also filter your search by the parameters in the payload, and we'll see how in a second.

Imagine the fake vectors we created actually represented a song. If we were building a recommender system for songs then, naturally, the things we would want to get back would be the song itself, the artist, maybe the genre, and so on.

What we'll do here is to take advantage of a Python package call `faker` and create a bit of information to add to our payload.

In [77]:
from faker import Faker

In [84]:
fake_something = Faker()
fake_something.name()

'Heather Sanders DDS'

In [120]:
payload = []

for i in range(len(data)):
    payload.append(
        {
            "artist": fake_something.name(),
            "song": " ".join(fake_something.words()),
            "url_song": fake_something.url(),
            "year": fake_something.year(),
            "country": fake_something.country()
        }
    )

payload[:3]

[{'artist': 'Nicole Peterson',
  'song': 'box firm item',
  'url_song': 'http://curtis.biz/',
  'year': '1977',
  'country': 'Belize'},
 {'artist': 'Kathy Ponce',
  'song': 'though moment almost',
  'url_song': 'https://www.wallace-carter.info/',
  'year': '1991',
  'country': 'Chad'},
 {'artist': 'Lauren Payne',
  'song': 'approach individual despite',
  'url_song': 'https://lindsey.com/',
  'year': '1999',
  'country': 'Mexico'}]

In [121]:
client.upsert(
    collection_name="my_first_collection",
    points=models.Batch(
        ids=index,
        vectors=data.tolist(),
        payloads=payload
    )
)

UpdateResult(operation_id=7, status=<UpdateStatus.COMPLETED: 'completed'>)

In [122]:
resutls = client.retrieve(
    collection_name="my_first_collection",
    ids=[10, 50, 100, 500],
    with_vectors=False
)
resutls

[Record(id=500, payload={'artist': 'Martha Evans', 'country': 'Comoros', 'song': 'none approach increase', 'url_song': 'https://www.soto.com/', 'year': '1987'}, vector=None),
 Record(id=10, payload={'artist': 'Zachary Campbell', 'country': 'Spain', 'song': 'once something woman', 'url_song': 'http://www.rogers.com/', 'year': '2011'}, vector=None),
 Record(id=100, payload={'artist': 'Timothy Dorsey', 'country': 'Korea', 'song': 'under far evidence', 'url_song': 'http://hess-brown.com/', 'year': '1999'}, vector=None),
 Record(id=50, payload={'artist': 'John Compton DDS', 'country': 'Algeria', 'song': 'spend general economic', 'url_song': 'http://sanchez.com/', 'year': '1972'}, vector=None)]

In [123]:
resutls[0].payload

{'artist': 'Martha Evans',
 'country': 'Comoros',
 'song': 'none approach increase',
 'url_song': 'https://www.soto.com/',
 'year': '1987'}

In [110]:
client.clear_payload(
    collection_name="my_first_collection",
    points_selector=models.PointIdsList(
        points=index,
    )
)

UpdateResult(operation_id=6, status=<UpdateStatus.COMPLETED: 'completed'>)

#### 4.3.4 Search

Now that we have our vectors with an ID and a payload, we can explore a few of ways in which we can search for content when, in our use case, new music gets selected. Let's check it out.

Say, for example, that a new song comes in and our model immediately transforms it into a vector.

In [125]:
living_la_vida_loca = create_song()

In [127]:
client.search(
    collection_name="my_first_collection",
    query_vector=living_la_vida_loca,
    limit=10
)

[ScoredPoint(id=601, version=7, score=0.29209563, payload={'artist': 'Victoria Frazier', 'country': 'Nicaragua', 'song': 'even partner keep', 'url_song': 'http://www.lane.com/', 'year': '2012'}, vector=None),
 ScoredPoint(id=232, version=7, score=0.27164534, payload={'artist': 'Jamie Gibson', 'country': 'Wallis and Futuna', 'song': 'from teach special', 'url_song': 'http://matthews.info/', 'year': '1987'}, vector=None),
 ScoredPoint(id=380, version=7, score=0.26160175, payload={'artist': 'Jay Peters', 'country': 'Malawi', 'song': 'model offer chance', 'url_song': 'https://swanson-flores.com/', 'year': '1999'}, vector=None),
 ScoredPoint(id=848, version=7, score=0.26122367, payload={'artist': 'Jennifer Williams', 'country': 'Somalia', 'song': 'happy challenge activity', 'url_song': 'http://jackson.biz/', 'year': '1995'}, vector=None),
 ScoredPoint(id=119, version=7, score=0.24758185, payload={'artist': 'James Shaw', 'country': 'Ukraine', 'song': 'best spend vote', 'url_song': 'http://al

Now imagine that we only want Australian songs recommended to us.

In [128]:
aussie_songs = models.Filter(
    must=[models.FieldCondition(key="country", match=models.MatchValue(value="Australia"))]
)

In [130]:
client.search(
    collection_name="my_first_collection",
    query_vector=living_la_vida_loca,
    query_filter=aussie_songs,
    limit=5
)

[ScoredPoint(id=536, version=7, score=0.22075969, payload={'artist': 'Randy Smith', 'country': 'Australia', 'song': 'including nice special', 'url_song': 'http://www.scott.com/', 'year': '1984'}, vector=None),
 ScoredPoint(id=219, version=7, score=0.17249845, payload={'artist': 'Colleen Underwood', 'country': 'Australia', 'song': 'either represent else', 'url_song': 'https://www.tyler-brown.com/', 'year': '2021'}, vector=None),
 ScoredPoint(id=560, version=7, score=-0.054121234, payload={'artist': 'Kimberly Tucker', 'country': 'Australia', 'song': 'policy although music', 'url_song': 'https://www.thomas.org/', 'year': '1973'}, vector=None),
 ScoredPoint(id=810, version=7, score=-0.0861626, payload={'artist': 'Jacob Jacobs', 'country': 'Australia', 'song': 'serious treatment generation', 'url_song': 'https://jensen.com/', 'year': '1979'}, vector=None)]

Lastly, say we want aussie songs but we don't care how new or old these songs are.

In [131]:
client.search(
    collection_name="my_first_collection",
    query_vector=living_la_vida_loca,
    query_filter=aussie_songs,
    with_payload=models.PayloadSelectorExclude(exclude=["year"]),
    limit=5
)

[ScoredPoint(id=536, version=7, score=0.22075969, payload={'artist': 'Randy Smith', 'country': 'Australia', 'song': 'including nice special', 'url_song': 'http://www.scott.com/'}, vector=None),
 ScoredPoint(id=219, version=7, score=0.17249845, payload={'artist': 'Colleen Underwood', 'country': 'Australia', 'song': 'either represent else', 'url_song': 'https://www.tyler-brown.com/'}, vector=None),
 ScoredPoint(id=560, version=7, score=-0.054121234, payload={'artist': 'Kimberly Tucker', 'country': 'Australia', 'song': 'policy although music', 'url_song': 'https://www.thomas.org/'}, vector=None),
 ScoredPoint(id=810, version=7, score=-0.0861626, payload={'artist': 'Jacob Jacobs', 'country': 'Australia', 'song': 'serious treatment generation', 'url_song': 'https://jensen.com/'}, vector=None)]

As you can see, the possibilities are endless.

## 5. Transformers and Embeddings

In the context of audio data, embeddings and transformers are used to process the sound waves and extract features that are useful for training machine learning models.

### 5.1 What are transformers?

Transformers are a type of neural network used for natural language processing, but they can also be used for processing audio data by breaking the sound waves into smaller parts and learning how those parts fit together to form meaning.

### 5.2 What are Embeddings?

Embeddings are a way of representing audio data as vectors or numbers, which makes it easier for machine learning algorithms to process and analyze them.

### 5.3 Fine tunning Wav2Vec

In [None]:
import torch
# from datasets import Audio
from transformers import AutoModel, AutoFeatureExtractor

In [None]:
from IPython.display import Audio as player

In [None]:
sample = music_data[0]
player(sample['audio']['bytes'], rate=16_000)

In [None]:
labels = music_data.unique("genre")
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

num_labels = len(id2label)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
inputs = feature_extractor(
    sample, sampling_rate=feature_extractor.sampling_rate, 
    return_tensors="pt", padding=True, return_attention_mask=True
).to(device)

In [None]:
model = AutoModel.from_pretrained(
    "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
).to(device)

### 5.4 Extracting Embeddings

In [None]:
def extract_hidden_states(batch):
    inputs = {k: v.to(device) for k, v in batch.items() if k in feature_extractor.model_input_names}
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
    return {"hidden_state": last_hidden_state[:, 0].cpu().numpy()}

In [None]:
hidden_state = data.map(extract_hidden_states, batched=True, batch_size=50)

In [None]:
np.save(
    file_path_name, 
    np.array(hidden_state["hidden_state"]), 
    allow_pickle=False
)

Embeddings and transformers are tools used to extract important information from audio data and make it easier for computers to understand and work with that information. They are used in a wide range of applications, from speech recognition to music analysis.

In [None]:
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")

In [None]:
dataset['audio'][0]

In [None]:
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))

In [None]:
sample = dataset[0]['audio']["array"]
player(sample, rate=16_000)

In [None]:
inputs = processor(sample, sampling_rate=16_000, return_tensors="pt", return_attention_mask=True)

In [None]:
inputs['input_values'].size()

In [None]:
with torch.no_grad():
    embeddings = model(inputs.input_values, inputs.attention_mask).last_hidden_state

In [None]:
embeddings.size()

In [None]:
embeddings[0, 0, :].size()

In [None]:
inputs['input_values'].size()

In [None]:
with torch.no_grad():
    embeds = model(inputs.input_values, inputs.attention_mask)
# hidden

In [None]:
embeds.last_hidden_state.size()

In [None]:
hidden['last_hidden_state'].size()

In [None]:
sample[:, None].shape

In [None]:
model(**sample).input_ids

In [None]:
vectors = np.load('vectors.npy')
vectors.shape

In [None]:
client = QdrantClient("localhost", port=6333)

In [None]:
client.recreate_collection(
    collection_name="test_collection",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

In [None]:
from pprint import pprint
collection_info = client.get_collection(collection_name="test_collection")
collection_info

In [None]:
from qdrant_client.http.models import CollectionStatus

assert collection_info.status == CollectionStatus.GREEN
assert collection_info.vectors_count == 0

In [None]:
len(vectors)

In [None]:
client.upsert(
    collection_name="test_collection",
    points=models.Batch(
        ids=list(range(len(vectors))),
        vectors=vectors.tolist()
    ),
)

In [None]:
from diffusers import AudioLDMPipeline
import torch

repo_id = "cvssp/audioldm"
pipe = AudioLDMPipeline.from_pretrained(repo_id)
pipe = pipe.to("cuda")

prompt = "high quality bachata"

audio = pipe(prompt=prompt, num_inference_steps=500, audio_length_in_s=5.0).audios[0]

from IPython.display import Audio

Audio(audio, rate=16000)

In [None]:
classifier(audio, ).

In [None]:
audio.shape

In [None]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("first_mod")
inputs = feature_extractor(audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt", max_length=16000, truncation=True)

In [None]:
inputs['input_values'].size()

In [None]:
with torch.no_grad():
    last_hidden_state = model(**inputs.to(device)).last_hidden_state[:, 0]
last_hidden_state.size()

In [None]:
vectr = last_hidden_state.cpu().numpy()[0, :]

In [None]:
results2 = client.search(
    collection_name="test_collection",
    query_vector=vectr,
    limit=10, 
    # with_vectors=True
)
results2

In [None]:
one_array = np.array(results[0].dict()["vector"])

In [None]:
music = []

for result in results:
    the_song = Audio(np.array(result.dict()["vector"]), rate=16_000)
    # feature_extractor(the_song, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt", max_length=16000, truncation=True)
    music.append(the_song)

In [None]:
music[2]

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
np.sum(one_array)

In [None]:
np.sum(vectors, axis=0) == np.sum(one_array)

In [None]:
scores = cosine_similarity([one_array], vectors)[0]
scores

In [None]:
top_scores_ids = np.argsort(scores)[-5:][::-1]
top_scores_ids