# Qdrant & Text Data

![qdrant](../images/crab_nlp.png)

Welcome to a tutorial on Natural Language Processing and Vector Databases! Here, we will explore how these two exciting technologies work together via Qdrant, a vector similarity search engine that provides a production-ready service with a convenient API to store, search, and manage vectors with an additional payload.

## Table of Contents

1. Learning Outcomes
2. Overview
3. Before We Get Started
4. NLP
    - The Task & The Data
    - Exploration
    - GPT-2 Embeddings
5. Semantic Search with Qdrant
6. Conclusion
7. Resources

## 1. Learning Outcomes

By the end of this tutorial, you will be able to
- Generate embeddings from text data.
- Create collections of vectors using Qdrant.
- Conduct semantic search over a corpus of documents using Qdrant.
- Provide recommendations with Qdrant.

## 2. Overview

Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and human language. It involves teaching computers to understand, interpret, and generate human language in a way that is both meaningful and useful. NLP techniques can help us with tasks such as text classification, named entity recognition, sentiment analysis, and language generation.

Vector Databases, on the other hand, are a type of database that specializes in storing and querying high-dimensional vectors. In the context of NLP, vectors are numerical representations of words, sentences, or documents that capture their semantic meaning. These vector representations, often referred to as word embeddings or document embeddings, transform textual data into a numerical format that machines can easily process and analyze.

Vector Databases serve as efficient storage systems for these vector representations, allowing for fast and accurate similarity search. They enable users to find similar words, sentences, or documents based on their semantic meaning rather than relying solely on exact matches or keywords. By organizing vectors in a way that facilitates quick retrieval and comparison, Vector Databases are instrumental in powering various NLP applications, including information retrieval, recommendation systems, semantic search, and content clustering.

The connecting dot between NLP and Vector Databases lies in the importance of vector representations in NLP tasks. Vector representations enable NLP algorithms to understand the contextual relationships and semantic meaning of textual data. By leveraging Vector Databases, NLP systems can efficiently store and retrieve these vector representations, making it easier to process and analyze large volumes of textual data.

Throughout this tutorial, we will delve deeper into the fundamentals of NLP and Vector Databases. In particular, we will learn (at a high-level) how to use transformers to create embeddings for a corpus of news, and how to use Qdrant to search and recommend the best matches to a chosen document.

## 3. Before We Started

In order to use Qdrant, you will need to pull the latest image from docker hub with the command `docker pull qdrant/qdrant`. Next, you can initialize Qdrant with the following command.

```sh
docker run -p 6333:6333 \
    -v $(pwd)/qdrant_storage:/qdrant/storage \
    qdrant/qdrant
```

Now that you have Qdrant up and running, your next step is to set up a virtual environment with the packages we'll be using. You can do so via the following commands.

```sh
# with mamba or conda
mamba env create -n my_env python=3.10
mamba activate my_env

# or with virtualenv
python -m venv venv
source venv/bin/activate

# install packages
pip install qdrant-client transformers datasets pandas numpy torch faker
```

After your have your environment ready, let's get started with Qdrant.

## 4. NLP & Vector Databases

The most common use case you will find at the time of writing, will likely involve large language models. You might have heard of models like [GPT-4](https://openai.com/product/gpt-4), [Codex](https://openai.com/blog/openai-codex), and [PaLM-2](https://ai.google/discover/palm2) which are powering incredible tools such as [ChatGPT](https://openai.com/blog/chatgpt), [GitHub Copilot](https://github.com/features/copilot), and [Bard](https://bard.google.com/?hl=en), respectively. These three models are part of a family of deep learning architectures called [transformers](https://arxiv.org/abs/1706.03762), which are known for their ability to learn long-range dependencies between words in a sentence. This ability to learn from text makes them well-suited for tasks such as machine translation, text summarization, and question answering.

Transformer models work by using a technique called attention, which allows them to focus on different parts of a sentence when making predictions. For example, if you are trying to translate a sentence from English to Spanish, the transformer model will use attention to focus on the words in the English sentence that are most important for the translation into Spanish.

One analogy that can be used to explain transformer models is to think of them as a group of people who are trying to solve a puzzle. Each person in the group is given a different piece of the puzzle, and they need to work together to figure out how the pieces fit together. The transformer model is like the group of people, and the attention mechanism is like the way that the people in the group communicate with each other.

In a more concise way, transformer models are a type of machine learning model that can learn long-range dependencies between words in a sentence by using (or paying 😉) attention.

In NLP, vector databases are used to store word embeddings. Word embeddings are vector representations of words that capture their semantic meaning, and these are used to improve the performance of different NLP tasks.

The transformers architecture has been incredibly influential in the field of machine learning, and one of the tools at the heart of this is the [`transformers`](https://huggingface.co/docs/transformers/index) library developed by the Hugging Face team. With it, getting embeddings from a corpus of text can be done in a very straightforward way.

Before we get started with the model, let's talk about the use case we will be covering here.

### 4.1 The Task & The Data

> We have been given the **task of creating a system that will recommend the most similar articles to any one article chosen by a user.** 

The dataset we will use is called the **AG News** dataset and here is a description from its [dataset card in Hugging Face](https://huggingface.co/datasets/ag_news):

> "AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html"

With that out of the way, let's download our dataset and load into our session.

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("ag_news", split="train")
dataset

If you have never used HuggingFace's [`datasets`](https://huggingface.co/docs/datasets/index) library you might be a little puzzled regarding what just happened. Let's break it apart.

- The `datasets` library is a tool that allows us to manipulate unstructured data in a very efficient way by using [Apache Arrow](https://arrow.apache.org/) under the hood. It has a lot of useful functionalities for massaging and shaping up the data in whatever way we need it to be for our task. (It is safe to call it the pandas of unstructured data.)
- Next, we imported the `load_dataset` function and used it to download the dataset from the [HuggingFace Data Hub](https://huggingface.co/datasets) directly into our PC's.
- Lastly, by indicating that we want to "split" our dataset into a `train` set only, we are effectively indicating that we do not want any partitions.

Let's have a look at a couple of samples.

In [None]:
from random import choice

for i in range(5):
    random_sample = choice(range(len(dataset)))
    print(f"Sample {i+1}")
    print("=" * 70)
    print(dataset[random_sample]['text'])
    print()

One nice feature of HuggingFace datasets' objects is that we can switch effortlessly to pandas dataframe by using the method `.pandas()`. This allows us to take advantage of may of the nice tools pandas comes with for manipulating and plotting data. Let's have a look at the distribution of the labels, but before we do that, let's extract the class names of our dataset as we will be needing it shortly. 

In [None]:
id2label = {str(i): label for i, label in enumerate(dataset.features["label"].names)}

In [None]:
(
    dataset.select_columns('label')
           .to_pandas()
           .astype(str)['label']
           .map(id2label)
           .value_counts()
           .plot(kind="barh", title="Frequency with which each label appears")
);

As you can see, we have a very well-balanced dataset at our disposal. Let's look at the average length of news per class label. We will write a function for this and map to all of the elements in our dataset. Note that this will create a new column in our dataset.

In [None]:
def get_lenght_of_text(example):
    example['length_of_text'] = len(example['text'])
    return example

dataset = dataset.map(get_lenght_of_text)
dataset[:10]['length_of_text']

In [None]:
(
    dataset.select_columns(["label", "length_of_text"])
           .to_pandas()
           .pivot(columns='label', values='length_of_text')
           .plot.hist(
                bins=100, alpha=0.5, #log=True,
                title="Distribution of the Length of News"
           )
);

The length of characters in the news articles seem to be quite similar for all the labels, but with a few outliers here and there.

Our next step will be to use a pre-trained model to tokenize our data and create an embedding layer based on it.

Tokenization is like breaking down a sentence into smaller pieces called "tokens." It's similar to how we break a sentence into words, but tokens can be words, numbers, curly brackets, or even punctuation marks. This process helps computers understand and analyze text more easily because they can treat each token as a separate unit and work with them individually. It's like taking a sentence and turning it into a set of building blocks that a computer can understand and manipulate.

The model we will use to extract the tokenize our news and extract the embeddings is [GPT-2](https://huggingface.co/gpt2). GPT-2 is a powerful language model created by OpenAI, and it is like a super-smart computer program that has been trained on a lot of text from the internet. You can think of it as an AI that can generate human-like text and answer questions based on what it has learned. GPT-2 can be used for a variety of things, like writing articles, creating chatbots, generating story ideas, or even helping with language translation. It's a tool that helps computers understand and generate text in a way that seems very human-like.

The process is similar to that with the `datasets` library, we will use two classes from the `transformers` library, GPT2Tokenizer and GPT2Model, and these will make use of the model checkpoint of GPT-2 that we pass to them. The example below takes inspiration from an example available on Chapter 9 of the excellent book, [Natural Language Processing with Transformers](https://transformersbook.com/) by Lewis Tunstall, Leandro von Werra, and Thomas Wolf.

In [None]:
from transformers import GPT2Tokenizer, GPT2Model
import torch

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')#.to(device) # switch this for GPU

In natural language processing (NLP), padding refers to adding extra tokens to make all input sequences the same length. When processing text data, it's common for sentences or documents to have different lengths. However, many machine learning models require fixed-size inputs. Padding solves this issue by adding special tokens (such as zeros) to the shorter sequences, making them equal in length to the longest sequence in the dataset.

For example, let's say you have a set of sentences: "I love cats," "Dogs are friendly," and "Birds can fly." If you want to process them using a model that requires fixed-length input, you may pad the sequences to match the length of the longest sentence, let's say five tokens. The padded sentences would look like this:

1. "I love cats" -> "I love cats [PAD] [PAD]"
2. "Dogs are friendly" -> "Dogs are friendly [PAD]"
3. "Birds can fly" -> "Birds can fly [PAD] [PAD]"

By padding the sequences, you ensure that all inputs have the same size, allowing the model to process them uniformly. Padding is a common preprocessing step in NLP tasks like text classification, sentiment analysis, and machine translation.

Because GPT-2 does not have a padding token, we will use the "end of text" token instead.

In [None]:
tokenizer.eos_token

In [None]:
tokenizer.pad_token

In [None]:
tokenizer.pad_token = tokenizer.eos_token

 With that out of the way, let's walk through a quick example.

In [None]:
text = "What does a cow use to do math? A cow-culator."
inputs = tokenizer(text, padding=True, truncation=True, max_length=128, return_tensors="pt")#.to(device)
inputs

Our tokenizer will take the input tensor with the matching IDs of the words in our sentence to that of the vocabulary.

In [None]:
toks = tokenizer.convert_ids_to_tokens(inputs.input_ids[0])
toks

We can always, of course, reverse the formula.

In [None]:
tokenizer.convert_tokens_to_string(toks)

And if you are curious about how large is the vocabulary in your model, you can always access it with the method `.vocab_size`.

In [None]:
tokenizer.vocab_size

In [None]:
tokenizer.max_model_input_sizes

In [None]:
tokenizer.model_max_length

In [None]:
tokenizer.model_input_names

Now it is time to pass the inputs we got from our tokenizer to our model and examine what we'll get in return.

In [None]:
with torch.no_grad():
    embs = model(**inputs)

embs.last_hidden_state.size(), embs[0]

Notice that we got a tensor of shape `[batch_size, inputs, dimensions]`. The inputs are our tokens and these dimensions are the embedding representation that we want for our sentence rather than each token. So what can we do to get one rather than 15? The answer is **mean pooling**. We are going to take the average of all 15 vectors while paying attention to the most important parts of it. The details of how this is happening are outside of the scope of this tutorial, but please refer to the Natural Language Processing with Transformers book mentioned earlier for a richer discussion on the concepts touched on in this section (including the burrowed functions we are about to use).

In [None]:
def mean_pooling(model_output, attention_mask):

    token_embeddings = model_output[0]
    input_mask_expanded = (attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float())
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

In [None]:
embedding = mean_pooling(embs, inputs["attention_mask"])
embedding.shape, embedding[0, :10]

Now we have everything we need to extract the embedding layers from our corpus of news. The last piece of the puzzle is to create a function that we can map to every news article to extract the embedding layers with. Let's do that using our tokenizer and model from earlier, and, since our dataset contains quite a bit of articles, we'll apply it to a smaller subset of the data.

In [None]:
def embed_text(examples):
    inputs = tokenizer(
        examples["text"], padding=True, truncation=True, return_tensors="pt"
    )#.to(device)
    with torch.no_grad():
        model_output = model(**inputs)
    pooled_embeds = mean_pooling(model_output, inputs["attention_mask"])
    return {"embedding": pooled_embeds.cpu().numpy()}

In [None]:
small_set = (
    dataset.shuffle(42) # randomly shuffles the data, 42 is the seed
           .select(range(1000)) # we'll take 1k rows
           .map(embed_text, batched=True, batch_size=128) # and apply our function above to 128 articles at a time
)

In [None]:
small_set

As you can see, we now have an extra column with the embeddings for our data, and we can use these vector representations to semantically search for other news articles or to recommend similar articles to our users by taking advantage of Qdrant.

Before we add our news articles to Qdrant, let's create an index for our dataset and a column with the labels to allow our users to get recommendations in a more precise fashion, i.e. by context.

In [None]:
n_rows = range(len(small_set))
small_set = small_set.add_column("idx", n_rows)
small_set

In [None]:
small_set['idx'][-10:]

In [None]:
def get_names(label_num):
    return id2label[str(label_num)]

label_names = list(map(get_names, small_set['label']))
small_set = small_set.add_column("label_names", label_names)
small_set

Now that we have everything we need, we can create a new collection for our use case. We'll call it, `news_embeddings`.

In [None]:
dim_size = len(small_set[0]["embedding"]) # we'll need the dimensions of our embeddings

The diagram above represents a high-level overview of some of the main components of Qdrant. Here are the terminologies you should get familiar with.

- [Collections](https://qdrant.tech/documentation/collections/): A collection is a named set of points (vectors with a payload) among which you can search. Vectors within the same collection can have different dimensionalities and be compared by a single metric.
- Distance Metrics: These are used to measure similarities among vectors and they must be selected at the same time you are creating a collection. The choice of metric depends on the way vectors obtaining and, in particular, on the method of neural network encoder training.
- [Points](https://qdrant.tech/documentation/points/): The points are the central entity that Qdrant operates with and they consist of a vector and an optional id and payload.
- id: a unique identifier for your vectors.
- Vector: a high-dimensional representation of data, for example, an image, a sound, a document, a video, etc.
- [Payload](https://qdrant.tech/documentation/payload/): A payload additional data you can add to a vector.
- [Storage](https://qdrant.tech/documentation/storage/): Qdrant can use one of  two options for storage, **In-memory** storage (Stores all vectors in RAM, has the highest speed since disk access is required only for persistence), or **Memmap** storage, (creates a virtual address space associated with the file on disk).
- Clients: the programming languages you can use to connect to Qdrant.

The two modules we'll use the most are the `QdrantClient` and the `models` one. The former allows us to connect to Qdrant or it allows us to run an in-memory database by switching the parameter `location=` to `":memory:"` (this is a great feature for testing in a CI/CD pipeline). We'll start by instantiating our client using `host="localhost"` and `port=6333` (as it is the default we used earlier with docker). You can also follow along with the `location=":memory:"` option commented out below.

In [None]:
from qdrant_client import QdrantClient
from qdrant_client.http import models
from qdrant_client.http.models import CollectionStatus

In [None]:
client = QdrantClient(host="localhost", port=6333)
client

In [None]:
# client = QdrantClient(location=":memory:")
# client

In [None]:
my_collection = "news_embeddings"
second_collection = client.recreate_collection(
    collection_name=my_collection,
    vectors_config=models.VectorParams(size=dim_size, distance=models.Distance.COSINE)
)

Before we fill in our new collection, we want to create a payload that contains the news domain the article belongs to plus the article itself. Note that this payload is a list of JSON objects where the key is the name of the column and the value is the label or text of that same column.

Something that could be incredibly useful is to refocus our model to the task of named entity recognition and extract characteristics from the text that could be use to filter via the payload. I will leave this task to you, though, our dear learner.

In [None]:
payloads = small_set.select_columns(["label_names", "text"]).to_pandas().to_dict(orient="records")
payloads[:3]

In [None]:
client.upsert(
    collection_name=my_2nd_collection,
    points=models.Batch(
        ids=small_set["idx"],
        vectors=small_set["embedding"],
        payloads=payloads
    )
)

We can verify that our collection has been created by scrolling through the points with the following command.

In [None]:
client.scroll(
    collection_name=my_collection, 
    scroll_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="label_names", 
                match=models.MatchValue(value="Business")
            ),
        ]
    ),
    limit=3,
    with_payload=True,
)

We can also have a look at the vectors with or without the payloads by adding `with_vectors=True` to the `client.scroll` function and by setting `with_payload=False` if you'd like to see the vectors.

Now that we have our collection ready to roll, let's start querying the data and see what we get.

In [None]:
query1 = small_set[100]['embedding']
small_set[100]['text'], query1[:7]

As you can see the text above is talking about stocks so let's have a look at what kinds of articles we can find with Qdrant.

In [None]:
client.search(
    collection_name=my_2nd_collection,
    query_vector=query1,
    limit=3
)

Of course, the first article is going to be the same one we used to query the data as there is no distance between its vectors. The other interesting thing we can see here is that even though we have different labels, we still get semantically similar articles with the label `World` as we do with the label `Busines`.

The nice thing about what we have done is that we are getting decent results and we haven't even finetuned the model to our use case. To fine-tune a transformer model means to take a pre-trained model that has learned general knowledge from vast amounts of data and adapt it to a specific task or domain. It's like giving a smart assistant some additional training to make them better at a particular job. By fine-tuning, the model learns to understand text relevant to the specific task, improving its performance and making it more useful for specific applications. When we do this, we should expect even better results from our search.

Let's pick a random sample from the larger dataset and see what we get back from Qdrant. Note that because our function was created to be applied on a dictionary object, we'll represent the random text in the same way.

In [None]:
# Step 1 - Select Random Sample
query2 = {"text": dataset[choice(range(len(dataset)))]['text']}
query2

In [None]:
# Step 2 - Create a Vector
query2 = embed_text(query2)['embedding'][0, :]
query2.shape, query2[:20]

In [None]:
query2.tolist()[:20]

In [None]:
# Step 3 - Search for similar articles. Don't forget to convert the vector to a list.
client.search(
    collection_name=my_2nd_collection,
    query_vector=query2.tolist(),
    limit=3
)

Because we selected a random sample, you will see something different everytime you go through this part of the tutorial so make sure you read some of the articles that come back and evaluate the similarity of these articles to the one you randomly got from the larger dataset. Have some fun with it too.

Let's make things more interesting and pick the most similar results from a Business context. We'll do so by creating a field condition with `models.FieldCondition()` by setting the `key` to `label_names` and the `match` parameter as `"Business"` with `models.MatchValue` function.

In [None]:
business = models.Filter(
    must=[models.FieldCondition(key="label_names", match=models.MatchValue(value="Business"))]
)

We will add this as a query filter to our `client.search` call and see what we get.

In [None]:
client.search(
    collection_name=my_2nd_collection,
    query_vector=query2.tolist(),
    query_filter=business,
    limit=3
)

To see all of the collections that we have created today, you can use `client.get_collections`.

In [None]:
client.get_collections()

That's it! You have now gone over a whirlwind tour of vector databases and are ready to tackle new challenges. 😎

## 5. Conclusion

In conclusion, we have explored a bit of the fascinating world of vector databases, natural language processing, transformers, and embeddings. In this tutorial we learned that (1) vector databases provide efficient storage and retrieval of high-dimensional vectors, making them ideal for similarity-based search tasks. (2) Natural language processing enables us to understand and process human language, opening up possibilities for different kinds of useful applications for digital technologies. (3) Transformers, with their attention mechanism, capture long-range dependencies in language and achieve incredible results in different tasks. Finally, embeddings encode words or sentences into dense vectors, capturing semantic relationships and enabling powerful language understanding.

By combining these technologies, we can unlock new levels of language understanding, information retrieval, and intelligent systems that continue to push the boundaries of what's possible in the realm of AI.

## 6. Resources

Here is a list with some resources that we found useful, and that helped with the development of this tutorial.

1. Books
    - [Natural Language Processing with Transformers](https://transformersbook.com/) by Lewis Tunstall, Leandro von Werra, and Thomas Wolf
    - [Natural Language Processing in Action, Second Edition](https://www.manning.com/books/natural-language-processing-in-action-second-edition) by Hobson Lane and Maria Dyshel
2. Articles
    - [Fine Tuning Similar Cars Search](https://qdrant.tech/articles/cars-recognition/)
    - [Q&A with Similarity Learning](https://qdrant.tech/articles/faq-question-answering/)
    - [Question Answering with LangChain and Qdrant without boilerplate](https://qdrant.tech/articles/langchain-integration/)
    - [Extending ChatGPT with a Qdrant-based knowledge base](https://qdrant.tech/articles/chatgpt-plugin/)
3. Videos
    - [Word Embedding and Word2Vec, Clearly Explained!!!](https://www.youtube.com/watch?v=viZrOnJclY0&ab_channel=StatQuestwithJoshStarmer) by StatQuest with Josh Starmer
    - [Word Embeddings, Bias in ML, Why You Don't Like Math, & Why AI Needs You](https://www.youtube.com/watch?v=25nC0n9ERq4&ab_channel=RachelThomas) by Rachel Thomas
4. Courses
    - [fast.ai Code-First Intro to Natural Language Processing](https://www.youtube.com/playlist?list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9)
    - [NLP Course by Hugging Face](https://huggingface.co/learn/nlp-course/chapter1/1)