# Using a vector database

- https://github.com/openai/openai-cookbook/blob/main/examples/vector_databases/qdrant/Getting_started_with_Qdrant_and_OpenAI.ipynb
- https://github.com/openai/openai-cookbook/blob/main/examples/vector_databases/qdrant/QA_with_Langchain_Qdrant_and_OpenAI.ipynb

## Which vector database shall we try?

- https://platform.openai.com/docs/guides/embeddings/how-can-i-retrieve-k-nearest-embedding-vectors-quickly

In [1]:
! curl http://localhost:6333

{"title":"qdrant - vector search engine","version":"1.1.1"}

## Connect to Qdrant

In [2]:
import qdrant_client

client = qdrant_client.QdrantClient(
    host="localhost",
    prefer_grpc=True,
)

In [3]:
client.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='Articles')])

## Load data

In [4]:
import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)

'vector_database_wikipedia_articles_embedded.zip'

In [5]:
import zipfile

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../data")

In [8]:
import pandas as pd

from ast import literal_eval

# article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')
article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv', nrows=1000)

# Read vectors from strings back into a list
article_df["title_vector"] = article_df.title_vector.apply(literal_eval)
article_df["content_vector"] = article_df.content_vector.apply(literal_eval)
article_df.head()

Unnamed: 0,id,url,title,text,title_vector,content_vector,vector_id
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,"[0.001009464613161981, -0.020700545981526375, ...","[-0.011253940872848034, -0.013491976074874401,...",0
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,"[0.0009286514250561595, 0.000820168002974242, ...","[0.0003609954728744924, 0.007262262050062418, ...",1
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,"[0.003393713850528002, 0.0061537534929811954, ...","[-0.004959689453244209, 0.015772193670272827, ...",2
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,"[0.0153952119871974, -0.013759135268628597, 0....","[0.024894846603274345, -0.022186409682035446, ...",3
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,"[0.02224554680287838, -0.02044147066771984, -0...","[0.021524671465158463, 0.018522677943110466, -...",4


## Index data

In [9]:
from qdrant_client.http import models as rest

vector_size = len(article_df["content_vector"][0])

client.recreate_collection(
    collection_name="Articles",
    vectors_config={
        "title": rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
        "content": rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)

True

In [10]:
client.upsert(
    collection_name="Articles",
    points=[
        rest.PointStruct(
            id=k,
            vector={
                "title": v["title_vector"],
                "content": v["content_vector"],
            },
            payload=v.to_dict(),
        )
        for k, v in article_df.iterrows()
    ],
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [11]:
client.count(collection_name="Articles")

CountResult(count=1000)

## Search data

In [12]:
import openai

def query_qdrant(query, collection_name, vector_name="title", top_k=20):
    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
        input=query,
        model="text-embedding-ada-002",
    )["data"][0]["embedding"]

    query_results = client.search(
        collection_name=collection_name,
        query_vector=(
            vector_name, embedded_query
        ),
        limit=top_k,
    )

    return query_results

In [13]:
query_results = query_qdrant("modern art in Europe", "Articles")
for i, article in enumerate(query_results):
    print(f"{i + 1}. {article.payload['title']} (Score: {round(article.score, 3)})")

1. Art (Score: 0.841)
2. Europe (Score: 0.839)
3. European Union (Score: 0.824)
4. Eurasia (Score: 0.821)
5. Asia (Score: 0.817)
6. Vienna (Score: 0.817)
7. Italy (Score: 0.816)
8. Architecture (Score: 0.815)
9. Madrid (Score: 0.815)
10. France (Score: 0.812)
11. 20th century (Score: 0.811)
12. Munich (Score: 0.808)
13. Belgium (Score: 0.808)
14. Rome (Score: 0.805)
15. Century (Score: 0.805)
16. Netherlands (Score: 0.805)
17. Greece (Score: 0.803)
18. Brussels (Score: 0.803)
19. Holland (Score: 0.803)
20. Euro (Score: 0.803)


In [14]:
# This time we'll query using content vector
query_results = query_qdrant("Famous battles in Scottish history", "Articles", "content")
for i, article in enumerate(query_results):
    print(f"{i + 1}. {article.payload['title']} (Score: {round(article.score, 3)})")

1. Scotland (Score: 0.817)
2. Scottish Football League (Score: 0.809)
3. 1965 (Score: 0.777)
4. Ireland (Score: 0.761)
5. Dissolution of the monasteries (Score: 0.758)
6. United Kingdom (Score: 0.757)
7. England (Score: 0.757)
8. 1925 (Score: 0.756)
9. Wales (Score: 0.756)
10. 1974 (Score: 0.755)
11. Great Britain (Score: 0.755)
12. Sheffield (Score: 0.752)
13. Plymouth Argyle F.C. (Score: 0.751)
14. 1940s (Score: 0.75)
15. War (Score: 0.747)
16. 1978 (Score: 0.747)
17. Rudyard Kipling (Score: 0.746)
18. 20th century (Score: 0.746)
19. Elizabeth II (Score: 0.746)
20. Welsh League (Score: 0.745)
