# **1. Installation**

In [1]:
# !pip install -U sentence-transformers
# !pip install -U qdrant-client

## **Import the models**

In [2]:
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


The **Sentence Transformers** framework contains many embedding models. We’ll take `all-MiniLM-L6-v2` as it has a good balance between speed and embedding quality 

In [3]:
encoder = SentenceTransformer("all-MiniLM-L6-v2")

# **2. Add the dataset**

`all-MiniLM-L6-v2` will encode the data you provide. Here you will list all the science fiction books in your library. Each book has metadata, a name, author, publication year and a short description.

In [4]:
documents = [
    {
        "name": "The Time Machine",
        "description": "A man travels through time and witnesses the evolution of humanity.",
        "author": "H.G. Wells",
        "year": 1895,
    },
    {
        "name": "Ender's Game",
        "description": "A young boy is trained to become a military leader in a war against an alien race.",
        "author": "Orson Scott Card",
        "year": 1985,
    },
    {
        "name": "Brave New World",
        "description": "A dystopian society where people are genetically engineered and conditioned to conform to a strict social hierarchy.",
        "author": "Aldous Huxley",
        "year": 1932,
    },
    {
        "name": "The Hitchhiker's Guide to the Galaxy",
        "description": "A comedic science fiction series following the misadventures of an unwitting human and his alien friend.",
        "author": "Douglas Adams",
        "year": 1979,
    },
    {
        "name": "Dune",
        "description": "A desert planet is the site of political intrigue and power struggles.",
        "author": "Frank Herbert",
        "year": 1965,
    },
    {
        "name": "Foundation",
        "description": "A mathematician develops a science to predict the future of humanity and works to save civilization from collapse.",
        "author": "Isaac Asimov",
        "year": 1951,
    },
    {
        "name": "Snow Crash",
        "description": "A futuristic world where the internet has evolved into a virtual reality metaverse.",
        "author": "Neal Stephenson",
        "year": 1992,
    },
    {
        "name": "Neuromancer",
        "description": "A hacker is hired to pull off a near-impossible hack and gets pulled into a web of intrigue.",
        "author": "William Gibson",
        "year": 1984,
    },
    {
        "name": "The War of the Worlds",
        "description": "A Martian invasion of Earth throws humanity into chaos.",
        "author": "H.G. Wells",
        "year": 1898,
    },
    {
        "name": "The Hunger Games",
        "description": "A dystopian society where teenagers are forced to fight to the death in a televised spectacle.",
        "author": "Suzanne Collins",
        "year": 2008,
    },
    {
        "name": "The Andromeda Strain",
        "description": "A deadly virus from outer space threatens to wipe out humanity.",
        "author": "Michael Crichton",
        "year": 1969,
    },
    {
        "name": "The Left Hand of Darkness",
        "description": "A human ambassador is sent to a planet where the inhabitants are genderless and can change gender at will.",
        "author": "Ursula K. Le Guin",
        "year": 1969,
    },
    {
        "name": "The Three-Body Problem",
        "description": "Humans encounter an alien civilization that lives in a dying system.",
        "author": "Liu Cixin",
        "year": 2008,
    },
]

# **3. Define storage location**

tell qdrant where to store, locally it would use your laptop memory

In [5]:
client = QdrantClient(":memory:")

# **4. Create a collection**

What is a **Qdrant Collection**?
-> A **Qdrant Collection** is the primary way Qdrant organizes data.

The simplest and most accurate analogy is:

- A **Collection** in Qdrant is like a **Table** in a **SQL database**, or an **Index** in **Elasticsearch**.

- It is a named container designed to store and manage a set of "Points". (a Point is a Row within that table.)
---

All data in **Qdrant** is organized by collections. In this case, you are storing books, so we are calling it `my_books`.

In [6]:
# call fucntion to ceate new collection
client.create_collection(
    collection_name="my_books",
    vectors_config=models.VectorParams(                     # properties of the vectors
        size=encoder.get_sentence_embedding_dimension(),    # set the dim or len of the vectors
        distance=models.Distance.COSINE,                    # define the distance metrics to compare SIMILARITY 
    ),
)

True

- The `vector_size` parameter defines the size of the vectors for a specific collection. If their size is different, it is impossible to calculate the distance between them. 384 is the encoder output dimensionality. You can also use model.get_sentence_embedding_dimension() to get the dimensionality of the model you are using.

- The `distance parameter` lets you specify the function used to measure the distance between two points.

# **5. Upload data to collection**

Tell the database to upload `documents` to the `my_books` collection. This will give each record an **`id`** and a **`payload`**. The **`payload`** is just the **metadata** from the dataset.

In [7]:
# upload points

loaded_points = []
for idx, doc in enumerate(documents):
    Point = models.PointStruct(
        id=idx,
        vector=encoder.encode(doc["description"]).tolist(),
        payload=doc
    )
    loaded_points.append(Point)
# tai sao can tolist()?? -> encoder.encode(str) return -> 1 tensor, qdrant client can 1 python list -> dung tolist()
client.upload_points(
    collection_name="my_books",
    points=loaded_points,
)

# **6. Ask the engine a question**

Now that the data is stored in Qdrant, you can ask it questions and receive semantically relevant results.

```python
client.query_points(
    collection_name= `str`,
    query= `python list`,
    limit= `int`
)
-> return an object (QueryResponse)

QueryResponse.points -> list of the result, each point is a qdrant point(id, vector, payload)
QueryResponse.score -> similarity
```

In [8]:
# semantic search - tim kiem ngu~ nghia~
hits = client.query_points(
    collection_name="my_books",
    query=encoder.encode("superheroes").tolist(),
    limit=3,    # -> top k most similar
)

hits_points = hits.points

for hit in hits_points:
    print(hit.payload, "score:", hit.score)

{'name': 'The Hunger Games', 'description': 'A dystopian society where teenagers are forced to fight to the death in a televised spectacle.', 'author': 'Suzanne Collins', 'year': 2008} score: 0.2524248497646995
{'name': 'Snow Crash', 'description': 'A futuristic world where the internet has evolved into a virtual reality metaverse.', 'author': 'Neal Stephenson', 'year': 1992} score: 0.23755856160144348
{'name': 'The Left Hand of Darkness', 'description': 'A human ambassador is sent to a planet where the inhabitants are genderless and can change gender at will.', 'author': 'Ursula K. Le Guin', 'year': 1969} score: 0.2094308986093288


## **Narrow down the query**

How about the most recent book from the early 2000s?

In [9]:
# hybrid search or filtered search (tim kiem ket hop)
hits = client.query_points(
    collection_name="my_books",
    query=encoder.encode("alien invasion").tolist(),                            # semantic search
    query_filter=models.Filter(                                                 # filter
        must=[models.FieldCondition(key="year", range=models.Range(gte=2000))]  # must = conditions, 
                                                                                # key = field trong payload: name, description, author, year,...
    ),                                                                          # gte->greater than or equal
    limit=1,                                                                    # return 1 only
).points

for hit in hits:
    print(hit.payload, "score:", hit.score)

{'name': 'The Three-Body Problem', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'author': 'Liu Cixin', 'year': 2008} score: 0.4590292734129918


Chap 1 End

# **Build a Neural Search Service with Sentence Transformers and Qdrant**

# **Pipeline**

![alt text](https://qdrant.tech/docs/workflow-neural-search.png) 

## **1. Download the dataset**

In [10]:
# import wget

# url = "https://storage.googleapis.com/generall-shared-data/startups_demo.json"
# wget.download(url) 

## **2+3. Install & Import dependencies**

In [11]:
from sentence_transformers import SentenceTransformer
import numpy as np
import json
import pandas as pd
from tqdm.notebook import tqdm

## **4. Download + create a pre-trained sentence encoder**

pre-trained model used: "`all-MiniLM-L6-v2`"

In [12]:
model = SentenceTransformer(
    "all-MiniLM-L6-v2",
    device="cpu",
)

## **5. Read raw data**

In [13]:
path = r".\startups_demo.json"
df = pd.read_json(path, lines=True)
df[:5]

Unnamed: 0,name,images,alt,description,link,city
0,SaferCodes,https://safer.codes/img/brand/logo-icon.png,SaferCodes Logo QR codes generator system form...,QR codes systems for COVID-19.\nSimple tools f...,https://safer.codes,Chicago
1,Human Practice,https://d1qb2nb5cznatu.cloudfront.net/startups...,Human Practice - health care information tech...,Point-of-care word of mouth\nPreferral is a mo...,http://humanpractice.com,Chicago
2,StyleSeek,https://d1qb2nb5cznatu.cloudfront.net/startups...,StyleSeek - e-commerce fashion mass customiza...,Personalized e-commerce for lifestyle products...,http://styleseek.com,Chicago
3,Scout,https://d1qb2nb5cznatu.cloudfront.net/startups...,Scout - security consumer electronics interne...,Hassle-free Home Security\nScout is a self-ins...,http://www.scoutalarm.com,Chicago
4,Invitation codes,https://invitation.codes/img/inv-brand-fb3.png,Invitation App - Share referral codes community,The referral community\nInvitation App is a so...,https://invitation.codes,Chicago


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40474 entries, 0 to 40473
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         40474 non-null  object
 1   images       40474 non-null  object
 2   alt          40474 non-null  object
 3   description  40474 non-null  object
 4   link         40474 non-null  object
 5   city         40474 non-null  object
dtypes: object(6)
memory usage: 1.9+ MB


## **6. Encode `'description'` to create embedding vector**

In [15]:
inp = []
for row in df.itertuples():
    vector = row.alt + ". " + row.description
    inp.append(vector)

vectors = model.encode(
    inp,
    show_progress_bar=True,
)

Batches:  35%|███▍      | 440/1265 [08:07<15:13,  1.11s/it]


KeyboardInterrupt: 

In [None]:
vectors.shape

(40474, 384)