<a href="https://colab.research.google.com/github/wajihh/genai_projects/blob/main/query_qdrant_steps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Step-by-Step Python Code for Querying Qdrant
When you run this code in Colab:

It will load all the vectors in the collection.
It will ask you to enter a search query (e.g., "What is AI policy?").
The query will be converted into a vector using the sentence-transformers model.
The vector will be searched against the AI_policy_new collection.
The top 5 results (vector IDs, scores, and payloads) will be displayed.
This allows for interactive querying based on natural language inputs, which is great for finding information across your 14-page document stored in Qdrant.
model = SentenceTransformer('BAAI/bge-small-en-v1.5') # Example of an alternative model
Install Qdrant client: The Qdrant client is needed to interact with the Qdrant service. You can install it using pip.

Load secrets from JSON file: We will store the QDRANT_URL and QDRANT_API_KEY in a qdrant_secrets.json file and load it securely in Colab.

Connect to Qdrant: Using the secrets loaded, we will establish a connection with Qdrant.

Perform a Query: Once connected, we can query the AI_policy_new collection for relevant vectors.

In [1]:
from google.colab import drive
import os
import json

# Mount Google Drive
drive.mount('/content/drive')

# Define path to the qdrant_secrets.json file in your Google Drive
secrets_path = '/content/drive/MyDrive/2024_Advance_AI/qdrant_secrets.json'

# Load the secrets from the file
with open(secrets_path, 'r') as f:
    secrets = json.load(f)

# Extract the Qdrant URL and API Key
QDRANT_URL = secrets['QDRANT_URL']
QDRANT_API_KEY = secrets['QDRANT_API_KEY']
print("Qdrant credentials loaded successfully.")


Mounted at /content/drive
Qdrant credentials loaded successfully.


#Step 2: Install Qdrant Client
Install the Qdrant client if you haven't already. This is required to communicate with the Qdrant server.

In [2]:
!pip install qdrant-client


Collecting qdrant-client
  Downloading qdrant_client-1.11.3-py3-none-any.whl.metadata (10 kB)
Collecting grpcio-tools>=1.41.0 (from qdrant-client)
  Downloading grpcio_tools-1.66.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Collecting httpx>=0.20.0 (from httpx[http2]>=0.20.0->qdrant-client)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting portalocker<3.0.0,>=2.7.0 (from qdrant-client)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Collecting protobuf<6.0dev,>=5.26.1 (from grpcio-tools>=1.41.0->qdrant-client)
  Downloading protobuf-5.28.2-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Collecting grpcio>=1.41.0 (from qdrant-client)
  Downloading grpcio-1.66.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Collecting httpcore==1.* (from httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client)
  Downloading httpcore-1.0.6-py3-none-any.whl.metadata (21 kB)
Collecting h11<0.15,>=0

#Step 3: Connect to Qdrant
Use the QDRANT_URL and QDRANT_API_KEY to establish a connection to your Qdrant server.

In [3]:
from qdrant_client import QdrantClient

# Initialize Qdrant Client using URL and API Key
client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)


# Query with User Input and Open-Source Model

Step 1: Import Required Libraries
In addition to the previous steps, you’ll need to install sentence-transformers to convert user queries into vectors.

In [4]:
# Install the necessary libraries
!pip install qdrant-client sentence-transformers


Collecting sentence-transformers
  Downloading sentence_transformers-3.1.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.1.1-py3-none-any.whl (245 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.3/245.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.1.1


##Step 2: Load Vectors from the Collection
We'll first load all the vectors stored in the AI_policy_new collection to prepare them for querying.

In [5]:
# Load all vectors from the collection
def load_all_vectors(client, collection_name):
    vectors = []
    offset = 0
    limit = 100  # Number of points to fetch per call

    while True:
        # Retrieve points from the collection
        scroll_result, next_page_offset = client.scroll(
            collection_name=collection_name,
            limit=limit,  # Number of points to fetch per call
            offset=offset,
            with_payload=True
        )

        # Append retrieved points to the vectors list
        for point in scroll_result:
            vectors.append({
                'id': point.id,
                'vector': point.vector,
                'payload': point.payload  # Assuming each vector has associated metadata
            })

        # If no more points are returned, break the loop
        if next_page_offset is None or len(scroll_result) == 0:
            break

        # Update offset for the next iteration
        offset = next_page_offset

    return vectors

# Load vectors from the AI_policy_new collection
vectors = load_all_vectors(client, "AI_policy_new")

print(f"Loaded {len(vectors)} vectors from the collection.")


Loaded 339 vectors from the collection.


##Step 3: Create Query Vector from User Input
We'll use the sentence-transformers model (e.g., paraphrase-MiniLM-L6-v2) to transform user input into a query vector that can be searched against the loaded vectors.

In [6]:
from sentence_transformers import SentenceTransformer

# Load the sentence transformer model
# model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
model = SentenceTransformer('BAAI/bge-small-en-v1.5') # Example of an alternative model

# Function to transform input query to vector
def get_query_vector(user_input):
    return model.encode(user_input).tolist()


  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
# prompt: suggest an alternate open source gemini model to replace in this part: model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Load the sentence transformer model
# model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
model = SentenceTransformer('BAAI/bge-small-en-v1.5') # Example of an alternative model


Step 4: Query Using User Input
Now, we will allow the user to input a query in a text box, convert that input to a vector, and use it to search the AI_policy_new collection.

In [8]:
# Function to search the collection using user input
def search_vectors(client, collection_name, query_vector):
    search_results = client.search(
        collection_name=collection_name,  # The collection name
        query_vector=query_vector,        # Query vector obtained from user input
        limit=5                           # Limit the number of results returned
    )

    return search_results

# Take user input
user_input = input("Enter your search query: ")

# Convert the user input to a vector using the open-source model
query_vector = get_query_vector(user_input)

# Perform the search query in the AI_policy_new collection
search_results = search_vectors(client, "AI_policy_new", query_vector)

# Display the search results
for result in search_results:
    print(f"ID: {result.id}, Score: {result.score}, Payload: {result.payload}")


Enter your search query: What is AI training plan?
ID: 22bedeac-014a-48b6-9338-b3b0582cb54c, Score: 0.727238, Payload: {'page_content': 'Learning/Deep Learning -based technologies trained on the datasets. This makes it a priority task for the IT \nboards to manage the standardization and accessibility of data. The following interventions are suggested:  \nI. IT boards should design  and provide roadmaps for the transformations in various sectors and \nindustries based on the ir awareness and readiness for AI adoption . These roadmaps should start \ncirculating in the respective sectors by 2023 so that the structural and competency transformation \ntoward s effective AI adoption in various sectors, especially public institutions , can be expedited.  \nII. IT boards should become facilitators in designing and providing specialized training courses and \ncertifications to prepare trained and skilled human capital with skills tailored to  sectoral \nrequirements . These training programs m

In [9]:
# Check the payload of a few vectors
for idx, result in enumerate(search_results[:2], 1):  # Display first 5 results
    print(f"Result {idx}: Payload - {result.payload}")


Result 1: Payload - {'page_content': 'Learning/Deep Learning -based technologies trained on the datasets. This makes it a priority task for the IT \nboards to manage the standardization and accessibility of data. The following interventions are suggested:  \nI. IT boards should design  and provide roadmaps for the transformations in various sectors and \nindustries based on the ir awareness and readiness for AI adoption . These roadmaps should start \ncirculating in the respective sectors by 2023 so that the structural and competency transformation \ntoward s effective AI adoption in various sectors, especially public institutions , can be expedited.  \nII. IT boards should become facilitators in designing and providing specialized training courses and \ncertifications to prepare trained and skilled human capital with skills tailored to  sectoral \nrequirements . These training programs may be initiated  as early as 2023  to accelerate compliance \nwith AI adoption needs and requiremen

Steps:
Generate Markdown Content: After getting the search results, we’ll format them into a clean Markdown structure.
Convert Markdown to PDF: We'll use markdown2 to convert the content to HTML and then pdfkit to convert the HTML to a PDF.

In [10]:
# Install required libraries
!pip install markdown2 pdfkit
!apt-get install wkhtmltopdf  # For PDF conversion


Collecting markdown2
  Downloading markdown2-2.5.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting pdfkit
  Downloading pdfkit-1.0.0-py3-none-any.whl.metadata (9.3 kB)
Downloading markdown2-2.5.0-py2.py3-none-any.whl (47 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.2/47.2 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfkit-1.0.0-py3-none-any.whl (12 kB)
Installing collected packages: pdfkit, markdown2
Successfully installed markdown2-2.5.0 pdfkit-1.0.0
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  avahi-daemon bind9-host bind9-libs geoclue-2.0 glib-networking glib-networking-common
  glib-networking-services gsettings-desktop-schemas iio-sensor-proxy libavahi-core7 libavahi-glib1
  libdaemon0 libevdev2 libfontenc1 libgudev-1.0-0 libhyphen0 libinput-bin libinput10
  libjson-glib-1.0-0 libjson-glib-1.0-common liblmdb0 libmaxminddb0 libmbim

In [18]:

# Function to format the results into Markdown and include the text content
import markdown2
import pdfkit # import the pdfkit module here

def generate_markdown_from_results(search_results):
    markdown_content = "# Query Results\n\n"
    markdown_content += "Here are the top results for your query:\n\n"

    # Loop through results and format them in Markdown, showing the text from payload
    for idx, result in enumerate(search_results, 1):
        # Assuming the text is stored in the 'payload' under a key like 'text'
        text_content = result.payload.get('page_content', 'No text available')  # Adjust if 'text' is named differently
        markdown_content += f"### Result {idx}\n"
        markdown_content += f"- **Similarity Score**: {result.score:.2f}\n"
        markdown_content += f"- **Text**: {text_content}\n\n"

    return markdown_content

# Function to convert Markdown to PDF
def save_markdown_as_pdf(markdown_content, output_pdf_path):
    # Convert markdown to HTML
    html_content = markdown2.markdown(markdown_content)

    # Save the HTML as a PDF
    pdfkit.from_string(html_content, output_pdf_path)

Step 2: Display Query Results in Colab Output
After querying the collection, we can print the formatted text output directly in the Colab notebook.

In [19]:
# Example: Convert search results to Markdown and save as PDF
query_results_markdown = generate_markdown_from_results(search_results)

# Print the formatted results in Colab
print(query_results_markdown)

# Save the query results as a PDF file
output_pdf_path = "/content/drive/MyDrive/query_results.pdf"  # Save in Google Drive
save_markdown_as_pdf(query_results_markdown, output_pdf_path)

print(f"PDF generated and saved to: {output_pdf_path}")


# Query Results

Here are the top results for your query:

### Result 1
- **Similarity Score**: 0.73
- **Text**: Learning/Deep Learning -based technologies trained on the datasets. This makes it a priority task for the IT 
boards to manage the standardization and accessibility of data. The following interventions are suggested:  
I. IT boards should design  and provide roadmaps for the transformations in various sectors and 
industries based on the ir awareness and readiness for AI adoption . These roadmaps should start 
circulating in the respective sectors by 2023 so that the structural and competency transformation 
toward s effective AI adoption in various sectors, especially public institutions , can be expedited.  
II. IT boards should become facilitators in designing and providing specialized training courses and 
certifications to prepare trained and skilled human capital with skills tailored to  sectoral 
requirements . These training programs may be initiated  as early as 202

Important Notes:
Ensure that the text you want to display is stored in the payload of each result. If it's named something other than text, update the key in this

In [20]:
# prompt: generate code to save output as markdown file and save in google drive

import markdown2

# Function to format the results into Markdown and include the text content
def generate_markdown_from_results(search_results):
    markdown_content = "# Query Results\n\n"
    markdown_content += "Here are the top results for your query:\n\n"

    # Loop through results and format them in Markdown, showing the text from payload
    for idx, result in enumerate(search_results, 1):
        # Assuming the text is stored in the 'payload' under a key like 'page_content'
        text_content = result.payload.get('page_content', 'No text available')  # Adjust if 'page_content' is named differently
        markdown_content += f"### Result {idx}\n"
        markdown_content += f"- **Similarity Score**: {result.score:.2f}\n"
        markdown_content += f"- **Text**: {text_content}\n\n"

    return markdown_content

# Example: Convert search results to Markdown and save as Markdown file in Google Drive
query_results_markdown = generate_markdown_from_results(search_results)

# Define the path to save the Markdown file in Google Drive
output_markdown_path = "/content/drive/MyDrive/query_results.md"  # Save in Google Drive

# Save the Markdown content to a file
with open(output_markdown_path, 'w') as f:
    f.write(query_results_markdown)

print(f"Markdown file generated and saved to: {output_markdown_path}")


Markdown file generated and saved to: /content/drive/MyDrive/query_results.md


In [21]:
text_content = result.payload.get('page_content', 'No text available')


In [16]:
print(text_content)

Learning/Deep Learning -based technologies trained on the datasets. This makes it a priority task for the IT 
boards to manage the standardization and accessibility of data. The following interventions are suggested:  
I. IT boards should design  and provide roadmaps for the transformations in various sectors and 
industries based on the ir awareness and readiness for AI adoption . These roadmaps should start 
circulating in the respective sectors by 2023 so that the structural and competency transformation 
toward s effective AI adoption in various sectors, especially public institutions , can be expedited.  
II. IT boards should become facilitators in designing and providing specialized training courses and 
certifications to prepare trained and skilled human capital with skills tailored to  sectoral 
requirements . These training programs may be initiated  as early as 2023  to accelerate compliance 
with AI adoption needs and requirements and prepare a skilled workforce to bear the 