<a href="https://colab.research.google.com/github/sam4410/RAG-Technique-based-models/blob/main/Building_a_semantic_search_engine_and_generative_agent_using_index_based_search_on_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Coverage:
* Building a semantic search engine with a LlamaIndex framework and indexing methods
* Populating Deep Lake vector stores
* Integration of LlamaIndex, Deep Lake, and OpenAI
* Score ranking and cosine similarity metrics
* Metadata enhancement for traceability
* Query setup and generation configuration
* Introducing automated document ranking
* Vector, tree, list, and keyword indexing types

We will build a semantic index-based search engine and generative AI agent engine using Deep Lake vector stores, LlamaIndex, and OpenAI. The goal is to create an index based RAG agent for drone technology questions and answers. The program will demonstrate how drones use computer vision techniques to identify vehicles and other objects.

This project will be organized into building 3 main pipelines:
* Pipeline 1. Collecting and preparing the documents for indexing
* Pipeline 2. Creating and populating a Deep Lake vector store
* Pipeline 3. Index-based RAG for query processing and generation along with time score performances measured

In [1]:
#!pip install llama-index-vector-stores-deeplake==0.1.6 deeplake==3.9.12 llama-index==0.10.64 openai

In [48]:
import os
import warnings
import requests
from google.colab import userdata
from bs4 import BeautifulSoup
from huggingface_hub import login
import re
import os
import openai
from google.colab import drive
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
warnings.filterwarnings('ignore')

# Connect this Colab to my Google Drive
drive.mount("/content/drive")

#Retrieving and setting OpenAI API key
f = open("drive/MyDrive/Colab Notebooks/key_files/openai_api_key.txt", "r")
API_KEY=f.readline().strip()
f.close()

#The OpenAI API key
os.environ['OPENAI_API_KEY'] =API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

#Retrieving and setting Activeloop API token
f = open("drive/MyDrive/Colab Notebooks/key_files/activeloop_token.txt", "r")
API_token=f.readline().strip()
f.close()
ACTIVELOOP_TOKEN=API_token
os.environ['ACTIVELOOP_TOKEN'] =ACTIVELOOP_TOKEN

# signing to hugging face hub
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=False)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Pipeline 1: Collecting and preparing the documents

In [12]:
# collect and prepare the drone-related documents with the metadata necessary to trace the documents back to their source.
# The goal is to trace a response’s content back to the exact chunk of data retrieved to find its source. Create a directory to load the documents
# output_dir = "drive/MyDrive/Colab Notebooks/data/"

In [36]:
# using a heterogeneous corpus for the drone technology data that we will process using BeautifulSoup
# list of sites related to drones, computer vision, and related technologies
urls = [
"https://en.wikipedia.org/wiki/UAV-IQ",
"https://en.wikipedia.org/wiki/Unmanned_combat_aerial_vehicle",
"https://en.wikipedia.org/wiki/Drone-Enhanced_Emergency_Medical_Services",
"https://en.wikipedia.org/wiki/Unmanned_aerial_vehicle",
"https://en.wikipedia.org/wiki/Object_detection",
"https://en.wikipedia.org/wiki/Computer_vision"
]

In [6]:
def clean_text(content):
  # Remove references and unwanted characters
  content = re.sub(r'\[\d+\]', '', content) # Remove references
  content = re.sub(r'[^\w\s\.]', '', content) # Remove punctuation (except periods)
  return content

def fetch_and_clean(url):
  try:
    response = requests.get(url)
    response.raise_for_status()     # # Raise exception for bad responses (e.g., 404)
    soup = BeautifulSoup(response.content, 'html.parser')
    # Prioritize "mw-parser-output" but fall back to "content" class if not found
    content = soup.find('div', {'class': 'mw-parser-output'}) or soup.find('div', {'id': 'content'})
    if content is None:
      return None

    # Remove specific sections, including nested ones
    for section_title in ['References', 'Bibliography', 'External links', 'See also', 'Notes']:
      section = content.find('span', id=section_title)
      while section:
        for sib in section.parent.find_next_siblings():
          sib.decompose()
        section.parent.decompose()
        section = content.find('span', id=section_title)

    # Extract and clean text
    text = content.get_text(separator=' ', strip=True)
    text = clean_text(text)
    return text
  except requests.exceptions.RequestException as e:
    print(f"Error fetching content from {url}: {e}")
    return None # Return None on error

In [38]:
# function to save each piece of text with the name of its data source, by creating a keyword based on its URL
# directory to store the output files
output_dir = "drive/MyDrive/Colab Notebooks/data/"
os.makedirs(output_dir, exist_ok=True)

# Processing each URL and writing its content to a separate file
for url in urls:
  article_name = url.split('/')[-1].replace('.html',"").replace(".pdf","")
  filename = os.path.join(output_dir, article_name + '.txt')
  clean_article_text = fetch_and_clean(url)
  with open(filename, 'w', encoding='utf-8') as file:
      file.write(clean_article_text)

print(f"Content(ones that were possible) written to files in the '{output_dir}'directory.")

Content(ones that were possible) written to files in the 'drive/MyDrive/Colab Notebooks/data/'directory.


In [39]:
# load data stored in directory using LlamaIndex SimpleDirectoryReader class which is designed for working with unstructured data
documents = SimpleDirectoryReader(output_dir).load_data()
print(documents[0])
#documents[0]    # will give entire metadata about document

Doc ID: 90a64373-c31d-4a3c-9576-ba85684ad2c7
Text: Computerized information extraction from images Part of a series
on Artificial intelligence AI Major goals Artificial general
intelligence Intelligent agent Recursive selfimprovement Planning
Computer vision General game playing Knowledge reasoning Natural
language processing Robotics AI safety Approaches Machine learning
Symbolic Deep learning ...


### Pipeline 2: Creating and populating a Deep Lake vector store

Here, we will create a Deep Lake vector store and populate it with the data in our documents. We will implement a standard tensor configuration with:
* text (str): The text is the content of one of the text files listed in the dictionary of documents. It will be seamless, and chunking will be optimized, breaking the text into meaningful chunks.
* metadata(json): In this case, the metadata will contain the filename source of each chunk of text for full transparency and control. We will see how to access this information in code.
* embedding (float32): The embedding is seamless, using an OpenAI embedding model called directly by the LlamaIndex-Deep Lake-OpenAI package.
* id (str, auto-populated): A unique ID is attributed automatically to each chunk. The vector store will also contain an index, which is a number from 0 to n, but it cannot be used semantically, since it will change each time we modify the dataset. However, the unique ID field will remain unchanged until we decide to optimize it with index-based search strategies (will see in next pipeline)

In [49]:
from llama_index.core import StorageContext

vector_store_path = "hub://sam4410/drone_v2"
dataset_path = "hub://sam4410/drone_v2"

# create a vector store, populate it, and create an index over the documents:
#overwrite is set to True to create the vector store and overwrite any existing one
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create an index over the documents
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

Your Deep Lake dataset has been successfully created!




Uploading data to deeplake dataset.


100%|██████████| 75/75 [00:00<00:00, 76.03it/s] 
\

Dataset(path='hub://sam4410/drone_v2', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (75, 1)      str     None   
 metadata     json      (75, 1)      str     None   
 embedding  embedding  (75, 1536)  float32   None   
    id        text      (75, 1)      str     None   


 

In [50]:
# load dataset into memory
import deeplake
ds = deeplake.load(dataset_path)

/

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/sam4410/drone_v2



/

hub://sam4410/drone_v2 loaded successfully.





In [51]:
# We can also decide to add code to display the dataset. We begin by loading the data in a pandas Dataframe
import json
import pandas as pd
import numpy as np

# Create a dictionary to hold the data
data = {}

# Iterate through the tensors in the dataset
for tensor_name in ds.tensors:
  tensor_data = ds[tensor_name].numpy()
  # Check if the tensor is multi-dimensional
  if tensor_data.ndim > 1:
    # Flatten multi-dimensional tensors
    data[tensor_name] = [np.array(e).flatten().tolist() for e in tensor_data]
  else:
    # Convert 1D tensors directly to lists and decode text
    if tensor_name == "text":
      data[tensor_name] = [t.tobytes().decode('utf-8') if t else "" for t in tensor_data]
    else:
      data[tensor_name] = tensor_data.tolist()

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

In [54]:
# create a function to display a record:
def display_record(record_number):
  record = df.iloc[record_number]
  display_data = {
    "ID": record["id"] if "id" in record else "N/A",
    "Metadata": record["metadata"] if "metadata" in record else "N/A",
    "Text": record["text"] if "text" in record else "N/A",
    "Embedding": record["embedding"] if "embedding" in record else "N/A"
  }
  return display_data

In [55]:
# select a record and display each field:
# Function call to display a record
rec = 0 # Replace with the desired record number
display_record(rec)

{'ID': ['6dfc7fe9-236b-4f9a-a213-48164ffcf2ca'],
 'Metadata': [{'file_path': '/content/drive/MyDrive/Colab Notebooks/data/Computer_vision.txt',
   'file_name': 'Computer_vision.txt',
   'file_type': 'text/plain',
   'file_size': 56067,
   'creation_date': '2025-03-17',
   'last_modified_date': '2025-03-17',
   '_node_content': '{"id_": "6dfc7fe9-236b-4f9a-a213-48164ffcf2ca", "embedding": null, "metadata": {"file_path": "/content/drive/MyDrive/Colab Notebooks/data/Computer_vision.txt", "file_name": "Computer_vision.txt", "file_type": "text/plain", "file_size": 56067, "creation_date": "2025-03-17", "last_modified_date": "2025-03-17"}, "excluded_embed_metadata_keys": ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"], "excluded_llm_metadata_keys": ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"], "relationships": {"1": {"node_id": "90a64373-c31d-4a3c-9576-ba85684ad2c7", "node_type": "4", 

In [56]:
# metadata field contains the information we need to trace the content back to the original file and file path
# also contains the information of the node created from the record’s data, which can then be used for the indexing engine

### Pipeline 3: Index-based RAG

Implmenting index based RAG pipeline using LlamaIndex which uses the data we have prepared and processed with Deep Lake. It will retrieve relevant information from theheterogeneous (noise-containing) drone-related document collection and synthesize the response through OpenAI's LLM models. We will implement four index engines:
* Vector Store Index Engine: Creates a vector store index from the documents, enabling efficient similarity-based searches.
* Tree Index: Builds a hierarchical tree index from the documents, offering an alternative retrieval structure.
* List Index: Constructs a straightforward list index from the documents.
* Keyword Table Index: Creates an index based on keywords extracted from the documents.

In [57]:
# User input and query parameters
# user input will be the reference question for the four index engines. We will evaluate each response based on the index engine’s retrievals and measure the outputs,
# using time and score ratios

user_input="How do drones identify vehicles?"

# The four query engines will seamlessly be called with the same parameters as
#similarity_top_k
k=3
#temperature
temp=0.1
#num_output
mt=1024


In [58]:
# Cosine similarity metric
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def calculate_cosine_similarity_with_embeddings(text1, text2):
  embeddings1 = model.encode(text1)
  embeddings2 = model.encode(text2)
  similarity = cosine_similarity([embeddings1], [embeddings2])
  return similarity[0][0]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [59]:
# Vector store index query engine
# first create the vector store index
from llama_index.core import VectorStoreIndex
vector_store_index = VectorStoreIndex.from_documents(documents)

# display the vector store index
print(type(vector_store_index))

<class 'llama_index.core.indices.vector_store.base.VectorStoreIndex'>


In [60]:
# We now need a query engine to retrieve and synthesize the document(s) retrieved with an LLM
vector_query_engine = vector_store_index.as_query_engine(similarity_top_k=k,
                                                         temperature=temp, num_output=mt)

##### Query response and source

In [68]:
# define a function that will manage the query and return information on the content of the response
import textwrap

def index_query(input_query):
  response = vector_query_engine.query(input_query)
  # Print a formatted view of the response
  print(textwrap.fill(str(response), 100))

  node_data = []
  for node_with_score in response.source_nodes:
    node = node_with_score.node
    node_info = {
        "Node ID": node.id_,
        "Score": node_with_score.score,
        "Text": node.text
    }

    node_data.append(node_info)

  df = pd.DataFrame(node_data)
  return df, response

In [80]:
# Below code now call the query
import time

#start the timer
start_time = time.time()
df, response = index_query(user_input)
# Stop the timer
end_time = time.time()

# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")

print(df.to_markdown(index=False, numalign="left", stralign="left")) # Display the DataFrame using markdown

Drones can identify vehicles through various means such as using sophisticated sensor payloads
including SIGINT gear, which allows them to gather intelligence through signals interception.
Additionally, drones can also rely on visual identification methods and may have the capability to
carry out autonomous target recognition based on pre-programmed algorithms.
Query execution time: 1.2979 seconds
| Node ID                              | Score    | Text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

In [81]:
# The ID of the node guarantees full transparency and can be traced back to the original document even when the index engines re-index the dataset.
# We can obtain the node source of the first node as
nodeid = response.source_nodes[0].node_id
nodeid

'fe1820aa-13be-4beb-9bbc-bf1608396133'

In [82]:
# We can drill down and retrieve the full text of the node containing the document that was synthesized by the LLM
response.source_nodes[0].get_text()

'Multinational  edit  EADS Surveyor  The EADS Surveyor is still in preliminary investigation phase. It will be a fixedwing jetpowered UAV and is being positioned as a replacement for the CL289. EADS is currently working on a demonstrator the Carapas modified from an Italian Mirach 100 drone. The production Surveyor would be a stealthy machine with a top speed of 850\xa0kmh 530\xa0mph an endurance of up to three hours and capable of carrying a sophisticated sensor payload including SIGINT gear. It would also be able to carry external loads such as airdropped sensors or light munitions.  citation needed  Nonstate actors  edit  In the mid2010s the Islamic State terrorist group began attaching explosives to commerciallyavailable quadcopters such as the Chinesemade DJI Phantom to bomb military targets in Iraq and Syria .  42  During the 201617 battle of Mosul  the Islamic State reportedly used drones as surveillance and weapons delivery platforms using improvised cradles to drop grenades an

##### Optimized chunking

In [83]:
# We can predefine the chunk size, or we can let LlamaIndex select it for us. In this case, the code determines the chunk size automatically:
for node_with_score in response.source_nodes:
  node = node_with_score.node # Extract the Node object from NodeWithScore
  chunk_size = len(node.text)
  print(f"Node ID: {node.id_}, Chunk Size: {chunk_size} characters")

# The advantage of an automated chunk size is that it can be variable. For example, in this case, the chunk size shown in the size of the output nodes is
# probably in the 4000-to-5500-character range. The chunking function does not linearly cut content but optimizes the chunks for semantic search.

Node ID: fe1820aa-13be-4beb-9bbc-bf1608396133, Chunk Size: 4562 characters
Node ID: 24165c8b-9eba-4b1b-afdd-5c384323adad, Chunk Size: 4200 characters
Node ID: 209e76ba-d4c4-4154-a3a5-0e1316487439, Chunk Size: 4735 characters


##### Performance metric

In [86]:
# We will also implement a performance metric based on the accuracy of the queries and the time elapsed.
# Below function calculates and prints a performance metric for a query, along with its execution time.
# The metric is based on the weighted average relevance scores of the retrieved information, divided by the time it took to get the results.
# Higher scores indicate better performance
def info_metrics(response):
  # Calculate the performance (handling None scores)
  scores = [node.score for node in response.source_nodes if node.score is not None]
  avg_score = np.mean(scores) if scores else 0
  if scores: # Check if there are any valid scores
    weights = np.exp(scores) / np.sum(np.exp(scores))
    perf = np.average(scores, weights=weights) / elapsed_time
  else:
    perf = 0
  return print(f"AverageScore: {avg_score}\nQuery execution time:{elapsed_time}\nPerformance metric:{perf}")

In [87]:
info_metrics(response)

AverageScore: 0.835926428579287
Query execution time:1.297931432723999
Performance metric:0.6440803368110414


In [None]:
"""
This performance metric is not an absolute value. It’s an indicator that we can use to compare this output with the other index engines.
It may also vary from one run to another, due to the stochastic
nature of machine learning algorithms. Additionally, the quality of the output depends on the user’s
subjective perception. In any case, this metric will help compare the query engines’ performances.

It can be observed that the average score is satisfactory, even though we loaded heterogeneous and
sometimes unrelated documents in the dataset. The integrated retriever and synthesizer functionality
of LlamaIndex, Deep Lake, and OpenAI have proven to be highly effective.
"""

#### Tree index query engine

In [88]:
# The tree index in LlamaIndex creates a hierarchical structure for managing and querying text documents efficiently.
# The tree index engine optimizes the hierarchy, content, and order of the nodes
# The tree index is efficient for large datasets and queries large collections of documents rapidly by breaking them down into manageable optimized chunks
# create a tree index
from llama_index.core import TreeIndex
tree_index = TreeIndex.from_documents(documents)
print(type(tree_index))

# now make our tree index the query engine
tree_query_engine = tree_index.as_query_engine(similarity_top_k=k,
                                               temperature=temp, num_output=mt)

<class 'llama_index.core.indices.tree.base.TreeIndex'>


In [89]:
# now calls the query, measures the time elapsed, and processes the response
# Start the timer
start_time = time.time()
response = tree_query_engine.query(user_input)
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

Query execution time: 3.3890 seconds
Drones can identify vehicles using computer vision techniques, which involve analyzing images or
video captured by the drone's cameras to detect and recognize vehicles based on their visual
features.


##### Performance metric

In [90]:
# calculate the cosine similarity between the user input and the response of our RAG pipeline
similarity_score = calculate_cosine_similarity_with_embeddings(user_input, str(response))
print(f"Cosine Similarity Score: {similarity_score:.3f}")
print(f"Query execution time: {elapsed_time:.4f} seconds")
performance=similarity_score/elapsed_time
print(f"Performance metric: {performance:.4f}")

Cosine Similarity Score: 0.811
Query execution time: 3.3890 seconds
Performance metric: 0.2393


#### List index query engine

In [91]:
# ListIndex is not just a simple list of nodes. The query engine will process the user input and each document as a prompt for an LLM.
# The LLM will evaluate the semantic similarity relationship between the documents and the query, thus implicitly ranking and selecting the most relevant nodes
# LlamaIndex will filter the documents based on the rankings obtained, and it can also take the task further by synthesizing information from multiple nodes and documents
# the list index can also be created as follows
from llama_index.core import ListIndex
list_index = ListIndex.from_documents(documents)

print(type(list_index))

<class 'llama_index.core.indices.list.base.SummaryIndex'>


In [93]:
# The list index is a SummaryIndex, which shows the large amount of document summary optimization that is running under the hood
# now utilize our list index as a query engine in the seamless framework provided by LlamaIndex
list_query_engine = list_index.as_query_engine(similarity_top_k=k,
                                               temperature=temp, num_output=mt)

# now run our query, wrap the response up, and display the output
#start the timer
start_time = time.time()
response = list_query_engine.query(user_input)
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

Query execution time: 10.7201 seconds
Drones do not specifically identify vehicles. Their primary use in emergency medical services
involves delivering critical medical supplies such as defibrillators, medications, and diagnostic
equipment to emergency situations. Drones are equipped with advanced navigation systems, sensors,
and real-time data transmission capabilities to autonomously deliver these supplies to incident
sites quickly and efficiently.


In [94]:
# The execution time is longer because the query goes through a list, not an optimized tree.

##### Performance metric

In [95]:
# will use the cosine similarity, as we did for the tree index, to evaluate the similarity score
similarity_score = calculate_cosine_similarity_with_embeddings(user_input, str(response))

print(f"Cosine Similarity Score: {similarity_score:.3f}")
print(f"Query execution time: {elapsed_time:.4f} seconds")
performance=similarity_score/elapsed_time
print(f"Performance metric: {performance:.4f}")

Cosine Similarity Score: 0.706
Query execution time: 10.7201 seconds
Performance metric: 0.0658


In [96]:
# The performance metric is lower than the tree index due to the longer execution time.
# If we look back at the performance metric of each indexing type. The vector store index was the fastest.

#### Keyword index query engine

In [97]:
# KeywordTableIndex is a type of index in LlamaIndex, designed to extract keywords from your documents and organize them in a table-like structure.
# This structure makes it easier to query and retrieve relevant information based on specific keywords or topics.
# The extracted keywords are organized into a table-like format where each keyword is associated with an ID that points to the related nodes.
# create the keyword index as follows
from llama_index.core import KeywordTableIndex
keyword_index = KeywordTableIndex.from_documents(documents)

print(type(keyword_index))

<class 'llama_index.core.indices.keyword_table.base.KeywordTableIndex'>


In [98]:
# let's extract the data and create a pandas DataFrame to see how the index is structured
# Extract data for DataFrame
data = []

for keyword, doc_ids in keyword_index.index_struct.table.items():
  for doc_id in doc_ids:
    data.append({"Keyword": keyword, "Document ID": doc_id})

# Create the DataFrame
df = pd.DataFrame(data)
df  #output will show each keyword is associated with an ID that contains a document or a summary depending on the way LlamaIndex optimizes the index

Unnamed: 0,Keyword,Document ID
0,deep learning,2baa1434-1ae7-4c82-bc27-bd56aec1b0bf
1,deep learning,c1974c48-ca27-4d6f-8ec3-e82529ddd4c4
2,deep learning,4b4e95fb-198c-483f-9087-87891fe06c4c
3,deep learning,aff4e25a-116d-4777-bb40-241d42ea339b
4,deep learning,db49fd72-7b9e-44d4-8d53-569d70d05a63
...,...,...
4111,united arab emirates,e75ee5d3-4048-4b63-b7d9-9b0285ea61eb
4112,wing loong,e75ee5d3-4048-4b63-b7d9-9b0285ea61eb
4113,emirates,e75ee5d3-4048-4b63-b7d9-9b0285ea61eb
4114,egypt,e75ee5d3-4048-4b63-b7d9-9b0285ea61eb


In [99]:
# now define the keyword index as the query engine
keyword_query_engine = keyword_index.as_query_engine(similarity_top_k=k,
                                                     temperature=temp, num_output=mt)

# run the keyword query and see how well and fast it can produce a response
start_time = time.time()
response = keyword_query_engine.query(user_input)
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time

print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

Query execution time: 1.5545 seconds
Drones can identify vehicles through various means such as visual recognition using cameras,
sensors, and artificial intelligence algorithms. They can also utilize GPS technology for tracking
and identification purposes. Additionally, drones can be equipped with specialized systems for
specific identification tasks, depending on the intended application.


##### Performance metric

In [100]:
# code runs the same metric as for the tree and list index
similarity_score = calculate_cosine_similarity_with_embeddings(user_input, str(response))

print(f"Cosine Similarity Score: {similarity_score:.3f}")
print(f"Query execution time: {elapsed_time:.4f} seconds")
performance=similarity_score/elapsed_time
print(f"Performance metric: {performance:.4f}")

Cosine Similarity Score: 0.803
Query execution time: 1.5545 seconds
Performance metric: 0.5168
