#### Thursday, December 12, 2024

Gonna give this another go ...

#### Wednesday, November 13, 2024

mamba activate rag_genai

This all runs in one pass. The code does not make any explicit reference to anything OpenAI yet a quick look at my usage for today reveals 17c of usage. I'm guessing LLamaIndex is using OpenAI under the hood, and if this is true, how can I control the model used?

OK. How do I change the code to use a local model served up through LMStudio??

Whelp, looks like the answers to some of these questions can be found [here](https://docs.llamaindex.ai/en/stable/examples/llm/lmstudio/)

In [1]:
# import os
# print(os.environ['OPENAI_API_KEY'])
# print(os.environ['ACTIVELOOP_TOKEN'])

In [1]:
# Deliberately set the OPENAI_API_KEY to an invalid value to ensure that the code is not using it.
import os
os.environ['OPENAI_API_KEY'] = "Nope!"

# Constructing an Index-Based Deep Lake Vector Store for Semantic Search with LlamaIndex and OpenAI

copyright 2024, Denis Rothman

A Practical Guide to Building a Semantic Search Engine with Deep Lake, LlamaIndex, and OpenAI:

*   Installing the Environment
*   Creating and populating the Vector Store &   dataset
*   Getting started with  index-based semantic search




# Installing the environment

In [3]:
#Google Drive option to store API Keys
#Store you key in a file and read it(you can type it directly in the notebook but it will be visible for somebody next to you)
# from google.colab import drive
# drive.mount('/content/drive')

*First run the following cells and restart Google Colab session if prompted. Then run the notebook again cell by cell to explore the code.*

In [4]:
# !pip install llama-index-vector-stores-deeplake==0.1.6

LlamaIndex supports Deep Lake vector stores through the DeepLakeVectorStore class.

In [5]:
# !pip install deeplake==3.9.18

In [6]:
# !pip install llama-index==0.10.64

Next, let's import the required modules and set the needed environmental variables:

In [2]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document
from llama_index.vector_stores.deeplake import DeepLakeVectorStore



In [8]:
# !pip install sentence-transformers==3.0.1

In [9]:
#Retrieving and setting the OpenAI API key
# f = open("drive/MyDrive/files/api_key.txt", "r")
# API_KEY=f.readline().strip()
# f.close()

# #The OpenAI KeyActiveloop and OpenAI API keys
# import os
# import openai
# os.environ['OPENAI_API_KEY'] =API_KEY
# openai.api_key = os.getenv("OPENAI_API_KEY")

In [10]:
#Retrieving and setting the Activeloop API token
# f = open("drive/MyDrive/files/activeloop.txt", "r")
# API_token=f.readline().strip()
# f.close()
# ACTIVELOOP_TOKEN=API_token
# os.environ['ACTIVELOOP_TOKEN'] =ACTIVELOOP_TOKEN

In [11]:
# For Google Colab and Activeloop while waiting for Activeloop (April 2024) pending new version
#This line writes the string "nameserver 8.8.8.8" to the file. This is specifying that the DNS server the system
#should use is at the IP address 8.8.8.8, which is one of Google's Public DNS servers.
# with open('/etc/resolv.conf', 'w') as file:
#    file.write("nameserver 8.8.8.8")

# Pipeline 1 : Collecting and preparing the documents

In [12]:
# !mkdir data

In [3]:
import requests
from bs4 import BeautifulSoup
import re
import os

urls = [
    "https://github.com/VisDrone/VisDrone-Dataset",
    "https://paperswithcode.com/dataset/visdrone",
    "https://openaccess.thecvf.com/content_ECCVW_2018/papers/11133/Zhu_VisDrone-DET2018_The_Vision_Meets_Drone_Object_Detection_in_Image_Challenge_ECCVW_2018_paper.pdf",
    "https://github.com/VisDrone/VisDrone2018-MOT-toolkit",
    "https://en.wikipedia.org/wiki/Object_detection",
    "https://en.wikipedia.org/wiki/Computer_vision",
    "https://en.wikipedia.org/wiki/Convolutional_neural_network",
    "https://en.wikipedia.org/wiki/Unmanned_aerial_vehicle",
    "https://www.faa.gov/uas/",
    "https://www.tensorflow.org/",
    "https://pytorch.org/",
    "https://keras.io/",
    "https://arxiv.org/abs/1804.06985",
    "https://arxiv.org/abs/2202.11983",
    "https://motchallenge.net/",
    "http://www.cvlibs.net/datasets/kitti/",
    "https://www.dronedeploy.com/",
    "https://www.dji.com/",
    "https://arxiv.org/",
    "https://openaccess.thecvf.com/",
    "https://roboflow.com/",
    "https://www.kaggle.com/",
    "https://paperswithcode.com/",
    "https://github.com/"
]

In [4]:
import requests
import re
import os
from bs4 import BeautifulSoup

def clean_text(content):
    # Remove references and unwanted characters
    content = re.sub(r'\[\d+\]', '', content)   # Remove references
    content = re.sub(r'[^\w\s\.]', '', content)  # Remove punctuation (except periods)
    return content

def fetch_and_clean(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise exception for bad responses (e.g., 404)
        soup = BeautifulSoup(response.content, 'html.parser')

        # Prioritize "mw-parser-output" but fall back to "content" class if not found
        content = soup.find('div', {'class': 'mw-parser-output'}) or soup.find('div', {'id': 'content'})
        if content is None:
            return None

        # Remove specific sections, including nested ones
        for section_title in ['References', 'Bibliography', 'External links', 'See also', 'Notes']:
            section = content.find('span', id=section_title)
            while section:
                for sib in section.parent.find_next_siblings():
                    sib.decompose()
                section.parent.decompose()
                section = content.find('span', id=section_title)

        # Extract and clean text
        text = content.get_text(separator=' ', strip=True)
        text = clean_text(text)
        return text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching content from {url}: {e}")
        return None  # Return None on error

In [5]:
# Directory to store the output files
output_dir = './data/'  # More descriptive name
os.makedirs(output_dir, exist_ok=True)

In [6]:
# This will overwrite existing files ... only 7 files are created
# Processing each URL (and skipping invalid ones)
for url in urls:
    article_name = url.split('/')[-1].replace('.html', '')  # Handle .html extension
    filename = os.path.join(output_dir, f"{article_name}.txt")

    clean_article_text = fetch_and_clean(url)
    if clean_article_text:  # Only write to file if content exists
        with open(filename, 'w', encoding='utf-8') as file:
            file.write(clean_article_text)

# 14.3 s
# 10.6 s
# 8.3 s


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


In [7]:
print(f"Content(ones that were possible) written to files in the '{output_dir}' directory.")

Content(ones that were possible) written to files in the './data/' directory.


In [8]:
# load documents
documents = SimpleDirectoryReader("./data/").load_data()

In [9]:
documents[0]

Document(id_='6bd5ff45-2e96-482c-b2d0-b2165543a68b', embedding=None, metadata={'file_path': '/home/rob/Data/Documents/Github/rkaunismaa/RAG-Driven-Generative-AI/LMStudio/Chapter03/data/1804.06985.txt', 'file_name': '1804.06985.txt', 'file_type': 'text/plain', 'file_size': 3798, 'creation_date': '2024-12-12', 'last_modified_date': '2024-12-12'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='High Energy Physics  Theory arXiv1804.06985 hepth Submitted on 19 Apr 2018 Title A Near Horizon Extreme Binary Black Hole Geometry Authors Jacob Ciafre  Maria J. Rodriguez View a PDF of the paper titled A Near Horizon Extreme Binary Black Hole Geometry by Jacob Ciafre and Maria J. Rodriguez View PDF Abstract A new solution of fourdimensional vacuum General Relativity

How do I load these documents into a local VectorStoreIndex?

In [10]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

In [11]:
# Settings.llm = OpenAI(model="gpt-3.5-turbo")
from llama_index.llms.lmstudio import LMStudio

# Settings.llm = LMStudio(
#     model_name="hermes-3-llama-3.1-8b",
#     base_url="http://localhost:1234/v1",
#     temperature=0.7,
# )

# hermes-3-llama-3.2-3b was relased December 3, 2024 so let's use it!
Settings.llm = LMStudio(
    model_name="hermes-3-llama-3.2-3b",
    base_url="http://localhost:1234/v1",
    temperature=0.7,
)





In [12]:
# Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Initialize embeddings (using a local model or HuggingFace)
Settings.embed_model = HuggingFaceEmbedding(model_name="all-MiniLM-L6-v2")

In [23]:
# Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
# Settings.num_output = 512
# Settings.context_window = 3900

In [13]:
# a vector store index only needs an embed model
index = VectorStoreIndex.from_documents(
    documents, embed_model=Settings.embed_model
)

In [14]:
# ... until you create a query engine
query_engine = index.as_query_engine(llm=Settings.llm)

In [15]:
input_query="How do drones identify vehicles?"

In [16]:
# At this point here, LMStudio is used to generate the response ...
# We can now query the dataset and generate a response ...
response = query_engine.query(input_query)

In [17]:
response

Response(response='The provided context does not directly discuss how drones identify vehicles. Drones use various technologies for identification during different applications. Some of these include Remote ID, which requires commercially used UAVs weighing 250g or more to make drone locations and controller locations public from takeoff until shutdown. However, this is one specific example.\n\nFor a broader understanding, drones may employ advanced technologies like computer vision, artificial intelligence, machine learning, deep learning, thermal sensors, and cloud computing for identifying and monitoring different objects. These technologies are essential in various applications such as aerial photography, area coverage, precision agriculture, river monitoring, environmental monitoring, weather observation, policing, infrastructure inspection, smuggling, product deliveries, entertainment, and drone racing.\n\nDrones may also rely on specific systems like ground-based controllers and

# Pipeline 2 : Creating and populating a Deep Lake Vector Store

**Replace `hub://denis76/drone_v2` by your organization and dataset name**

In [None]:
# from llama_index.core import StorageContext

In [18]:
# vector_store_path = "hub://denis76/drone_v2"
# dataset_path = "hub://denis76/drone_v2"

vector_store_path = "VectorStore"
dataset_path = vector_store_path

In [19]:
# overwrite=True will overwrite dataset, False will append it
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)

# 3.2 s



In [None]:
# storage_context = StorageContext.from_defaults(vector_store=vector_store)

# 0.0 s

In [20]:
# Create an index over the documents
# index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
index = VectorStoreIndex.from_documents(
    documents, embed_model=Settings.embed_model
)

In [21]:
import deeplake
ds = deeplake.load(dataset_path)  # Load the dataset

VectorStore loaded successfully.





In [22]:
import json
import pandas as pd
import numpy as np

# Assuming 'ds' is your loaded Deep Lake dataset

# Create a dictionary to hold the data
data = {}

# Iterate through the tensors in the dataset
for tensor_name in ds.tensors:
    tensor_data = ds[tensor_name].numpy()

    # Check if the tensor is multi-dimensional
    if tensor_data.ndim > 1:
        # Flatten multi-dimensional tensors
        data[tensor_name] = [np.array(e).flatten().tolist() for e in tensor_data]
    else:
        # Convert 1D tensors directly to lists and decode text
        if tensor_name == "text":
            data[tensor_name] = [t.tobytes().decode('utf-8') if t else "" for t in tensor_data]
        else:
            data[tensor_name] = tensor_data.tolist()

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

In [23]:
# Function to display a selected record
def display_record(record_number):
    record = df.iloc[record_number]
    display_data = {
        "ID": record.get("id", "N/A"),
        "Metadata": record.get("metadata", "N/A"),
        "Text": record.get("text", "N/A"),
        "Embedding": record.get("embedding", "N/A")
    }

    # Print the ID
    print("ID:")
    print(display_data["ID"])
    print()

    # Print the metadata in a structured format
    print("Metadata:")
    metadata = display_data["Metadata"]
    if isinstance(metadata, list):
        for item in metadata:
            for key, value in item.items():
                print(f"{key}: {value}")
            print()
    else:
        print(metadata)
    print()

    # Print the text
    print("Text:")
    print(display_data["Text"])
    print()

    # Print the embedding
    print("Embedding:")
    print(display_data["Embedding"])
    print()



In [24]:
# Function call to display a record
rec = 0  # Replace with the desired record number
display_record(rec)

IndexError: single positional indexer is out-of-bounds

# Original documents

In [None]:
# Ensure 'text' column is of type string
df['text'] = df['text'].astype(str)
# Create documents with IDs
documents = [Document(text=row['text'], doc_id=str(row['id'])) for _, row in df.iterrows()]

# Pipeline 3:Index-based RAG

## User input and RAG parameters

In [None]:
user_input="How do drones identify vehicles?"

#similarity_top_k
k=3
#temperature
temp=0.1
#num_output
mt=1024

## Cosine similarity metric

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
def calculate_cosine_similarity_with_embeddings(text1, text2):
    embeddings1 = model.encode(text1)
    embeddings2 = model.encode(text2)
    similarity = cosine_similarity([embeddings1], [embeddings2])
    return similarity[0][0]

# Vector store index query engine

(The first type of llama_index we will be looking at is the VectorSoreStoreIndex. This type of index is useful when you have a collection of documents and you want to be able to query them. The VectorStoreIndex uses a vector store to store the embeddings of the documents and then uses these embeddings to perform similarity search.)

( VectorStoreIndex is a type of index within LlamaIndex that implements vector embeddings to represent and retrieve information from documents. These documents with similar meanings will have embeddings that are closer together in the vector space, as we explored in the previous chapter. However, this time, the VectorStoreIndex does not automatically use the existing Deep Lake vector store. It can create a new in-memory vector index, re-embed the documents, and create a new index structure.)

In [None]:
from llama_index.core import VectorStoreIndex
vector_store_index = VectorStoreIndex.from_documents(documents)

In [None]:
print(type(vector_store_index))

(We now need a query engine to retrieve and synthesize the document(s) retrieved with an LLM—in our case, an OpenAI model)

In [None]:
vector_query_engine = vector_store_index.as_query_engine(similarity_top_k=k, temperature=temp, num_output=mt)

In [None]:
print(type(vector_query_engine))

## Query response and source

In [None]:
import pandas as pd
import textwrap

def index_query(input_query):
    
    # We can now query the dataset and generate a response ...
    response = vector_query_engine.query(input_query)

    # Optional: Print a formatted view of the response (remove if you don't need it in the output)
    print(textwrap.fill(str(response), 100))

    node_data = []
    for node_with_score in response.source_nodes:
        node = node_with_score.node
        node_info = {
            'Node ID': node.id_,
            'Score': node_with_score.score,
            'Text': node.text
        }
        node_data.append(node_info)

    df = pd.DataFrame(node_data)

    # Instead of printing, return the DataFrame and the response object
    return df, response


We will evaluate the time it takes for the query to retrieve the relevant data and generate a response synthesis with the LLM (in this case, an OpenAI model). The output of the semantic search first returns a response synthesized by the LLM:

In [None]:
import time

#start the timer
start_time = time.time()
df, response = index_query(user_input)
# Stop the timer
end_time = time.time()

# Calculate and print the execution time
elapsed_time = end_time - start_time

print(f"Query execution time: {elapsed_time:.4f} seconds")
print(df.to_markdown(index=False, numalign="left", stralign="left"))  # Display the DataFrame using markdown

Node information and relationships

The ID of the node guarantees full transparency and can be traced back to the original document, even when the index engines re-index the dataset. We can obtain the node source of the first node, for example, with the following code:

In [None]:
nodeid=response.source_nodes[0].node_id
nodeid

We can drill down and retrieve the full text of the node containing the document that was synthesized by the LLM:

In [None]:
response.source_nodes[0].get_text()

## Optimized chunking

In [None]:
# Assuming you have the 'response' object from query_engine.query()

for node_with_score in response.source_nodes:

    node = node_with_score.node  # Extract the Node object from NodeWithScore
    chunk_size = len(node.text)
    
    print(f"Node ID: {node.id_}, Chunk Size: {chunk_size} characters")

## Performance metric

We will also implement a performance metric based on the accuracy of the queries and the time elapsed. This function calculates and prints a performance metric for a query, along with its execution time. The metric is based on the weighted average relevance scores of the retrieved information, divided by the time it took to get the results. Higher scores indicate better performance.

In [None]:
import numpy as np

def info_metrics(response):
  
  # Calculate the performance (handling None scores)
  scores = [node.score for node in response.source_nodes if node.score is not None]
  
  if scores:  # Check if there are any valid scores
      weights = np.exp(scores) / np.sum(np.exp(scores))
      perf = np.average(scores, weights=weights) / elapsed_time
  else:
      perf = 0  # Or some other default value if all scores are None

  average_score=np.average(scores, weights=weights)
  print(f"Average score: {average_score:.4f}")
  print(f"Query execution time: {elapsed_time:.4f} seconds")
  print(f"Performance metric: {perf:.4f}")

In [None]:
info_metrics(response)

This performance metric is not an absolute value. It’s an indicator that we can use to compare this output with the other index engines. It may also vary from one run to another, due to the stochastic nature of machine learning algorithms. Additionally, the quality of the output depends on the user’s subjective perception. In any case, this metric will help compare the query engines’ performances in this chapter.

# Tree index query engine

(The second type of llama_index we will look as is the TreeIndex. This index engine is a hierarchical index that organizes data into a tree structure. It uses a tree-based data structure to store and retrieve data, which can improve query performance and efficiency.)

The tree index organizes documents in a tree structure, with broader summaries at higher levels and detailed information at lower levels. Each node in the tree summarizes the text it covers. The tree index is efficient for large datasets and queries large collections of documents rapidly by breaking them down into manageable optimized chunks. Thus, the optimization of the tree structure allows for rapid retrieval by traversing the relevant nodes without wasting time.

In [None]:
from llama_index.core import TreeIndex
tree_index = TreeIndex.from_documents(documents)
# 33.2s

In this index model, the LLM (an OpenAI model in this case) acts like it is answering a multiple-choice question when selecting the best nodes during a query. It analyzes the query, compares it with the summaries of the current node’s children, and decides which path to follow to find the most relevant information.

In [None]:
print(type(tree_index))

In [None]:
tree_query_engine = tree_index.as_query_engine(similarity_top_k=k, temperature=temp, num_output=mt)

In [None]:
import time
import textwrap

# Start the timer
start_time = time.time()
response = tree_query_engine.query(user_input)
# Stop the timer
end_time = time.time()

# Calculate and print the execution time
elapsed_time = end_time - start_time

print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

## Performance metric

In [None]:
similarity_score = calculate_cosine_similarity_with_embeddings(user_input, str(response))
print(f"Cosine Similarity Score: {similarity_score:.3f}")
print(f"Query execution time: {elapsed_time:.4f} seconds")
performance=similarity_score/elapsed_time
print(f"Performance metric: {performance:.4f}")

# List index query engine

(The third type of llama_index we we will look as is the ListIndex. This index engine is a simple and efficient index that stores documents in a list and retrieves them based on similarity scores. It is suitable for small datasets and simple queries.)

Don’t think of ListIndex as simply a list of nodes. The query engine will process the user input and each document as a prompt for an LLM. The LLM will evaluate the semantic similarity relationship between the documents and the query, thus implicitly ranking and selecting the most relevant nodes. LlamaIndex will filter the documents based on the rankings obtained, and it can also take the task further by synthesizing information from multiple nodes and documents.

We can see that the selection process with an LLM is not rule-based. Nothing is predefined, which means that the selection is prompt-based by combining the user input with a collection of documents. The LLM evaluates each document in the list independently, assigning a score based on its perceived relevance to the query. This score isn’t relative to other documents; it’s a measure of how well the LLM thinks the current document answers the question. Then, the top-k documents are retained by the query engine if we wish, as in the function used in this section.



In [None]:
from llama_index.core import ListIndex
list_index = ListIndex.from_documents(documents)

In [None]:
print(type(list_index))

In [None]:
list_query_engine = list_index.as_query_engine(similarity_top_k=k, temperature=temp, num_output=mt)

In [None]:
#start the timer
start_time = time.time()
response = list_query_engine.query(user_input)
# Stop the timer
end_time = time.time()

# Calculate and print the execution time
elapsed_time = end_time - start_time

print(f"Query execution time: {elapsed_time:.4f} seconds")
print(textwrap.fill(str(response), 100))

## Performance metric

In [None]:
similarity_score = calculate_cosine_similarity_with_embeddings(user_input, str(response))
print(f"Cosine Similarity Score: {similarity_score:.3f}")
print(f"Query execution time: {elapsed_time:.4f} seconds")
performance=similarity_score/elapsed_time
print(f"Performance metric: {performance:.4f}")

# Keyword index query index

(The fourth and last type of llama_index we will be looking at is the KeywordTableIndex)

KeywordTableIndex is a type of index in LlamaIndex, designed to extract keywords from your documents and organize them in a table-like structure. This structure makes it easier to query and retrieve relevant information based on specific keywords or topics. Once again, don’t think about this function as a simple list of extracted keywords. The extracted keywords are organized into a table-like format where each keyword is associated with an ID that points to the related nodes.

In [None]:
from llama_index.core import KeywordTableIndex
keyword_index = KeywordTableIndex.from_documents(documents)

# 2m 20.1s

In [None]:
# Extract data for DataFrame
data = []
for keyword, doc_ids in keyword_index.index_struct.table.items():
    for doc_id in doc_ids:
        data.append({"Keyword": keyword, "Document ID": doc_id})

# Create the DataFrame
df = pd.DataFrame(data)
df

In [None]:
keyword_query_engine = keyword_index.as_query_engine(similarity_top_k=k, temperature=temp, num_output=mt)

In [None]:
import time

# Start the timer
start_time = time.time()

# Execute the query (using .query() method)
response = keyword_query_engine.query(user_input)

# Stop the timer
end_time = time.time()

# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")

print(textwrap.fill(str(response), 100))

## Performance metric

In [None]:
similarity_score = calculate_cosine_similarity_with_embeddings(user_input, str(response))
print(f"Cosine Similarity Score: {similarity_score:.3f}")
print(f"Query execution time: {elapsed_time:.4f} seconds")
performance=similarity_score/elapsed_time
print(f"Performance metric: {performance:.4f}")