## Introduction

This notebook contains a script for retrieval augmented generation (RAG) using the Llama Index RAG framework [9], Milvus vector store [5], HuggingFace models [2,5], and TruLens evaluation [1]. The most prevalent approach to RAG leverages OpenAI's embeddings and inference model. This is apparent through the number of tutorials that use OpenAI in RAG systems [3,5,6,9,10]. However, there are a vast number of models available [7,8,9]. So there is value in exploring RAG performance of non-OpenAI alternatives. In this script, the embedding model was chosen from the HuggingFace leaderboard of embedding models [7]. The notebook experimented with the Gemma 2B [12] and Gemma 7B [13] due to its recent release, SoTA performance, and rising popularity.

In [1]:
# Clears the output of certain cells to maintain a cleaner notebook appearance
from IPython.display import clear_output

In [2]:
# pip installations
!pip install llama-index
!pip install pymilvus
!pip install milvus
!pip install "transformers[torch]"
!pip install openai
clear_output(0) # output will clear after n wait time

In [3]:
# colab % pip installations
%pip install llama-index-vector-stores-milvus
%pip install llama-index-embeddings-huggingface
%pip install llama-index-embeddings-openai
%pip install llama-index-llms-huggingface
clear_output()

## Data, Embeddings, & Inference

In [4]:
# imports
from llama_index.core import SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.huggingface import HuggingFaceInferenceAPI

In [5]:
# Read the Document
# links to a directory and reads in the relevant files.
# ./ points to the local dir.
docs = SimpleDirectoryReader('./').load_data()

docs[0].extra_info

{'page_label': '1',
 'file_name': '/content/MOOC-2 HumanSensing.pdf',
 'file_path': '/content/MOOC-2 HumanSensing.pdf',
 'file_type': 'application/pdf',
 'file_size': 24406576,
 'creation_date': '2024-02-29',
 'last_modified_date': '2024-02-29',
 'last_accessed_date': None}

In [44]:
# define function for embedding model to be used

def define_embeddings_model(service: str, _model_name: str = None):
    """
    Initializes and sets the embedding model based on the specified service.

    Parameters:
    - service (str): The embedding service to use. Currently supports 'openai' or 'huggingface'.
    - model_name (str): The name or identifier of the embedding model (specific to the chosen service).

    Raises:
    - ValueError: If the provided service is not 'openai' or 'huggingface'.

    Returns:
    None

    Example:
    >>> define_embeddings_model('openai', 'gpt-3.5-turbo', 'Hello, World!')
    Sets the embedding model to OpenAI's GPT-3.5-turbo for the input text 'Hello, World!'.

    Notes: GPT-3.5-turbo authored this docstring, not the function.
    """
    if service.lower() == 'openai':
        embed_model = OpenAIEmbedding()  # uses the default OpenAI Embedding model
    elif service.lower() == 'hf' or service.lower() == 'huggingface':
        embed_model = HuggingFaceInferenceAPI(model_name=_model_name)
    else:
        raise ValueError('Embedding service is not supported.')

    Settings.embed_model = embed_model

I experimented with two embedding models. 'Salesforce/SFR-Embedding-Mistral' [8] ranks highest overall and in retrieval on HuggingFace's Massive Text Embedding Benchmark (MTEB) [7]. 'WhereIsAI/UAE-Large-V1' [11] ranks highly in both categories but is much smaller allowing it to be run locally from a free Google Colab environment. The former requires use of the HuggingFace Inference API while the latter can be called via the HuggingFaceEmbedding function, as demonstrated below [4].

In [84]:
define_embeddings_model('hf','Salesforce/SFR-Embedding-Mistral')
# define_embeddings_model('hf','WhereIsAI/UAE-Large-V1')

In [93]:
# Use Open Source Emeddings
# This is an alternative method that references a local or downloadable embeddings model
model_name = 'WhereIsAI/UAE-Large-V1'
embed_model = HuggingFaceEmbedding(model_name)

# Adds the selected embed model to be used by the Llama Index functions/framework
# This functionality replaces service_context in Llama Index recent v10 release
Settings.embed_model = embed_model

In [94]:
# This prints examples of the embeddings to make sure it's working properly
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

1024
[0.005473877303302288, 0.056089356541633606, 0.015606299974024296, -0.014848134480416775, -0.03771238774061203]


In [85]:
# selection of the inference model
inf_model = 'google/gemma-7b'
# inf_model = 'google/gemma-2b'

llm = HuggingFaceInferenceAPI(model_name=inf_model)
Settings.llm = llm

## Vector Database

In [57]:
from llama_index.core import VectorStoreIndex, Document, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore
from milvus import default_server, debug_server
from pymilvus import connections, utility

Milvus proved to be temperamental. The primary service is incompatible with a Jupyter Notebook and Google Colab environment [3]. As a solution, Milvus Lite was created [5]. However, it will only run on the initial run of the script. Any subsequent runs will cause the index to be unable to connect to the Milvus server/vector store. This makes the package setup disorienting. It also requires users to reload the colab environment entirely after each working session.

In [11]:
# starts a milvus server for the vector database to use
# stops any currently running servers
default_server.stop()

try:
  default_server.cleanup() # cleans up previous data
except:
  print('Server is not running.')

default_server.start()

connections.connect(host='127.0.0.1', port=default_server.listen_port)

print(utility.get_server_version())
print(default_server.listen_port)

v2.3.5-lite
19530


In [95]:
# Create an Index
vector_store = MilvusVectorStore(dim=1024, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(docs, storage_context=storage_context)

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 4dd7a568c16f46528dae2e0257249d22
DEBUG:pymilvus.milvus_client.milvus_client:Successfully created collection: llamacollection
DEBUG:pymilvus.milvus_client.milvus_client:Successfully created an index on collection: llamacollection


In [96]:
# Query Engine
query_engine = index.as_query_engine()

In [None]:
# Test Query
res = query_engine.query('What is a common form of human sensing?')
res

## TruLens Evaluation

TruLens evaluations frameworks provides many useful tools for assessing the performance of RAG systems. It keeps a record of experiments, although it does not record different backend inference or embedding models. It provides functions to create metrics such as language match, groundedness, question answer relevance, and question search relevance. It provides a Streamlit-based dashboard for viewing and analysing the results of different experiments [1].

In [15]:
!pip install trulens_eval
clear_output()

In [87]:
# TruLens
from trulens_eval import Tru
tru = Tru()
clear_output()

In [97]:
from trulens_eval.app import App
from trulens_eval import Feedback
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider.hugs import Huggingface
from trulens_eval.feedback.provider.openai import OpenAI
import numpy as np

# define context used in feedback
context = App.select_context(query_engine)

# provides the model on which the following metrics are calculated
provider = OpenAI()

# alternative option for evaluation metrics
# although HF doesn't support the same metrics
# provider = Huggingface()

# groundedness
grounded = Groundedness(groundedness_provider=provider)
f_groundedness = (
    Feedback(grounded.groundedness_measure_with_cot_reasons)
    .on(context.collect()) # collect context chunks into a list
    .on_output()
    .aggregate(grounded.grounded_statements_aggregator)
)

# question and answer relevance
f_qa_relevance = Feedback(provider.relevance).on_input_output()

# question search relevance
f_qs_relevance = (
    Feedback(provider.qs_relevance)
    .on_input()
    .on(context)
    .aggregate(np.mean)
)

✅ In groundedness_measure_with_cot_reasons, input source will be set to __record__.app.query.rets.source_nodes[:].node.text.collect() .
✅ In groundedness_measure_with_cot_reasons, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In qs_relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In qs_relevance, input statement will be set to __record__.app.query.rets.source_nodes[:].node.text .


In [98]:
# TruLens Recording/Logging
from trulens_eval import TruLlama
query_engine_recorder = TruLlama(query_engine, feedbacks=[f_groundedness, f_qa_relevance, f_qs_relevance])

## Prompt Experimentation

I choose to experiment with different retrieval questions to qualitatively assess the RAG system's capacity to return coherent and accurate information. On a variety of prompts, the system returned structured accurate information for the large data source. The output was prone to repetition. The model has not undergone instruction fine-tuning. Despite this, I tried issuing an instruction to 'avoid repeating', which had little effect. The best prompt was the final uncommented string that asked about sensing sleep. This returned a numbered list of relevant devices. This seems to indicate that the system benefits from additional specificity.

In [115]:

# prompt = 'Who is the author of presentation?'
# prompt = 'After sensory data is gathered, what are some examples of its usage?'
# prompt = 'What kinds of sensors are there?'
# prompt = 'Avoid repeating, what can sensory data be used for?'
prompt = 'What devices are used for sensing sleep?'

with query_engine_recorder as recording:
    query_engine.query(prompt)

In [None]:
# display recordings
rec = recording.get()

display(rec)

In [None]:
# view results of the feedback function
for feedback, feedback_result in rec.wait_for_feedback_results().items():
    print(feedback.name, feedback_result.result)

In [113]:
tru.get_leaderboard()

Unnamed: 0_level_0,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1
Example1,6.0,0.0
app_hash_0ebeb74bd1fb0fcb293283b1389190c4,6.0,0.0
app_hash_0f197a1def38758ccfb392dea7b14c6b,4.0,0.0
app_hash_a28c53e34ac14e64895744a94bdd813c,6.0,0.0
app_hash_ad067d64c00c21b37c5a3597f54433c0,4.0,0.0
app_hash_df319f4f4b118bd79c19640e56cacfc5,6.0,0.0


In [110]:
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.
npx: installed 22 in 9.24s

Go to this url and submit the ip given here. your url is: https://salty-geese-train.loca.lt

  Submit this IP Address: 35.236.183.2



<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

In [116]:
tru.stop_dashboard()

## Conclusion

The system provides accurate question and answering functionality despite using a smaller non-OpenAI LLM as a backbone. This supports the view that non-massive models still have their place in the modern machine learning ecosystem.

## Suggested Improvements

This project was the perfect level of challenging. Llama Index provided extensive documentation that was essential. I recommend opening the vector store to the student's discretion. There are many companies creating their own vector databases and it would be a helpful exercise to review popular technologies and make a selection. I was unimpressed by Milvus so, if you would rather not have students make an open selection, I would change the recommended service from Milvus to Chroma DB.

Next, I would encourage use of non-OpenAI services. There is little doubt that OpenAI is the industry standard and their models are commonly the benchmarks with which new models are compared. However, it is beneficial to go through the exercise of embedding and inference model selection because it teaches students the resources they need to make those decisions. There are many situations where smaller models are a better choice than the extremely large GPT models.


## Citations

[1] “🤗 HuggingFace - 🦑 TruLens.” Accessed: Feb. 29, 2024. [Online]. Available: https://www.trulens.org/trulens_eval/api/provider/huggingface/#trulens_eval.feedback.provider.hugs.Huggingface.pii_detection

[2] “Building RAG from Scratch (Open-source only!) - LlamaIndex 🦙 v0.10.14.” Accessed: Feb. 29, 2024. [Online]. Available: https://docs.llamaindex.ai/en/stable/examples/low_level/oss_ingestion_retrieval.html

[3] “Get Started with Milvus Lite.” Accessed: Feb. 29, 2024. [Online]. Available: https://milvus.io/docs/milvus_lite.md

[4] “Local Embeddings with HuggingFace - LlamaIndex 🦙 v0.10.14.” Accessed: Feb. 29, 2024. [Online]. Available: https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface.html

[5] “Milvus Vector Store - LlamaIndex 🦙 v0.10.14.” Accessed: Feb. 29, 2024. [Online]. Available: https://docs.llamaindex.ai/en/stable/examples/vector_stores/MilvusIndexDemo.html

[6] “milvus-lite/examples/example.py at main · milvus-io/milvus-lite · GitHub.” Accessed: Feb. 29, 2024. [Online]. Available: https://github.com/milvus-io/milvus-lite/blob/main/examples/example.py

[7] “MTEB Leaderboard - a Hugging Face Space by mteb.” Accessed: Feb. 29, 2024. [Online]. Available: https://huggingface.co/spaces/mteb/leaderboard

[8] “Salesforce/SFR-Embedding-Mistral · Hugging Face.” Accessed: Feb. 29, 2024. [Online]. Available: https://huggingface.co/Salesforce/SFR-Embedding-Mistral

[9] “Starter Tutorial - LlamaIndex 🦙 v0.10.14.” Accessed: Feb. 29, 2024. [Online]. Available: https://docs.llamaindex.ai/en/stable/getting_started/starter_example.html

[10] “Using LLMs - LlamaIndex 🦙 v0.10.14.” Accessed: Feb. 29, 2024. [Online]. Available: https://docs.llamaindex.ai/en/stable/understanding/using_llms/using_llms.html

[11] “WhereIsAI/UAE-Large-V1 · Hugging Face.” Accessed: Feb. 29, 2024. [Online]. Available: https://huggingface.co/WhereIsAI/UAE-Large-V1

[12] “google/gemma-2b · Hugging Face.” Accessed: Feb. 29, 2024. [Online]. Available: https://huggingface.co/google/gemma-2b

[13] “google/gemma-7b · Hugging Face.” Accessed: Feb. 29, 2024. [Online]. Available: https://huggingface.co/google/gemma-7b

