<a href="https://colab.research.google.com/github/vectara/example-notebooks/blob/main/notebooks/using-vectara-with-llamaindex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vectara and LlamaIndex

## About Vectara

[Vectara](https://vectara.com/) is the trusted GenAI and semantic search platform that provides an easy-to-use API for document indexing and querying. 

Vectara provides an end-to-end managed service for Retrieval Augmented Generation or [RAG](https://vectara.com/grounded-generation/), which includes:

1. A way to extract text from document files and chunk them into sentences.

2. The state-of-the-art [Boomerang](https://vectara.com/how-boomerang-takes-retrieval-augmented-generation-to-the-next-level-via-grounded-generation/) embeddings model. Each text chunk is encoded into a vector embedding using Boomerang, and stored in the Vectara internal knowledge (vector+text) store. Thus, when using Vectara with LlamaIndex you do not need to call a separate embedder model - this happens automatically within the Vectara backend.

3. A query service that automatically encodes the query into embedding, and retrieves the most relevant text segments (including support for [Hybrid Search](https://docs.vectara.com/docs/api-reference/search-apis/lexical-matching) and [MMR](https://vectara.com/get-diverse-results-and-comprehensive-summaries-with-vectaras-mmr-reranker/))

4. An option to create [generative summary](https://docs.vectara.com/docs/learn/grounded-generation/grounded-generation-overview), based on the retrieved documents, including citations.

See the [Vectara API documentation](https://docs.vectara.com/docs/) for more information on how to use the API.

The main benefits for using Vectara for a RAG application are:
* **Easy to use**: Vectara provides an end-to-end, fully functional, highly scalable and robust RAG pipeline, so as a user you don't have to code up these pieces and maintain them over time.
* **Scalable and Secure**: building GenAI applications may seem easy at first, but the DIY approach can become overwhelming beyond simple examples. Vectara provides instant scalablility to millions of documents, while maintaing data security and privacy, as well as latency SLAs.

## About LlamaIndex

LlamaIndex is a "data framework" to help you build LLM apps:

1. It includes **data connectors** to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.)
2. It provides ways to **structure your data** (indices, graphs) so that this data can be easily used with LLMs.
3. It provides an **advanced retrieval/query interface over your data**: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.

LlamaIndex's high-level API allows beginner users to use LlamaIndex to ingest and query their data in just a few lines of code, whereas its lower-level APIs allow advanced users to customize and extend any module (data connectors, indices, retrievers, query engines, reranking modules), to fit their needs.

Vectara is implemented in LlamaIndex as a [Managed Service](https://docs.llamaindex.ai/en/stable/community/integrations/managed_indices.html#vectara), abstracting all of Vectara's powerful API so they are easily integrated into LlamaIndex.

In this notebook, we will demonstrate some of the great ways you can use Vectara together with LlamaIndex.

## Getting Started

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

To get started with Vectara, [sign up](https://console.vectara.com/signup?utm_source=vectara&utm_medium=signup&utm_term=DevRel&utm_content=example-notebooks&utm_campaign=vectara-signup-DevRel-example-notebooks) (if you haven't already) and follow our [quickstart](https://docs.vectara.com/docs/quickstart) guide to create a corpus and an API key. 

Once you have these, you can provide them as environment variables, which will be used by the LlamaIndex code later on:

In [1]:
#!pip install -U llama-index llama-index-indices-managed-vectara arxiv

import os
os.environ['VECTARA_API_KEY'] = "<YOUR_VECTARA_API_KEY>"
os.environ['VECTARA_CORPUS_ID'] = "<YOUR_VECTARA_CORPUS_ID>"
os.environ['VECTARA_CUSTOMER_ID'] = "<YOUR_VECTARA_CUSTOMER_ID>"

## Loading Data Into Vectara

As mentioned above, Vectara is a RAG managed service, and in many cases data may be uploaded to the index ahead of time (e.g. by using [Airbyte](https://docs.airbyte.com/integrations/destinations/vectara), directly via Vectara's [indexing API](https://docs.vectara.com/docs/api-reference/indexing-apis/indexing) or using tools like [vectara-ingest](https://github.com/vectara/vectara-ingest)), but another easy way is via the VectaraIndex constructor: `from_documents()`.

For this notebook we will assume the Vectara corpus is empty and will load PDF documents from Arxiv, using Python's [arxiv](https://github.com/lukasschwab/arxiv.py) library. We would pull in data from the top papers related to "climate change":

In [2]:
import arxiv

client = arxiv.Client()
search = arxiv.Search(
  query = "(ti:embedding model) OR (ti:sentence embedding)",
  max_results = 100,
  sort_by = arxiv.SortCriterion.Relevance
)
papers = list(client.results(search))

In [3]:
[p.entry_id for p in papers][:5]

['http://arxiv.org/abs/2402.14776v2',
 'http://arxiv.org/abs/2007.01852v2',
 'http://arxiv.org/abs/1910.13291v1',
 'http://arxiv.org/abs/2104.06719v1',
 'http://arxiv.org/abs/1511.08198v3']

Next, download the Arxiv paper, and upload them into Vectara using the `add_file()`. 

In [4]:
import shutil
from llama_index.indices.managed.vectara import VectaraIndex

data_folder = 'temp'
os.makedirs(data_folder, exist_ok=True)

# Create Vectara Index
index = VectaraIndex()

# Upload content ofr all papers
for paper in papers:
    try:
        paper_fname = paper.download_pdf(data_folder)
    except Exception as e:
        print(f"File {paper_fname} failed to load with error {e}")
        continue
    metadata = {
        'url': paper.pdf_url,
        'title': paper.title,
        'author': str(paper.authors[0]),
        'published': str(paper.published.date())
    }
    index.insert_file(file_path=paper_fname, metadata=metadata)

shutil.rmtree(data_folder)
del papers

File temp/1909.03104v2.Efficient_Sentence_Embedding_using_Discrete_Cosine_Transform.pdf failed to load with error HTTP Error 404: Not Found


Two important things to note here:
1. Vectara processes each file uploaded on the backend, and performs appropriate chunking. So you don't need to apply any local processing, or choose a chunking strategy. 
2. We have used the fields `url`, `title`, `author`, and `published` as metadata fields (where author is the first author if there are many, just to simplify). You will need to make sure those fields are defined in your Vectara corpus as [filterable metadata fields](https://docs.vectara.com/docs/learn/metadata-search-filtering/filter-overview) to ensure we can filter by them in query time.

So that's it for upload. 

## Querying with the VectaraIndex
We can now ask questions using the `VectaraIndex` object.

In [5]:
query = "What is sentence embedding?"

In [6]:
query_engine = index.as_query_engine(
    summary_enabled=True, summary_num_results=5,
    summary_response_lang="en",
    summary_prompt_name="vectara-summary-ext-24-05-med"
)
res = query_engine.query(query)
print(res.response)

Sentence embedding is a method used in natural language processing that represents sentences in a numerical format. It involves the use of models like SimCSE-BERT/SRoBERTa/ST5 to convert sentences into a form that can be processed by machine learning algorithms. This process often involves the use of LSTM (Long Short-Term Memory) encoders and decoders. The LSTM encoder architecture with max-pooling is commonly used for all encoders, and LSTM decoders are used for SDAE and NMT. The resulting sentence embeddings can then be used in various applications, such as in a softmax inference classifier [1][2][3].


Note that the response here is fully generated by Vectara. There is no additional LLM involved (or API key you need to setup). The response also includes citations (marked in square brackets), which provide links to references used to generate this response by Vectara. 
<br>
The `res` object includes the actual response to the user query, but also has the citations:

In [7]:
[(inx, n.node.metadata['url']) for inx,n in enumerate(res.source_nodes)]

[(0, 'http://arxiv.org/pdf/2305.03010v1'),
 (1, 'http://arxiv.org/pdf/1904.05542v1'),
 (2, 'http://arxiv.org/pdf/1904.05542v1'),
 (3, 'http://arxiv.org/pdf/1904.05542v1'),
 (4, 'http://arxiv.org/pdf/1904.05542v1'),
 (5, 'http://arxiv.org/pdf/1904.05542v1'),
 (6, 'http://arxiv.org/pdf/1904.05542v1'),
 (7, 'http://arxiv.org/pdf/1904.05542v1'),
 (8, 'http://arxiv.org/pdf/2206.02690v3'),
 (9, 'http://arxiv.org/pdf/2404.03921v2')]

## Using Streaming

You can also stream the Vectara response simply by specifying `streaming=True`:

In [8]:
query_engine = index.as_query_engine(summary_enabled=True, streaming=True)
res = query_engine.query(query)

# print streamed output chunk by chunk
for chunk in res.response_gen:
    print(chunk.delta or "", end="", flush=True)

Sentence embedding refers to the process of converting a sentence into a fixed-dimensional vector representation. This is achieved through various models like SimCSE-BERT, SRoBERTa, and ST5 using techniques such as LSTM encoders and decoders. The embeddings capture the semantic meaning of the sentence, allowing for tasks like entailment, contradiction, and neutral classification. Different architectures like bidirectional LSTM and softmax inference classifiers are utilized in sentence embedding models to encode sentences efficiently.

## Reranking

Vectara supports two types of [reranking](https://docs.vectara.com/docs/api-reference/search-apis/reranking). The first one is called 
[max-marginal-relevance](https://docs.vectara.com/docs/api-reference/search-apis/reranking#maximal-marginal-relevance-mmr-reranker) or MMR and provides a reranking that can promote diversity in results at the cost of relevance. The other reranker, called Slingshot, is an ML reranker that increases accuracy of results ranking and is available to Vectara Scale customers.

 Let's see an example of how to use MMR: We will run the same query but this time we will use MMR where `mmr_diversity_bias=0.3` provides a tradeoff between relevance and diversity (0.0 is full relevance, 1.0 is only diversity):

In [9]:
query_engine = index.as_query_engine(
    similarity_top_k=5,
    reranker="mmr",
    rerank_k=50,
    mmr_diversity_bias=0.3,
)
response = query_engine.query(query)
print(response)

Sentence embedding is a technique in Natural Language Processing (NLP) that involves representing sentences as vectors in n-dimensional space to encode their meanings. Various models like ELMo, BERT, and SBERT-WK are used to compute sentence embeddings by averaging LSTM outputs or word representations. These embeddings are crucial for tasks like document classification and sentiment analysis. The goal is to preserve the original sentence meanings effectively in the embedded vectors. Sentence embedding plays a vital role in NLP applications such as search engines and question-and-answer platforms, contributing to significant breakthroughs in the field [2] [3] [4] [7].


In [10]:
[(inx, n.node.metadata['url']) for inx,n in enumerate(response.source_nodes)]

[(0, 'http://arxiv.org/pdf/2305.03010v1'),
 (1, 'http://arxiv.org/pdf/1808.05505v3'),
 (2, 'http://arxiv.org/pdf/2206.02690v3'),
 (3, 'http://arxiv.org/pdf/2002.09620v2'),
 (4, 'http://arxiv.org/pdf/2210.06432v3')]

As you can see, the results are now reranked in a way that provides more diversity instead of maximizing pure relevance. This in turn results in a different set of chunks used to generate the response.

So far we've used Vectara's internal summarization capability, which is the best way for most users.

You can still use Llama-Index's standard VectorStore `as_query_engine()` method, in which case Vectara's summarization won't be used, and you would be using an external LLM (like OpenAI's GPT-4 or similar) and a custom prompt from LlamaIndex to generate the summary. For this option just set `summary_enabled=False`

For this you would need to specify your own OpenAI API key in the environment:

> `os.environ['OPENAI_API_KEY'] = '<YOUR_OPENAI_API_KEY>`

In [11]:
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4-turbo", temperature=0)

query_engine = index.as_query_engine(
    similarity_top_k=5,
    summary_enabled=False,
    llm=llm
)
response = query_engine.query(query)
print(response)

Sentence embedding is a term used for representing words or sentences in a text that encodes the meaning of the word or the sentence in n-dimensional space. It involves mapping text data into vectors that can be a set of real numbers, allowing the encoded text to be processed and understood by machines. This representation is crucial in various applications such as search engines, expert systems, and question-and-answer platforms, where the proximity of vectors in the space can indicate the similarity of meanings between words or sentences.


## Using Vectara Chat

Vectara now fully supports Chat in its platform, where the chat history is maintained by Vectara and so you don't have to worry about keeping history and integrating it with your RAG pipeline. 

To use it simple call `as_chat_engine()`.

(Chat mode always uses Vectara's summarization so you don't have to explicitly specify `summary_enabled=True` like before)

In [12]:
ce = index.as_chat_engine()

In [13]:
questions = [
    'What is a sentence embedding model?',
    'What are some known models?',
    'How are they different than token embedding models'
]

for q in questions:
    print(f"Question: {q}\n")
    response = ce.chat(q).response
    print(f"Response: {response}\n")

Question: What is a sentence embedding model?

Response: A sentence embedding model is a method that represents input sentences as fixed-dimensional vectors, regardless of sentence length. These models have shown significant enhancements in various NLP tasks like information retrieval, question answering, and machine translation. They are trained to adapt to specific domains by fine-tuning on synthesized sentence pairs before further fine-tuning on labeled data, leading to improved performance. Additionally, sentence embedding models can be tailored for cross-lingual applications by aligning at both sentence and token levels, enhancing their versatility and effectiveness in capturing semantic relationships within and across languages [4][3][5].

Question: What are some known models?

Response: Some examples of well-known sentence embedding models include BERT (both BERT-Large and BERT-Base), ELMO + Attn, GenSen, DSE (with different alpha values), FastSent, Quick-Thought, and models lik

Of course streaming works as well with Chat:

In [14]:
ce = index.as_chat_engine(streaming=True)

In [15]:
response = ce.stream_chat("who is behind SBERT?")
for chunk in response.chat_stream:
    print(chunk.delta or "", end="", flush=True)

The entity behind SBERT is a team that extended Sentence-BERT (SBERT) based on transformer models like BERT. SBERT utilizes a siamese network structure and pre-trained weights from BERT for efficient training of suitable sentence embeddings methods. The team achieved state-of-the-art performance for various sentence embedding tasks using SBERT, which applies mean pooling on the output. Additionally, they used XLM-R as a pre-trained network on 100 languages in their experiments. The team's work demonstrates improvements in supervised sentence embedding methods, showcasing the effectiveness of SBERT in enhancing performance. However, the specific names or affiliations of the individuals behind SBERT are not explicitly mentioned in the search results provided.

# Advanced RAG with Vectara and LLamaIndex

## Agentic RAG

LlamaIndex provides various agent implementations such as ChainOfThough or React.

To use these with Vectara, you would need to use an external LLM as the driver of the agent resoning, and in this example we will be using OpenAI's GPT4o (for this to work, please make sure you have `OPENAI_API_KEY` defined in your environment).

In [16]:
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, ToolMetadata

vectara_tool = QueryEngineTool(
    query_engine=index.as_query_engine(
        summary_enabled=True,
        summary_num_results=5,
        summary_response_lang="en",
        summary_prompt_name="vectara-summary-ext-24-05-large",
    ),
    metadata=ToolMetadata(
        name="Vectara",
        description="Vectara tool that can answer Questions about Embedding Models, NLP, and related topics.",
    ),
)
agent = ReActAgent.from_tools(
    tools=[vectara_tool],
    llm=llm,
    context="""
        You are a helpful chatbot that answers any user questions around embedding models in NLP using the Vectara tool.
        You break down complex questions into simpler ones and use the vectara tool to answer every question or sub-question.
        You use the Vectara tool to help answer the user question.
    """,
    verbose=True,
    max_iterations=20
)

In [17]:
question = """
    What are the sentence embedding models? 
    what are the best sentence embedding models, who created each model and in what year was the paper published?
    Compare and contrast their architecture, training procedure and performance
"""

print(agent.chat(question).response)

[1;3;38;5;200mThought: The current language of the user is: English. I need to use a tool to help me answer the question.
Action: Vectara
Action Input: {'input': 'What are sentence embedding models?'}
[0m[1;3;34mObservation: Sentence embedding models are designed to convert sentences into numerical representations that capture their semantic meanings. These models are typically trained to achieve state-of-the-art representations for tasks like semantic textual similarity. They can be adapted for specific domains by incorporating domain-specific knowledge [2]. Examples of sentence embedding models include Sentence-BERT, SimCSE-BERT/SimCSE-RoBERTa, Sentence-T5, and MP-Net [3]. These models are useful in various applications including information retrieval [4].
[0m[1;3;38;5;200mThought: The user asked for details about the best sentence embedding models, their creators, and publication years. I need to gather this information next.
Action: Vectara
Action Input: {'input': 'What are th

## Using Auto Retriever with Vectara

LlamaIndex's auto-retriever functionality is really cool. 
It is most useful when you have metadata fields (like in our case of papers from Arxiv), and would like a query that references a metadata field to be automatically interpreted in the right way.

For example, if I ask "what is a paper about climate change risks published after 2020", the auto-retriever would (behind the scences) interpret ths into a query "what is a paper about climate change risks" along with a filter condition of "published > 2020"

Let's see how this works with the Vectara Index.
First, we have to define a `VectorStoreInfo` structure that defines the meta data fields the auto-retriever knows about to do its job:

In [18]:
from llama_index.core.vector_stores.types import MetadataInfo, VectorStoreInfo

vector_store_info = VectorStoreInfo(
    content_info="information about a paper",
    metadata_info=[
        MetadataInfo(
            name="published",
            description="The date the paper was published",
            type="string",
        ),
        MetadataInfo(
            name="author",
            description="The author of the paper",
            type="string",
        ),
        MetadataInfo(
            name="title",
            description="The title of the papers",
            type="string",
        ),
        MetadataInfo(
            name="url",
            description="The URL for this paper",
            type="string",
        ),
    ],
)

Auto-retrieval is implemented before calling Vectara as a query transformation. 

Now we can define the `VectaraAutoRetriever`, which can perform auto-retrieval using Vectara:

In [19]:
from llama_index.indices.managed.vectara import VectaraAutoRetriever

retriever = VectaraAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    llm=llm,
    verbose=True
)
res = retriever.retrieve("What is sentence embedding, based on papers before 2019?")
[(r.metadata['published'], r.text) for r in res]

Using query str: What is sentence embedding?
Using implicit filters: [('published', '<', '2019')]
final filter string: (doc.published < '2019')


[('2018-08-16',
  'This\nproblem can be alleviated by obtaining more of para-\nphrase sentence pairs. Conclusion Sentence embedding is one of the most important text\nprocessing techniques in NLP. To date,  various sen-\ntence embedding models have been proposed and have\nyielded good performances in document classification\nand sentiment analysis tasks. However, the fundamen-\ntal ability of sentence embedding methods, i.e., how\neffectively the meanings of the original sentences are\npreserved  in  the  embedded  vectors,  cannot  be  fully\nevaluated through such indirect methods.'),
 ('2018-08-16',
  'Paraphrase Thought:  Sentence Embedding Module Imitating\n                        Human Language Recognition Myeongjun Jang 1 Abstract\nSentence embedding is an important research\ntopic in natural language processing. It is es-\nsential to generate a good embedding vector\nthat  fully  reflects  the  semantic  meaning  of\na sentence in order to achieve an enhanced\nperformance  for 

As you can see, the Auto Retriever was able to translate the natural language text into a shorter query and a proper condition (in this case `doc.published < 2019`).

We can also of course ask a question directly: we use the `VectaraQueryEngine` which can work with the `VectaraAutoRetriever` directly:

In [20]:
from llama_index.indices.managed.vectara.query import VectaraQueryEngine
from llama_index.indices.managed.vectara import VectaraAutoRetriever

ar = VectaraAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    llm=llm,
    summary_enabled=True,
    summary_num_results=5,
    verbose=True
)

query_engine = VectaraQueryEngine(retriever=ar)
response = query_engine.query("What is sentence embedding, based on papers before 2019?")
print(response)

Using query str: What is sentence embedding?
Using implicit filters: [('published', '<', '2019')]
final filter string: (doc.published < '2019')
Sentence embedding is a crucial technique in Natural Language Processing (NLP) that involves transforming sentences into low-dimensional vector representations that capture their semantic meanings. Various models have been developed to create these embeddings, aiming to preserve the original sentence meanings effectively. These embeddings play a vital role in enhancing the performance of NLP tasks like machine translation, document classification, sentiment analysis, and more. Different methods have been proposed to optimize sentence embeddings, such as averaging word embeddings or using self-attention mechanisms. The goal is to ensure that semantically similar sentences have similar embeddings, while dissimilar sentences have distinct embeddings. Overall, sentence embedding is fundamental for improving the efficiency and accuracy of various NL

## Advanced querying with QueryFusionRetriever

The QueryFusion [Retriever](https://docs.llamaindex.ai/en/stable/examples/retrievers/reciprocal_rerank_fusion.html#reciprocal-rerank-fusion-retriever) is an advanced query mechanism whereby the original query is pre-processed to generate N variations. Each of these rephrased queries is then run against the Vectara engine and rank-fusion is used to combine the best results. 

Let's see this in action:

In [21]:
query = "is SBERT a dual encoder? what type of DL architecture does it use?"
query_engine = index.as_query_engine(
    similarity_top_k=3,
    summary_enabled=False,
    llm=llm,
)
response = query_engine.query(query)
print(response)

SBERT, or Sentence-BERT, is a modification of the pre-trained BERT network that uses a siamese and triplet network structure to derive semantically meaningful sentence embeddings that can be compared using cosine similarity. This approach effectively makes it a dual-encoder architecture, where each encoder processes one sentence independently, and the embeddings are then compared.


In [22]:
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
import nest_asyncio

rf_retriever = QueryFusionRetriever(
    [index.as_retriever(similarity_top_k=2)],
    similarity_top_k=2,
    num_queries=5,  # this includes the origianl query; set this to 1 to disable query generation
    mode="reciprocal_rerank",
    use_async=True,
    verbose=True,
)

nest_asyncio.apply()     # apply nested async to run in a notebook
query_engine = RetrieverQueryEngine.from_args(rf_retriever)
response = query_engine.query(query)
print(response)

Generated queries:
1. What is SBERT and how does it differ from traditional encoder models?
2. Comparison of SBERT with other dual encoder architectures in deep learning.
3. Advantages and disadvantages of using SBERT as a dual encoder in natural language processing tasks.
4. How does SBERT utilize deep learning architecture to achieve superior performance in sentence embeddings?
SBERT is not a dual encoder. It uses a Bi-Encoder architecture.


We can see how the QueryFusionRetriever created additional query variations (they are displayed since we used `verbose=True`) and then the overall response includes the results fused together. This is very helpful in this case because the QueryFusionRetriever creates sub-questions that inquire about the specific architecture of SBERT which is relevant context to answering this question properly.

## Summary

In this notebook we've seen various examples for using Vectara with LlamaIndex, which provides the following benefits:
* Vectara provides a complete RAG pipeline, so you don't have to deal with a lot of the details around data ingestion: pre-processing, chunking, embedding, etc. Instead all these steps are handled automatically and efficiently in Vectara. 
* Being a platform, Vectara uses its own internal Embedding model (Boomerang), its own vector storage, and calls the LLM for summarization, so you don't have to maintain separate API keys and relationships with additional vendors or install other products.
* Vectara is built for large scale GenAI applications, and with the tools provided by LlamaIndex like Auto Retrieval and Query Fusion, you can easily build and test advanced RAG applications at enteprise scale.