<a href="https://colab.research.google.com/github/vectara/example-notebooks/blob/main/notebooks/using-vectara-with-llamaindex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vectara and LlamaIndex

In this notebook we are going to show how to use Vectara with LlamaIndex.

## About Vectara

[Vectara](https://vectara.com/) is the trusted AI Assistant and Agent platform which focuses on enterprise readiness for mission-critical applications. 

Vectara provides an end-to-end managed service for Retrieval Augmented Generation or [RAG](https://vectara.com/grounded-generation/), which includes:

1. An integrated API for processing input data, including text extraction from documents and ML-based chunking.

2. The state-of-the-art [Boomerang](https://vectara.com/how-boomerang-takes-retrieval-augmented-generation-to-the-next-level-via-grounded-generation/) embeddings model. Each text chunk is encoded into a vector embedding using Boomerang, and stored in the Vectara internal knowledge (vector+text) store. Thus, when using Vectara with LlamaIndex you do not need to call a separate embedding model - this happens automatically within the Vectara backend.

3. A query service that automatically encodes the query into embeddings and retrieves the most relevant text segmentsthrough [hybrid search](https://docs.vectara.com/docs/api-reference/search-apis/lexical-matching) and a variety of [reranking](https://docs.vectara.com/docs/api-reference/search-apis/reranking) strategies, including a [multilingual reranker](https://docs.vectara.com/docs/learn/vectara-multi-lingual-reranker), [maximal marginal relevance (MMR) reranker](https://docs.vectara.com/docs/learn/mmr-reranker), [user-defined function reranker](https://docs.vectara.com/docs/learn/user-defined-function-reranker), and a [chain reranker](https://docs.vectara.com/docs/learn/chain-reranker) that provides a way to chain together multiple reranking methods to achieve better control over the reranking, combining the strengths of various reranking methods.

4. An option to create a [generative summary](https://docs.vectara.com/docs/learn/grounded-generation/grounded-generation-overview) with a wide selection of LLM summarizers (including Vectara's [Mockingbird](https://vectara.com/blog/mockingbird-is-a-rag-specific-llm-that-beats-gpt-4-gemini-1-5-pro-in-rag-output-quality/), trained specifically for RAG-based tasks), based on the retrieved documents, including citations.

See the [Vectara API documentation](https://docs.vectara.com/docs/) for more information on how to use the API.

The main benefits of using Vectara RAG-as-a-service to build your application are:
* **Accuracy and Quality**: Vectara provides an end-to-end platform that focuses on eliminating hallucinations, reducing bias, and safeguarding copyright integrity.
* **Security**: Vectara's platform provides acess control--protecting against prompt injection attacks--and meets SOC2 and HIPAA compliance.
* **Explainability**: Vectara makes it easy to troubleshoot bad results by clearly explaining rephrased queries, LLM prompts, retrieved results, and agent actions.

## About LlamaIndex

LlamaIndex is a "data framework" to help you build LLM apps:

1. It includes **data connectors** to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.)
2. It provides ways to **structure your data** (indices, graphs) so that this data can be easily used with LLMs.
3. It provides an **advanced retrieval/query interface over your data**: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.

LlamaIndex's high-level API allows beginner users to use LlamaIndex to ingest and query their data in just a few lines of code, whereas its lower-level APIs allow advanced users to customize and extend any module (data connectors, indices, retrievers, query engines, reranking modules), to fit their needs.

Vectara is implemented in LlamaIndex as a [Managed Service](https://docs.llamaindex.ai/en/stable/community/integrations/managed_indices.html#vectara), abstracting all of Vectara's powerful API so they are easily integrated into LlamaIndex.

In this notebook, we will demonstrate some of the great ways you can use Vectara together with LlamaIndex.

## Getting Started

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

To get started with Vectara, [sign up](https://console.vectara.com/signup?utm_source=vectara&utm_medium=signup&utm_term=DevRel&utm_content=example-notebooks&utm_campaign=vectara-signup-DevRel-example-notebooks) (if you haven't already) and follow our [quickstart](https://docs.vectara.com/docs/quickstart) guide to create a corpus and an API key. 

Once you have these, you can provide them as environment variables, which will be used by the LlamaIndex code later on:

In [1]:
# !pip install -U llama-index llama-index-indices-managed-vectara arxiv

import os
# os.environ['VECTARA_API_KEY'] = "<YOUR_VECTARA_API_KEY>"
# os.environ['VECTARA_CORPUS_ID'] = "<YOUR_VECTARA_CORPUS_ID>"
# os.environ['VECTARA_CUSTOMER_ID'] = "<YOUR_VECTARA_CUSTOMER_ID>"

## Loading Data Into Vectara

As mentioned above, Vectara is a RAG managed service, and in many cases data may be uploaded to the index ahead of time (e.g. by using [Airbyte](https://docs.airbyte.com/integrations/destinations/vectara), directly via Vectara's [indexing API](https://docs.vectara.com/docs/api-reference/indexing-apis/indexing) or using tools like [vectara-ingest](https://github.com/vectara/vectara-ingest)), but another easy way is via the VectaraIndex constructor: `from_documents()`.

For this notebook, we will assume the Vectara corpus is empty and will load PDF documents from Arxiv, using Python's [arxiv](https://github.com/lukasschwab/arxiv.py) library. We will pull in data from the top papers related to "climate change":

In [2]:
import arxiv

client = arxiv.Client()
search = arxiv.Search(
  query = "(ti:embedding model) OR (ti:sentence embedding)",
  max_results = 100,
  sort_by = arxiv.SortCriterion.Relevance
)
papers = list(client.results(search))

In [3]:
[p.entry_id for p in papers][:5]

['http://arxiv.org/abs/2402.14776v2',
 'http://arxiv.org/abs/2007.01852v2',
 'http://arxiv.org/abs/1910.13291v1',
 'http://arxiv.org/abs/2104.06719v1',
 'http://arxiv.org/abs/1511.08198v3']

Next, download the Arxiv paper, and upload them into Vectara using the `add_file()`. 

In [4]:
import shutil
from llama_index.indices.managed.vectara import VectaraIndex

data_folder = 'temp'
os.makedirs(data_folder, exist_ok=True)

# Create Vectara Index
index = VectaraIndex()

# Upload content for all papers
for paper in papers:
    try:
        paper_fname = paper.download_pdf(data_folder)
    except Exception as e:
        print(f"File {paper_fname} failed to load with error {e}")
        continue
    metadata = {
        'url': paper.pdf_url,
        'title': paper.title,
        'author': str(paper.authors[0]),
        'published': str(paper.published.date())
    }
    index.insert_file(file_path=paper_fname, metadata=metadata)

shutil.rmtree(data_folder)
del papers

Two important things to note here:
1. Vectara processes each file uploaded on the backend, and performs appropriate chunking. So you don't need to apply any local processing, or choose a chunking strategy. 
2. We have used the fields `url`, `title`, `author`, and `published` as metadata fields (for simplicity, author is the first author if there are multiple). You will need to make sure those fields are defined in your Vectara corpus as [filterable metadata fields](https://docs.vectara.com/docs/learn/metadata-search-filtering/filter-overview) to ensure we can filter by them in query time.

So that's it for upload. 

## Querying with the VectaraIndex
We can now ask questions using the `VectaraIndex` object.

In [5]:
query = "What is sentence embedding?"

In [6]:
query_engine = index.as_query_engine(
    summary_enabled=True, summary_num_results=5,
    summary_response_lang="eng",
    summary_prompt_name="mockingbird-1.0-2024-07-16"
)
res = query_engine.query(query)
print(res.response)

Based on the provided sources, here is a summary that answers the query "What is sentence embedding?":

Sentence embedding is a form of word or sentence representation that maps text data into vectors, which can be a set of real numbers (a vector) [1]. It is a way to represent words or sentences in a text that encodes the meaning of the word or the sentence in n-dimensional space [1]. The goal of sentence embedding is to make the embeddings of two sentences that are similar to get closer in this vector space [2]. This is achieved by training models such as SimCSE, BERT, or RoBERTa, which input a sentence and output a vector representation of it [3]. The vector representation captures the meaning of the sentence, and sentences that are similar in meaning will have vectors that are closer together in the vector space [1][2].

Sources: [1], [2], [3]


Note that the response here is fully generated by Vectara. There is no additional LLM involved (or API key you need to setup). The response also includes citations (marked in square brackets), which provide links to references used to generate this response by Vectara. 
<br>
The `res` object includes the actual response to the user query, but also has the citations:

In [7]:
[(inx, n.node.metadata['url']) for inx, n in enumerate(res.source_nodes)]

[(0, 'http://arxiv.org/pdf/2206.02690v3'),
 (1, 'http://arxiv.org/pdf/1910.13291v1'),
 (2, 'http://arxiv.org/pdf/2305.03010v1'),
 (3, 'http://arxiv.org/pdf/1904.05542v1'),
 (4, 'http://arxiv.org/pdf/1904.05542v1'),
 (5, 'http://arxiv.org/pdf/1904.05542v1'),
 (6, 'http://arxiv.org/pdf/1904.05542v1'),
 (7, 'http://arxiv.org/pdf/1904.05542v1'),
 (8, 'http://arxiv.org/pdf/1904.05542v1'),
 (9, 'http://arxiv.org/pdf/1904.05542v1')]

## Using Streaming

You can also stream the Vectara response simply by specifying `streaming=True`:

In [8]:
query_engine = index.as_query_engine(
    summary_enabled=True,
    summary_prompt_name="mockingbird-1.0-2024-07-16",
    streaming=True)

res = query_engine.query(query)

# print streamed output chunk by chunk
for chunk in res.response_gen:
    print(chunk.delta or "", end="", flush=True)

Based on the provided sources, here is a summary that answers the query "What is sentence embedding?":

Sentence embedding is a form of word or sentence representation that maps text data into vectors, which can be a set of real numbers (a vector) [1]. It is a way to represent words or sentences in a text that encodes the meaning of the word or the sentence in n-dimensional space [1]. The goal of sentence embedding is to make the embeddings of two sentences that are similar to get closer in this vector space [2]. This is achieved by training models such as SimCSE, BERT, or RoBERTa, which input sentence representations and output sentence embeddings [3].

In the context of natural language processing, sentence embedding is used to capture the meaning of a sentence and is used in tasks such as multiple choice question answering, next sentence prediction, and paraphrase identification [2]. It is also used in models such as LSTM, SDAE, and NMT, which use sentence embeddings as input repres

## Reranking

Vectara supports three types of [reranking](https://docs.vectara.com/docs/api-reference/search-apis/reranking):
1. [Maximal Marginal Relevance](https://docs.vectara.com/docs/learn/mmr-reranker), or MMR, provides a reranking that can promote diversity in results at the cost of relevance.
2. [Slingshot](https://docs.vectara.com/docs/learn/vectara-multi-lingual-reranker) is a mulitilingual reranker that increases the accuracy of retrieved results across 100+ languages and is available to Vectara Scale customers.
3. [User Defined Functions](https://docs.vectara.com/docs/learn/user-defined-function-reranker) allow you to create your own functions for reranking search results, unlocking better retrieval in a wide variety of use cases, such as sorting by recency or price of a product.

 Let's see an example of how to use MMR: We will run the same query but this time we will use MMR where `mmr_diversity_bias=0.3` provides a tradeoff between relevance and diversity (0.0 is full relevance, 1.0 is only diversity):

In [9]:
query_engine = index.as_query_engine(
    similarity_top_k=5,
    reranker="mmr",
    rerank_k=50,
    mmr_diversity_bias=0.3,
)
response = query_engine.query(query)
print(response)

Sentence embedding is a method of representing words or sentences in a text by encoding their meaning in n-dimensional space. It involves mapping text data into vectors of real numbers, where words or sentences closer in the vector space are more similar. Different types of embeddings exist, such as traditional, static, and contextualized word embeddings, as well as non-parameterized and parameterized models for sentence embeddings. This technique is crucial in preparing texts for machine understanding and various natural language processing tasks.


In [10]:
[(inx, n.node.metadata['url']) for inx, n in enumerate(response.source_nodes)]

[(0, 'http://arxiv.org/pdf/2206.02690v3'),
 (1, 'http://arxiv.org/pdf/1904.05542v1'),
 (2, 'http://arxiv.org/pdf/2305.15077v2'),
 (3, 'http://arxiv.org/pdf/2404.17606v1'),
 (4, 'http://arxiv.org/pdf/1910.03375v1')]

As you can see, the results are now reranked in a way that provides more diversity instead of maximizing pure relevance. This in turn results in a different set of chunks used to generate the response.

Now let's see an example with a user defined function. We may be interested in getting results that are the most semantically similar to our question, but we also want the most up-to-date information. Thus, we can bias our search results so that the papers that are not only semantically similar but also published more recently are used to answer our query. We can do this by using the available time functions (to see other built-in functions, see the UDF Reranker [documentation](https://docs.vectara.com/docs/learn/user-defined-function-reranker)).

Vectara also supports chain-reranking, which provides a way to chain together multiple reranking methods to achieve better control over the reranking, and combining the strengths of various reranking methods. A great way to use the UDF reranker is in a chain: first the multilingual reranker, followed by the maximal marginal relevance (MMR) reranker, and then a user-defined function, as shown below:

In [11]:
query_engine = index.as_query_engine(
    similarity_top_k = 50,
    reranker="chain",
    rerank_chain=[
        {
            "type": "slingshot"
        },
        {
            "type": "mmr",
            "diversity_bias": 0.3
        },
        {
            "type": "udf",
            "user_function": "max(0, 5 * get('$.score') - hours(seconds((to_unix_timestamp(now()) - to_unix_timestamp(datetime_parse(get('$.document_metadata.published'), 'yyyy-MM-dd'))))) / 24 / 365)",
            "limit": 5
        }
    ]
)

response = query_engine.query("What innovations have been made to sentence embedding models?")
print(response)

Innovations in sentence embedding models include the development of Espresso Sentence Embeddings (ESE) supporting model depth and embedding size scaling [1]. Additionally, there is a shift towards generative models like PromptEOL, which enhance embeddings by introducing an Explicit One word Limitation (EOL) into prompts [2]. Furthermore, research is focusing on computationally efficient direct inference methods for sentence representation, bridging the gap in fine-tuning scenarios [3]. Lastly, advancements like Backward Dependency Enhanced Large Language Models (BeLLM) are exploring the effects of backward dependencies in LLMs for semantic similarity measurements [5].


In [12]:
[(inx, n.node.metadata['published']) for inx, n in enumerate(response.source_nodes)]

[(0, '2024-02-22'),
 (1, '2024-04-05'),
 (2, '2024-04-05'),
 (3, '2023-11-16'),
 (4, '2023-11-09')]

Notice how many of the papers used to generate the final summary were published recently, and they still give us information to generate a relevant response that answers our question.

Also notice how we use a max() function with 0 in our user-defined expression. This is to ensure that all of our reranking scores are non-negative. Additionally, since we multiplied the original score by 10 and its value ranges from 0 to 1, we throw away any search results that are older than 10 years old for generating our final response.

So far we've used Vectara's internal summarization capability, which is the best way for most users.

You can still use Llama-Index's standard VectorStore `as_query_engine()` method, in which case Vectara's summarization won't be used, and you would be using an external LLM (e.g. OpenAI's GPT-4) and a custom prompt from LlamaIndex to generate the summary. For this option just set `summary_enabled=False`

For this functionality, you will need to specify your own OpenAI API key in the environment:

> `os.environ['OPENAI_API_KEY'] = '<YOUR_OPENAI_API_KEY>'`

In [13]:
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4-turbo", temperature=0)

query_engine = index.as_query_engine(
    similarity_top_k=5,
    summary_enabled=False,
    llm=llm
)
response = query_engine.query(query)
print(response)

Sentence embedding is a method used to represent sentences in a text by encoding their meaning into n-dimensional space. This technique involves mapping sentences into vectors of real numbers, where each vector represents the semantic content of a sentence. The proximity of these vectors in the vector space indicates the similarity between the meanings of the sentences. Sentence embeddings are utilized in various machine learning applications to handle and process text data effectively.


## Using Vectara Chat

Vectara now fully supports Chat in its platform, where the chat history is maintained by Vectara and so you don't have to worry about keeping history and integrating it with your RAG pipeline. 

To use it, simply call `as_chat_engine()`.

(Chat mode always uses Vectara's summarization so you don't have to explicitly specify `summary_enabled=True` like before)

In [14]:
ce = index.as_chat_engine()

In [15]:
questions = [
    'What is a sentence embedding model?',
    'What are some known models?',
    'How are they different than token embedding models'
]

for q in questions:
    print(f"Question: {q}\n")
    response = ce.chat(q).response
    print(f"Response: {response}\n")

Question: What is a sentence embedding model?

Response: A sentence embedding model is a framework that converts sentences into numerical vectors, capturing their semantic meanings. These models, such as Sentence-BERT, SimCSE-BERT, SimCSE-RoBERTa, Sentence-T5, and MP-Net, are used for tasks like information retrieval and semantic textual similarity benchmarks. They play a crucial role in various applications by representing sentences in a continuous vector space for efficient processing and analysis.

Question: What are some known models?

Response: Some known sentence embedding models include Sentence-BERT, SimCSE-BERT, SimCSE-RoBERTa, Sentence-T5, MP-Net, GenSen, and DSE. These models have been used for various tasks such as embedding inversion attacks and semantic textual similarity benchmarks. Additionally, there is ongoing research and development in this area, with a focus on both supervised and unsupervised models for constructing sentence embeddings.

Question: How are they dif

Of course streaming works as well with Chat:

In [16]:
ce = index.as_chat_engine(streaming=True)

In [17]:
response = ce.stream_chat("Who is behind SBERT?")
for chunk in response.chat_stream:
    print(chunk.delta or "", end="", flush=True)

The individuals behind SBERT are Reimers and Gurevych, as they proposed the SBERT network structure in 2019. SBERT is based on transformer models like BERT and fine-tunes them using a siamese network structure [1].

# Advanced RAG with Vectara and LLamaIndex

## Agentic RAG

Vectara also has its own package, [vectara-agentic](https://github.com/vectara/py-vectara-agentic), built on top of many features from LlamaIndex to easily implement agentic RAG applications. It allows you to create your own AI assistant with RAG query tools and other custom tools, such as making API calls to retrieve information from financial websites. You can find the full documentation for vectara-agentic [here](https://vectara.github.io/vectara-agentic-docs/).

Let's create a ReAct Agent with a single RAG tool using vectara-agentic (to create a ReAct agent, specify `VECTARA_AGENTIC_AGENT_TYPE` as `"REACT"` in your environment).

Vectara does not yet have an LLM capable of acting as an agent for planning and tool use, so we will need to use another LLM as the driver of the agent resoning.

In this demo, we are using OpenAI's GPT4o. Please make sure you have `OPENAI_API_KEY` defined in your environment or specify another LLM with the corresponding key (for the full list of supported LLMs, check out our [documentation](https://vectara.github.io/vectara-agentic-docs/introduction.html#try-it-yourself) for setting up your environment).

In [18]:
# !pip install -U vectara-agentic

In [19]:
from vectara_agentic.agent import Agent
from IPython.display import display, Markdown

agent = Agent.from_corpus(
    data_description="sentence embeddings",
    assistant_specialty="sentence embeddings research",
    tool_name="ask_embeddings",
    vectara_summary_num_results=5,
    vectara_summarizer="mockingbird-1.0-2024-07-16",
    vectara_reranker="mmr",
    vectara_rerank_k=50,
    verbose=True,
)

response = agent.chat(
    "Tell me about the latest innovations in sentence embedding models."
)

display(Markdown(response))

Initializing vectara-agentic version 0.1.17...
No observer set.
> Running step 36031c69-ab3e-4952-8e6b-d07b4310a628. Step input: Tell me about the latest innovations in sentence embedding models.
[1;3;38;5;200mThought: The current language of the user is: English. I need to use a tool to help me answer the question.
Action: ask_embeddings
Action Input: {'query': 'latest innovations in sentence embedding models'}
[0m[1;3;34mObservation: 
                    Response: '''The latest innovations in sentence embedding models include Sentence-BERT, SimCSE-BERT/SimCSE-RoBERTa, Sentence-T5, and MP-Net [1]. These models have been found to have isotropy issues, which can lead to alignment problems in areas such as domain adaptation, word embedding evaluation, and machine translation [2]. Additionally, there are concerns about the trustworthiness of reported results in the field of NLP, with some studies suggesting that neural language models have been misleadingly evaluated [5]. A novel sente

Recent innovations in sentence embedding models include advancements such as Sentence-BERT, SimCSE-BERT/SimCSE-RoBERTa, Sentence-T5, and MP-Net. These models have been noted for having isotropy issues, which can cause alignment problems in various applications like domain adaptation and machine translation. Additionally, there are concerns about the reliability of reported results in NLP, with some studies indicating that neural language models might have been misleadingly evaluated. A new model called Espresso Sentence Embeddings (ESE) has been introduced, which supports scaling in both model depth and embedding size, aiming to address the limitations of existing models and provide a more effective and trustworthy approach. 

For more detailed information, you can refer to the following sources: [arXiv:2305.03010v1](http://arxiv.org/pdf/2305.03010v1), [arXiv:2408.08073v1](http://arxiv.org/pdf/2408.08073v1), and [arXiv:2402.14776v2](http://arxiv.org/pdf/2402.14776v2).

## Summary

In this notebook we've seen various examples for using Vectara with LlamaIndex, which provides the following benefits:
* Vectara provides a complete RAG pipeline, so you don't have to deal with a lot of the details around data ingestion: pre-processing, chunking, embedding, etc. Instead all these steps are handled automatically and efficiently in Vectara. 
* Being a platform, Vectara uses its own internal Embedding model (Boomerang), its own vector storage, and its own LLM (Mockingbird) for summarization, so you don't have to maintain separate API keys and relationships with additional vendors or install other products.
* Vectara is built for large scale GenAI applications, and with the vectara-agentic package, you can easily build and test advanced RAG applications at an enteprise scale.