<a href="https://colab.research.google.com/github/vectara/example-notebooks/blob/main/notebooks/using-vectara-with-llamaindex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vectara and LlamaIndex

In this notebook we are going to show how to use Vectara with LlamaIndex.

## About Vectara

[Vectara](https://vectara.com/) is the trusted AI Assistant and Agent platform which focuses on enterprise readiness for mission-critical applications. 

Vectara provides an end-to-end managed service for Retrieval Augmented Generation or [RAG](https://vectara.com/grounded-generation/), which includes:

1. An integrated API for processing input data, including text extraction from documents and ML-based chunking.

2. The state-of-the-art [Boomerang](https://vectara.com/how-boomerang-takes-retrieval-augmented-generation-to-the-next-level-via-grounded-generation/) embeddings model. Each text chunk is encoded into a vector embedding using Boomerang, and stored in the Vectara internal knowledge (vector+text) store. Thus, when using Vectara with LlamaIndex you do not need to call a separate embedding model - this happens automatically within the Vectara backend.

3. A query service that automatically encodes the query into embeddings and retrieves the most relevant text segmentsthrough [hybrid search](https://docs.vectara.com/docs/api-reference/search-apis/lexical-matching) and a variety of [reranking](https://docs.vectara.com/docs/api-reference/search-apis/reranking) strategies, including a [multilingual reranker](https://docs.vectara.com/docs/learn/vectara-multi-lingual-reranker), [maximal marginal relevance (MMR) reranker](https://docs.vectara.com/docs/learn/mmr-reranker), [user-defined function reranker](https://docs.vectara.com/docs/learn/user-defined-function-reranker), and a [chain reranker](https://docs.vectara.com/docs/learn/chain-reranker) that provides a way to chain together multiple reranking methods to achieve better control over the reranking, combining the strengths of various reranking methods.

4. An option to create a [generative summary](https://docs.vectara.com/docs/learn/grounded-generation/grounded-generation-overview) with a wide selection of LLM summarizers (including Vectara's [Mockingbird](https://vectara.com/blog/mockingbird-is-a-rag-specific-llm-that-beats-gpt-4-gemini-1-5-pro-in-rag-output-quality/), trained specifically for RAG-based tasks), based on the retrieved documents, including citations.

See the [Vectara API documentation](https://docs.vectara.com/docs/) for more information on how to use the API.

The main benefits of using Vectara RAG-as-a-service to build your application are:
* **Accuracy and Quality**: Vectara provides an end-to-end platform that focuses on eliminating hallucinations, reducing bias, and safeguarding copyright integrity.
* **Security**: Vectara's platform provides acess control--protecting against prompt injection attacks--and meets SOC2 and HIPAA compliance.
* **Explainability**: Vectara makes it easy to troubleshoot bad results by clearly explaining rephrased queries, LLM prompts, retrieved results, and agent actions.

## About LlamaIndex

LlamaIndex is a "data framework" to help you build LLM apps:

1. It includes **data connectors** to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.)
2. It provides ways to **structure your data** (indices, graphs) so that this data can be easily used with LLMs.
3. It provides an **advanced retrieval/query interface over your data**: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.

LlamaIndex's high-level API allows beginner users to use LlamaIndex to ingest and query their data in just a few lines of code, whereas its lower-level APIs allow advanced users to customize and extend any module (data connectors, indices, retrievers, query engines, reranking modules), to fit their needs.

Vectara is implemented in LlamaIndex as a [Managed Service](https://docs.llamaindex.ai/en/stable/community/integrations/managed_indices.html#vectara), abstracting all of Vectara's powerful API so they are easily integrated into LlamaIndex.

In this notebook, we will demonstrate some of the great ways you can use Vectara together with LlamaIndex.

## Getting Started

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

To get started with Vectara, [sign up](https://console.vectara.com/signup?utm_source=vectara&utm_medium=signup&utm_term=DevRel&utm_content=example-notebooks&utm_campaign=vectara-signup-DevRel-example-notebooks) (if you haven't already) and follow our [quickstart](https://docs.vectara.com/docs/quickstart) guide to create a corpus and an API key. 

Once you have these, you can provide them as environment variables, which will be used by the LlamaIndex code later on:

In [1]:
# !pip install -U llama-index llama-index-indices-managed-vectara arxiv

import os
# os.environ['VECTARA_API_KEY'] = "<YOUR_VECTARA_API_KEY>"
# os.environ['VECTARA_CORPUS_ID'] = "<YOUR_VECTARA_CORPUS_ID>"
# os.environ['VECTARA_CUSTOMER_ID'] = "<YOUR_VECTARA_CUSTOMER_ID>"

## Loading Data Into Vectara

As mentioned above, Vectara is a RAG managed service, and in many cases data may be uploaded to the index ahead of time (e.g. by using [Airbyte](https://docs.airbyte.com/integrations/destinations/vectara), directly via Vectara's [indexing API](https://docs.vectara.com/docs/api-reference/indexing-apis/indexing) or using tools like [vectara-ingest](https://github.com/vectara/vectara-ingest)), but another easy way is via the VectaraIndex constructor: `from_documents()`.

For this notebook, we will assume the Vectara corpus is empty and will load PDF documents from Arxiv, using Python's [arxiv](https://github.com/lukasschwab/arxiv.py) library. We will pull in data from the top papers related to "climate change":

In [2]:
import arxiv

client = arxiv.Client()
search = arxiv.Search(
  query = "(ti:embedding model) OR (ti:sentence embedding)",
  max_results = 100,
  sort_by = arxiv.SortCriterion.Relevance
)
papers = list(client.results(search))

In [3]:
[p.entry_id for p in papers][:5]

['http://arxiv.org/abs/2402.14776v2',
 'http://arxiv.org/abs/2007.01852v2',
 'http://arxiv.org/abs/1910.13291v1',
 'http://arxiv.org/abs/2104.06719v1',
 'http://arxiv.org/abs/1511.08198v3']

Next, download the Arxiv paper, and upload them into Vectara using the `add_file()`. 

In [4]:
import shutil
from llama_index.indices.managed.vectara import VectaraIndex

data_folder = 'temp'
os.makedirs(data_folder, exist_ok=True)

# Create Vectara Index
index = VectaraIndex()

# Upload content for all papers
for paper in papers:
    try:
        paper_fname = paper.download_pdf(data_folder)
    except Exception as e:
        print(f"File {paper_fname} failed to load with error {e}")
        continue
    metadata = {
        'url': paper.pdf_url,
        'title': paper.title,
        'author': str(paper.authors[0]),
        'published': str(paper.published.date())
    }
    index.insert_file(file_path=paper_fname, metadata=metadata)

shutil.rmtree(data_folder)
del papers

File temp/1909.03104v2.Efficient_Sentence_Embedding_using_Discrete_Cosine_Transform.pdf failed to load with error HTTP Error 404: Not Found


Two important things to note here:
1. Vectara processes each file uploaded on the backend, and performs appropriate chunking. So you don't need to apply any local processing, or choose a chunking strategy. 
2. We have used the fields `url`, `title`, `author`, and `published` as metadata fields (for simplicity, author is the first author if there are multiple). You will need to make sure those fields are defined in your Vectara corpus as [filterable metadata fields](https://docs.vectara.com/docs/learn/metadata-search-filtering/filter-overview) to ensure we can filter by them in query time.

So that's it for upload. 

## Querying with the VectaraIndex
We can now ask questions using the `VectaraIndex` object.

In [5]:
query = "What is sentence embedding?"

In [6]:
query_engine = index.as_query_engine(
    summary_enabled=True, summary_num_results=5,
    summary_response_lang="eng",
    summary_prompt_name="mockingbird-1.0-2024-07-16"
)
res = query_engine.query(query)
print(res.response)

Based on the provided sources, sentence embedding can be summarized as follows:

Sentence embedding is a form of word or sentence representation that maps text data into vectors, which can be a set of real numbers (a vector) [1]. It is a term used to represent words or sentences in a text that encodes the meaning of the word or the sentence in n-dimensional space [1]. The goal of sentence embedding is to make the embeddings of two sentences that are similar to get closer in this vector space [2]. This is achieved by training the sentence embedding model to capture the meaning of the sentence, and it is expected that sentences that are closer in the vector space are more similar [1].

There are different approaches to sentence embedding, including traditional word embedding, static word embedding, contextualized word embedding, and two-sentence embeddings approach, with non-parameterized and parameterized models [1]. Sentence embedding has gained attention in recent years, particularly 

Note that the response here is fully generated by Vectara. There is no additional LLM involved (or API key you need to setup). The response also includes citations (marked in square brackets), which provide links to references used to generate this response by Vectara. 
<br>
The `res` object includes the actual response to the user query, but also has the citations:

In [7]:
[(inx, n.node.metadata['url']) for inx, n in enumerate(res.source_nodes)]

[(0, 'http://arxiv.org/pdf/2206.02690v3'),
 (1, 'http://arxiv.org/pdf/1910.13291v1'),
 (2, 'http://arxiv.org/pdf/2305.03010v1'),
 (3, 'http://arxiv.org/pdf/2305.03010v1'),
 (4, 'http://arxiv.org/pdf/1904.05542v1'),
 (5, 'http://arxiv.org/pdf/1904.05542v1'),
 (6, 'http://arxiv.org/pdf/1904.05542v1'),
 (7, 'http://arxiv.org/pdf/1904.05542v1'),
 (8, 'http://arxiv.org/pdf/1904.05542v1'),
 (9, 'http://arxiv.org/pdf/1904.05542v1')]

## Using Streaming

You can also stream the Vectara response simply by specifying `streaming=True`:

In [8]:
query_engine = index.as_query_engine(
    summary_enabled=True,
    summary_prompt_name="mockingbird-1.0-2024-07-16",
    streaming=True)

res = query_engine.query(query)

# print streamed output chunk by chunk
for chunk in res.response_gen:
    print(chunk.delta or "", end="", flush=True)

Based on the provided sources, sentence embedding is a representation of a sentence in a vector space that encodes the meaning of the sentence. It is a form of word or sentence representation that prepares texts in an understandable format for a machine [1]. Sentence embeddings are expected to map sentences that are closer in the vector space to be more similar [1]. There are different approaches to sentence embeddings, including non-parameterized and parameterized models [2]. Sentence embeddings have gained attention in recent years, particularly in natural language processing (NLP), information extraction (IE), and neural machine translation (NMT) tasks [2].

Sources: [1], [2]

## Reranking

Vectara supports three types of [reranking](https://docs.vectara.com/docs/api-reference/search-apis/reranking):
1. [Maximal Marginal Relevance](https://docs.vectara.com/docs/learn/mmr-reranker), or MMR, provides a reranking that can promote diversity in results at the cost of relevance.
2. [Slingshot](https://docs.vectara.com/docs/learn/vectara-multi-lingual-reranker) is a mulitilingual reranker that increases the accuracy of retrieved results across 100+ languages and is available to Vectara Scale customers.
3. [User Defined Functions](https://docs.vectara.com/docs/learn/user-defined-function-reranker) allow you to create your own functions for reranking search results, unlocking better retrieval in a wide variety of use cases, such as sorting by recency or price of a product.

 Let's see an example of how to use MMR: We will run the same query but this time we will use MMR where `mmr_diversity_bias=0.3` provides a tradeoff between relevance and diversity (0.0 is full relevance, 1.0 is only diversity):

In [14]:
query_engine = index.as_query_engine(
    similarity_top_k=5,
    reranker="mmr",
    rerank_k=50,
    mmr_diversity_bias=0.3,
)
response = query_engine.query(query)
print(response)

Sentence embedding is a method of representing words or sentences in a text by encoding their meaning in n-dimensional space. It involves mapping text data into vectors of real numbers, where words or sentences closer in the vector space are more similar. Different types of embeddings exist, such as traditional, static, and contextualized word embeddings, as well as non-parameterized and parameterized models for sentence embeddings. This approach aims to prepare texts in a machine-understandable format, facilitating various natural language processing tasks.


In [10]:
[(inx, n.node.metadata['url']) for inx, n in enumerate(response.source_nodes)]

[(0, 'http://arxiv.org/pdf/2206.02690v3'),
 (1, 'http://arxiv.org/pdf/2305.03010v1'),
 (2, 'http://arxiv.org/pdf/2305.15077v2'),
 (3, 'http://arxiv.org/pdf/2402.12890v1'),
 (4, 'http://arxiv.org/pdf/2404.17606v1')]

As you can see, the results are now reranked in a way that provides more diversity instead of maximizing pure relevance. This in turn results in a different set of chunks used to generate the response.

Now let's see an example with a user defined function. We may be interested in getting results that are the most semantically similar to our question, but we also want the most up-to-date information. Thus, we can bias our search results so that the papers that are not only semantically similar but also published more recently are used to answer our query. We can do this by using the available time functions (to see other built-in functions, see the UDF Reranker [documentation](https://docs.vectara.com/docs/learn/user-defined-function-reranker)).

Vectara also supports chain-reranking, which provides a way to chain together multiple reranking methods to achieve better control over the reranking, and combining the strengths of various reranking methods. A great way to use the UDF reranker is in a chain: first the multilingual reranker, followed by the maximal marginal relevance (MMR) reranker, and then a user-defined function, as shown below:

In [55]:
# query_engine = index.as_query_engine(
#     similarity_top_k = 50,
#     reranker="chain",
    # rerank_chain=[
    #     {
    #         "type": "slingshot"
    #     },
    #     {
    #         "type": "mmr",
    #         "diversity_bias": 0.3
    #     },
    #     {
    #         "type": "mmr",
    #         "diversity_bias": 0.7,
    #         "limit": 5
    #     }
    # ]
# )

query_engine = index.as_query_engine(
    similarity_top_k = 50,
    reranker="mmr",
    # udf_expression="max(0, 10 * get('$.score') - hours(seconds((to_unix_timestamp(now()) - to_unix_timestamp(datetime_parse(get('$.document_metadata.published'), 'yyyy-MM-dd'))))) / 24 / 365)"
)

# udf_expression="max(0, 10 * get('$.score') - hours(seconds((to_unix_timestamp(now()) - to_unix_timestamp(datetime_parse(get('$.document_metadata.published'), 'yyyy-MM-dd'))))) / 24 / 365)"

        # {
        #     "type": "udf",
        #     "user_function": "get('$.score') + 10"
        # }

# response = query_engine.query("What innovations have been made to sentence embedding models?")
# print(response)

In [38]:
[(inx, n.node.metadata['published']) for inx, n in enumerate(response.source_nodes)]

[(0, '2017-03-07'),
 (1, '2023-05-04'),
 (2, '2024-04-05'),
 (3, '2024-02-22'),
 (4, '2023-11-09'),
 (5, '2023-07-06'),
 (6, '2022-04-02'),
 (7, '2020-05-22'),
 (8, '2023-05-04'),
 (9, '2022-04-28'),
 (10, '2022-04-22'),
 (11, '2024-04-05'),
 (12, '2016-05-16'),
 (13, '2022-04-28'),
 (14, '2021-10-02'),
 (15, '2018-03-29'),
 (16, '2020-07-03'),
 (17, '2022-04-28'),
 (18, '2019-06-04'),
 (19, '2018-08-16'),
 (20, '2023-11-09'),
 (21, '2020-03-09'),
 (22, '2022-04-28'),
 (23, '2018-10-20'),
 (24, '2022-10-20'),
 (25, '2023-11-16'),
 (26, '2024-04-05'),
 (27, '2024-04-05'),
 (28, '2019-08-14'),
 (29, '2018-06-16'),
 (30, '2019-08-14'),
 (31, '2020-04-21'),
 (32, '2024-05-30'),
 (33, '2024-05-30'),
 (34, '2022-04-02'),
 (35, '2020-06-05'),
 (36, '2022-05-10'),
 (37, '2020-11-02'),
 (38, '2023-07-31'),
 (39, '2018-06-03'),
 (40, '2024-02-22'),
 (41, '2023-05-24'),
 (42, '2022-05-31'),
 (43, '2023-05-04'),
 (44, '2018-02-12'),
 (45, '2016-07-10'),
 (46, '2015-11-25'),
 (47, '2023-11-09'),
 (

Notice how many of the papers used to generate the final summary were published in the recent past and they still give us information to generate a relevant response that answers our question.

Also notice how we use a max() function with 0 in our user-defined expression. This is to ensure that all of our reranking scores are non-negative. Additionally, since we multiplied the original score by 10 and its value ranges from 0 to 1, we throw away any search results that are older than 10 years old for generating our final response.

So far we've used Vectara's internal summarization capability, which is the best way for most users.

You can still use Llama-Index's standard VectorStore `as_query_engine()` method, in which case Vectara's summarization won't be used, and you would be using an external LLM (e.g. OpenAI's GPT-4) and a custom prompt from LlamaIndex to generate the summary. For this option just set `summary_enabled=False`

For this functionality, you will need to specify your own OpenAI API key in the environment:

> `os.environ['OPENAI_API_KEY'] = '<YOUR_OPENAI_API_KEY>'`

In [15]:
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4-turbo", temperature=0)

query_engine = index.as_query_engine(
    similarity_top_k=5,
    summary_enabled=False,
    llm=llm
)
response = query_engine.query(query)
print(response)

Sentence embedding is a technique used in natural language processing to represent sentences as vectors in a high-dimensional space. These vectors encode the meaning of the sentences, allowing for various applications such as machine translation, information retrieval, and text classification. The goal is for sentences with similar meanings to be close to each other in this vector space, facilitating tasks that require understanding of sentence semantics.


## Using Vectara Chat

Vectara now fully supports Chat in its platform, where the chat history is maintained by Vectara and so you don't have to worry about keeping history and integrating it with your RAG pipeline. 

To use it, simply call `as_chat_engine()`.

(Chat mode always uses Vectara's summarization so you don't have to explicitly specify `summary_enabled=True` like before)

In [16]:
ce = index.as_chat_engine()

In [17]:
questions = [
    'What is a sentence embedding model?',
    'What are some known models?',
    'How are they different than token embedding models'
]

for q in questions:
    print(f"Question: {q}\n")
    response = ce.chat(q).response
    print(f"Response: {response}\n")

Question: What is a sentence embedding model?

Response: A sentence embedding model is a method that represents input sentences as fixed-dimensional vectors, regardless of sentence length. These models have shown significant enhancements in various natural language processing tasks like information retrieval, question answering, and machine translation. They are particularly beneficial for tasks where sentence-level representations are crucial, enabling improved performance compared to traditional word embeddings. Sentence embedding models are trained to capture semantic meanings and relationships within sentences, providing a more efficient way to process and analyze textual data [4].

Question: What are some known models?

Response: Some known sentence embedding models include non-parameterized models like averaging word embeddings using methods such as average-pooling, min-pooling, and max-pooling, as well as parameterized models like BERT, RoBERTa, GenSen, and DSE. These models aim

Of course streaming works as well with Chat:

In [18]:
ce = index.as_chat_engine(streaming=True)

In [19]:
response = ce.stream_chat("Who is behind SBERT?")
for chunk in response.chat_stream:
    print(chunk.delta or "", end="", flush=True)

The individuals behind SBERT are Reimers and Gurevych, as mentioned in the search results [1], [7].

# Advanced RAG with Vectara and LLamaIndex

## Agentic RAG

Vectara also has its own package, [vectara-agentic](https://github.com/vectara/py-vectara-agentic), built on top of many features from LlamaIndex to easily implement agentic RAG applications. It allows you to create your own AI assistant with RAG query tools and other custom tools, such as making API calls to retrieve information from financial websites. You can find the full documentation for vectara-agentic [here](https://vectara.github.io/vectara-agentic-docs/).

Let's create a ReAct Agent with a single RAG tool using vectara-agentic (to create a ReAct agent, specify `VECTARA_AGENTIC_AGENT_TYPE` as `"REACT"` in your environment).

Vectara does not yet have an LLM capable of acting as an agent for planning and tool use, so we will need to use another LLM as the driver of the agent resoning.

In this demo, we are using OpenAI's GPT4o. Please make sure you have `OPENAI_API_KEY` defined in your environment or specify another LLM with the corresponding key (for the full list of supported LLMs, check out our [documentation](https://vectara.github.io/vectara-agentic-docs/introduction.html#try-it-yourself) for setting up your environment).

In [23]:
# !pip install -U vectara-agentic

In [24]:
from vectara_agentic.agent import Agent
from IPython.display import display, Markdown

agent = Agent.from_corpus(
    data_description="sentence embeddings",
    assistant_specialty="sentence embeddings research",
    tool_name="ask_embeddings",
    vectara_summary_num_results=5,
    vectara_summarizer="mockingbird-1.0-2024-07-16",
    vectara_reranker="mmr",
    vectara_rerank_k=50,
    verbose=True,
)

response = agent.chat(
    "Tell me about the latest innovations in sentence embedding models."
)

display(Markdown(response))

No observer set.
Added user message to memory: Tell me about the latest innovations in sentence embedding models.
=== Calling Function ===
Calling function: ask_embeddings with args: {"query":"latest innovations in sentence embedding models"}
Got output: 
                    Response: '''The latest innovations in sentence embedding models involve the use of large language models such as LLaMA and Mistral, which have achieved notable breakthroughs in fine-tuning scenarios [1]. However, research on computationally efficient direct inference methods for sentence representation is still in its nascent stage [1]. Recent studies have focused on refining training objectives with contrastive loss and using various training datasets, often involving translation pairs [5]. The integration of large language models and prompting methods has further advanced the capabilities of sentence embedding models [5]. Additionally, the use of uniformity loss and alignment loss in sentence embedding models ha

The latest innovations in sentence embedding models include the use of large language models like LLaMA and Mistral, which have made significant advancements in fine-tuning scenarios. However, research on computationally efficient direct inference methods for sentence representation is still emerging. Recent studies have focused on refining training objectives with contrastive loss and using various training datasets, often involving translation pairs. The integration of large language models and prompting methods has further enhanced the capabilities of sentence embedding models. Additionally, the use of uniformity loss and alignment loss in sentence embedding models has been shown to improve performance.

For more detailed information, you can refer to the following sources:
- [Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models](http://arxiv.org/pdf/2404.03921v2) by Bowen Zhang, published in 2024.
- [arXiv:2204.10931v1](http://arxiv.org/pdf/2204.10931v1) by Miaoran Zhang, published in 2022.
- [Introduction](http://arxiv.org/pdf/2205.15744v2) by Zhuoyuan Mao, published in 2022.

## Using Auto Retriever with Vectara

LlamaIndex's auto-retriever functionality is really cool. 
It is most useful when you have metadata fields (like in our case of papers from Arxiv), and would like a query that references a metadata field to be automatically interpreted in the right way.

For example, if I ask "what is a paper about climate change risks published after 2020", the auto-retriever would (behind the scences) interpret ths into a query "what is a paper about climate change risks" along with a filter condition of "published > 2020"

Let's see how this works with the Vectara Index.
First, we have to define a `VectorStoreInfo` structure that defines the meta data fields the auto-retriever knows about to do its job:

In [25]:
from llama_index.core.vector_stores.types import MetadataInfo, VectorStoreInfo

vector_store_info = VectorStoreInfo(
    content_info="information about a paper",
    metadata_info=[
        MetadataInfo(
            name="published",
            description="The date the paper was published",
            type="string",
        ),
        MetadataInfo(
            name="author",
            description="The author of the paper",
            type="string",
        ),
        MetadataInfo(
            name="title",
            description="The title of the papers",
            type="string",
        ),
        MetadataInfo(
            name="url",
            description="The URL for this paper",
            type="string",
        ),
    ],
)

Auto-retrieval is implemented before calling Vectara as a query transformation. 

Now we can define the `VectaraAutoRetriever`, which can perform auto-retrieval using Vectara:

In [26]:
from llama_index.indices.managed.vectara import VectaraAutoRetriever

retriever = VectaraAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    llm=llm,
    verbose=True
)
res = retriever.retrieve("What is sentence embedding, based on papers before 2019?")
[(r.metadata['published'], r.text) for r in res]

Using query str: What is sentence embedding?
Using implicit filters: [('published', '<', '2019')]
final filter string: (doc.published < '2019')


[('2018-08-16',
  'This\nproblem can be alleviated by obtaining more of para-\nphrase sentence pairs. Conclusion Sentence embedding is one of the most important text\nprocessing techniques in NLP. To date,  various sen-\ntence embedding models have been proposed and have\nyielded good performances in document classification\nand sentiment analysis tasks. However, the fundamen-\ntal ability of sentence embedding methods, i.e., how\neffectively the meanings of the original sentences are\npreserved  in  the  embedded  vectors,  cannot  be  fully\nevaluated through such indirect methods.'),
 ('2018-08-16',
  'Paraphrase Thought:  Sentence Embedding Module Imitating\n                        Human Language Recognition Myeongjun Jang 1 Abstract\nSentence embedding is an important research\ntopic in natural language processing. It is es-\nsential to generate a good embedding vector\nthat  fully  reflects  the  semantic  meaning  of\na sentence in order to achieve an enhanced\nperformance  for 

As you can see, the Auto Retriever was able to translate the natural language text into a shorter query and a proper condition (in this case `doc.published < 2019`).

We can also of course ask a question directly: we use the `VectaraQueryEngine` which can work with the `VectaraAutoRetriever` directly:

In [27]:
from llama_index.indices.managed.vectara.query import VectaraQueryEngine
from llama_index.indices.managed.vectara import VectaraAutoRetriever

ar = VectaraAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    llm=llm,
    summary_enabled=True,
    summary_num_results=5,
    verbose=True
)

query_engine = VectaraQueryEngine(retriever=ar)
response = query_engine.query("What is sentence embedding, based on papers before 2019?")
print(response)

Using query str: What is sentence embedding?
Using implicit filters: [('published', '<', '2019')]
final filter string: (doc.published < '2019')
Sentence embedding is a crucial technique in Natural Language Processing (NLP) that involves transforming sentences into low-dimensional vector representations to capture their semantic meanings. Various models have been developed to create embedding vectors that enhance performance in tasks like document classification, sentiment analysis, and machine translation. These models aim to preserve the original sentence meanings effectively within the embedded vectors, ultimately improving the efficiency of NLP tasks. Different methods and approaches have been proposed to optimize sentence embeddings, ensuring that semantically similar sentences have similar embeddings while semantically different ones are dissimilar. Overall, sentence embedding plays a vital role in NLP by providing structured representations of unstructured text data, leading to i

## Advanced querying with QueryFusionRetriever

The QueryFusion [Retriever](https://docs.llamaindex.ai/en/stable/examples/retrievers/reciprocal_rerank_fusion.html#reciprocal-rerank-fusion-retriever) is an advanced query mechanism whereby the original query is pre-processed to generate N variations. Each of these rephrased queries is then run against the Vectara engine and rank-fusion is used to combine the best results. 

Let's see this in action:

In [28]:
query = "is SBERT a dual encoder? what type of DL architecture does it use?"
query_engine = index.as_query_engine(
    similarity_top_k=3,
    summary_enabled=False,
    llm=llm,
)
response = query_engine.query(query)
print(response)

SBERT is not specifically described as a dual encoder in the provided context. It is mentioned as one of the base encoder models used in a study, but the specific architecture type of SBERT itself is not detailed in the excerpts. SBERT typically employs a sentence embedding approach, but further specifics on whether it uses a dual encoder or another type of deep learning architecture are not provided in the context.


In [29]:
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
import nest_asyncio

rf_retriever = QueryFusionRetriever(
    [index.as_retriever(similarity_top_k=2)],
    similarity_top_k=2,
    num_queries=5,  # this includes the origianl query; set this to 1 to disable query generation
    mode="reciprocal_rerank",
    use_async=True,
    verbose=True,
)

nest_asyncio.apply()     # apply nested async to run in a notebook
query_engine = RetrieverQueryEngine.from_args(rf_retriever)
response = query_engine.query(query)
print(response)

Generated queries:
1. What is SBERT and how does it work as a dual encoder?
2. Comparison of SBERT with other dual encoder models in natural language processing.
3. Deep learning architecture used in SBERT for sentence embeddings.
4. Advantages and limitations of using a dual encoder like SBERT in machine learning tasks.
SBERT is not a dual encoder. It is a text encoder model that does not use a dual-encoder architecture. Instead, SBERT uses a single embedding space for encoding text.


We can see how the QueryFusionRetriever created additional query variations (they are displayed since we used `verbose=True`) and then the overall response includes the results fused together. This is very helpful in this case because the QueryFusionRetriever creates sub-questions that inquire about the specific architecture of SBERT which is relevant context to answering this question properly.

## Summary

In this notebook we've seen various examples for using Vectara with LlamaIndex, which provides the following benefits:
* Vectara provides a complete RAG pipeline, so you don't have to deal with a lot of the details around data ingestion: pre-processing, chunking, embedding, etc. Instead all these steps are handled automatically and efficiently in Vectara. 
* Being a platform, Vectara uses its own internal Embedding model (Boomerang), its own vector storage, and its own LLM (Mockingbird) for summarization, so you don't have to maintain separate API keys and relationships with additional vendors or install other products.
* Vectara is built for large scale GenAI applications, and with the tools provided by LlamaIndex like Auto Retrieval and Query Fusion, you can easily build and test advanced RAG applications at enteprise scale.