# Getting Started - Lab 01 - Vectara Index API

We'll now explore the Vectara Index API, where we encode our data into vectors using the Boomerang model which provides
the best bi-lingual meaning and intent embeddings in the industry. We then store the embeddings plus the document text
and metadata together in the corpus.

This notebook will use our "lab" authentication profile, if you haven't set this up, please [Setup Authentication](./00_setup_authentication.ipynb).

<img src="./resources/platform-capabilities-index.png" alt="Platform Capabilities - Encode and Index" />

In [None]:
from vectara.factory import Factory
from getting_started_util import GettingStartedUtil

util = GettingStartedUtil()
logger = util.logger
client = Factory(profile="lab").build()

## Setup Corpus
We will setup a lab corpus below before we ingest our data. We'll examine this in more depth in the following notebooks.

In [None]:
corpus_key = util.setup_02(client)

## Load Our Content
We'll now use the same example as the last lab, loading Shakespeare's _Taming of the Shrew_ text.

It's important to note that this text is "one block". Dependent on which method we use below will dictate how it is
structured.

<img src="./resources/Taming_of_the_Shrew_01.jpg" alt="Taming of the Shrew" />

In [None]:
from pathlib import Path

path = Path("resources/shakespeare/taming_shrew.txt")
logger.info(f"Loading {path}")
with open(path, "r", encoding="utf-8") as f:
    play_text = f.read()


## Automatic Chunking with Structured Document Indexing
We'll now submit the document with the structured document indexing. This is the simplest method to
put data in Vectara and works for most use cases with unstructured data. The only downside is that document
parts may span multiple chunks. The current default indexing API chunks at the sentence level, with some caveats.

Chunking strategies is an advanced topic - there are lots of pros and cons of different chunking strategies. We're
keen to hear feedback - and if you need more control you can look at the CoreIndex method below. If you want to learn
more about how we do it at Vectara, see the following blog article: https://vectara.com/blog/grounded-generation-done-right-chunking/ 

We will highlight the important fields on the indexing below:

* **id** - each document in a corpus must have a unique id field. You cannot insert a document when an ID already exists and must first delete it.
* **type** - for the V2 API, you must provide a type, which may be "structured" as per below or "core" which we'll show next. This is known as a discriminator value and indicates which type of document you are submitting.
* **title** - provided for context which helps the retrieval model and re-ranker determine relevancy to the users query.
* **description** - provides further context like the title field.
* **sections** - for structured documents, you must provide the sections of text. There are other fields which can be present here in a nested structure, however the "text" field may be split into multiple "document_part" sections.

A key takeaway here is that the "sections" field will be transformed by Vectara into the "core" document format using 
optimal processing. This will work for most use cases however you may have requirements that define strict boundaries
on the document parts.

In [None]:
from vectara.types import StructuredDocument

request = StructuredDocument.parse_obj({
   "id": "taming_of_the_shrew_structured",
   "type": "structured",
   "title": "Taming of the Shrew",
   "description": "The Shakespeare play, 'the Taming of the Shrew'",
   "sections": [
       {
           "text": play_text # One big section which will be automatically chunked.
       }
   ]
})

structured_index_response = client.documents.index(corpus_key, request=request)

# Let's Look at a "document_part"
In order to know what the Retrieval model can "see" when searching for relevant parts, we can take a look at one of the parts here.

In [None]:
indexed_doc_resp = client.documents.retrieve(corpus_key, "taming_of_the_shrew_structured")

logger.info(f"Here's the 2nd document part:\n{indexed_doc_resp.parts[1].text}")

# Document Size and Reduction
You can see that the ingested size is slightly smaller than the original text. We see
about a 97% of the size of it's source, however text and JSON does not reduce much.

In [None]:
def show_usage_info(index_response_1):
    bytes_used = index_response_1.storage_usage.bytes_used
    metadata_bytes_used = index_response_1.storage_usage.metadata_bytes_used
    
    reduction_pct = bytes_used / len(bytearray(play_text, "utf-8")) * 100
    kb_used = int(bytes_used / 1024)
    metadata_kb_used = int(metadata_bytes_used / 1024)
    logger.info(f"The text was reduced by [{reduction_pct:.3}%]")
    logger.info(f"Total data storage is [{kb_used}KB]")
    logger.info(f"Total metadata storage is [{metadata_kb_used}KB]")

show_usage_info(structured_index_response)


In [None]:
from vectara.queries import SearchCorpusParameters
from vectara.types import GenerationParameters, ContextConfiguration
import json

def run_query(doc_id):
    query = "Does Sly offer to pay for the broken glasses?"
    
    generation = GenerationParameters.parse_obj({
        "generation_preset_name": "vectara-summary-ext-v1.3.0",
        "max_used_search_results": 5,
        "max_response_characters": 300,
        "response_language": "auto",
        
    })
    
    search_corpus = SearchCorpusParameters.parse_obj({
        "lexical_interpolation": 0.025,
        "semantics": "default",
        "offset": 0,
        "limit": 10,
        "reranker": {
            "type": "customer_reranker",
            "reranker_id": "rnk_272725719" # Multi-lingual Re-Ranker
        },
        "context_configuration": {
            "characters_before": 30,
            "characters_after": 30,
            "start_tag": "<b>",
            "end_tag": "</b>"
        },
    })
    
    query_response = client.queries.query_corpus(corpus_key, query=query, search=search_corpus, generation=generation)
    logger.info(f"Document summary for document with id [{doc_id}] is [{query_response.summary}]")
    return query_response.summary

structured_summary = run_query("taming_of_the_shrew_structured")


## Some Structuring
We can see from the example above that there is no true "part" - the document is stored internally as one giant part.

The chunking is done automatically behind the scenes.

We can break up the document parts into more logical elements. We'll now parse the ingested document into acts (INDUCTION, ACT 1, ACT 2 etc) and scenes (Scene 1, Scene 2).
This will allow us to do 2 things:
1. Utilise the metadata to target specific sections in the document which will be relevant when we look at Filter Attributes.
2. Seperate distinct areas of text and avoid unrelated context between sections (clipping information which should be distinct).

Note - we'll use an extension of this example to add metadata for the Scene and Act to the information when we perform Corpus Modelling.

In [None]:
# You can ignore the code here - we use the Act/Scene breaks in the text file as section delimiters.
acts = util.lab_02_chunk_play(path)
        

# Convert to Chunks and Index
After we've extracted the text from the raw content, we break apart into sub-chunks at a 1000 character limit with a 50 character overlap.

Once done, we index the resulting document with the same API call as we used for the Structured document. The key difference is that we've
manually created our chunks and specify "core" instead of "structured".

You will also notice we add in metadata we extracted from the document for the Act and Scene - allowing us to ask very specific questions
when combined with Corpus Modelling.

In [None]:
document_parts = []
core_document = {
    "id": "taming_of_the_shrew_core",
    "type": "core",
    "document_parts": document_parts # Add these in the loop below.
}

for act in acts:
    logger.info(f"Act: {act["name"]}")
    
    for scene in act["scenes"]:
        logger.info(f"\tScene: {scene["name"]}")
        
        # Add a title
        scene_title_part = {
            "text": f"{act["name"]} - {scene["name"]}",
            "metadata": {
                "is_title": True,
                "act": act["name"],
                "scene": scene["name"]
            }
        }
        document_parts.append(scene_title_part)
        
        full_text = "\n".join(scene["scene_texts"])
        start = 0
        
        while start < len(full_text):
            
            # Simple chunks at 1000 characters with a 50 character overlap.
            chunk_text = full_text[start:start+1000]
            scene_chunk_part = {
                "text": chunk_text,
                "metadata": {
                    "act": act["name"],
                    "scene": scene["name"]
                }
            }
            document_parts.append(scene_chunk_part)
            start += 950

client.documents.index(corpus_key, request=core_document)

# Check the Core Document
Now we'll investigate the the core document parts.

In [None]:
indexed_doc_resp = client.documents.retrieve(corpus_key, "taming_of_the_shrew_core")

logger.info(f"Here's the 2nd document part:\n{indexed_doc_resp.parts[1].text}")

In [None]:
logger.info(f"Remember, here was our first summary:\n{structured_summary}")

core_summary = run_query("taming_of_the_shrew_core")
