# Getting Started - Lab 01 - Vectara Index API

We'll now explore the Vectara Index API.

This notebook will use our "lab" authentication profile, if you haven't set this up, please [Setup Authentication](./00_setup_authentication.ipynb).

TODO - Insert diagram with Index API highlighted.

In [None]:
from vectara.factory import Factory
from vectara.managers import CreateCorpusRequest
from pathlib import Path
import logging

logging.basicConfig(format='%(asctime)s:%(name)-35s %(levelname)s:%(message)s', level=logging.INFO, datefmt='%H:%M:%S %z')
logger = logging.getLogger(__name__)

client = Factory(profile="lab").build()

## Setup Corpus and Data
We will setup a lab corpus below before we ingest our data. We'll examine this in more depth in the following notebooks.

In [None]:
request = CreateCorpusRequest(name="Getting Started - Index API", key="02-getting-started-index-api")
response = client.lab_helper.create_lab_corpus(request)

logger.info(f"Our corpus key is [{response.key}]")

## Load Our Content
We'll now use the same example as the last lab, loading Shakespeare's _Taming of the Shrew_ text.

It's important to note that this text is "one block". Dependent on which method we use below will dictate how it is
structured.

In [None]:
from pathlib import Path

path = Path("resources/shakespeare/taming_shrew.txt")
logger.info(f"Loading {path}")
with open(path, "r", encoding="utf-8") as f:
    play_text = f.read()


## Automatic Chunking with Structured Document Indexing
We'll now submit the document with the structured document indexing. This is the simplest method to
put data in Vectara and works for most use cases with unstructured data. The only downside is that document
parts may span multiple sections.

We will highlight the important fields on the indexing below:

* **id** - each document in a corpus must have a unique id field. You cannot insert a document when an ID already exists and must first delete it.
* **type** - for the V2 API, you must provide a type, which may be "structured" as per below or "core" which we'll show next. This is known as a discriminator value and indicates which type of document you are submitting.
* **title** - provided for context which helps the retrieval model and re-ranker determine relevancy to the users query.
* **description** - provides further context like the title field.
* **sections** - for structured documents, you must provide the sections of text. There are other fields which can be present here in a nested structure, however the "text" field may be split into multiple "document_part" sections.

A key takeaway here is that the "sections" field will be transformed by Vectara into the "core" document format using 
optimal processing. This will work for most use cases however you may have requirements that define strict boundaries
on the document parts.

In [None]:
from vectara.types import StructuredDocument

request = StructuredDocument.parse_obj({
   "id": "my-doc",
   "type": "structured",
   "title": "Taming of the Shrew",
   "description": "The Shakespeare play, 'the Taming of the Shrew'",
   "sections": [
       {
           "text": play_text # One big section which will be automatically chunked.
       }
   ]
})

client.documents.index(response.key, request=request)

In [None]:
from vectara.queries import SearchCorpusParameters
from vectara.types import GenerationParameters, ContextConfiguration
import json

query = "Does Sly offer to pay for the broken glasses?"

search_corpus = SearchCorpusParameters.parse_obj({
    # TODO Add reranker from SearchParameters#SearchReranker
    # TODO Add context_configuration from SearchParameters#ContextConfiguration
    "limit": 1,
    "context_configuration": {
        "characters_before": 10000,
        "characters_after": 10000,
        "start_tag": "<b>",
        "end_tag": "</b>"
    }
    
})

query_response = client.queries.query_corpus(response.key, query=query, search=search_corpus)
logger.info(json.dumps(query_response.model_dump(), indent=4))

logger.info(f"Document part length is [{len(query_response.search_results[0].text)}]")

## Some Structuring
We can see from the example above that there is no true "part" - the document is stored internally as one giant part.

The chunking is done automatically behind the scenes.

We can break up the document parts into more logical elements. We'll now parse the ingested document into acts (INDUCTION, ACT 1, ACT 2 etc) and scenes (Scene 1, Scene 2).
This will allow us to do 2 things:
1. Utilise the metadata to target specific sections in the document which will be relevant when we look at Filter Attributes.
2. Seperate distinct areas of text and avoid unrelated context between sections (clipping information which should be distinct).

In [None]:
import re
acts = []
act = {"name": "Overview", "scenes": []}
scene_name = None
scene_texts = []

break_marker = re.compile(r'^=+$')

ignored_break_markers = [
    "Characters in the Play"
]

scene_prefix = "Scene "

last = ""
logger.info(f"Loading {path}")
with open(path, "r", encoding="utf-8") as f:
    for idx, line in enumerate(f):
        stripped_line = line.strip()
        
        #logger.info(f"{idx:03} Received line: {stripped_line}")
        if idx > 200:
            break
        if idx > 0:
            if break_marker.match(stripped_line):
                if last in ignored_break_markers:
                    continue
                logger.info(f"Found break marker, last was: [{last}]")
                if stripped_line.startswith(scene_prefix):
                    # Put the last scene into the act (if not empty)
                    if len(scene_text) > 0:
                        scene_text = "\n".join(scene_texts)
                        scene = {
                            "text": scene_text
                        }
                        if scene_name:
                            scene["name"] = scene_name
                        act["scenes"].append(scene)
                        
                        # Reset the scene variables
                        scene_texts = []
                        scene_name = last
                else:
                    # New Act.
                        
                            
                        
        
        last = stripped_line