# Setup
We'll now run through the setup for this example, creating a corpus. This example
uses the unoffical Vectara Python SDK, vectara-skunk-client which requires
setup of ".vec_auth.yaml" to be created and stored in your home directory.

## Pre-requisites
Before running this example, you will need:

1. a Vectara Account which is either enabled for "Scale" or "Pro"
and has access to custom prompts.
2. An OAuth credential for the account
3. Setup of ".vec_auth.yaml" for the SDK to use, detailed instructions below.

## Setup Unoffical SDK
This example uses the unoffical Vectara Python SDK, `vectara-skunk-client` which requires
setup of ".vec_auth.yaml" to be created and stored in your home directory.

More information on this setup can be found here: https://github.com/davidglevy/vectara-skunk-client

## Get the SDK
The first step is to install the SDK with the standard pip install commands.

In [1]:
%pip install -q vectara-skunk-client

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


## Initialize the Client
We'll now initialize the client so we can start making calls to our Vectara account.

In [13]:
from vectara_client.core import Factory
from vectara_client.admin import CorpusBuilder
import logging
from pathlib import Path

logging.basicConfig(format='%(asctime)s:%(name)-35s %(levelname)s:%(message)s', level=logging.INFO, datefmt='%H:%M:%S %z')

logger = logging.getLogger(__name__)

client = Factory().build()
admin_service = client.admin_service

## Delete Existing (if exists)

The following code shows how we can delete an existing corpus if it exists. Please be aware that this will delete any corpus named "verified-corpus"
and this code will likely be used in a test harness tear down.

In [14]:
name = "prompt-guardrails"
logger.info(f"Looking for existing corpus named [{name}]")

# The filter below will match any corpus with "verified-corpus" anywhere in the name, so we need a 
# client side check to validate equivalence.
existing_corpora = admin_service.list_corpora(name)
logger.info(f"We found [{len(existing_corpora)}] potential matches")
for existing_corpus in existing_corpora:
    if existing_corpus.name == name:
        # The following code will delete the corpus
        logger.info(f"Deleting corpus with id [{existing_corpus.id}]")
        admin_service.delete_corpus(existing_corpus.id)
    else:
        logger.info(f"Ignoring corpus with id [{existing_corpus.id}] as it doesn't match our target name exactly")

## Create Test Corpus
We'll now create a test corpus loaded with some information from the US Tourist Visa information page. 

In [15]:
logger.info("Creating our corpus with filter attributes")
corpus = (CorpusBuilder(name)
          .description("Corpus to illustrate guard rails")
          .build()
         )
corpus_id = admin_service.create_corpus_d(corpus).corpusId
logger.info(f"New corpus created with id [{corpus_id}]")

## Upload our Document
We'll now upload our test documents to the corpus

We'll also add some research content to the corpus - we do this to provide responses in the retrieval component
of RAG which act to provide the "Augmented Generation" data which is outside the scope of our goal. This helps
illustrate the problem and need to add guardrails for use cases which accept data from sources which may not be
curated.

### Political Content
To help illustrate the requirement for guardrails, we've included the text from this article:
https://www.govexec.com/oversight/2015/08/there-are-more-republicans-federal-government-you-might-think/119138/

### Tourist Review Content
We've also included a response from a GenAI summarizer for "fun things" to do in Washington DC.


In [26]:
indexer_service = client.indexer_service
document_service = client.document_service

def load_documents(folder: str):
    path = Path(folder)
    for file_path in path.glob("*.docx"):
        logger.info(f"Found [{file_path}]")
    
        # Delete the document if it exists.
        doc_list = document_service.list_documents(corpus_id, metadata_filter=f"doc.id = '{file_path.name}'")
        if len(doc_list) > 0:
            logger.info(f"Found existing document with id [{file_path.name}]")
            indexer_service.delete(corpus_id, file_path.name)
        
        indexer_service.upload(corpus_id, file_path)
        
load_documents("resources")

More Republicans in Federal Government.docx: 15.4kB [00:03, 4.87kB/s]                   
US - Visitor Visa.docx: 27.2kB [00:03, 9.11kB/s]                            
Washington DC - Fun things to Visit.docx: 14.1kB [00:03, 3.69kB/s]                   
