# Corpus Creation
As Carl Sagan once said, "If you wish to make an apple pie from scratch, you must first invent the universe". Up until now, we have worked with a corpus we create during our lab setup. 
This is akin to the Universe from Carl Sagan's quote. We'll now take a look at the **Create a Corpus** API call and corpus concept.

You can find full documentation of this method here:
* Create a Corpus: https://docs.vectara.com/docs/rest-api/create-corpus
* Retrieve metadata about a Corpus: https://docs.vectara.com/docs/rest-api/get-corpus
* Delete a Corpus and all its data: https://docs.vectara.com/docs/rest-api/delete-corpus

For a greenfield use case, the first step when working with Vectara is to create your corpus. You can think of a corpus as a logical container for your documents.

# TODO - MOVE FOLLOWING TO FILTER ATTRIBUTES / CORPUS MODELLING LAB
When users first see Vectara, they often put a few documents in the system and run some basic queries, but much of the power of Vectara is only unlocked when
combined with two advanced features Vectara brings:

* Filter Attributes: metadata on documents and the document parts within them; and
* Custom Dimensions: Numerical attributes on a document which allow for boosting.

TBC - Add diagram of Account, Document and Document-Part.



In [None]:
from vectara.managers import CreateCorpusRequest
from vectara.factory import Factory
from getting_started_util import GettingStartedUtil

util = GettingStartedUtil()
logger = util.logger
client = Factory(profile="lab").build()

# Looking a bit closer
Up until now we've relied on the LabHelper class to do most of the heavy lifting
to create our corpus. We put three tiers of abstraction here to cover different
levels of responsibility:

1. `vectara.corpora.client.CorporaClient` - Generated API class, performs a direct mapping to Vectara APIs related to Corpora and Corpus
2. `vectara.managers.corpus.CorpusManager` - Business Facade on top of CorporaClient which will perform intelligent checks, such as "check exists" or "delete if exists"
3. `vectara.utils.lab_helper.LabHelper` - A second layer on top of CorpusManager to create and clean up lab names with a users prefix.

It is important to note that the python request object `vectara.managers.CreateCorpusRequest` is passed down to the underlying API, only being flattened when we invoke `CorporaClient.create`.

When we run our code below, we can see all three tiers being invoked in the logging. API methods are visible.

In [None]:
request = CreateCorpusRequest(name="Getting Started - Corpus Creation", key="04-getting-started-corpus_creation")
response = client.lab_helper.create_lab_corpus(request)
corpus_key = response.key

logger.info(f"Our corpus key is [{corpus_key}]")

# Corpus Key
Each corpus has a unique key which can be set by the request. If not set this will be created by default. If this was the first time running the code, you will see the `CorpusManager` doing a check for the Corpus with the same `key`. From version 2.0 of the Vectara API, the Corpus `key` has become the primary way to identify the corpus targetted for API calls.

One other convenience of the `CorpusManager` is to wrap the HTTP 404 error which makes it simpler to do a unique check - otherwise your code would need to wrap the exception from the generated `CorporaClient`.

After the unique check, your logs may differ depending on whether this was the first or second time you ran the code.

## First Run
In this case, you should just see an invocation of the method to **Create a Corpus** (POST to https://api.vectara.io/v2/corpora) with a 201 response.

## Subsequent Runs
On subsequent runs, you should also see a call to **Delete a Corpus and all its data** (DELETE to  [https://api.vectara.io/v2/corpora/{prefix}_04-getting-started-corpus_creation](https://api.vectara.io/v2/corpora/{prefix}_04-getting-started-corpus_creation) 

# List Corpora
We'll now show how we can list corpora. Similar to the creation of a corpus, you can pick which level of abstraction
suits your development style:

1. `vectara.corpora.client.CorporaClient` - Generated API class, performs a direct mapping to Vectara APIs related to Corpora and Corpus
2. `vectara.managers.corpus.CorpusManager` - Business Facade on top of CorporaClient which will perform intelligent checks, wraps the generator SyncPager returned from the API, creates a limit for list and provides conveinence methods:
        1. "find_corpora_with_filter" - adds a true "limit" whereas the SDK API method returns a generator where the limit is for "each call".
        2. "find_corpora_by_name" - returns all corpora with an exact name match
        3. "find_corpus_by_name" - returns 1 or None using an exact name match. Throws a 


## CorporaClient#list
We now call the API method directly, this may be useful if you need to control the yield behaviour

In [None]:
def list_corpora_gen():
    list_response = client.corpora.list(filter="Getting Started", limit=100)
    for item in list_response:
        yield item
    
for corpus in list_corpora_gen():
    logger.info(f"Found [{corpus.name}] with key [{corpus.key}]")

## CorpusManager#find_corpora_with_filter
For most uses, you can use the more streamlined method below to achieve the same. This is a more standard way to invoke 
the API, letting the underlying manager handle the generator and paging.

In [None]:
# Find all the Getting Started corpora, will include other users if visible.
for corpus in client.corpus_manager.find_corpora_with_filter("Getting Started"):
    logger.info(f"Found [{corpus.name}] with key [{corpus.key}]")


In [None]:
# We can also run with limit=1
for corpus in client.corpus_manager.find_corpora_with_filter("Getting Started", limit=1):
    logger.info(f"Found [{corpus.name}] with key [{corpus.key}]")

## CorpusManager#find_corpora_by_name
We can also lookup a corpus by name.


In [None]:
our_name = response.name
# Expect 1 or None.
corpus = client.corpus_manager.find_corpus_by_name(our_name)
logger.info(f"We found our lab: {corpus.key}")


In [None]:
our_name = response.name
# Match multiple exact names, useful for identifying duplicates.
corpora = client.corpus_manager.find_corpora_by_name(our_name)
logger.info(f"We found [{len(corpora)}] corpora with our name")

# Delete Corpora
Finally, we can delete a corpus using the corpus key.

In [None]:
client.corpus_manager.delete(corpus_key)