## Seqtra
Chunking plays an important role in Retrieval Augmented Generation (RAG) in order to index the textual data in an appropriate format for retrieval. The size of chunks are usually predetermined, i.e. set or limited by a fixed size during the data ingestion phase. Seqtra combines chunking and retrieving in one stage, i.e. chunking only with respect to the query during retrieval rather than defining chunk boundary during the ingestion phase, making the chunk sizes dynamic and adapted to the query. This strategy is known as late chunking in literature. Seqtra constructs and utilizes graph-based relationships in order to chunk the documents.

If you donot have seqtra api key, please generate a free one at https://app.seqtra.com/. Please note that we only accept PDFs for now. Additionally, please do not forget to execute code in the "End Session" section before you exit.


After generating API key, first let's clone the SeqtraClient repository.

In [1]:
!git clone https://github.com/seqtra/SeqtraClient.git

Cloning into 'SeqtraClient'...
remote: Enumerating objects: 132, done.[K
remote: Counting objects: 100% (132/132), done.[K
remote: Compressing objects: 100% (85/85), done.[K
remote: Total 132 (delta 63), reused 100 (delta 35), pack-reused 0 (from 0)[K
Receiving objects: 100% (132/132), 498.04 KiB | 19.16 MiB/s, done.
Resolving deltas: 100% (63/63), done.


In [2]:
# Switch to the cloned repo
%cd SeqtraClient

/content/SeqtraClient


## Install required packages

In [3]:
!pip install -q -r requirements.txt

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/117.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.0/117.0 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.5/79.5 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m427.6/427.6 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m63.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m77.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:0

## Initialize Seqtra Client

In [12]:
SEQTRA_API_URL = "https://api.seqtra.com/"
# Please generate API keys at https://app.seqtra.com/, if you haven't done so.
SEQTRA_API_TOKEN = "Your Seqtra API key"
# Setup LLM parameters. You don't have to change anything on this if you want to test it without
# LLM. Without LLM, seqtra will only return the relevant chunks.
# We only have claude and openai available right now.
LLM = "claude"
LLM_KEY = "YOUR LLM API KEY"
#This is path for the example files provided with the code. You may change it to your own path where the test files are situated
DIR_PATH = "./Files"
# Project name which will act like a project folder where all your files for the given collection will be stored
PROJECT_NAME = "test"

In [29]:
from src.seqtra_client import SeqtraClient

In [13]:
#Initialize the client
seqtra = SeqtraClient(
    api_token=SEQTRA_API_TOKEN,
    project_name=PROJECT_NAME,
    url=SEQTRA_API_URL,
    llm=LLM,
    llm_key=LLM_KEY
)

Project already initialized
Initialization time: 0:00:00.728513


## Ingest your test files

In [9]:
seqtra.ingest(DIR_PATH)

Ingestion: This can take time for uploading and ingesting the data into our database...
Uploaded files ingested into Seqtra database!
Ingestion time: 0:00:18.258738


## Query the ingested files
You would need to set several parameters here, which are explained as follows:
1. **query**: Question you want to ask related to the ingested files.
2. **num_seed_nodes**: This is equivalent to topk parameter in RAG. It is named so in our service, due to the presence of graph linkages and traversal during chunking and retrieval. You may optimize this for your use case.
3. **chunk_only**: Setting this to false provides the answer to the query using LLM along with the retrieved chunks. Setting it to true provides you with only the relevant chunks.
**strategy**: We currently provide four strategies to chunk and retrieve relevant context for the given query:<br><br>
   &emsp;a) **seed_only**: This is similar to conventional vector based retrieval, where it will only retrieve chunks which are relevant with respect to the given query but by definition, independent with each other. These will be labeled as chunk during the retrieval, but within the database, the actual categories of these chunks are text related class labels of [DocLayNet](https://arxiv.org/pdf/2206.01062) including "Text", "List-item" and so on.<br><br>
   &emsp;b) **seed_extended**: In addition to a), it also retrieves additional context, i.e. other paragraphs and list items of the document section within which the given seed chunk is embedded in the document.<br><br>
   &emsp;c) **graph**: Along with chunks in "seed_only", it also retrieves additional chunks which are related to the seed chunk, providing additional context. This relationship is established during the ingestion phase either through conceptual linkages, or hyperlink linkages internal to the document (for example, some text pointing to some other paragraph or section within the document).<br><br>
   &emsp;d) **graph_extended**: This combines "graph" strategy with "seed_extended" strategy. This retrieves additional sibling texts of the seed chunk along with the graph linkages.<br>  
   You may explore the strategy and adopt the most optimal one for your use case and nature of the document. For example, "graph" strategy might suffice for paragraph-heavy documents while legal documents with list-heavy clauses might require to use "graph_extended" strategy. So, c) and d) are our major offerings, a) and b) are provided as an additional options which you may find in other services also.

In [24]:
query = "How can the Board and the CCO manage control functions?"
num_seed_nodes = 1
chunk_only = False
strategy = "graph_extended"

In [25]:
response = seqtra.query(
    query = query,
    num_seed_nodes = num_seed_nodes,
    chunk_only = chunk_only,
    strategy = strategy
)

This could take some time as LLM generates the answer for the given query...
Query time: 0:00:08.733046


In [22]:
response.keys()

dict_keys(['chunks', 'graph', 'answer'])

In [26]:
# The answer will be empty if you set chunk_only to True
print(response['answer'])

# How the Board and the CCO Manage Control Functions

Based on the provided context, the Board and the Chief Compliance Officer (CCO) have specific responsibilities in managing control functions:

## Board's Responsibilities

The Board must approve the sharing of compliance function responsibilities between a dedicated compliance unit and other control functions <chunk>chunk_2</chunk>. Specifically, the Board must:

1. Approve the appointment, remuneration, and termination of the CCO
2. Ensure the CCO has sufficient stature for effective engagement with senior management
3. Regularly engage with the CCO to discuss issues faced by the compliance function
4. Provide the CCO with direct and unimpeded access to the Board
5. Ensure the CCO has sufficient resources and competent officers
6. Be satisfied that the overall control environment won't be compromised if the CCO carries out responsibilities for other control functions <chunk>chunk_10</chunk>

## CCO's Responsibilities

When complian

## How to Interpret the output JSON
Keys:

1. **"chunks"**: JSON object in the format of ("chunk_i", "chunk_id") key value pairs, where i runs from 1 to n (number of chunks retrieved). "chunk_id" represents node id in the graph database.
2. **"answer"**: Answer to the given query based on retrieved chunks. It will be an empty string if "chunk_only" is set to true.
3. **"graph"**: JSON Object with "nodes" and "edges" keys. Each is a list of JSON objects each representing a graph node in "nodes" case, while a graph edge in "edges" case. This graph represents relationship among chunks in "chunks" key. if, for example, num_seed_node is set to 1, and you have used one of graph strategies, one of the chunks is the seed node, and additional nodes are retrieved due to their links to the seed node as extracted during the ingestion stage. "nodes" data also contains pdf name, page number and bounding box information to locate the exact section of the chunk in the pdf. Bounding box is in the format of (left, top, width, height).

You may further rerank and filter the retrieved chunks if it fits your use case.


## End Session
This will clean up the session state in backend to free up memory and avoid service interruption

In [27]:
seqtra.end_session()

Successfully ended the session for the project with name test


## Delete Project
It will delete files uploaded previously along with associated graphs. When uploading new documents not associated with previous project, please delete project or create new project.

In [28]:
SeqtraClient.remove(
    url=SEQTRA_API_URL,
    project_name=PROJECT_NAME, # Name of the project you want to delete
    api_token=SEQTRA_API_TOKEN
)

Removing project from the server...


'Successfully deleted project with name test'