<a href="https://colab.research.google.com/github/waghmareps12/RANDOM_COLLAB_LLM_NOTEBOOKS/blob/main/Selecting_an_embedding_model_for_custom_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Selecting an embedding model for your custom data

We recommend reading the ["Understanding embedding models: make an informed choice for your RAG"](https://unstructured.io/blog/understanding-embedding-models-make-an-informed-choice-for-your-rag) blog post before proceeding with this tutorial.

In this notebook, we'll build an end-to-end data processing pipeline using Unstructured Serverless API, and incorporate a model evaluation step into it. This way you can eliminate the guesswork - pick several promising candidates from the [Hugging Face MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard), choose the best one for your specific data, and an embedding step with the best candidate to your Unstructured pipeline.

We'll be comparing the performance of three embedding models from the MTEB leaderboard:
* [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5): a strong general purpose embedding model that has 335M parameters, and achieves an average nDCG@10 of 54.29 on the MTEB leaderboard. On the financial-specific FiQA2018 dataset, it scores a respectable 45.02.
* [mukaj/fin-mpnet-base](https://huggingface.co/mukaj/fin-mpnet-base): a compact text embedding model with only 109M parameters, fine-tuned on financial data. While its average nDCG@10 isn't listed on the leaderboard, it shines on FiQA2018 with an impressive score of 79.91.
* [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l): a general purpose text embedding model with 334M parameters, ranks among the top performance on the leaderboard.  It achieved an average nDCG@10 of 55.98. On FiQA2018, its score is 44.71.

To demonstrate the evaluation process, we'll use publicly available financial reports as "custom data", specifically, annual Form 10-K reports from a couple of Fortune 500 companies. These reports, required by the U.S. Securities and Exchange Commission (SEC), offer a deep dive into a company's financial performance. They go beyond the typical annual report, providing detailed information on corporate history, financial statements, earnings per share, and other crucial data points. For investors, 10-Ks are invaluable tools for making informed decisions. You can easily access and download these reports by visiting the website of any publicly traded US company.

To reproduce the notebook, download the following PDFs, and place them into local `PDFS` directory:
* Direct link: [Walmart Form-10K SEC filing for 2023](https://d18rn0p25nwr6d.cloudfront.net/CIK-0000104169/dfe6ee99-8fe6-4333-80ac-829d9e7595fa.pdf), or find it on [stock.walmart.com/financials](https://stock.walmart.com/financials/sec-filings/default.aspx)
* Direct link: [Exxon Mobile Form-10K SEC filing for 2023](https://investor.exxonmobil.com/sec-filings/annual-reports/content/0000034088-24-000018/0000034088-24-000018.pdf), or find it on [https://ir.exxonmobil.com/sec-filings](https://ir.exxonmobil.com/sec-filings)

To evaluate our chosen models on the Form-10-K PDFs as custom data, we’ll go through the following steps:
* Process the raw data from PDFs to ready-to-use chunks
* Generate a synthetic evaluation dataset from the chunks
* Set up and query three retrievers, each with a different embedding model
* Compare performance and pick the best model
* Integrate the best model into the pipeline as an embedding step


## Pre-requisites and setup

In this notebook we'll be generating a synthetic dataset, and you will need an LLM for that.
We've run this notebook locally with `Llama3.1:8b` model via Ollama to generate a synthetic dataset. For a toy example with just two documents (even though they total to 386 pages!), this will be ok, but in a real-world scenario you may want to switch to a more powerful model, and potentially use a model provider, such as OpenAI, or Anthropic, for example.
In this notebook, we've added an alternative code for calls to Claude3.5 Sonnet. Pick whichever you prefer.  

To run the notebook locally with ollama:
* Go to https://ollama.com and download the app for your OS, then pull the model onto your local machine: `ollama pull llama3.1:8b`
* `pip install ollama`

To run the notebook with Claude3.5 Sonnet from Anthropic:
* Acquire an [Anthropic API key](https://www.anthropic.com/claude) and save it to a local `.env` file as `ANTHROPIC_API_KEY`.
* `pip install anthropic`  

Install the rest of the necessary:

* `unstructured` & `unstructured-ingest` for preprocessing documents.
* `python-dotenv` to load the environment variables from a `.env` file
* `chromadb` and `langchain` to set up retrievers with different embedding models

To use this example, you'll need to get an [Unstructured API key](https://unstructured.io/api-key-hosted). The Unstructured Serverless API comes with a 14-day trial capped at 1000 pages per day.
Save the Unstructured API key as `UNSTRUCTURED_API_KEY`, and Unstructured API URL (you can get it from your personal dashboard) as `UNSTRUCTURED_URL` in a local `.env` file.

In [None]:
!pip install -qU "unstructured-ingest[pdf, embed-huggingface]" unstructured python-dotenv langchain chromadb ollama

## Load the environment variables

Load the environment variables from a local file.

In [None]:
import os
import dotenv

dotenv.load_dotenv('.env')

True

Import the libraries. Uncomment `import anthropic` if you plan to use Claude.

In [None]:
import json
import ollama
# import anthropic
import pandas as pd

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig,
    LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
from unstructured_ingest.v2.processes.embedder import EmbedderConfig

from unstructured.staging.base import elements_from_json
from unstructured.staging.base import elements_to_dicts
from unstructured.staging.base import dict_to_elements
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain.vectorstores import utils as chromautils
from langchain.embeddings import HuggingFaceEmbeddings

## Preprocess PDFs from a source location

The Form 10-K reports are in PDF format, so the first step is to preprocess them - partition to extract text, and chunk them.

Unstructured simplifies the data processing from any source to any destination with its ingestion pipeline. We'll configure the pipeline with a local source connector to read the PDFs from a local directory and a local destination connector to store the processed data. The Unstructured processing pipeline can be assembled from a number of configurations:

* `ProcessorConfig` describes general behavior such as logs verbosity, number of processes, etc.
* `LocalIndexerConfig`, `LocalDownloaderConfig`, and `LocalConnectionConfig` control data ingestion from a local source, you only need to provide a path to your local directory with PDFs here.
* `PartitionerConfig`: use it to supply your credentials for the Unstructured Serverless API, and customize the partitioning behavior, e.g. what partitioning strategy to use, whether to exclude some types of metadata, etc. In this case, we use the fast strategy to partition the files, as the PDFs are not complex and contain text only.
* `ChunkerConfig`: after partitioning we will chunk the documents into meaningful sized chunks that are not exceeding the input size of all the embedding models we'll be evaluating.
* `LocalUploaderConfig`: specify a local directory to load the processed files into.   

In [None]:
Pipeline.from_configs(
    context=ProcessorConfig(
        verbose=True,
        tqdm=True,
        num_processes=5
    ),
    indexer_config=LocalIndexerConfig(input_path="PDFS"),
    downloader_config=LocalDownloaderConfig(),
    source_connection_config=LocalConnectionConfig(),
    partitioner_config=PartitionerConfig(
        partition_by_api=True,
        api_key=os.getenv("UNSTRUCTURED_API_KEY"),
        partition_endpoint=os.getenv("UNSTRUCTURED_URL"),
        strategy="fast", # for complex image-based PDFs replace this with "hi_res"
        additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            }
        ),
    chunker_config=ChunkerConfig(
        chunking_strategy="by_title",
        chunk_max_characters=1500,
        chunk_overlap = 150,
        ),
    uploader_config=LocalUploaderConfig(output_dir="local-ingest-output")
).run()

Since we've selected the `fast` strategy for this preprocessing pipeline, it should take only a minute or two to preprocess the two large PDFs. You can learn more about the partitioning strategies Unstructured offers in the [documentation](https://docs.unstructured.io/api-reference/api-services/partitioning).

Once the pipeline finishes running, you'll find two `*.json` documents, one per original PDF file, in the `local-ingest-output` directory. These files contain the chunks extracted from the original documents that we'll use in the next step to build an evaluation dataset.

## Create an evaluation dataset

Because we don't have real user queries for our 10-K data, we'll create a synthetic evaluation dataset, which is always better than nothing, and you can always wean yourself off of synthetic evaluation dataset once you have actual user queries.

To create the dataset we will generate question-answer pairs for each of the document chunks. First, let's load all the processed files from the output directory:

In [None]:
def load_processed_files(directory_path):
    """
    Reads all preprocessed data from JSON files in the given directory and returns elements as a list
    """
    elements = []
    for filename in os.listdir(directory_path):
        if filename.endswith('.json'):
            file_path = os.path.join(directory_path, filename)
            try:
                elements.extend(elements_to_dicts(elements_from_json(filename=file_path)))
            except IOError:
                print(f"Error: Could not read file {filename}.")

    return elements

In [None]:
elements = load_processed_files("local-ingest-output")

len(elements)

1082

Let's add a helper function that will parse string LLM responses into a dictionary, we'll also add `context` (chunk content) and `chunk_id` of the chunk the question is based on, so that we could later see whether we retrieve this chunk or not:

In [None]:
def convert_qa_string_to_dict(input_string, chunk_id, chunk_text):
    """
    Converts a string response from an LLM to a Python dictionary with question-answer-context entries.
    """
    try:
        result = json.loads(input_string)
        questions = result["questions"]
        for question in questions:
            question['id'] = chunk_id
            question['context'] = chunk_text
        return questions
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {e}")
        return []


To create the synthetic evaluation dataset, we'll go over the chunks, and for each chunk we'll prompt the local `llama3.1:8b` model to generate two question/answer pairs.  To use Claude instead, use the commented out section.

In [None]:
def generate_chunk_qa_pairs(element):
    """
    Uses a local LLM to generate two question-answer pairs for an individual chunk, then
    parses the string response to a Python dictionary.
    """

    prompt = """
    You are an assistant specialized in RAG tasks. \n
    The task is the following: given a document chunk, you will have to
    generate questions that can be asked by a user to retrieve information from
    a large documentary corpus. \n
    The question should be relevant to the chunk, and should not be too specific
    or too general. The question should be about the subject of the chunk, and
    the answer needs to be found in the chunk. \n

    Remember that the question is asked by a user to get some information from a
    large documentary corpus. \n

    Generate a question that could be asked by a user without knowing the existence and the content of the corpus. \n
    Also generate the answer to the question, which should be found in the
    document chunk.  \n
    Generate TWO pairs of questions and answers per chunk in a
    dictionary with the following format, your answer should ONLY contain this dictionary, NOTHING ELSE: \n
    {
        "questions": [
            {
                "question": "XXXXXX",
                "answer": "YYYYYY",
            },
            {
                "question": "XXXXXX",
                "answer": "YYYYYY",
            },
        ]
    }
    where XXXXXX is the question, YYYYYY is the corresponding answers that could be as long as needed. \n
    Note: If there are no questions to ask about the chunk, return an empty list.
    Focus on making relevant questions concerning the page. \n
    Here is the chunk: \n
"""

    response = ollama.generate('llama3.1:8b', prompt + element['text'])
    return convert_qa_string_to_dict(response['response'], element['element_id'], element['text'])

    # replace with the following if you want to switch to Claude3.5-sonnet
    # client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
    # response = client.messages.create(
    #     model="claude-3-5-sonnet-20240620",
    #     max_tokens=1024,
    #     messages=[
    #         {"role": "user", "content": prompt + element['text']}
    #     ]
    # )
    # return convert_qa_string_to_dict(response.content[0].text, element['element_id'], element['text'])


In [None]:
def generate_qa_pairs_dataset(elements):
    """
    Creates a dataset of question-answer-context pairs from a dictionary with elements.
    """

    dataset = []
    for el in elements:
        dataset.extend(generate_chunk_qa_pairs(el))
    return dataset

Running the following cell can take a long time depending on your hardware, a model you use, how large your documents are and how many of them you have.
You may also see a few JSON parsing errors, this isn’t a big issue in this case, we're simply skipping malformatted question-answer pairs, and still have plenty for the dataset.


In [None]:
eval_dataset = generate_qa_pairs_dataset(elements)

Once you have generated the dataset, save the results into a local `*.csv` file.

In [None]:
def save_dataset_as_csv(dict_list, output_file):
    """
    Saves a list of dictionaries with QA pairs as a CSV file.
    """

    df = pd.DataFrame(dict_list)
    df = df[df['question'].notna()]
    df.to_csv(output_file, index=False)
    print(f"DataFrame saved to {output_file}")

save_dataset_as_csv(eval_dataset, "qa_pairs_dataset.csv")

DataFrame saved to qa_pairs_dataset.csv


Here's what the dataset looks like.

In [None]:
df = pd.read_csv("qa_pairs_dataset.csv")
df.head()

Unnamed: 0,question,answer,id,context
0,What is the name of the corporation that submi...,Exxon Mobil Corporation,10982f6a0008dfa086cd91f90df940e7,2023\n\nUNITED STATES SECURITIES AND EXCHANGE ...
1,What type of securities are registered with th...,"Common Stock, without par value 0.142% Notes d...",10982f6a0008dfa086cd91f90df940e7,2023\n\nUNITED STATES SECURITIES AND EXCHANGE ...
2,What is the name of the stock exchange mention...,New York Stock Exchange,f9adc4b9c1da74f14e91b510533ef9d6,XOM XOM24B XOM28 XOM32 XOM39A\n\nNew York Stoc...
3,Is the registrant a well-known seasoned issuer...,Yes,f9adc4b9c1da74f14e91b510533ef9d6,XOM XOM24B XOM28 XOM32 XOM39A\n\nNew York Stoc...
4,What is the category of the first item listed?,Large accelerated filer,5399d22de8c81de1534b9bb3a067a3f1,Large accelerated filer Non-accelerated filer\...


## Set up retrievers and collect responses to questions

Now that we have our synthetic dataset and processed documents, it's time to put our embedding models to the test. For each model, we'll:
* Set up a retriever with chunks using LangChain and ChromaDB: We'll initialize a retriever using Chroma and the chosen embedding model.
* Load the questions from our synthetic dataset.
* Collect retrieval results: We'll use the retriever to find N most relevant document chunks for each question in the evaluation dataset.
* We'll store the retrieved document IDs along with the corresponding questions.

The following function does just that:

In [None]:
def setup_and_query_rag(embedding_model, documents, eval_dataset, output_directory, n_to_retrieve=10):

    elements = load_processed_files(documents)
    staged_elements = dict_to_elements(elements)

    documents = []

    for element in staged_elements:
        metadata = element.metadata.to_dict()
        metadata['element_id'] = element._element_id
        del metadata['orig_elements']
        documents.append(Document(page_content=element.text, metadata=metadata))

    documents = chromautils.filter_complex_metadata(documents)
    db = Chroma.from_documents(documents, HuggingFaceEmbeddings(model_name=embedding_model))
    retriever =  db.as_retriever(search_type="similarity", search_kwargs={"k": n_to_retrieve})

    df = pd.read_csv(eval_dataset)
    df = df[df['question'].notna()]
    questions = df["question"].to_list()

    results = []
    for question in questions:
        try:
            retrieved_documents = retriever.invoke(question)
            retrieved_ids = [doc.metadata['element_id'] for doc in retrieved_documents]
            results.append({"question": question, "retrieved_ids": retrieved_ids})
        except:
            print(f"Skipped question: {question}")

    os.makedirs(output_directory, exist_ok=True)
    file_path = os.path.join(output_directory, f"{embedding_model.replace('/', '@')}-{n_to_retrieve}.csv")

    df = pd.DataFrame(results)
    df.to_csv(file_path, index=False)
    print(f"DataFrame saved to {file_path}")
    db.delete_collection()

Let's collect 10 retrieval results for each question from each retriever.

In [None]:
models = ["BAAI/bge-large-en-v1.5", "mukaj/fin-mpnet-base", "Snowflake/snowflake-arctic-embed-l"]

for model in models:
    setup_and_query_rag(model, "local-ingest-output", "qa_pairs_dataset.csv", "retriever_results")

DataFrame saved to retriever_results/BAAI@bge-large-en-v1.5-10.csv
DataFrame saved to retriever_results/mukaj@fin-mpnet-base-10.csv
DataFrame saved to retriever_results/Snowflake@snowflake-arctic-embed-l-10.csv


Next, let's collect 100 retrieval results for each question from each retriever.

In [None]:
for model in models:
    setup_and_query_rag(model, "local-ingest-output", "qa_pairs_dataset.csv", "retriever_results_100", 100)

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange


DataFrame saved to retriever_results_100/BAAI@bge-large-en-v1.5-100.csv
DataFrame saved to retriever_results_100/mukaj@fin-mpnet-base-100.csv
DataFrame saved to retriever_results_100/Snowflake@snowflake-arctic-embed-l-100.csv


## Calculate the metrics and compare the results

Once you have the results from each of the retrievers, let's calculate some metrics.
In this example, we'll use two metrics: Recall, and MRR.

Since the evaluation dataset has one relevant chunk per question, the average Recall will tell us how often this chunk was retrieved _at all_ in the K retrieved documents. The value of 1 would mean that we retrieved the relevant chunk for every question (without taking into account its position in the list of retrieved chunks), the value of 0 would mean that the relevant chunk was never retrieved for any question. The higher the average recall, the better.

The average MRR (Mean reciprocal rank) will tell us the average position of the relevant chunk in the list of retrieved chunks, e.g. mrr = 1 would mean it was always the first result, mrr = 1/2 would mean it was second, etc. The higher the average MRR, the better.  

In [None]:
def calculate_retrieval_metrics(evaluation_data: pd.DataFrame, retrieval_results: pd.DataFrame, top_k=10):
    eval_list = evaluation_data.to_dict('records')
    retrieval_list = retrieval_results.to_dict('records')
    recall = []
    ranks = []

    for item in retrieval_list:
        question = item["question"]
        retrieved_ids = eval(item["retrieved_ids"])[:top_k]

        for eval_point in eval_list:
            if eval_point['question'] == question:
                correct_id = eval_point["id"]

        if correct_id in retrieved_ids:
            recall.append(1)
            rank = retrieved_ids.index(correct_id) + 1
            ranks.append(1 / rank)
        else:
            recall.append(0)
            ranks.append(0)

    # Calculate average metrics
    avg_recall = sum(recall) / len(retrieval_list)
    mrr = sum(ranks) / len(retrieval_list)
    metrics = {
        'Recall': avg_recall,
        'MRR': mrr,
    }

    return metrics

Let's calculate the metrics for 10 retrieved results:

In [None]:
eval_dataset = pd.read_csv("qa_pairs_dataset.csv")
directory_with_retrieval_results = "retriever_results"
k = 10
all_metrics = dict()

for filename in os.listdir(directory_with_retrieval_results):
    if filename.endswith('.csv'):
        file_path = os.path.join(directory_with_retrieval_results, filename)
        try:
            model_name = filename[:-4].rsplit('-', 1)[0].replace('@', '/')
            retrieval_results = pd.read_csv(file_path)
            all_metrics[model_name] = calculate_retrieval_metrics(eval_dataset, retrieval_results, top_k=k)
        except IOError:
            print(f"Error: Could not read file {filename}.")

In [None]:
all_metrics

{'Snowflake/snowflake-arctic-embed-l': {'Recall': 0.3502610346464167,
  'MRR': 0.20764157268666042},
 'BAAI/bge-large-en-v1.5': {'Recall': 0.8794494542002848,
  'MRR': 0.6415374677002579},
 'mukaj/fin-mpnet-base': {'Recall': 0.8239202657807309,
  'MRR': 0.5528548074822408}}

In [None]:
eval_dataset = pd.read_csv("qa_pairs_dataset.csv")
directory_with_retrieval_results = "retriever_results_100"
k = 100
all_metrics_100 = dict()

for filename in os.listdir(directory_with_retrieval_results):
    if filename.endswith('.csv'):
        file_path = os.path.join(directory_with_retrieval_results, filename)
        try:
            model_name = filename[:-4].rsplit('-', 1)[0].replace('@', '/')
            retrieval_results = pd.read_csv(file_path)
            all_metrics_100[model_name] = calculate_retrieval_metrics(eval_dataset, retrieval_results, top_k=k)
        except IOError:
            print(f"Error: Could not read file {filename}.")

all_metrics_100

{'Snowflake/snowflake-arctic-embed-l': {'Recall': 0.7446606549596583,
  'MRR': 0.24060553153830655},
 'BAAI/bge-large-en-v1.5': {'Recall': 0.9905078310393926,
  'MRR': 0.6660323162790893},
 'mukaj/fin-mpnet-base': {'Recall': 0.9857617465590888,
  'MRR': 0.5725198750014092}}

As we can see, for 10 retrieved documents, `'BAAI/bge-large-en-v1.5'` has both the highest recall and the highest MRR, so it is a clear winner! However, I wouldn't disregard `'mukaj/fin-mpnet-base'`. It is a close second, and it's even closer to the top when we retrieve 100 results instead of 10. It's also x3 times smaller than `'BAAI/bge-large-en-v1.5'`, so it might be a good choice still, especially if you add a reranker step to your RAG. However, for simplicity, here we'll pick the model that had the best recall on 10 retrievals.

In [None]:
model_with_max_recall = max(all_metrics, key=lambda k: all_metrics[k]['Recall'])
model_with_max_recall

'BAAI/bge-large-en-v1.5'

## Complete the preprocessing pipeline with an embedding and upload steps

Once we have or choice of the best embedding model, we can simply add a new embedding step to the existing pipeline, and run the pipeline one more time. The results of partitioning and chunking are already cached, so by adding an embedding step to the pipeline, it will pick up at the embedding step, and won't re-process the documents from scratch.

Here we change the destination to a different local directory, but you can set up a vector store as a destination instead.
Find out how to configure your favorite vector store as a destination connector in Unstructured [documentation](https://docs.unstructured.io/api-reference/ingest/destination-connector/overview).

In [None]:
Pipeline.from_configs(
    context=ProcessorConfig(
        verbose=True,
        tqdm=True,
        num_processes=20,
    ),
    indexer_config=LocalIndexerConfig(input_path="PDFS"),
    downloader_config=LocalDownloaderConfig(),
    source_connection_config=LocalConnectionConfig(),
    partitioner_config=PartitionerConfig(
        partition_by_api=True,
        api_key=os.getenv("UNSTRUCTURED_API_KEY"),
        partition_endpoint=os.getenv("UNSTRUCTURED_URL"),
        strategy="fast",
        additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            }
        ),
    chunker_config=ChunkerConfig(
        chunking_strategy="by_title",
        chunk_max_characters=1500,
        chunk_overlap = 150,
        ),
    embedder_config=EmbedderConfig(
        embedding_provider="langchain-huggingface",
        embedding_model_name=model_with_max_recall, # use the model with the highest recall
    ),
    # We're changing the output location here, but you can switch to a vector store as a destination
    uploader_config=LocalUploaderConfig(output_dir="outputs-with-embeddings")
).run()

2024-08-19 18:39:10,872 MainProcess INFO     Created index with configs: {"input_path": "PDFS", "recursive": false}, connection configs: {"access_config": {}}
2024-08-19 18:39:10,873 MainProcess INFO     Created download with configs: {"download_dir": null}, connection configs: {"access_config": {}}
2024-08-19 18:39:10,874 MainProcess INFO     Created partition with configs: {"strategy": "fast", "ocr_languages": null, "encoding": null, "additional_partition_args": {"split_pdf_page": true, "split_pdf_allow_failed": true, "split_pdf_concurrency_level": 15}, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": [], "metadata_include": [], "partition_endpoint": "https://api.unstructuredapp.io/general/v0/general", "partition_by_api": true, "api_key": "*******", "hi_res_model_name": null}
2024-08-19 18:39:10,875 MainProcess INFO     Created chunk with configs: {"chunking_strategy": "by_title"