# All of the Above, but More

Conceptually, we have covered the fundamental concepts of embeddings and similarity calculations. These two concepts enable capabilities that are important in language tasks and that form the foundation of agentic capabilities ([OpenAI, 2025](https://platform.openai.com/docs/guides/embeddings)):

- Search
- Clustering
- Recommendations
- Anomaly detection
- Diversity measurement
- Classification

## Document RAG

Some of these tasks are related to Retrieval-Augmented Generation (RAG). In the diagram below, we depict how to split a document, each chunk's embeddings and store them in a vector DB.

![](./img/02_document_rag_embed.png)

Once embeddings are stored, given a query we can use proximity search to find the nearest chunk. The chunk (and other related data) are context in prompt sent to an LLM.

![](./img/02_document_rag_query.png)

# Introducing LangChain

[LangChain](https://www.langchain.com/) is a set of tools that support cross-model for agent engineering. The library is useful and popular among the many options available.

Some useful resources are:

- [LangChain Documentation](https://docs.langchain.com/).
- [Directory of LangChain Resources](https://www.langchain.com/resources).

## Document Splitting 

Document splitting  or chunking is usually the first step in any RAG setup. The idea is that we want to split documents into smaller sections to:

- Comply with the models context length constraints.
- Enhance search quality.
- Reduce latency.
- Control costs.

![](./img/02_document_rag_embed.png)

LangChain contains a family of [document loaders](https://python.langchain.com/docs/integrations/document_loaders/). Each document loader has its own set of parameters, but they all implement the `.load()` method. A few examples include:

### Common File Types

- [CSVLoader](https://python.langchain.com/docs/integrations/document_loaders/csv): CSV files
- [DirectoryLoader](https://python.langchain.com/docs/how_to/document_loader_directory): All files in a given directory.
- [Unstructured](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file): Many file types (see https://docs.unstructured.io/platform/supported-file-types)
- [JSONLoader](https://python.langchain.com/docs/integrations/document_loaders/json): JSON files

### PDF

- [PyPDF](https://python.langchain.com/docs/integrations/document_loaders/pypdfloader): Uses - [pypdf](https://pypi.org/project/pypdf/) to load and parse PDFs	(Package).
- [Unstructured](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file): Uses [Unstructured's](https://pypi.org/project/unstructured/) open source library to load PDFs	(Package).
- [PDFPlumber](https://python.langchain.com/docs/integrations/document_loaders/pdfplumber):  Load PDF files using [PDFPlumber](https://pypi.org/project/pdfplumber/)	(Package).


### Web Pages

- [Web](https://python.langchain.com/docs/integrations/document_loaders/web_base): Uses urllib and BeautifulSoup to load and parse HTML web pages	(Package).
- [Unstructured](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file): Uses [Unstructured](https://pypi.org/project/unstructured/) to load and parse web pages (Package).
- [RecursiveURL](https://python.langchain.com/docs/integrations/document_loaders/recursive_url): Recursively scrapes all child links from a root URL (Package).



## JSONLoader

[JSONLoader](https://python.langchain.com/docs/integrations/document_loaders/json/) implements a JSON (including JSON lines) document loader. JSONLoader uses [`jq`](https://jqlang.org/) to specify hwo to use the data passed in the document. 

A few notes on the code below:

- `jq_schema="."` indicates that we will read all keys from each JSON line. The [`jq specification`](https://jqlang.org/manual/#basic-filters) affords flexible filtering. 
- `content_key="content"` is required when more than one key is included in `jq_schema`.
- `json_lines=True` means that the file is a [JSON lines file](https://jsonlines.org/). Each line of a JSON line file is a fully compliant JSON.
- `metadata_func=get_metadata` indicates that we want to use the function `get_metadata()` to extract metdata from the filtered JSON line.

In [None]:
%load_ext dotenv
%dotenv ../../05_src/.secrets

In [None]:
from langchain_community.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

In [None]:
def get_metadata(record:dict, metadata: dict) -> dict:
    metadata['reviewid'] = record.get('reviewid')
    return metadata

loader = JSONLoader("../../05_src/documents/pitchfork_content.jsonl", 
                    jq_schema=".",
                    content_key="content",
                    json_lines=True,
                    text_content=True,
                    metadata_func=get_metadata)

In [None]:
data = loader.load()
data

In [None]:
data[1].to_json()

## Splitting Documents

There are good reasons to split documents. As explained in [LangChain's Documentation](https://python.langchain.com/docs/concepts/text_splitters/#why-split-documents):


- Handling non-uniform document lengths: Real-world document collections often contain texts of varying sizes. Splitting ensures consistent processing across all documents.
- Overcoming model limitations: Many embedding models and language models have maximum input size constraints. Splitting allows us to process documents that would otherwise exceed these limits.
- Improving representation quality: For longer documents, the quality of embeddings or other representations may degrade as they try to capture too much information. Splitting can lead to more focused and accurate representations of each section.
- Enhancing retrieval precision: In information retrieval systems, splitting can improve the granularity of search results, allowing for more precise matching of queries to relevant document sections.
- Optimizing computational resources: Working with smaller chunks of text can be more memory-efficient and allow for better parallelization of processing tasks.

## Text Splitters in LangChain

LangChain contains a family of [document splitters](https://docs.langchain.com/oss/python/integrations/splitters/index):

- Length-based: simple and intuitive approach that ensures a specific text length. Can be based on [characters](https://python.langchain.com/docs/how_to/character_text_splitter/) or [tokens](https://python.langchain.com/docs/how_to/split_by_token/).
- Text structure-based: tries to use the natural structure of text, including paragraphs, sentences, and words. More specifically:

    + The [RecursiveCharacterTextSplitter](https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter) attempts to keep larger units (e.g., paragraphs) intact.
    + If a unit exceeds the chunk size, it moves to the next level (e.g., sentences).
    + This process continues down to the word level if necessary.

- Document Structure-based: Uses the structure of documents in specific formats, including Markdown, HTML, and JSON.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000, 
    chunk_overlap=200, 
    length_function = len, 
    add_start_index = True
)

In [None]:
chunks = text_splitter.split_documents(data)
print(f'Split {len(data)} reviews (documents) into {len(chunks)} chunks.' )

Notice that the output documents (the "chunks") include the keys:

- `seq_num`: a sequential number identifying each of the original documents. 
- `start_index`: the starting index for the chunk.
- `page_content`: text of the document chunk.

In [None]:
chunks

## Batch Embeddings

We now have a large number of documents for which we need embeddings. We could use a direct call to the Embeddings API. However, here we demonstrate how to request embeddings using the [Batch API](https://platform.openai.com/docs/api-reference/batch). From the documentation:

The Batch API is used to send asynchronous groups of requests. This API offers lower costs, a separate pool of significantly higher rate limits, and a clear 24-hour turnaround time. The service is ideal for processing jobs that don't require immediate responses. 

A couple of useful references are: 

- [Batch API Guide](https://platform.openai.com/docs/guides/batch)
- [API Reference](https://platform.openai.com/docs/api-reference/batch)

## Creating Batches

The batch process works as follows:

1. Prepare the batch file. Batches start with a .jsonl file where each line contains the details of an individual request to the API.
2. Upload the batch file to input. We must first input the batch file, so that we can reference it below.
3. Create the batch.    
4. Check status of the batch.
5. Retrieve the results.

In addition to the steps above, the API allows us to list all batches and to cancel a batch.

### 1. Prepare the Batch File

Batch processing using the API requires input files to follow a specific format. 

A few notes from the [documentation](https://platform.openai.com/docs/guides/batch#1-prepare-your-batch-file)

+ Batches start with a .jsonl file where each line contains the details of an individual request to the API. 
+ The available endpoints are:

    - Responses API: /v1/responses
    - Chat Completions API: /v1/chat/completions 
    - Embeddings API: /v1/embeddings 
    - Completions API: /v1/completions 
    - Moderations API: /v1/moderations 

+ For a given input file, the parameters in each line's body field are the same as the parameters for the underlying endpoint. 
+ Each request **must include a unique custom_id value**, which you can use to reference results after completion. 

#### Rate Limits

It is important to keep in mind the [API's rate limits](https://platform.openai.com/docs/guides/batch#rate-limits):


+ **Per-batch limits**: A single batch may include up to 50,000 requests, and a batch input file can be up to 200 MB in size. Note that /v1/embeddings batches are also restricted to a maximum of 50,000 embedding inputs across all requests in the batch.
+ **Enqueued prompt tokens per model**: Each model has a maximum number of enqueued prompt tokens allowed for batch processing. You can find these limits on the [Platform Settings](https://platform.openai.com/settings/organization/limits) page.

It is important to note: 

> There are no limits for output tokens or number of submitted requests for the Batch API today. Because Batch API rate limits are a new, separate pool, using the Batch API will not consume tokens from your standard per-model rate limits, thereby offering you a convenient way to increase the number of requests and processed tokens you can use when querying our API 



We must create files that contain the `page_content` and an identifier that would arguably include important metadata (like 'reviewid' and a chunk identifier) of our document chunks. We also want to create files that are within the rate limits (i.e., at most 50,000 documents per batch).

The batch definition jsonl should contain one line per request. [Each request is defined as](https://cookbook.openai.com/examples/batch_processing#creating-the-batch-file):

```
{
    "custom_id": <REQUEST_ID>,
    "method": "POST",
    "url": "/v1/chat/completions",
    "body": {
        "model": <MODEL>,
        "messages": <MESSAGES>,
        // other parameters
    }
}
```


In [None]:
chunks[0].page_content

In [None]:
chunks[0].metadata['reviewid']

In [None]:

import json
import os

def prep_batch_file_for_embedding(input:list, output_path:str, max_lines_per_file:int=10000):
    total_lines = len(input)
    num_files = (total_lines // max_lines_per_file) + 1
    print(f'Total lines: {total_lines}, Number of files to create: {num_files}')

    for num_file in range(num_files):
        start_index = num_file * max_lines_per_file
        end_index = min(start_index + max_lines_per_file, total_lines)
        output_file = os.path.join(output_path, f"pitchfork_reviews_batch_{num_file+1}.jsonl")
        print(f'Creating file: {output_file} with lines from {start_index} to {end_index-1}')
        create_single_batch_file(input, start_index, end_index, output_file)

def create_single_batch_file(input, start_index, end_index, output_file):
    with open(output_file, 'w') as outfile:
        for line in input[start_index:end_index]:
            custom_id = (
                    str(line.metadata['reviewid']) + "_" + 
                    str(line.metadata['seq_num']) + "_" + 
                    str(line.metadata['start_index'])
                )
            content = line.page_content
            out_dict = {
                    "custom_id": custom_id, 
                    "method": "POST", 
                    "url": "/v1/embeddings", 
                    "body": {
                        "model": "text-embedding-3-small", 
                        "input": content
                    }
                }
            outfile.write(json.dumps(out_dict) + '\n')
        
            

In [None]:
prep_batch_file_for_embedding(
    input=chunks, 
    output_path='../../05_src/documents/'
)

### 2. Upload the Input File

Before running the batch process, we will upload the files to the API. File management has some useful functions.

#### List available files

In [None]:
from openai import OpenAI

client = OpenAI()
files = client.files.list()


In [None]:
files.to_dict()['data']

#### Remove Files

You can remove files from storage using code like the one below, which deletes all files in the account. 
Note: this is a destructive action that cannot be undone.

In [None]:
# for file in files.to_dict()['data']:
#     print(f'Deleting file: {file["filename"]}')
#     resp = client.files.delete(file["id"])
#     print(resp)

#### Search and Upload Files

We search for the files that we created and upload them

In [None]:
from glob import glob

batch_files = glob('../../05_src/documents/pitchfork_reviews_batch_*.jsonl')
batch_files

In [None]:
from openai import OpenAI
from tqdm import tqdm
client = OpenAI()


for b_file in tqdm(batch_files):
    batch_input_file = client.files.create(
        file=open(b_file, "rb"), 
        purpose='batch'
    )
    print(batch_input_file)

### 3. Create Batches

As before, we can consult the files that we have in store:

In [None]:
batch_files = client.files.list().to_dict()
batch_file_ids = [file['id'] for file in batch_files['data']]
batch_file_ids

At a difference with the files API, there is no easy way of removing batches that have a completed or failed state, so the description and status are important. 

Now we can create the batch procedure. For each file, we create the batch with the call below:

In [None]:
my_id = <add your id here>

In [None]:
from datetime import datetime

timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
batch_description = f"Pitchfork reviews content embeddings ({my_id}) {timestamp}"

for file_id in tqdm(batch_file_ids):
    client.batches.create(
            input_file_id = file_id,
            endpoint="/v1/embeddings",
            completion_window="24h",
            metadata={
                "description": batch_description,
                "timestamp": timestamp
            }
        )

In [None]:
batch_description

In [None]:
batch_processes = client.batches.list().to_dict()
batch_info= [
    {'batch_id': batch['id'],
     'description': batch['metadata']['description'],
    'status': batch['status'],
    'request_counts': batch['request_counts'],
    'output_file_id': batch['output_file_id']}  
            for batch in batch_processes['data'] if batch['metadata']['description'] == batch_description
    ]
batch_info

If you need to cancel a batch, you can use the code below:

In [None]:
# for batch in batch_info:
#     client.batches.cancel(batch['batch_id'])