# Elasticsearch Document Processing Pipeline

This notebook provides a complete pipeline for processing legal documents and storing them in Elasticsearch with:
- Text chunking and vector embeddings
- Named Entity Recognition (NER) annotations
- Full-text search capabilities
- Duplicate detection and removal

## Configuration and Setup

In [1]:
# Configuration
ELASTICSEARCH_HOST = "http://localhost:9201"
INDEX_NAME = "eu_legislation"
JSON_FOLDER = "./output"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100
VECTOR_DIMS = 768
MODEL_NAME = "Alibaba-NLP/gte-multilingual-base"

# Imports
from elasticsearch import Elasticsearch, helpers
import os
import json
import hashlib
import torch
import nltk
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter
import logging

# Initialize Elasticsearch client
es = Elasticsearch(ELASTICSEARCH_HOST, verify_certs=False, request_timeout=60)

# Test connection
try:
    response = es.info()
    es_version = response["version"]["number"]
    print(f"Connected to Elasticsearch Server Version: {es_version}")
except Exception as e:
    print(f"Connection failed: {e}")

Connected to Elasticsearch Server Version: 8.13.3


## Utility Functions

In [2]:
def get_string_hash(input_string):
    """Generate SHA256 hash for a given string."""
    hash_object = hashlib.sha256()
    hash_object.update(input_string.encode("utf-8"))
    return hash_object.hexdigest()


def process_annotation(annotation, text, document_id):
    """Process a single annotation into the required format."""
    name = text[annotation["start"]:annotation["end"]]
    
    ann_object = {
        "mention": name,
        "start": annotation["start"],
        "end": annotation["end"],
        "id": annotation["id"],
        "type": annotation["type"],
    }
    
    # Handle linking information
    if ("linking" in annotation.get("features", {}) and 
        not annotation["features"]["linking"].get("is_nil", True)):
        
        linking = annotation["features"]["linking"]
        ann_object.update({
            "display_name": annotation["features"].get("title", name),
            "is_linked": True,
            "id_ER": linking.get("top_candidate", {}).get("url", "")
        })
    else:
        ann_object.update({
            "display_name": name,
            "is_linked": False,
            "id_ER": f"{document_id}_{name}"
        })
    
    return ann_object


def clean_document_data(file_object):
    """Clean and prepare document data for indexing."""
    # Remove unnecessary fields
    for key in ["annotation_sets", "annoation_sets", "features", "_id"]:
        if key in file_object:
            del file_object[key]
    
    # Ensure required fields exist
    if "metadata" not in file_object:
        file_object["metadata"] = []
    
    return file_object

In [3]:
# Quick document search and inspection
def search_documents(query="*", size=10, exclude_fields=None):
    """Search documents in the index with optional field exclusions."""
    if exclude_fields is None:
        exclude_fields = ["chunks", "annotations"]
    
    search_query = {
        "query": {"query_string": {"query": query}},
        "_source": {"excludes": exclude_fields},
        "size": size,
    }
    
    return es.search(index=INDEX_NAME, body=search_query)


def find_empty_annotation_documents():
    """Find documents with empty annotations field."""
    search_query = {
        "query": {"query_string": {"query": "*"}},
        "_source": {"excludes": ["chunks", "annotation_sets"]},
    }
    
    response = es.search(index=INDEX_NAME, body=search_query)
    empty_annotation_ids = []
    
    for hit in response["hits"]["hits"]:
        if hit["_source"].get("annotations") == []:
            name = hit["_source"].get("name", "(no name)")
            print(f"Document with empty 'annotations' field: {name}")
            empty_annotation_ids.append(hit["_source"]["id"])
    
    return empty_annotation_ids


# Example usage
sample_response = search_documents(size=1)
if sample_response["hits"]["hits"]:
    print("Sample document:")
    print(json.dumps(sample_response["hits"]["hits"][0]["_source"], indent=2))

Sample document:
{
  "text": "CHAPTER I\nGENERAL PROVISIONS\nArticle 1\nSubject matter and scope\n1. This Regulation lays down harmonised rules, inter alia, on:\n(a) the making available of product data and related service data to the user of the connected product or related service;\n(b) the making available of data by data holders to data recipients;\n(c) the making available of data by data holders to public sector bodies, the Commission, the European Central Bank and Union bodies, where there is an exceptional need for those data for the performance of a specific task carried out in the public interest;\n(d) facilitating switching between data processing services;\n(e) introducing safeguards against unlawful third-party access to non-personal data; and\n(f) the development of interoperability standards for data to be accessed, transferred and used.\n2. This Regulation covers personal and non-personal data, including the following types of data, in the following contexts:\n(a) Chapt

  return es.search(index=INDEX_NAME, body=search_query)


## Index Management

In [4]:
def get_index_settings():
    """Get the index settings with custom nested object limit."""
    return {
        "settings": {
            "index.mapping.nested_objects.limit": 20000
        },
        "mappings": {
            "properties": {
                "text": {"type": "text"},
                "name": {"type": "keyword"},
                "preview": {"type": "keyword"},
                "id": {"type": "keyword"},
                "metadata": {
                    "type": "nested",
                    "properties": {
                        "type": {"type": "keyword"},
                        "value": {"type": "keyword"}
                    }
                },
                "annotations": {
                    "type": "nested",
                    "properties": {
                        "mention": {"type": "keyword"},
                        "start": {"type": "integer"},
                        "end": {"type": "integer"},
                        "display_name": {"type": "keyword"},
                        "id": {"type": "integer"},
                        "type": {"type": "keyword"},
                        "is_linked": {"type": "boolean"},
                        "id_ER": {"type": "keyword"}
                    }
                },
                "chunks": {
                    "type": "nested",
                    "properties": {
                        "vectors": {
                            "type": "nested",
                            "properties": {
                                "predicted_value": {
                                    "type": "dense_vector",
                                    "index": True,
                                    "dims": VECTOR_DIMS,
                                    "similarity": "cosine",
                                },
                                "text": {"type": "text"},
                                "entities": {"type": "text"},
                            },
                        },
                    },
                }
            }
        }
    }


def recreate_index(index_name=INDEX_NAME, delete_existing=False):
    """Create or recreate the Elasticsearch index."""
    try:
        if delete_existing and es.indices.exists(index=index_name):
            es.indices.delete(index=index_name)
            print(f"Deleted existing index: {index_name}")
        
        if not es.indices.exists(index=index_name):
            index_settings = get_index_settings()
            response = es.indices.create(index=index_name, body=index_settings)
            print(f"Created index: {index_name}")
            return response
        else:
            print(f"Index {index_name} already exists")
            return None
            
    except Exception as e:
        print(f"Error managing index: {e}")
        return None


# Uncomment the line below to recreate the index
recreate_index(delete_existing=True)

Deleted existing index: eu_legislation
Created index: eu_legislation


  response = es.indices.create(index=index_name, body=index_settings)


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'eu_legislation'})

## Data Processing Pipeline

In [5]:
def read_json_files(path, target_ids=None):
    """
    Read and process JSON annotation files from a directory.
    
    Args:
        path: Directory containing .json.annotated files
        target_ids: Optional set of document IDs to filter by
        
    Returns:
        List of processed document objects
    """
    json_files = [f for f in os.listdir(path) if f.endswith(".json")]
    data = []
    
    print(f"Processing {len(json_files)} JSON files from {path}")
    
    for json_file in tqdm(json_files, desc="Reading files"):
        try:
            with open(os.path.join(path, json_file), "r") as file:
                file_object = json.load(file)
                
                # Skip empty files
                if not file_object.get("text"):
                    print(f"Warning: Skipping empty file: {json_file}")
                    continue
                
                # Generate document ID
                file_object["id"] = get_string_hash(file_object["text"])
                
                # Filter by target IDs if provided
                if target_ids and file_object["id"] not in target_ids:
                    continue
                
                # Process annotations
                annotations = process_document_annotations(file_object)
                file_object["annotations"] = annotations
                
                # Clean up the document
                file_object = clean_document_data(file_object)
                
                data.append(file_object)
                
        except Exception as e:
            print(f"Error processing {json_file}: {e}")
            continue
    
    print(f"Successfully processed {len(data)} documents")
    return data


def process_document_annotations(file_object):
    """Extract and process annotations from a document."""
    text = file_object.get("text", "")
    annotations = []
    
    annotation_sets = file_object.get("annotation_sets", {})
    entities = annotation_sets.get("entities_", {})
    raw_annotations = entities.get("annotations", [])
    
    for annotation in raw_annotations:
        try:
            ann_object = process_annotation(annotation, text, file_object.get("id", ""))
            annotations.append(ann_object)
        except Exception as e:
            print(f"Warning: Error processing annotation: {e}")
            continue
    
    print(f"Processed {len(annotations)} annotations for document '{file_object.get('name', 'Unknown')}'")
    return annotations


def send_to_elasticsearch(data, index_name=INDEX_NAME, update_existing=True):
    """
    Send documents to Elasticsearch with optional duplicate handling.
    
    Args:
        data: List of document objects
        index_name: Target index name
        update_existing: Whether to update existing documents
    """
    print(f"Sending {len(data)} documents to Elasticsearch...")
    
    for item in tqdm(data, desc="Indexing documents"):
        try:
            if update_existing:
                # Remove existing documents with same ID
                search_query = {"query": {"term": {"id": item["id"]}}}
                search_response = es.search(index=index_name, body=search_query)
                
                for hit in search_response["hits"]["hits"]:
                    es.delete(index=index_name, id=hit["_id"])
            
            # Index the new document
            es.index(index=index_name, body=item)
            
        except Exception as e:
            print(f"Error indexing document {item.get('name', 'Unknown')}: {e}")
    
    print("Document indexing completed")

In [6]:
# Process documents - you can filter by specific document IDs if needed
# empty_annotation_ids = find_empty_annotation_documents()  # Uncomment to filter
target_ids = None  # or set to empty_annotation_ids to process only those documents

data = read_json_files(JSON_FOLDER, target_ids=target_ids)

Processing 48 JSON files from ./output


Reading files:   0%|          | 0/48 [00:00<?, ?it/s]

Reading files: 100%|██████████| 48/48 [00:00<00:00, 813.19it/s]

Processed 122 annotations for document 'DataAct_Chapter_IX'
Processed 158 annotations for document 'DataGovernanceAct_Chapter_II'
Processed 134 annotations for document 'DataAct_Chapter_VIII'
Processed 49 annotations for document 'AIAct_Chapter_XI'
Processed 631 annotations for document 'AIAct_Chapter_III'
Processed 183 annotations for document 'DataGovernanceAct_Chapter_III'
Processed 127 annotations for document 'AIAct_Chapter_I'
Processed 67 annotations for document 'GDPR_Chapter_IX'
Processed 88 annotations for document 'DataGovernanceAct_Chapter_I'
Processed 59 annotations for document 'DataGovernanceAct_Chapter_VI'
Processed 137 annotations for document 'DataGovernanceAct_Chapter_IV'
Processed 104 annotations for document 'DataAct_Chapter_XI'
Processed 204 annotations for document 'DataAct_Chapter_III'
Processed 664 annotations for document 'DataAct-intro'
Processed 501 annotations for document 'AIAct_Chapter_IX'
Processed 24 annotations for document 'GDPR_Chapter_X'
Processed 22




In [None]:
print(f"Loaded {len(data)} documents ready for processing")

In [7]:
send_to_elasticsearch(data)

Sending 48 documents to Elasticsearch...


  search_response = es.search(index=index_name, body=search_query)
  es.index(index=index_name, body=item)
Indexing documents: 100%|██████████| 48/48 [00:03<00:00, 13.27it/s]

Document indexing completed





## Text Chunking and Vector Embeddings

### Mapping for chunk vector search

In [None]:
# This mapping update is now included in the main index creation
# No need to run separately if using recreate_index() function
print("Vector mapping is included in the main index configuration")

In [8]:
# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    length_function=len,
    is_separator_regex=False,
)

In [9]:
class DocumentChunker:
    """Handles document chunking and embedding generation."""
    
    def __init__(self, model_name=MODEL_NAME, device="mps"):
        """Initialize the chunker with sentence transformer model."""
        self.model_name = model_name
        self.device = device
        self._model = None
        self._initialize_model()
    
    def _initialize_model(self):
        """Lazy initialization of the sentence transformer model."""
        if self._model is None:
            print(f"Loading embedding model: {self.model_name}")
            self._model = SentenceTransformer(
                self.model_name, 
                trust_remote_code=True
            ).to(self.device)
            print("Model loaded successfully")
    
    def generate_chunks_with_embeddings(self, text):
        """
        Split text into chunks and generate embeddings for each chunk.
        
        Args:
            text: Input text to chunk and embed
            
        Returns:
            List of lists: [embedding, chunk_text, entities_placeholder]
        """
        try:
            # Split text into chunks
            chunks = text_splitter.split_text(text)
            
            if not chunks:
                print("Warning: No chunks generated from text")
                return []
            
            # Generate embeddings
            embeddings = self._model.encode(chunks, show_progress_bar=False)
            
            # Return as list of lists (mutable) instead of tuples (immutable)
            result = [[emb.tolist(), chunk, ""] for emb, chunk in zip(embeddings, chunks)]
            
            return result
            
        except Exception as e:
            print(f"Error in chunking/embedding: {e}")
            return []
    
    def get_entities_from_chunk(self, chunk_text, full_text, annotations):
        """
        Find entities from annotations that are present in the chunk text.
        
        Args:
            chunk_text: The text of the current chunk
            full_text: The full document text 
            annotations: List of annotation objects
        
        Returns:
            String of entity mentions found in the chunk
        """
        chunk_entities = []
        
        # Find the position of this chunk in the full text
        chunk_start_in_full = full_text.find(chunk_text)
        if chunk_start_in_full == -1:
            return ""
        
        chunk_end_in_full = chunk_start_in_full + len(chunk_text)
        
        # Check which annotations overlap with this chunk
        for annotation in annotations:
            ann_start = annotation.get("start", 0)
            ann_end = annotation.get("end", 0)
            
            # Check if annotation overlaps with chunk boundaries
            if ((ann_start >= chunk_start_in_full and ann_start < chunk_end_in_full) or 
                (ann_end > chunk_start_in_full and ann_end <= chunk_end_in_full) or 
                (ann_start <= chunk_start_in_full and ann_end >= chunk_end_in_full)):
                
                entity_mention = full_text[ann_start:ann_end]
                chunk_entities.append(entity_mention)
        
        return " ".join(chunk_entities)


# Initialize the chunker
chunker = DocumentChunker()

Loading embedding model: Alibaba-NLP/gte-multilingual-base


Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Model loaded successfully


In [10]:
def update_document_with_chunks(doc_id, doc_source, chunker_instance):
    """
    Update a document with chunks and their entities using existing annotations.
    
    Args:
        doc_id: Elasticsearch document ID
        doc_source: Full document source containing text and annotations
        chunker_instance: DocumentChunker instance
    """
    text = doc_source.get("text", "")
    existing_annotations = doc_source.get("annotations", [])
    
    # Generate chunks with embeddings
    chunks = chunker_instance.generate_chunks_with_embeddings(text)
    
    if not chunks:
        print(f"Warning: No chunks generated for document {doc_id}")
        return
    
    # Add entity information to each chunk
    for chunk in chunks:
        chunk_text = chunk[1]
        chunk_entities = chunker_instance.get_entities_from_chunk(
            chunk_text, text, existing_annotations
        )
        chunk[2] = chunk_entities  # Update entities placeholder
    
    # Prepare data for Elasticsearch update
    passages_body = [
        {
            "vectors": {
                "predicted_value": chunk[0],
                "entities": chunk[2], 
                "text": chunk[1]
            }
        } 
        for chunk in chunks
    ]
    
    # Update document in Elasticsearch
    try:
        data = {"doc": {"chunks": passages_body}}
        response = es.update(index=INDEX_NAME, id=doc_id, body=data)
        print(f"Updated document {doc_id} with {len(chunks)} chunks")
    except Exception as e:
        print(f"Error updating document {doc_id}: {e}")


def process_documents_for_chunking(index_name=INDEX_NAME, query=None, batch_size=10000):
    """
    Process documents from Elasticsearch to add chunks and embeddings.
    
    Args:
        index_name: Index to search
        query: Optional query to filter documents
        batch_size: Maximum documents to process
    """
    if query is None:
        query = {"match_all": {}}
    
    search_body = {"query": query}
    
    try:
        print("Searching for documents to process...")
        response = es.search(index=index_name, body=search_body, size=batch_size)
        documents = response["hits"]["hits"]
        
        print(f"Found {len(documents)} documents to process")
        
        # Suppress Elasticsearch transport logging for cleaner output
        logging.getLogger("elastic_transport").setLevel(logging.WARNING)
        
        # Process documents in reverse order (optional)
        documents.reverse()
        
        # Process each document
        for doc in tqdm(documents, desc="Adding chunks and embeddings"):
            doc_id = doc["_id"]
            doc_source = doc["_source"]
            update_document_with_chunks(doc_id, doc_source, chunker)
            
        print("Document processing completed!")
        
    except Exception as e:
        print(f"Error during document processing: {e}")


# Example usage:
process_documents_for_chunking()  # Process all documents
# process_documents_for_chunking(query={"terms": {"id": specific_ids}})  # Process specific documents

Searching for documents to process...


  response = es.search(index=index_name, body=search_body, size=batch_size)


Found 48 documents to process


  response = es.update(index=INDEX_NAME, id=doc_id, body=data)
Adding chunks and embeddings:   2%|▏         | 1/48 [00:02<01:50,  2.36s/it]

Updated document OzmPuZkB6MMnAM0IQDdl with 78 chunks


Adding chunks and embeddings:   4%|▍         | 2/48 [00:09<03:54,  5.11s/it]

Updated document OjmPuZkB6MMnAM0IQDcY with 255 chunks


Adding chunks and embeddings:   6%|▋         | 3/48 [00:14<03:47,  5.07s/it]

Updated document OTmPuZkB6MMnAM0IPzfH with 168 chunks


Adding chunks and embeddings:   8%|▊         | 4/48 [00:15<02:33,  3.48s/it]

Updated document ODmPuZkB6MMnAM0IPzd6 with 29 chunks


Adding chunks and embeddings:  10%|█         | 5/48 [00:19<02:40,  3.72s/it]

Updated document NzmPuZkB6MMnAM0IPzcx with 120 chunks


Adding chunks and embeddings:  12%|█▎        | 6/48 [00:19<01:48,  2.59s/it]

Updated document NjmPuZkB6MMnAM0IPjfs with 6 chunks


Adding chunks and embeddings:  15%|█▍        | 7/48 [00:20<01:18,  1.91s/it]

Updated document NTmPuZkB6MMnAM0IPjeu with 12 chunks


Adding chunks and embeddings:  17%|█▋        | 8/48 [00:22<01:13,  1.84s/it]

Updated document NDmPuZkB6MMnAM0IPjdk with 48 chunks


Adding chunks and embeddings:  19%|█▉        | 9/48 [00:22<00:52,  1.35s/it]

Updated document MzmPuZkB6MMnAM0IPjck with 3 chunks


Adding chunks and embeddings:  21%|██        | 10/48 [00:24<00:57,  1.51s/it]

Updated document MjmPuZkB6MMnAM0IPTfj with 68 chunks


Adding chunks and embeddings:  23%|██▎       | 11/48 [00:25<00:47,  1.28s/it]

Updated document MTmPuZkB6MMnAM0IPTeg with 15 chunks


Adding chunks and embeddings:  25%|██▌       | 12/48 [00:26<00:44,  1.23s/it]

Updated document MDmPuZkB6MMnAM0IPTdg with 34 chunks


Adding chunks and embeddings:  27%|██▋       | 13/48 [00:27<00:46,  1.34s/it]

Updated document LzmPuZkB6MMnAM0IPTcd with 49 chunks


Adding chunks and embeddings:  29%|██▉       | 14/48 [00:30<01:00,  1.77s/it]

Updated document LjmPuZkB6MMnAM0IPDfV with 105 chunks


Adding chunks and embeddings:  31%|███▏      | 15/48 [00:31<00:54,  1.66s/it]

Updated document LTmPuZkB6MMnAM0IPDeN with 44 chunks


Adding chunks and embeddings:  33%|███▎      | 16/48 [00:34<01:02,  1.95s/it]

Updated document LDmPuZkB6MMnAM0IPDdH with 83 chunks


Adding chunks and embeddings:  35%|███▌      | 17/48 [00:35<00:47,  1.53s/it]

Updated document KzmPuZkB6MMnAM0IPDcC with 16 chunks


Adding chunks and embeddings:  38%|███▊      | 18/48 [00:36<00:44,  1.48s/it]

Updated document KjmPuZkB6MMnAM0IOzfB with 37 chunks


Adding chunks and embeddings:  40%|███▉      | 19/48 [00:37<00:41,  1.42s/it]

Updated document KTmPuZkB6MMnAM0IOzeB with 33 chunks


Adding chunks and embeddings:  42%|████▏     | 20/48 [00:40<00:52,  1.87s/it]

Updated document KDmPuZkB6MMnAM0IOzc3 with 113 chunks


Adding chunks and embeddings:  44%|████▍     | 21/48 [00:41<00:41,  1.52s/it]

Updated document JzmPuZkB6MMnAM0IOjfw with 20 chunks


Adding chunks and embeddings:  46%|████▌     | 22/48 [00:43<00:40,  1.54s/it]

Updated document JjmPuZkB6MMnAM0IOjes with 44 chunks


Adding chunks and embeddings:  48%|████▊     | 23/48 [00:43<00:30,  1.24s/it]

Updated document JTmPuZkB6MMnAM0IOjdp with 13 chunks


Adding chunks and embeddings:  50%|█████     | 24/48 [01:01<02:27,  6.17s/it]

Updated document JDmPuZkB6MMnAM0IOjcU with 693 chunks


Adding chunks and embeddings:  52%|█████▏    | 25/48 [01:01<01:43,  4.49s/it]

Updated document IzmPuZkB6MMnAM0IOTfF with 15 chunks


Adding chunks and embeddings:  54%|█████▍    | 26/48 [01:02<01:10,  3.22s/it]

Updated document IjmPuZkB6MMnAM0IOTeH with 5 chunks


Adding chunks and embeddings:  56%|█████▋    | 27/48 [01:03<00:57,  2.74s/it]

Updated document ITmPuZkB6MMnAM0IOTdM with 46 chunks


Adding chunks and embeddings:  58%|█████▊    | 28/48 [01:04<00:41,  2.09s/it]

Updated document IDmPuZkB6MMnAM0IOTcM with 15 chunks


Adding chunks and embeddings:  60%|██████    | 29/48 [01:04<00:30,  1.59s/it]

Updated document HzmPuZkB6MMnAM0IODfG with 9 chunks


Adding chunks and embeddings:  62%|██████▎   | 30/48 [01:04<00:21,  1.21s/it]

Updated document HjmPuZkB6MMnAM0IODd_ with 1 chunks


Adding chunks and embeddings:  65%|██████▍   | 31/48 [01:07<00:28,  1.67s/it]

Updated document HTmPuZkB6MMnAM0IODcx with 66 chunks


Adding chunks and embeddings:  67%|██████▋   | 32/48 [01:16<01:01,  3.83s/it]

Updated document HDmPuZkB6MMnAM0INzfh with 279 chunks


Adding chunks and embeddings:  69%|██████▉   | 33/48 [01:16<00:41,  2.78s/it]

Updated document GzmPuZkB6MMnAM0INzeh with 5 chunks


Adding chunks and embeddings:  71%|███████   | 34/48 [01:22<00:50,  3.59s/it]

Updated document GjmPuZkB6MMnAM0INzdY with 190 chunks


Adding chunks and embeddings:  73%|███████▎  | 35/48 [01:33<01:17,  5.93s/it]

Updated document GTmPuZkB6MMnAM0INjf- with 431 chunks


Adding chunks and embeddings:  75%|███████▌  | 36/48 [01:35<00:55,  4.62s/it]

Updated document GDmPuZkB6MMnAM0INjeq with 36 chunks


Adding chunks and embeddings:  77%|███████▋  | 37/48 [01:36<00:40,  3.71s/it]

Updated document FzmPuZkB6MMnAM0INjdX with 39 chunks


Adding chunks and embeddings:  79%|███████▉  | 38/48 [01:38<00:31,  3.14s/it]

Updated document FjmPuZkB6MMnAM0INjcM with 49 chunks


Adding chunks and embeddings:  81%|████████▏ | 39/48 [01:39<00:22,  2.46s/it]

Updated document FTmPuZkB6MMnAM0INTfM with 22 chunks


Adding chunks and embeddings:  83%|████████▎ | 40/48 [01:40<00:16,  2.08s/it]

Updated document FDmPuZkB6MMnAM0INTd_ with 30 chunks


Adding chunks and embeddings:  85%|████████▌ | 41/48 [01:41<00:11,  1.70s/it]

Updated document EzmPuZkB6MMnAM0INTdC with 18 chunks


Adding chunks and embeddings:  88%|████████▊ | 42/48 [01:44<00:12,  2.00s/it]

Updated document EjmPuZkB6MMnAM0INDf_ with 95 chunks


Adding chunks and embeddings:  90%|████████▉ | 43/48 [01:46<00:09,  1.91s/it]

Updated document ETmPuZkB6MMnAM0INDe3 with 56 chunks


Adding chunks and embeddings:  92%|█████████▏| 44/48 [01:59<00:21,  5.41s/it]

Updated document EDmPuZkB6MMnAM0INDdZ with 557 chunks


Adding chunks and embeddings:  94%|█████████▍| 45/48 [02:00<00:11,  3.91s/it]

Updated document DzmPuZkB6MMnAM0IMzfT with 8 chunks


Adding chunks and embeddings:  96%|█████████▌| 46/48 [02:01<00:06,  3.23s/it]

Updated document DjmPuZkB6MMnAM0IMzdc with 48 chunks


Adding chunks and embeddings:  98%|█████████▊| 47/48 [02:03<00:02,  2.80s/it]

Updated document DTmPuZkB6MMnAM0IMzcZ with 60 chunks


Adding chunks and embeddings: 100%|██████████| 48/48 [02:04<00:00,  2.60s/it]

Updated document DDmPuZkB6MMnAM0IMjey with 43 chunks
Document processing completed!



