Простенький ноутбук, демонстрирующий создание и использование ретривера pdf-ок из моего zotero на elasticsearch. Поиск с помощью BM25-алгоритма.
Для использования уже поднятого ретривера. промотать до секции `Inference`.

# Create Retriever

In [1]:
# docker run -d --name elasticsearch -e "discovery.type=single-node" -e "xpack.security.enabled=false" -p 9200:9200 -p 9300:9300 docker.elastic.co/elasticsearch/elasticsearch:8.13.4

In [2]:
import pandas as pd
data = pd.read_csv('/home/lexi/radarange-orchestrator/data/Library.csv')

In [3]:
urls = [url.replace('abs', 'pdf') for url in list(data['Url'].unique()) if pd.notna(url) and 'arxiv.org' in url]
len(urls), urls[:5]

(210,
 ['http://arxiv.org/pdf/1910.02893',
  'http://arxiv.org/pdf/2005.06600',
  'http://arxiv.org/pdf/2210.05619',
  'http://arxiv.org/pdf/2003.11080',
  'http://arxiv.org/pdf/1609.04747'])

In [6]:
from langchain_docling import DoclingLoader
from tqdm import tqdm

batch_size = 5
documents = []

for i in tqdm(range(0, len(urls), batch_size), desc="Loading batches"):
    batch_urls = urls[i:i + batch_size]
    loader = DoclingLoader(batch_urls)
    documents.extend(loader.load())

Loading batches:   0%|          | 0/42 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (671 > 512). Running this sequence through the model will result in indexing errors
Loading batches:   2%|▏         | 1/42 [01:49<1:14:50, 109.53s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (6760 > 512). Running this sequence through the model will result in indexing errors
Loading batches:   5%|▍         | 2/42 [03:19<1:05:10, 97.75s/it] Token indices sequence length is longer than the specified maximum sequence length for this model (1159 > 512). Running this sequence through the model will result in indexing errors
Loading batches:   7%|▋         | 3/42 [06:20<1:28:19, 135.90s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (638 > 512). Running this sequence through the model will result in indexing errors
Loading batches: 

In [7]:
def clean_metadata(doc):
    if "dl_meta" in doc.metadata:
        # Convert all hash values to strings
        if "origin" in doc.metadata["dl_meta"]:
            origin = doc.metadata["dl_meta"]["origin"]
            if "binary_hash" in origin:
                origin["binary_hash"] = str(origin["binary_hash"])
    return doc

documents = [clean_metadata(doc) for doc in documents]

In [8]:
for d in documents[:3]:
    print(f"- {d.page_content=}")

- d.page_content='Parallel Iterative Edit Models for Local Sequence Transduction\nAbhijeet Awasthi ∗ , Sunita Sarawagi , Rasna Goyal , Sabyasachi Ghosh , Vihari Piratla Department of Computer Science and Engineering, IIT Bombay'
- d.page_content='Abstract\nWe present a Parallel Iterative Edit (PIE) model for the problem of local sequence transduction arising in tasks like Grammatical error correction (GEC). Recent approaches are based on the popular encoder-decoder (ED) model for sequence to sequence learning. The ED model auto-regressively captures full dependency among output tokens but is slow due to sequential decoding. The PIE model does parallel decoding, giving up the advantage of modelling full dependency in the output, yet it achieves accuracy competitive with the ED model for four reasons: 1. predicting edits instead of tokens, 2. labeling sequences instead of generating sequences, 3. iteratively refining predictions to capture dependencies, and 4. factorizing logits over edi

In [9]:
from langchain_elasticsearch import ElasticsearchStore

def create_index_with_mapping(es_url, index_name):
    from elasticsearch import Elasticsearch
    es = Elasticsearch(es_url)

    mapping = {
        "mappings": {
            "properties": {
                "content": {"type": "text"},
                "metadata": {
                    "properties": {
                        "dl_meta": {
                            "properties": {
                                "origin": {
                                    "properties": {
                                        "binary_hash": {"type": "keyword"}  # Critical fix
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }

    if es.indices.exists(index=index_name):
        es.indices.delete(index=index_name)
    es.indices.create(index=index_name, body=mapping)

create_index_with_mapping("http://localhost:9200", "test_index")

In [10]:
db = ElasticsearchStore(
    es_url="http://localhost:9200",
    index_name="test_index",
    strategy=ElasticsearchStore.BM25RetrievalStrategy(),
)

In [11]:
a = db.add_documents(documents)

In [15]:
results = db.similarity_search_with_score(query="activation function", k=5)
for doc, score in results:
    print(score)
    print(doc.metadata['source'])
    print(doc)
    print('-----------------------\n')

11.901587
http://arxiv.org/pdf/2310.20360
page_content='1.2.6 Gaussian error linear unit (GELU) activation
Another popular activation function is the GELU activation function first introduced in Hendrycks & Gimpel [201]. This activation function is the subject of the next definition.
Definition 1.2.15 (GELU activation function) . We say that a is the GELU unit activation function (we say that a is the GELU activation function) if and only if it holds that a : R → R is the function from R to R which satisfies for all x ∈ R that
<!-- formula-not-decoded -->' metadata={'source': 'http://arxiv.org/pdf/2310.20360', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/538', 'parent': {'$ref': '#/body'}, 'children': [], 'content_layer': 'body', 'label': 'text', 'prov': [{'page_no': 39, 'bbox': {'l': 56.26, 't': 424.6710146484375, 'r': 515.91, 'b': 395.1370146484375, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 177]}]},

# Inference

In [2]:
from langchain_elasticsearch import ElasticsearchStore

db = ElasticsearchStore(
    es_url="http://localhost:9200",
    index_name="test_index",
    strategy=ElasticsearchStore.BM25RetrievalStrategy(),
)

results = db.similarity_search_with_score(query="activation function", k=5)
for doc, score in results:
    print(score)
    print(doc.metadata['source'])
    print(doc)
    print('-----------------------\n')

11.901587
http://arxiv.org/pdf/2310.20360
page_content='1.2.6 Gaussian error linear unit (GELU) activation
Another popular activation function is the GELU activation function first introduced in Hendrycks & Gimpel [201]. This activation function is the subject of the next definition.
Definition 1.2.15 (GELU activation function) . We say that a is the GELU unit activation function (we say that a is the GELU activation function) if and only if it holds that a : R → R is the function from R to R which satisfies for all x ∈ R that
<!-- formula-not-decoded -->' metadata={'source': 'http://arxiv.org/pdf/2310.20360', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/538', 'parent': {'$ref': '#/body'}, 'children': [], 'content_layer': 'body', 'label': 'text', 'prov': [{'page_no': 39, 'bbox': {'l': 56.26, 't': 424.6710146484375, 'r': 515.91, 'b': 395.1370146484375, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 177]}]},