# Document Loading & Retrieval

Author: [Abdulazeez Jimoh](https://github.com/abdulazeezoj)


This implementation is based on:

- PyPDF - PDF Document Parsing
- Llama Index - Document Loading & Retrieval
- Weaviate - Vector Storage
- Neo4J - Graph Database
- Redis - Ingestion Cache & Document Storage
- OpenAI - Large Language Models


In [18]:
from pprint import pprint

from llama_index.embeddings import OpenAIEmbedding
from llama_index.indices.knowledge_graph.retrievers import KGRetrieverMode
from llama_index.llms import OpenAI
from llama_index.response.schema import Response
from llama_index.response_synthesizers.type import ResponseMode
from llama_index.vector_stores.types import VectorStoreQueryMode

from diplodigst.config import DiploConfig
from diplodigst.tools import DiploDocLoader, DiploDocRetriever
from diplodigst.types import DiploIndex

In [None]:
diplo_config.model_dump()

In [14]:
diplo_config = DiploConfig()

In [15]:
# initialise embed model
llm_model = OpenAI(api_key=diplo_config.OPENAI_API_KEY)
embed_model = OpenAIEmbedding(api_key=diplo_config.OPENAI_API_KEY)

In [16]:
# initialise document loader
doc_loader = DiploDocLoader(
    embed_model=embed_model,
    llm_model=llm_model,
    weaviate_host=diplo_config.WEAVIATE_HOST,
    weaviate_port=diplo_config.WEAVIATE_PORT,
    redis_host=diplo_config.REDIS_HOST,
    redis_port=diplo_config.REDIS_PORT,
    neo4j_host=diplo_config.NEO4J_HOST,
    neo4j_port=diplo_config.NEO4J_PORT,
    neo4j_username=diplo_config.NEO4J_USERNAME,
    neo4j_password=diplo_config.NEO4J_PASSWORD,
    name=diplo_config.PROJECT_NAME,
    chunk_size=500,
    chunk_overlap=25,
    verbose=True,
)

[ INFO ] Initializing vector store
[ INFO ] Vector store initialized
[ INFO ] Initializing graph store


ValueError: Error loading graph store: Could not connect to Neo4j database. Please ensure that the username and password are correct

In [7]:
# create document index
_doc_index: DiploIndex = doc_loader.ingest(
    data_path="../data/test_docs/",
    filename_as_id=False,
)

[ INFO ] Loading documents


Loading files: 100%|██████████| 10/10 [00:05<00:00,  1.74file/s]


[ INFO ] Loaded 114 documents
[ INFO ] Running ingestion pipeline


  from .autonotebook import tqdm as notebook_tqdm


[ INFO ] Ingested 187 nodes
[ INFO ] Creating vector index


Generating embeddings: 100%|██████████| 187/187 [00:07<00:00, 26.04it/s]


[ INFO ] Vector index created
[ INFO ] Creating knowledge graph index


Processing nodes: 100%|██████████| 187/187 [12:15<00:00,  3.93s/it]


[ INFO ] Knowledge graph index created
[ INFO ] Indices created


In [8]:
# load documents index
doc_index: DiploIndex = doc_loader.load()

[ INFO ] Loading vector index
[ INFO ] Vector index loaded
[ INFO ] Loading knowledge graph index
[ INFO ] Knowledge graph index loaded


In [9]:
# initialise document retriever
doc_retriever = DiploDocRetriever(
    llm_model=llm_model,
    embed_model=embed_model,
    doc_index=doc_index,
    retriever_mode="OR",
    response_mode=ResponseMode.TREE_SUMMARIZE,
    vector_query_alpha=0.5,
    vector_query_mode=VectorStoreQueryMode.HYBRID,
    graph_query_mode=KGRetrieverMode.HYBRID,
    verbose=True,
)

[ INFO ] Initializing vector and graph indices
[ INFO ] Vector and graph indices initialized
[ INFO ] Initializing vector and graph retrievers
[ INFO ] Vector and graph retrievers initialized
[ INFO ] Initializing retriever
[ INFO ] Retriever initialized
[ INFO ] Initializing query engine
[ INFO ] Query engine initialized


In [10]:
query = "What is Switzerland contribution to the global digital compact?"
response: Response = doc_retriever.query(query)
pprint(
    [
        {"doc": n.metadata.get("file_name"), "page": n.metadata.get("page_label")}
        for n in response.source_nodes
        if n.metadata.get("page_label") is not None
    ]
)
pprint(response.response)

[{'doc': 'GDC-submission_Switzerland.pdf', 'page': '5'},
 {'doc': 'GDC-submission_Switzerland.pdf', 'page': '1'}]
("Switzerland's contribution to the Global Digital Compact is to support the "
 'efforts to increase accountability for discrimination and misleading '
 'content, while protecting human and fundamental rights. Switzerland commits '
 "to contribute as a member of the Catalyst Group to UNESCO's efforts for an "
 'Internet for Trust.')


In [12]:
query = "What are the commitments of Switzerland to the global digital compact?"
response: Response = doc_retriever.query(query)
pprint(
    [
        {"doc": n.metadata.get("file_name"), "page": n.metadata.get("page_label")}
        for n in response.source_nodes
        if n.metadata.get("page_label") is not None
    ]
)
pprint(response.response)

[{'doc': 'GDC-submission_Switzerland.pdf', 'page': '5'},
 {'doc': 'GDC-submission_Switzerland.pdf', 'page': '1'}]
('Switzerland commits to support the efforts to increase accountability for '
 'discrimination and misleading content, while protecting human and '
 'fundamental rights. Switzerland also commits to further supporting the '
 'efforts to address connectivity issues, such as through the GIGA Initiative '
 'and other efforts undertaken by ITU, UNESCO, and other UN institutions.')


In [13]:
query = "What is Singapore and Switzerland’s position on the inclusion of digital technologies in the global compact?"
response: Response = doc_retriever.query(query)
pprint(
    [
        {"doc": n.metadata.get("file_name"), "page": n.metadata.get("page_label")}
        for n in response.source_nodes
        if n.metadata.get("page_label") is not None
    ]
)
pprint(response.response)

[{'doc': 'GDC-submission_Republic-of-Singapore.pdf', 'page': '5'},
 {'doc': 'GDC-submission_Switzerland.pdf', 'page': '1'}]
('Singapore and Switzerland both support the inclusion of digital technologies '
 'in the global compact. They believe that the starting point for the Global '
 'Digital Compact should be digital inclusion and connectivity. They emphasize '
 'the importance of providing universal access and connectivity to the '
 'internet by 2030. Switzerland also mentions that the Global Digital Compact '
 'has the potential to strengthen a principles-based order in the area of '
 'digital cooperation. Both countries express their readiness to support the '
 'development of the Global Digital Compact and work towards its '
 'establishment.')
