# Document Loading & Retrieval

Author: [Abdulazeez Jimoh](https://github.com/abdulazeezoj)


This implementation is based on:

- PyPDF - PDF Document Parsing
- Llama Index - Document Loading & Retrieval
- Weaviate - Vector Storage
- Neo4J - Graph Database
- Redis - Ingestion Cache & Document Storage
- OpenAI - Large Language Models


In [3]:
from pprint import pprint

from llama_index.embeddings import OpenAIEmbedding
from llama_index.indices.knowledge_graph.retrievers import KGRetrieverMode
from llama_index.llms import OpenAI
from llama_index.response.schema import Response
from llama_index.response_synthesizers.type import ResponseMode
from llama_index.vector_stores.types import VectorStoreQueryMode

from diplodigst.config import DiploConfig
from diplodigst.tools import DiploDocLoader, DiploDocRetriever
from diplodigst.types import DiploIndex

In [3]:
diplo_config = DiploConfig()

In [None]:
diplo_config.model_dump()

In [5]:
# initialise embed model
llm_model = OpenAI(api_key=diplo_config.OPENAI_API_KEY)
embed_model = OpenAIEmbedding(api_key=diplo_config.OPENAI_API_KEY)

In [6]:
# initialise document loader
doc_loader = DiploDocLoader(
    embed_model=embed_model,
    llm_model=llm_model,
    weaviate_host=diplo_config.WEAVIATE_HOST,
    weaviate_port=diplo_config.WEAVIATE_PORT,
    redis_host=diplo_config.REDIS_HOST,
    redis_port=diplo_config.REDIS_PORT,
    neo4j_host=diplo_config.NEO4J_HOST,
    neo4j_port=diplo_config.NEO4J_PORT,
    neo4j_username=diplo_config.NEO4J_USERNAME,
    neo4j_password=diplo_config.NEO4J_PASSWORD,
    name=diplo_config.PROJECT_NAME,
    chunk_size=500,
    chunk_overlap=25,
    verbose=True,
)

[ INFO ] Initializing vector store
[ INFO ] Vector store initialized
[ INFO ] Initializing graph store
[ INFO ] Graph store initialized
[ INFO ] Initializing document store
[ INFO ] Document store initialized
[ INFO ] Initializing index store
[ INFO ] Index store initialized
[ INFO ] Initializing ingestion cache
[ INFO ] Ingestion cache initialized
[ INFO ] Initializing LLM and embedding models
[ INFO ] LLM and embedding models initialized
[ INFO ] Initializing service context
[ INFO ] Service context initialized
[ INFO ] Initializing storage context
[ INFO ] Storage context initialized
[ INFO ] Initializing ingestion pipeline
[ INFO ] Ingestion pipeline initialized


In [7]:
# create document index
_doc_index: DiploIndex = doc_loader.ingest(
    data_path="../data/test_docs/",
    filename_as_id=False,
)

[ INFO ] Loading documents


Loading files: 100%|██████████| 10/10 [00:07<00:00,  1.37file/s]


[ INFO ] Loaded 114 documents
[ INFO ] Running ingestion pipeline


  from .autonotebook import tqdm as notebook_tqdm


[ INFO ] Ingested 187 nodes
[ INFO ] Creating vector index


Generating embeddings: 100%|██████████| 187/187 [00:02<00:00, 80.95it/s]


[ INFO ] Vector index created
[ INFO ] Creating knowledge graph index


Processing nodes: 100%|██████████| 187/187 [25:10<00:00,  8.08s/it]

[ INFO ] Knowledge graph index created
[ INFO ] Indices created





In [8]:
# load documents index
doc_index: DiploIndex = doc_loader.load()

[ INFO ] Loading vector index
[ INFO ] Vector index loaded
[ INFO ] Loading knowledge graph index
[ INFO ] Knowledge graph index loaded


In [9]:
# initialise document retriever
doc_retriever = DiploDocRetriever(
    llm_model=llm_model,
    embed_model=embed_model,
    doc_index=doc_index,
    retriever_mode="OR",
    response_mode=ResponseMode.TREE_SUMMARIZE,
    vector_query_alpha=0.5,
    vector_query_mode=VectorStoreQueryMode.HYBRID,
    graph_query_mode=KGRetrieverMode.HYBRID,
    verbose=True,
)

[ INFO ] Initializing vector and graph indices
[ INFO ] Vector and graph indices initialized
[ INFO ] Initializing vector and graph retrievers
[ INFO ] Vector and graph retrievers initialized
[ INFO ] Initializing retriever
[ INFO ] Retriever initialized
[ INFO ] Initializing query engine
[ INFO ] Query engine initialized


In [15]:
from IPython.display import Markdown, display

In [16]:
query = "What is Switzerland contribution to the global digital compact?"
response: Response = doc_retriever.query(query)
pprint(
    [
        {"doc": n.metadata.get("file_name"), "page": n.metadata.get("page_label")}
        for n in response.source_nodes
        if n.metadata.get("page_label") is not None
    ]
)
pprint(response.response)

[{'doc': 'GDC-submission_Switzerland.pdf', 'page': '5'},
 {'doc': 'GDC-submission_Switzerland.pdf', 'page': '1'}]
("Switzerland's contribution to the Global Digital Compact is focused on "
 'several areas. They support the implementation of human rights due diligence '
 'by technology companies and advocate for effective remedies for individuals '
 'and communities exposed to human rights risks. Switzerland also emphasizes '
 'the importance of protecting internet freedom and fundamental human rights '
 'online. They commit to supporting efforts to increase accountability for '
 'discrimination and misleading content, while also protecting human and '
 'fundamental rights. Additionally, Switzerland is ready to support the '
 'Co-Facilitators of the Global Digital Compact and work towards the '
 'establishment of a Geneva-based presence of the Office of the '
 "Secretary-General's Envoy on Technology. They believe that an open and "
 'inclusive process, based on existing work streams an

In [17]:
# Display response
display(Markdown(f"<b>{response.response}</b>"))

<b>Switzerland's contribution to the Global Digital Compact is focused on several areas. They support the implementation of human rights due diligence by technology companies and advocate for effective remedies for individuals and communities exposed to human rights risks. Switzerland also emphasizes the importance of protecting internet freedom and fundamental human rights online. They commit to supporting efforts to increase accountability for discrimination and misleading content, while also protecting human and fundamental rights. Additionally, Switzerland is ready to support the Co-Facilitators of the Global Digital Compact and work towards the establishment of a Geneva-based presence of the Office of the Secretary-General's Envoy on Technology. They believe that an open and inclusive process, based on existing work streams and fora, is crucial for developing a Global Digital Compact. Switzerland also highlights the relevance of existing agreements, such as those of the World Summit on the Information Society (WSIS), and suggests considering the intersection between digitalization and climate change in the compact's themes. They propose that the UN Internet Governance Forum (IGF) could offer a platform for periodic discussions and actions related to the follow-up and implementation of the Global Digital Compact.</b>

In [11]:
query = "What are the commitments of Switzerland to the global digital compact?"
response: Response = doc_retriever.query(query)
pprint(
    [
        {"doc": n.metadata.get("file_name"), "page": n.metadata.get("page_label")}
        for n in response.source_nodes
        if n.metadata.get("page_label") is not None
    ]
)
pprint(response.response)

[{'doc': 'GDC-submission_Switzerland.pdf', 'page': '1'},
 {'doc': 'GDC-submission_Switzerland.pdf', 'page': '1'}]
('Switzerland commits to further supporting the efforts to address '
 'connectivity issues, such as through the GIGA Initiative and other efforts '
 'undertaken by ITU, UNESCO, and other UN institutions. Switzerland also '
 'commits to further international discussions and best practice on enabling '
 'environments for the deployment of broadband.')


In [12]:
query = "What is Singapore and Switzerland’s position on the inclusion of digital technologies in the global compact?"
response: Response = doc_retriever.query(query)
pprint(
    [
        {"doc": n.metadata.get("file_name"), "page": n.metadata.get("page_label")}
        for n in response.source_nodes
        if n.metadata.get("page_label") is not None
    ]
)
pprint(response.response)

[{'doc': 'GDC-submission_Republic-of-Singapore.pdf', 'page': '5'},
 {'doc': 'GDC-submission_Switzerland.pdf', 'page': '1'}]
('Singapore and Switzerland both support the inclusion of digital technologies '
 'in the global compact. They believe that the starting point for the Global '
 'Digital Compact should be digital inclusion and connectivity. They emphasize '
 'the importance of providing universal access and connectivity to the '
 'internet by 2030. Switzerland specifically mentions that the Global Digital '
 'Compact has the potential to strengthen a principles-based order in the area '
 'of digital cooperation and that it should build on existing agreements and '
 'outcomes of the World Summit on the Information Society. Both countries '
 'express their readiness to support the development of the Global Digital '
 'Compact and to work towards its implementation.')


In [14]:
# Display response
display(Markdown(f"<b>{response.response}</b>"))

<b>Singapore and Switzerland both support the inclusion of digital technologies in the global compact. They believe that the starting point for the Global Digital Compact should be digital inclusion and connectivity. They emphasize the importance of providing universal access and connectivity to the internet by 2030. Switzerland specifically mentions that the Global Digital Compact has the potential to strengthen a principles-based order in the area of digital cooperation and that it should build on existing agreements and outcomes of the World Summit on the Information Society. Both countries express their readiness to support the development of the Global Digital Compact and to work towards its implementation.</b>