The purpose of this notebook is to expore how to do batch processing of documents (mainly ppts and pdfs), apply appropriate chunking and metadata curation, and hence making a good RAG application.

1. Try pdf reading and parsing
2. Building a RAG based on FAISS
3. Improving the RAG through pre-processing of the documentation

In [1]:
from content_extraction import download_bill_pdfs

download_bill_pdfs("resources/legco_bill_index.json", "resources/legco")

File already exists, skipping: resources/legco\b201610141.pdf
File already exists, skipping: resources/legco\b201611111.pdf
File already exists, skipping: resources/legco\b201611181.pdf
File already exists, skipping: resources/legco\b201612021.pdf
File already exists, skipping: resources/legco\b201612301.pdf
File already exists, skipping: resources/legco\b201701271.pdf
File already exists, skipping: resources/legco\b201701272.pdf
File already exists, skipping: resources/legco\b201702221.pdf
File already exists, skipping: resources/legco\b201702241.pdf
File already exists, skipping: resources/legco\b201703031.pdf
File already exists, skipping: resources/legco\b201703101.pdf
File already exists, skipping: resources/legco\b201703102.pdf
File already exists, skipping: resources/legco\b201703241.pdf
File already exists, skipping: resources/legco\b201703311.pdf
File already exists, skipping: resources/legco\b201704071.pdf
File already exists, skipping: resources/legco\b201705051.pdf
File alr

In [2]:
from doc_processing import load_path
# Directory containing PDF files
pdf_dir = "C:/Users/surface/My Codes/AgentGallery/resources/legco"

# Load all PDFs
all_pages = await load_path(pdf_dir)
print(f"\nTotal pages loaded: {len(all_pages)}")

Found 139 PDF files
Processing b201610141.pdf...
Loaded 6 pages from b201610141.pdf
Processing b201611111.pdf...
Loaded 216 pages from b201611111.pdf
Processing b201611181.pdf...
Loaded 19 pages from b201611181.pdf
Processing b201612021.pdf...
Loaded 17 pages from b201612021.pdf
Processing b201612301.pdf...
Loaded 26 pages from b201612301.pdf
Processing b201701271.pdf...
Loaded 10 pages from b201701271.pdf
Processing b201701272.pdf...
Loaded 13 pages from b201701272.pdf
Processing b201702221.pdf...
Loaded 7 pages from b201702221.pdf
Processing b201702241.pdf...
Loaded 52 pages from b201702241.pdf
Processing b201703031.pdf...
Loaded 13 pages from b201703031.pdf
Processing b201703101.pdf...
Loaded 42 pages from b201703101.pdf
Processing b201703102.pdf...
Loaded 237 pages from b201703102.pdf
Processing b201703241.pdf...
Loaded 6 pages from b201703241.pdf
Processing b201703311.pdf...
Loaded 25 pages from b201703311.pdf
Processing b201704071.pdf...
Loaded 10 pages from b201704071.pdf
Proces

Advanced encoding /UniJIS-UTF16-H not implemented yet


Loaded 7 pages from ci20240220cb1-185-3-e.pdf
Processing ci20240220cb1-221-1-c.pdf...
Loaded 10 pages from ci20240220cb1-221-1-c.pdf
Processing ci20240408cb1-305-3-e.pdf...
Loaded 11 pages from ci20240408cb1-305-3-e.pdf
Processing ci20250218cb2-240-6-e.pdf...
Loaded 14 pages from ci20250218cb2-240-6-e.pdf
Processing ci20250318cb2-445-5-e.pdf...
Loaded 6 pages from ci20250318cb2-445-5-e.pdf
Processing s620192323p1.pdf...
Loaded 13 pages from s620192323p1.pdf

Total pages loaded: 6605


In [1]:
from vector_store import VectorStoreManager

# Initialize vector store manager (no docs loaded yet)
vs_manager = VectorStoreManager(
    index_path="vector_store",
    index_name="test_chromadb",
    max_splits_per_batch=100
)

# Add documents (batched)
#vs_manager.add_documents(all_pages)


ChromaDB instance ready at vector_store (collection: test_chromadb)
Current document count: 12869


In [2]:

# Inspect collection metadata and stats
collection_info = vs_manager.inspect_collection()
print("\nCollection Information:")
print(f"Collection Name: {collection_info['collection_name']}")
print(f"Total Documents: {collection_info['total_documents']}")
print(f"Metadata Fields: {collection_info['metadata_fields']}")



Collection Information:
Collection Name: test_chromadb
Total Documents: 12869
Metadata Fields: ['page', 'author', 'creationdate', 'keywords', 'category', 'total_pages', 'title', 'producer', 'page_label', 'moddate', 'subject', 'trapped', 'company', 'sourcemodified', 'source', 'date', 'creator', 'comments']


In [1]:
from rag_wrapper import rag_instance
# Initialize RAG system
rag_legco = rag_instance(vector_store_path="vector_store", index_name="test_chromadb")


ChromaDB instance ready at vector_store (collection: test_chromadb)
Current document count: 12867
Loading vector store from vector_store with name test_chromadb...
Vector store loaded successfully
The current document count is 12867


In [2]:

# Example questions
questions = [
    "What are the key points about national key labs?",
    "What is the government's policy on research funding?",
    "How does the government support innovation?"
]

# Process each question
for question in questions:
    print(f"\nQuestion: {question}")
    response = rag_legco.query(question)
    rag_legco.print_response(response) 


Question: What are the key points about national key labs?


  result = self.qa_chain({"query": question})



Answer:
The key points about the State Key Laboratories (SKLs) in Hong Kong include:

1. **Background and Management**: Launched in 1984, the SKL scheme is managed by the Mainland's Ministry of Science and Technology (MOST). It focuses on nurturing both basic and applied technology research and development, with a high status signifying recognition from MOST.

2. **Restructuring Initiative**: In 2022, MOST initiated an optimization and restructuring of SKLs, requiring them to address significant national scientific needs and clearly define their mission and positioning. Post-restructuring, the Chinese name for these labs will change from “国家重点实验室” to “全国重点实验室,” though the English title remains unchanged.

3. **Evaluation Criteria**:
   - **Mission and Positioning**: Must align with national development strategies and leverage Hong Kong’s international advantages.
   - **Research Level and Impact**: Should be internationally advanced, supporting major national needs and critical core t