# Document loading via langchain and llamaindex

- [Langchain](https://python.langchain.com/docs/integrations/document_loaders/)
- [LlamaIndex](https://developers.llamaindex.ai/python/framework/module_guides/loading/simpledirectoryreader/)

In [2]:
from langchain_community.document_loaders import TextLoader, PyMuPDFLoader, CSVLoader
from llama_index.core import SimpleDirectoryReader

In [5]:
# Langchain
TextLoader("../data/sample.txt").load()[0]

Document(metadata={'source': '../data/sample.txt'}, page_content='Terms And Conditions\nThese Terms of Use ("Terms") constitute an enforceable contract between you and Euron ("Euron", "we", or "our"), a subsidiary of Engage Sphere Technology Private Limited. By accessing or using our website, mobile applications, and related services (collectively, "Services"), you agree to be bound by these Terms. Please review them carefully as they contain important information about your legal rights, remedies, and obligations.\n\nTable of Contents\nAccounts\nCommunications\nContent Enrollment and Access\nPayments and Refunds\nDigital Product Access & Shipping Policy\nContent and Behavior Rules\nEuron\'s Rights to Content You Post\nUsing Euron at Your Own Risk\nEuron\'s Rights\nSubscription Terms\nMiscellaneous Legal Terms\nDispute Resolution\nUpdating These Terms\nHow to Contact Us\n1. Accounts\nYou need an account for most activities on our platform. Keep your password somewhere safe because you\

In [6]:
TextLoader("../data/sample.txt").load()[0].page_content

'Terms And Conditions\nThese Terms of Use ("Terms") constitute an enforceable contract between you and Euron ("Euron", "we", or "our"), a subsidiary of Engage Sphere Technology Private Limited. By accessing or using our website, mobile applications, and related services (collectively, "Services"), you agree to be bound by these Terms. Please review them carefully as they contain important information about your legal rights, remedies, and obligations.\n\nTable of Contents\nAccounts\nCommunications\nContent Enrollment and Access\nPayments and Refunds\nDigital Product Access & Shipping Policy\nContent and Behavior Rules\nEuron\'s Rights to Content You Post\nUsing Euron at Your Own Risk\nEuron\'s Rights\nSubscription Terms\nMiscellaneous Legal Terms\nDispute Resolution\nUpdating These Terms\nHow to Contact Us\n1. Accounts\nYou need an account for most activities on our platform. Keep your password somewhere safe because you\'re responsible for all activity associated with your account. If

In [9]:
# langchain
CSVLoader("../data/sample.csv").load()

[Document(metadata={'source': '../data/sample.csv', 'row': 0}, page_content='Course: Python Basics\nCategory: Programming\nInstructor: Prince Katiyar\nDuration (hrs): 10'),
 Document(metadata={'source': '../data/sample.csv', 'row': 1}, page_content='Course: Advanced SQL\nCategory: Database\nInstructor: Boktiar Bappy\nDuration (hrs): 8'),
 Document(metadata={'source': '../data/sample.csv', 'row': 2}, page_content='Course: LangChain Projects\nCategory: AI Agents\nInstructor: Prince Katiyar\nDuration (hrs): 15'),
 Document(metadata={'source': '../data/sample.csv', 'row': 3}, page_content='Course: AI Resume Matcher\nCategory: Career Tools\nInstructor: Boktiar Bappy\nDuration (hrs): 7')]

In [11]:
# langchain
PyMuPDFLoader("../data/sample.pdf").load()

[Document(metadata={'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': '', 'creationdate': 'D:20250620072739', 'source': '../data/sample.pdf', 'file_path': '../data/sample.pdf', 'total_pages': 1, 'format': 'PDF 1.3', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': 'D:20250620072739', 'page': 0}, page_content='Euron is an AI-powered learning and automation platform offering over 50 courses and 100+\nreal-world projects. \nIt empowers developers with tools, job support, and AI capabilities through projects in RAG, Agentic\nAI, MLOps, and Cloud.')]

In [14]:
# langchain
from langchain_docling import DoclingLoader

DoclingLoader("https://arxiv.org/pdf/2408.09869").load()

2025-09-18 13:35:53,635 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-09-18 13:35:53,645 - INFO - Going to convert document batch...
2025-09-18 13:35:53,645 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e647edf348883bed75367b22fbe60347
2025-09-18 13:35:53,646 - INFO - Accelerator device: 'mps'
2025-09-18 13:35:55,130 - INFO - Accelerator device: 'mps'
2025-09-18 13:35:56,423 - INFO - Accelerator device: 'mps'
2025-09-18 13:35:56,866 - INFO - Processing document 2408.09869v5.pdf
2025-09-18 13:36:03,635 - INFO - Finished converting document 2408.09869v5.pdf in 10.82 sec.


[Document(metadata={'source': 'https://arxiv.org/pdf/2408.09869', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/3', 'parent': {'$ref': '#/body'}, 'children': [], 'content_layer': 'body', 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 113.643, 't': 481.532, 'r': 498.359, 'b': 439.849, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 295]}]}, {'self_ref': '#/texts/4', 'parent': {'$ref': '#/body'}, 'children': [], 'content_layer': 'body', 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 249.283, 't': 427.545, 'r': 362.717, 'b': 408.084, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 50]}]}], 'headings': ['Version 1.0'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}}, page_content='Version 1.0\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbaue

In [15]:
DoclingLoader("../data/Querying.docx").load()

2025-09-18 13:36:57,729 - INFO - detected formats: [<InputFormat.DOCX: 'docx'>]
2025-09-18 13:36:57,737 - INFO - Going to convert document batch...
2025-09-18 13:36:57,738 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2025-09-18 13:36:57,738 - INFO - Processing document Querying.docx
2025-09-18 13:36:57,762 - INFO - Finished converting document Querying.docx in 0.04 sec.


[Document(metadata={'source': '../data/Querying.docx', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/0', 'parent': {'$ref': '#/body'}, 'children': [], 'content_layer': 'body', 'label': 'paragraph', 'prov': []}], 'origin': {'mimetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'binary_hash': 12777617756686467286, 'filename': 'Querying.docx'}}}, page_content='**Querying – Sorting, Filtering & Grouping Assignment Questions**'),
 Document(metadata={'source': '../data/Querying.docx', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/1', 'parent': {'$ref': '#/groups/0'}, 'children': [{'$ref': '#/groups/1'}], 'content_layer': 'body', 'label': 'list_item', 'prov': []}, {'self_ref': '#/texts/4', 'parent': {'$ref': '#/groups/0'}, 'children': [{'$ref': '#/groups/2'}], 'content_layer': 'body', 'label': 'list_ite

In [16]:
from langchain_community.document_loaders import WebBaseLoader

WebBaseLoader("https://deepshieldai.com").load()



[Document(metadata={'source': 'https://deepshieldai.com', 'title': 'Top Gen AI Transformation Partner | DeepshieldAI', 'description': 'Unlock the power of your data with our Gen AI based expert solutions | Get AI Ready in 4 weeks!', 'language': 'en'}, page_content='\n\n\n\n\n\n\nTop Gen AI Transformation Partner | DeepshieldAI\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n')]

In [17]:
from typing import Dict, Any
import json
from langchain.docstore.document import Document


def load_notion_json(path: str) -> Dict[str, Any]:
	with open(path, "r", encoding="utf-8") as f:
		data = json.load(f)

		document = []
		for entry in data:
			page_content = f"Title: {entry['title']}\n\nContent: {entry['content']}"
			metadata = {"title": entry.get("title", ""), "source": path}
			document.append(Document(page_content=page_content, metadata=metadata))
		return document

In [19]:
load_notion_json("../data/notion_export.json")

[Document(metadata={'title': 'Euron Features', 'source': '../data/notion_export.json'}, page_content='Title: Euron Features\n\nContent: Over 50 courses, 100+ projects, Resume AI, Focus Mode, Job Portal, EURI AI tools.'),
 Document(metadata={'title': 'Popular Projects', 'source': '../data/notion_export.json'}, page_content='Title: Popular Projects\n\nContent: PDF QA Bot, AI Job Matcher, LangChain Agent with Tools, FastAPI Deployment.')]

In [21]:
# llamaindex
SimpleDirectoryReader(input_files=["../data/sample.txt"]).load_data()

[Document(id_='b3f22ae8-d164-4248-84fc-350a7303f1c9', embedding=None, metadata={'file_path': '../data/sample.txt', 'file_name': 'sample.txt', 'file_type': 'text/plain', 'file_size': 4492, 'creation_date': '2025-06-20', 'last_modified_date': '2025-06-20'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Terms And Conditions\nThese Terms of Use ("Terms") constitute an enforceable contract between you and Euron ("Euron", "we", or "our"), a subsidiary of Engage Sphere Technology Private Limited. By accessing or using our website, mobile applications, and related services (collectively, "Services"), you agree to be bound by these Terms. Please 

In [22]:
# Langchain
TextLoader("../data/sample.txt").load()

[Document(metadata={'source': '../data/sample.txt'}, page_content='Terms And Conditions\nThese Terms of Use ("Terms") constitute an enforceable contract between you and Euron ("Euron", "we", or "our"), a subsidiary of Engage Sphere Technology Private Limited. By accessing or using our website, mobile applications, and related services (collectively, "Services"), you agree to be bound by these Terms. Please review them carefully as they contain important information about your legal rights, remedies, and obligations.\n\nTable of Contents\nAccounts\nCommunications\nContent Enrollment and Access\nPayments and Refunds\nDigital Product Access & Shipping Policy\nContent and Behavior Rules\nEuron\'s Rights to Content You Post\nUsing Euron at Your Own Risk\nEuron\'s Rights\nSubscription Terms\nMiscellaneous Legal Terms\nDispute Resolution\nUpdating These Terms\nHow to Contact Us\n1. Accounts\nYou need an account for most activities on our platform. Keep your password somewhere safe because you

## It seems llamaindex has better metadata capturing mechanism

In [24]:
SimpleDirectoryReader(input_files=["../data/notion_export.json"]).load_data()

[Document(id_='f38fc28e-2ac9-4f36-9727-106d68b5dd5c', embedding=None, metadata={'file_path': '../data/notion_export.json', 'file_name': 'notion_export.json', 'file_type': 'application/json', 'file_size': 276, 'creation_date': '2025-06-20', 'last_modified_date': '2025-06-20'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='[\n  {\n    "title": "Euron Features",\n    "content": "Over 50 courses, 100+ projects, Resume AI, Focus Mode, Job Portal, EURI AI tools."\n  },\n  {\n    "title": "Popular Projects",\n    "content": "PDF QA Bot, AI Job Matcher, LangChain Agent with Tools, FastAPI Deployment."\n  }\n]', path=None, url=None, mimetype=None