## Document loaders

documentation: https://python.langchain.com/v0.2/docs/integrations/document_loaders/#all-document-loaders

## 1- TextLoader
to load txt files

In [1]:
from langchain_community.document_loaders import TextLoader
loader=TextLoader('sample3.txt')
loader

<langchain_community.document_loaders.text.TextLoader at 0x298a3b77770>

In [2]:
document=loader.load()
document

[Document(metadata={'source': 'sample3.txt'}, page_content='Quod equidem non reprehendo;\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Quibus natura iure responderit non esse verum aliunde finem beate vivendi, a se principia rei gerendae peti; Quae enim adhuc protulisti, popularia sunt, ego autem a te elegantiora desidero. Duo Reges: constructio interrete. Tum Lucius: Mihi vero ista valde probata sunt, quod item fratri puto. Bestiarum vero nullum iudicium puto. Nihil enim iam habes, quod ad corpus referas; Deinde prima illa, quae in congressu solemus: Quid tu, inquit, huc? Et homini, qui ceteris animantibus plurimum praestat, praecipue a natura nihil datum esse dicemus?\n\nIam id ipsum absurdum, maximum malum neglegi. Quod ea non occurrentia fingunt, vincunt Aristonem; Atqui perspicuum est hominem e corpore animoque constare, cum primae sint animi partes, secundae corporis. Fieri, inquam, Triari, nullo pacto potest, ut non dicas, quid non probes eius, a quo dissentias. Equi

## 2- Reading a PDf file

In [4]:
from langchain_community.document_loaders import PyPDFLoader
loader=PyPDFLoader('n8ncheat.pdf')
document=loader.load()
document

Ignoring wrong pointing object 208 0 (offset 0)


[Document(metadata={'producer': 'macOS Version 15.4 (Build 24E248) Quartz PDFContext', 'creator': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36', 'creationdate': "D:20250418050915Z00'00'", 'title': 'n8n Cheat Sheet Dashboard', 'moddate': "D:20250418050915Z00'00'", 'source': 'n8ncheat.pdf', 'total_pages': 17, 'page': 0, 'page_label': '1'}, page_content='\uf395Docker Self-Hosting \uf0c5\nUse Docker to install, update, and run n8n quickly. Adjust volumes,\nenvironment variables, and ports as needed.\nClone the Starter\ngit clone https://github.com/n8n-io/self-hosted-n8n-starter.git\ncd self-hosted-n8n-starter\nPull the Latest Version\nsudo docker pull n8nio/n8n:latest\nStop Existing Container\nsudo docker stop n8n\nRemove Container\nsudo docker rm n8n\nExample Docker Run\ndocker run -it --rm \\\n--name n8n \\\n-p 5678:5678 \\\n-v ~/.n8n:/home/node/.n8n \\\n-e WEBHOOK_URL="https://n8n.example.com/" \\\n-e N8N_SMTP_HOS

In [7]:
type(document)

list

In [8]:
type(document[0])

langchain_core.documents.base.Document

## 3- web based loaders

In [22]:
from langchain_community.document_loaders import WebBaseLoader
import bs4
loader=WebBaseLoader(web_path=('https://medium.com/@aydinKerem/what-is-an-llm-agent-and-how-does-it-work-1d4d9e4381ca',),bs_kwargs=dict(parse_only=bs4.SoupStrainer(
    class_=("post-title","post-content","post-header")
)))

In [23]:
loader.load()

[Document(metadata={'source': 'https://medium.com/@aydinKerem/what-is-an-llm-agent-and-how-does-it-work-1d4d9e4381ca'}, page_content='')]

## 4- Arxiv Loader

In [26]:
from langchain_community.document_loaders import ArxivLoader
doc=ArxivLoader(
    query="1706.03762",
    load_max_docs=2
).load()
len(doc)

1

In [27]:
doc

[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntr

## 5- Wikipedia Loader

In [29]:
from langchain_community.document_loaders import WikipediaLoader
doc=WikipediaLoader(
    query="Langchain",
    load_max_docs=2
).load()
len(doc)

2

In [30]:
doc

[Document(metadata={'title': 'LangChain', 'summary': "LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.\n\n", 'source': 'https://en.wikipedia.org/wiki/LangChain'}, page_content='LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain\'s use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.\n\n\n== History ==\nLangChain was launched in October 2022 as an open source project by Harrison Chase, while working at machine learning startup Robust Intelligence. The project quickly garnered popularity, with improvements f

In [31]:
doc[0]

Document(metadata={'title': 'LangChain', 'summary': "LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.\n\n", 'source': 'https://en.wikipedia.org/wiki/LangChain'}, page_content='LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into applications. As a language model integration framework, LangChain\'s use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.\n\n\n== History ==\nLangChain was launched in October 2022 as an open source project by Harrison Chase, while working at machine learning startup Robust Intelligence. The project quickly garnered popularity, with improvements fr