# Data Ingestion using Documentloaders

A Document Loader in LangChain is a tool that helps load data from various sources, such as text files, PDFs, web pages, databases, and more. Once the data is loaded, it can be used for natural language processing (NLP), question answering, summarization, and chatbots.

- *API* : https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/

## Load a Text File

In [1]:
from langchain_community.document_loaders import TextLoader
textloader = TextLoader("sample.txt")
textloader

<langchain_community.document_loaders.text.TextLoader at 0x21623e1c5e0>

In [2]:
text_doc = textloader.load()
text_doc

[Document(metadata={'source': 'sample.txt'}, page_content="LangChain is an open-source framework designed to help developers build applications powered by large language models (LLMs).  \n\nIt provides tools for loading, processing, and managing different types of data sources such as text files, PDFs, web pages, and databases.  \n\nUsing LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize it for various AI-based tasks.\n")]

Here it will read the content of sample.txt as one document

## Read a PDF File

In [8]:
from langchain_community.document_loaders import PyPDFLoader 
pdf_loader = PyPDFLoader("Attention_is_ All_You_Need.pdf")
pdf_doc = pdf_loader.load()
pdf_doc

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'Attention_is_ All_You_Need.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoog

Read the content of the PDF and store each page as a separate document.

## Load a Web Page

In [9]:
from langchain_community.document_loaders import WebBaseLoader
webloader = WebBaseLoader(web_path=("https://medium.com/@sateeshfrnd/understanding-the-langchain-ecosystem-616b33f5cd15"))
webloader

USER_AGENT environment variable not set, consider setting it to identify your requests.


<langchain_community.document_loaders.web_base.WebBaseLoader at 0x21649c4a0b0>

In [15]:
webloader = WebBaseLoader(web_paths=("https://github.com/sateeshfrnd/sateeshfrnd/blob/main/README.md", ),)
webloader

<langchain_community.document_loaders.web_base.WebBaseLoader at 0x2164a28f2b0>

In [16]:
webloader.load()

[Document(metadata={'source': 'https://github.com/sateeshfrnd/sateeshfrnd/blob/main/README.md', 'title': 'sateeshfrnd/README.md at main · sateeshfrnd/sateeshfrnd · GitHub', 'description': 'Config files for my GitHub profile. Contribute to sateeshfrnd/sateeshfrnd development by creating an account on GitHub.', 'language': 'en'}, page_content="\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nsateeshfrnd/README.md at main · sateeshfrnd/sateeshfrnd · GitHub\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to content\n\n\n\n\n\n\n\n\n\n\n\n\n\nNavigation Menu\n\nToggle navigation\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n            Sign in\n          \n\n\n\n\n\n\n\n\n        Product\n        \n\n\n\n\n\n\n\n\n\n\n\n\n\nGitHub Copilot\n        Write better code with AI\n      \n\n\n\n\n\n\n\nSecurity\n        Find and fix vulnerabilities\n      \

In [17]:
import bs4
webloader1 = WebBaseLoader(web_paths=("https://github.com/sateeshfrnd/sateeshfrnd/blob/main/README.md", ),
                           bs_kwargs=dict(
                               parse_only = bs4.SoupStrainer(class_ =("markdown-body entry-content container-lg","markdown-heading",))
                            )
                           )
webloader1

<langchain_community.document_loaders.web_base.WebBaseLoader at 0x21649c43a30>

In [18]:
webloader1.load()

[Document(metadata={'source': 'https://github.com/sateeshfrnd/sateeshfrnd/blob/main/README.md'}, page_content="Hello, I am Satish 👋 \nA passionate coder and perpetual learner\n\n🔹 About Me\nHi, I'm Satish Kumar, a Data Engineer. I specialize in building scalable end-to-end data pipelines and working with big data technologies to drive impactful insights. Passionate about Data Engineering, Machine Learning, and Generative AI, I enjoy solving complex data problems and optimizing data workflows.\n🚀 Currently Exploring:\n\nBuilding real-world Data Engineering systems using Kafka, Spark, Airflow, Postgres, and Docker.\nNLP and Generative AI, focusing on Large Language Models.\nData as a Product and Data Governance best practices.\nHow to architectect better data system\n\n✍️ Blog Posts & Learning\n\nMy Medium Blog\nLinkedIn Articles\nMy Wordpress\n\n📚 Skills & Tech Stack\n\nProgramming Languages: HTML/CSS/JavaScript, Java, Python, SQL, Scala\nDatabases: Oracle DB, SQL Server, MySQL, Postgre

Here we only reading particular sections only

## Arxiv Loader

In [20]:
from langchain.document_loaders import ArxivLoader

arxicloader = ArxivLoader(query="1706.03762") # Load a research paper
arxicloader

<langchain_community.document_loaders.arxiv.ArxivLoader at 0x2164a28f520>

In [22]:
arxicloader.load()

[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntr

## Load Wikipedia

In [26]:
from langchain.document_loaders import WikipediaLoader

wikidocument = WikipediaLoader(query="Artificial Intelligence", lang="en").load()
wikidocument

[Document(metadata={'title': 'Artificial intelligence', 'summary': 'Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals. Such machines may be called AIs.\nHigh-profile applications of AI include advanced web search engines (e.g., Google Search); recommendation systems (used by YouTube, Amazon, and Netflix); virtual assistants (e.g., Google Assistant, Siri, and Alexa); autonomous vehicles (e.g., Waymo); generative and creative tools (e.g., ChatGPT and AI art); and superhuman play and analysis in strategy games (e.g., chess and Go). However, many AI applications are not perceived as AI: "A lot of cutting edge AI has filtered into general applications, often without being ca

In [27]:
len(wikidocument)

25