## Data Ingestion - Document loaders

https://python.langchain.com/v0.3/docs/integrations/document_loaders/

In [5]:
## Text Loader
## Text Loaders can be used to load data from text files. 

from langchain_community.document_loaders import TextLoader

## Initialize the TextLoader
loader = TextLoader('sampletext.txt')

## The below code will load the content of the text file.
text_documents=loader.load()
text_documents

[Document(metadata={'source': 'sampletext.txt'}, page_content='Agentic AI is a class of artificial intelligence that focuses on autonomous systems that can make decisions and perform tasks with or without human intervention. The independent systems automatically respond to conditions, with procedural, algorithmic, and human-like creative steps, to produce process results. The field is closely linked to agentic automation, also known as agent-based process management systems, when applied to process automation. Applications include software development, customer support, cybersecurity and business intelligence. \n\nThe core concept of agentic AI is the use of AI agents to perform automated tasks with or without human intervention.[1] While robotic process automation (RPA) systems automate rule-based, repetitive tasks with fixed logic, agentic AI adapts and learns from data inputs. [2] Agentic AI refers to autonomous systems capable of pursuing complex goals with minimal human interventi

In [15]:
## Reading a PDF file
from langchain_community.document_loaders import PyPDFLoader

## Initialize the PDF Loader
pdf_loader = PyPDFLoader('attention.pdf')

## The below code will load the content of the PDF file.
pdf_documents = pdf_loader.load()
pdf_documents

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoogle Brain\nlukaszk

In [16]:
## Web based loader
from langchain.document_loaders import WebBaseLoader
import bs4

Web_loader = WebBaseLoader(
    web_paths=("https://en.wikipedia.org/wiki/Artificial_intelligence",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("mw-page-title-main","mw-body-content") ##Parse only particular html classes
        ) 
    )
)

web_docs = Web_loader.load()
web_docs[0]



In [11]:
## Arxiv loader
## Arxiv Loader can be used to load research papers from arxiv.org
from langchain_community.document_loaders import ArxivLoader
arxiv_loader = ArxivLoader(
    query="1706.03762",  # Category for Artificial Intelligence
    load_max_docs=2  # Limit to 2 documents for this example
)
docs = arxiv_loader.load()
docs

[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntr

In [12]:
## Wikipedia Loader
from langchain_community.document_loaders import WikipediaLoader
wiki_loader = WikipediaLoader(
    query="Artificial Intelligence",
    load_max_docs=1
)
wiki_docs = wiki_loader.load()
wiki_docs

[Document(metadata={'title': 'Artificial intelligence', 'summary': 'Artificial intelligence (AI) is the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.\nHigh-profile applications of AI include advanced web search engines (e.g., Google Search); recommendation systems (used by YouTube, Amazon, and Netflix); virtual assistants (e.g., Google Assistant, Siri, and Alexa); autonomous vehicles (e.g., Waymo); generative and creative tools (e.g., language models and AI art); and superhuman play and analysis in strategy games (e.g., chess and Go). However, many AI applications are not perceived as AI: "A lot of cutting edge