# Data Ingestion - DocumentLoaders

In [1]:
## Text Vector

from langchain_community.document_loaders import TextLoader

loader = TextLoader('speech.txt')

loader


<langchain_community.document_loaders.text.TextLoader at 0x21cb6a02f80>

In [2]:
text_documents = loader.load()
text_documents

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n‚Ä¶\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness 

In [4]:
## Reading a PDF File

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('./attention.pdf')

docs = loader.load()
docs

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-08-03T00:07:29+00:00', 'author': '', 'keywords': '', 'moddate': '2023-08-03T00:07:29+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': './attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani‚àó\nGoogle Brain\navaswani@google.com\nNoam Shazeer‚àó\nGoogle Brain\nnoam@google.com\nNiki Parmar‚àó\nGoogle Research\nnikip@google.com\nJakob Uszkoreit‚àó\nGoogle Research\nusz@google.com\nLlion Jones‚àó\nGoogle Research\nllion@google.com\nAidan N. Gomez‚àó ‚Ä†\nUniversity of Toronto\naidan@cs.toronto.edu\n≈Åukasz Kaiser‚àó\nGo

In [6]:
type(docs[0])

langchain_core.documents.base.Document

In [15]:
## Web Based Loader
import bs4
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(web_paths = ("https://lilianweng.github.io/posts/2024-04-12-diffusion-video/",),bs_kwargs=dict(parse_only=bs4.SoupStrainer(
    class_ = ('post-title','post-content','post-header')
)))# We can put any number of urls here thats why we put common inside

In [16]:
loader.load()

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2024-04-12-diffusion-video/'}, page_content='\n\n      Diffusion Models for Video Generation\n    \nDate: April 12, 2024  |  Estimated Reading Time: 20 min  |  Author: Lilian Weng\n\n\nDiffusion models have demonstrated strong results on image synthesis in past years. Now the research community has started working on a harder task‚Äîusing it for video generation. The task itself is a superset of the image case, since an image is a video of 1 frame, and it is much more challenging because:\n\nIt has extra requirements on temporal consistency across frames in time, which naturally demands more world knowledge to be encoded into the model.\nIn comparison to text or images, it is more difficult to collect large amounts of high-quality, high-dimensional video data, let along text-video pairs.\n\n\n\nü•ë Required Pre-read: Please make sure you have read the previous blog on ‚ÄúWhat are Diffusion Models?‚Äù for image generatio

In [19]:
# Arxiv
from langchain_community.document_loaders import ArxivLoader

docs = ArxivLoader(query="1605.08386",load_max_docs=2).load()
len(docs)


1

In [20]:
docs

[Document(metadata={'Published': '2016-05-26', 'Title': 'Heat-bath random walks with Markov bases', 'Authors': 'Caprice Stanley, Tobias Windisch', 'Summary': 'Graphs on lattice points are studied whose edges come from a finite set of allowed moves of arbitrary length. We show that the diameter of these graphs on fibers of a fixed integer matrix can be bounded from above by a constant. We then study the mixing behaviour of heat-bath random walks on these graphs. We also state explicit conditions on the set of moves so that the heat-bath random walk, a generalization of the Glauber dynamics, is an expander in fixed dimension.'}, page_content='arXiv:1605.08386v1  [math.CO]  26 May 2016\nHEAT-BATH RANDOM WALKS WITH MARKOV BASES\nCAPRICE STANLEY AND TOBIAS WINDISCH\nAbstract. Graphs on lattice points are studied whose edges come from a Ô¨Ånite set of\nallowed moves of arbitrary length. We show that the diameter of these graphs on Ô¨Åbers of a\nÔ¨Åxed integer matrix can be bounded from above

In [21]:
from langchain_community.document_loaders import WikipediaLoader

docs = WikipediaLoader(query="Generative AI",load_max_docs=2).load()
len(docs)
print(docs)



In [22]:
docs

 Document(metadata={'title': 'Generative artificial intelligence', 'summary': 'Generative artificial intelligence (Generative AI, or GenAI) is a subfield of artificial intelligence that uses generative models to produce text, images, videos, audio, software code or other forms of data. These models learn the underlying patterns and structures of their training data and use them to produce new data based on the input, which often comes in the form of natural language prompts.\nGenerative AI tools have become more common since the AI boom in the 2020s. This boom was made possible by improvements in deep neural networks, particularly large language models (LLMs), which are based on the transformer architecture. Major tools include LLM-based chatbots such as ChatGPT, Claude, Copilot, DeepSeek, Google Gemini and Grok; text-to-image models such as Stable Diffusion, Midjourney, and DALL-E; and text-to-video models such as Veo and Sora. Technology companies developing generative AI include Ali