## Data Ingestion - Document Loaders

In [13]:
# Documentation Link - https://python.langchain.com/v0.2/docs/integrations/document_loaders/

In [14]:
## Text Loader
from langchain_community.document_loaders import TextLoader


# Initialize the Text Loader with FilePath
loader = TextLoader('speech.txt')
print(f"Loader : {loader}")


# Convert Into Documnets
text_dcouments = loader.load()
print(f"After Converting Into Documents: {text_dcouments}")
print(f"Type Of Document: {type(text_dcouments)}")

Loader : <langchain_community.document_loaders.text.TextLoader object at 0x000001D06A4FBDF0>
After Converting Into Documents: [Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fightin

In [15]:
## PDF Loader
from langchain_community.document_loaders import PyPDFLoader


# Initialize the PDF Loader with FilePath
loader = PyPDFLoader('attention.pdf')
print(f"Loader : {loader}")


# Convert Into Documnets
text_dcouments = loader.load()
print(f"After Converting Into Documents: {text_dcouments}")
print(f"Type Of Document: {type(text_dcouments)}")

Loader : <langchain_community.document_loaders.pdf.PyPDFLoader object at 0x000001D069846C50>
After Converting Into Documents: [Document(metadata={'source': 'attention.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also con

In [24]:
## Web Based Loader
from langchain_community.document_loaders import WebBaseLoader
import bs4


# Initialize the Web Based Loader with webpage path
loader = WebBaseLoader(web_path='https://jalammar.github.io/')
print(f"Loader : {loader}")


# Convert Into Documnets
text_dcouments = loader.load()
print(f"After Converting Into Documents: {text_dcouments}")
print(f"Type Of Document: {type(text_dcouments)}")


# We can also fetch only the thing we want from the webpage
loader2 = WebBaseLoader(web_path=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
                        bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                            class_=("post-title","post-content","post-header")
                            ))
                        )

doc = loader2.load()
print(f"Filtered Doc: {doc}")

Loader : <langchain_community.document_loaders.web_base.WebBaseLoader object at 0x000001D0697B52D0>
After Converting Into Documents: [Document(metadata={'source': 'https://jalammar.github.io/', 'title': 'Jay Alammar – Visualizing machine learning one concept at a time.', 'description': 'Visualizing machine learning one concept at a time.', 'language': 'No language found.'}, page_content='\n\n\nJay Alammar – Visualizing machine learning one concept at a time.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJay Alammar\nVisualizing machine learning one concept at a time.@JayAlammar on Twitter. YouTube Channel\n\n\nBlog\nAbout\n\n\n\n\n\n\n\nGenerative AI and AI Product Moats\n\n\n\n\n\nHere are eight observations I’ve shared recently on the Cohere blog and videos that go over them.:\nArticle: What’s the big deal with Generative AI? Is it the future or the present?\n\nArticle: AI is Eating The World\n\n\nRead More\n\n\nRemaking Old Computer Graphics With AI Image Ge

In [25]:
# Arxiv Loader
from langchain_community.document_loaders import ArxivLoader

docs = ArxivLoader(query="reasoning",
                   load_max_docs = 2,
                   doc_content_chars_max= 1000).load()
print(docs)

[Document(metadata={'Published': '2024-05-09', 'Title': 'Hypothesis Testing Prompting Improves Deductive Reasoning in Large Language Models', 'Authors': 'Yitian Li, Jidong Tian, Hao He, Yaohui Jin', 'Summary': 'Combining different forms of prompts with pre-trained large language models\nhas yielded remarkable results on reasoning tasks (e.g. Chain-of-Thought\nprompting). However, along with testing on more complex reasoning, these\nmethods also expose problems such as invalid reasoning and fictional reasoning\npaths. In this paper, we develop \\textit{Hypothesis Testing Prompting}, which\nadds conclusion assumptions, backward reasoning, and fact verification during\nintermediate reasoning steps. \\textit{Hypothesis Testing prompting} involves\nmultiple assumptions and reverses validation of conclusions leading to its\nunique correct answer. Experiments on two challenging deductive reasoning\ndatasets ProofWriter and RuleTaker show that hypothesis testing prompting not\nonly significant

In [28]:
# WikiPedia Loader
from langchain_community.document_loaders import WikipediaLoader

docs = WikipediaLoader(query='Magnus Carlsen',
                       load_max_docs=2,
                       lang='en',
                       doc_content_chars_max=1000).load()

print(docs)

[Document(metadata={'title': 'Magnus Carlsen', 'summary': 'Sven Magnus Øen Carlsen (born 30 November 1990) is a Norwegian chess grandmaster. Carlsen is a five-time World Chess Champion, the reigning five-time World Rapid Chess Champion, the reigning seven-time World Blitz Chess Champion, and the reigning Chess World Cup Champion. He has held the No.\u202f1 position in the FIDE world chess rankings since 1 July 2011 and trails only Garry Kasparov in time spent as the highest-rated player in the world. His peak rating of 2882 is the highest in history. He also holds the record for the longest unbeaten streak at an elite level in classical chess at 125 games.\nA chess prodigy, Carlsen finished first in the C group of the Corus chess tournament shortly after he turned 13 and earned the title of grandmaster a few months later. At 15, he won the Norwegian Chess Championship, and later became the youngest ever player to qualify for the Candidates Tournament in 2005. At 17, he finished joint f