# text loader 

* LangChain’s TextLoader is a useful tool for loading and processing text data, making it ready for use with large language models (LLMs). Here’s a quick overview of how it works:

* Basic Usage:
The TextLoader reads a text file and encapsulates its content into a Document object, which includes both the text and associated metadata.

In [1]:
#loading text file
from langchain_community.document_loaders import TextLoader 
loader=TextLoader('speech.txt')
loader

<langchain_community.document_loaders.text.TextLoader at 0x220b82d6470>

In [2]:
text_document=loader.load()
text_document

[Document(metadata={'source': 'speech.txt'}, page_content='Not to be confused with Artificial general intelligence. This page focuses on statistical machine learning AI. For other topics, see Algorithmic composition, Algorithm art, Generative art, Procedural generation.\nImpressionistic image of figures in a futuristic opera scene\nThÃ©Ã¢tre D\'opÃ©ra Spatial, an image made using generative artificial intelligence\nPart of a series on\nArtificial intelligence\n\nMajor goals\nApproaches\nApplications\nPhilosophy\nHistory\nGlossary\nvte\nGenerative artificial intelligence (generative AI, GenAI,[1] or GAI) is artificial intelligence capable of generating text, images, videos, or other data using generative models,[2] often in response to prompts.[3][4] Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.[5][6]\n\nImprovements in transformer-based deep neural networks, particularly large language mod

In [6]:
#Reading PDF 
from langchain_community.document_loaders import PyPDFLoader
pdf_document=PyPDFLoader('Pdfdoc.pdf')
pdf_document=pdf_document.load()
pdf_document

[Document(metadata={'source': 'Pdfdoc.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbas

In [7]:
type(pdf_document)

list

In [10]:
type(pdf_document[0])

langchain_core.documents.base.Document

In [19]:
pdf_document[0].metadata

{'source': 'Pdfdoc.pdf', 'page': 0}

In [1]:
#Webbased Document 
import bs4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
                       bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                           class_=("post-title","post-header","post-content")
                       )))
loader.load()


USER_AGENT environment variable not set, consider setting it to identify your requests.


[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistake

# ArxivRetriever
arXiv is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

This notebook shows how to retrieve scientific articles from Arxiv.org into the Document format that is used downstream.

In [3]:
from langchain_community.document_loaders import ArxivLoader
docs=ArxivLoader(query="2111.07139",load_max_docs=4).load()
len(docs)

1

# Accessing Wikipedia
Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system called MediaWiki. Wikipedia is the largest and most-read reference work in history.
* pip install wikipedia

In [4]:
from langchain_community.document_loaders import WikipediaLoader
from langchain.retrievers import WikipediaRetriever
loader = WikipediaLoader(query="Machine Learning")
documents = loader.load()
documents

[Document(metadata={'title': 'Machine learning', 'summary': 'Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Quick progress in the field of deep learning, beginning in 2010s, allowed neural networks to surpass many previous approaches in performance.\nML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. The application of ML to business problems is known as predictive analytics.\nStatistics and mathematical optimization (mathematical programming) methods comprise the foundations of machine learning. Data mining is a related field of study, focusing on exploratory data analysis (EDA) via unsupervised learning. \nFrom a theoretical viewpoint, probably approximately correct (PAC) learning pro

In [16]:
#loading specific information from wikipedia
documents[0].metadata

{'title': 'Machine learning',
 'summary': 'Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Quick progress in the field of deep learning, beginning in 2010s, allowed neural networks to surpass many previous approaches in performance.\nML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. The application of ML to business problems is known as predictive analytics.\nStatistics and mathematical optimization (mathematical programming) methods comprise the foundations of machine learning. Data mining is a related field of study, focusing on exploratory data analysis (EDA) via unsupervised learning. \nFrom a theoretical viewpoint, probably approximately correct (PAC) learning provides a framework 

In [17]:
documents[1].metadata['source']

'https://en.wikipedia.org/wiki/Attention_(machine_learning)'

In [18]:
documents[1].metadata['summary']

'Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.\nUnlike "hard" weights, which are computed during the backwards training pass, "soft" weights exist only in the forward pass and therefore change with every step of the input. Earlier designs implemented the attention mechanism in a serial recurrent neural network language translation system, but the later transformer design removed the slower sequential RNN and relied more heavily on the faster parallel attention scheme.\nInspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of leveraging information from the h