# text loader 

* LangChain’s TextLoader is a useful tool for loading and processing text data, making it ready for use with large language models (LLMs). Here’s a quick overview of how it works:

* Basic Usage:
The TextLoader reads a text file and encapsulates its content into a Document object, which includes both the text and associated metadata.

In [None]:
#loading text file
from langchain_community.document_loaders import TextLoader 
loader=TextLoader('speech.txt')
loader

In [None]:
text_document=loader.load()
text_document

In [None]:
#Reading PDF 
from langchain_community.document_loaders import PyPDFLoader
pdf_document=PyPDFLoader('Pdfdoc.pdf')
pdf_document=pdf_document.load()
pdf_document

In [None]:
type(pdf_document)

In [None]:
type(pdf_document[0])

In [None]:
pdf_document[0].metadata

In [None]:
#Webbased Document 
import bs4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
                       bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                           class_=("post-title","post-header","post-content")
                       )))
loader.load()


# ArxivRetriever
arXiv is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

This notebook shows how to retrieve scientific articles from Arxiv.org into the Document format that is used downstream.

In [None]:
from langchain_community.document_loaders import ArxivLoader
docs=ArxivLoader(query="2111.07139",load_max_docs=4).load()
len(docs)

# Accessing Wikipedia
Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system called MediaWiki. Wikipedia is the largest and most-read reference work in history.
* pip install wikipedia

In [None]:
from langchain_community.document_loaders import WikipediaLoader
from langchain.retrievers import WikipediaRetriever
loader = WikipediaLoader(query="Machine Learning")
documents = loader.load()
documents

In [None]:
#loading specific information from wikipedia
documents[0].metadata

In [None]:
documents[1].metadata['source']

In [None]:
documents[1].metadata['summary']

In [2]:
from langchain_community.document_loaders import WikipediaLoader
loader = WikipediaLoader(query="Narendra Modi")
documents = loader.load()
documents



  lis = BeautifulSoup(html).find_all('li')


[Document(metadata={'title': 'Narendra Modi', 'summary': "Narendra Damodardas Modi (born 17 September 1950) is an Indian politician serving as the current prime minister of India since 26 May 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the Member of Parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation. He is the longest-serving prime minister outside the Indian National Congress.\nModi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education. He was introduced to the RSS at the age of eight. At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so. Modi became a full-time worker for the RSS in Gujarat in 1971. The RSS assigned him to the BJP in 1985 and he rose through the 

In [4]:
documents[0].metadata

{'title': 'Narendra Modi',
 'summary': "Narendra Damodardas Modi (born 17 September 1950) is an Indian politician serving as the current prime minister of India since 26 May 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the Member of Parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation. He is the longest-serving prime minister outside the Indian National Congress.\nModi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education. He was introduced to the RSS at the age of eight. At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so. Modi became a full-time worker for the RSS in Gujarat in 1971. The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, b

In [6]:
print(documents[0].page_content)

Narendra Damodardas Modi (born 17 September 1950) is an Indian politician serving as the current prime minister of India since 26 May 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the Member of Parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation. He is the longest-serving prime minister outside the Indian National Congress.
Modi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education. He was introduced to the RSS at the age of eight. At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so. Modi became a full-time worker for the RSS in Gujarat in 1971. The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, becoming general secretary in 1998. In 200

# as question answering model

In [7]:
# using transformer as question answer model
from transformers import pipeline

In [8]:
# Use a local transformer model for question answering
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")
qa_pipeline

<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x2a0cf688760>

In [12]:
# Define a simple question-answer prompt
question = "Who is Narendra Modi?"
context = documents[0].page_content
context

"Narendra Damodardas Modi (born 17 September 1950) is an Indian politician serving as the current prime minister of India since 26 May 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the Member of Parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation. He is the longest-serving prime minister outside the Indian National Congress.\nModi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education. He was introduced to the RSS at the age of eight. At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so. Modi became a full-time worker for the RSS in Gujarat in 1971. The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, becoming general secretary in 1998. In 2

In [13]:
#to get the answers 
answer=qa_pipeline(question=question,context=context)
answer

{'score': 0.23292073607444763,
 'start': 53,
 'end': 73,
 'answer': 'an Indian politician'}

In [16]:
print(f"Question: {question}")
print(f"Answer: {answer['answer']}")

Question: Who is Narendra Modi?
Answer: an Indian politician


In [17]:
# Define a simple question-answer prompt
question = "when did Narendra Modi Born?"
context = documents[0].page_content
#to get the answers 
answer=qa_pipeline(question=question,context=context)
print(f"Question: {question}")
print(f"Answer: {answer['answer']}")

Question: when did Narendra Modi Born?
Answer: 17 September 1950


In [19]:
documents[1].page_content

"The Narendra Modi Stadium is an international cricket stadium located in Ahmedabad, Gujarat, India. With a seating capacity of 132,000, it is the largest stadium in the world.  The stadium, owned by the Gujarat Cricket Association, hosts both domestic and international cricket matches. It is recognized as one of the world's premier cricket venues. The stadium has hosted several high-profile matches such as the 2023 World Cup final. It is named after Prime Minister Narendra Modi.\nThe land which holds Narendra Modi Stadium was formerly occupied by another venue, Sardar Patel Stadium. It served as the arena for domestic and international cricket in the city until its demolition in 2015, hosting matches during the 1987, 1996, and 2011 ICC World Cups. In 2014, it was decided that a new stadium should be built on the same plot. The new stadium, originally named Motera Stadium, was designed by Australia's design firm Populous and built by Larsen and Toubro. It took 5 years to build, at an e

In [22]:
from langchain_community.document_loaders import WikipediaLoader
from transformers import pipeline
from bs4 import BeautifulSoup

# Load documents using WikipediaLoader
loader = WikipediaLoader(query="Narendra Modi")
documents = loader.load()

# Define the QA pipeline with a specific model
qa_pipeline = pipeline("question-answering", model="distilbert/distilbert-base-cased-distilled-squad")

# Concatenate all documents' content into one string
full_context = " ".join([doc.page_content for doc in documents])

# Example question and context
question = "Who is Narendra Modi?"
context = full_context

# Get the answer
answer = qa_pipeline(question=question, context=context)
print(f"Question: {question}")
print(f"Answer: {answer['answer']}")


Question: Who is Narendra Modi?
Answer: Prime Minister
