# Text, Pdf and Web Loader

## Text Loader

In [1]:
# TEXT LOADER

from langchain_community.document_loaders import TextLoader

loader = TextLoader('speech2.txt', encoding='utf-8')

text_doc = loader.load()

text_doc

[Document(page_content='I am happy to join with you today in what will go down in history as the greatest\ndemonstration for freedom in the history of our nation.\nFive score years ago, a great American, in whose symbolic shadow we stand today, signed the\nEmancipation Proclamation. This momentous decree came as a great beacon light of hope to millions of\nNegro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to\nend the long night of their captivity.\n\nBut 100 years later, the Negro still is not free. There are those who are asking the devotees of Civil\nRights: ‚ÄúWhen will you be satisfied?‚Äù We can never be satisfied as long as the Negro is the victim of the\nunspeakable horrors of police brutality. We can never be satisfied as long as our children are stripped of\ntheir selfhood and robbed of their dignity by signs stating ‚ÄúFor whites only.‚Äù No, no we are not satisfied,\nand we will not be satisfied until ‚Äújustice rolls down li

## Data Loader From Web

In [2]:
from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(web_paths = ('https://en.wikipedia.org/wiki/LangChain',),
                       bs_kwargs= dict(parse_only=bs4.SoupStrainer(
                           class_ =("mw-body-content", "mw-first-heading")
                       )))




In [16]:
webcontent = loader.load()
webcontent

[Document(page_content='Language model application development framework\nLangChainDeveloper(s)Harrison ChaseInitial releaseOctober 2022Stable release0.1.16[1]\n   / 11 April 2024; 39 days ago\xa0(11 April 2024)\nRepositorygithub.com/langchain-ai/langchainWritten inPython and JavaScriptTypeSoftware framework for large language model application developmentLicenseMIT LicenseWebsiteLangChain.com\nLangChain is a framework designed to simplify the creation of applications using large language models (LLMs). As a language model integration framework, LangChain\'s use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.[2]\n\n\nHistory[edit]\nLangChain was launched in October 2022 as an open source project by Harrison Chase, while working at machine learning startup Robust Intelligence. The project quickly garnered popularity,[3] with improvements from hundreds of contributors on GitHub, trending discussio

## Pdf Loader

In [2]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('attention.pdf')


In [3]:
pdf = loader.load()
pdf

[Document(page_content='Attention Is All You Need\nAshish Vaswani‚àó\nGoogle Brain\navaswani@google.comNoam Shazeer‚àó\nGoogle Brain\nnoam@google.comNiki Parmar‚àó\nGoogle Research\nnikip@google.comJakob Uszkoreit‚àó\nGoogle Research\nusz@google.com\nLlion Jones‚àó\nGoogle Research\nllion@google.comAidan N. Gomez‚àó‚Ä†\nUniversity of Toronto\naidan@cs.toronto.edu≈Åukasz Kaiser‚àó\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin‚àó‚Ä°\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to\nbe superior in quality while being more paralle

# Document Transform


In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap= 200)

doc = text_splitter.split_documents(pdf)

## Vector Store (Faiss and Chroma db)

In [5]:
from langchain_community.vectorstores import FAISS
from langchain.embeddings import OllamaEmbeddings

db1 = FAISS.from_documents(doc[0:50], OllamaEmbeddings(model='mxbai-embed-large'))


In [8]:
query = '''who are the authors of the research paper'''

response = db1.similarity_search(query)

response

[Document(page_content='tensorflow/tensor2tensor .\nAcknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful\ncomments, corrections and inspiration.\n9', metadata={'source': 'attention.pdf', 'page': 8}),
 Document(page_content='convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer,\nthe approach we take in our model.\nAs side beneÔ¨Åt, self-attention could yield more interpretable models. We inspect attention distributions\nfrom our models and present and discuss examples in the appendix. Not only do individual attention\nheads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic\nand semantic structure of the sentences.\n5 Training\nThis section describes the training regime for our models.\n5.1 Training Data and Batching\nWe trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million\nsentence pairs. Sentences were encoded using by

# Document And Retrival Chains

In [9]:
from langchain_community.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate

llm = Ollama(model='phi3')

prompt = ChatPromptTemplate.from_template(
    """
Answer the following question based only on the provided context.
Think Step by Step before providing a detailed answer.
I will tip you $1000 if the user finds the answer helful.
<context>
{context}
</context>
question: {input}
""")



## Document Chain

In [10]:
from langchain.chains.combine_documents import create_stuff_documents_chain

doc_chain = create_stuff_documents_chain(llm, prompt)

In [11]:
retriver = db1.as_retriever()
retriver

VectorStoreRetriever(tags=['FAISS', 'OllamaEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000002E0BEBE2BD0>)

## Retrival chain

In [12]:
from langchain.chains import create_retrieval_chain

retrival_chain = create_retrieval_chain(retriver, doc_chain)

In [14]:
res = retrival_chain.invoke({'input': 'An attention function can be described as mapping query'})

In [15]:
res['answer']

"An attention function in the context of TensorFlow's tensor2tensor model and machine translation can be described as a mechanism that maps queries (representations of input sequences) to values, allowing the model to focus on different parts of the input sequence when predicting each part of the output sequence. Based on the provided context, an attention function is characterized by its ability to jointly attend over different representation subspaces at various positions within a sentence or sequence, which can be visualized and understood better through self-attention layers.\n\nThe concept of multi-head attention further expands this mechanism by employing multiple independent attention heads that allow the model to capture diverse aspects of information from an input sequence simultaneously. For example, each head may learn to focus on different syntactic or semantic structures within a sentence, and when combined linearly projected versions (with dimensions dk, dv, and dw respec