## PDF Query Using Langchain

In [40]:
# !pip install langchain
# !pip install openai
# !pip install PyPDF2
# !pip install faiss-cpu
# !pip install tiktoken

In [19]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

In [37]:
import os
# os.environ["OPENAI_API_KEY"] = ""

In [21]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('Infosys_Result.pdf')

In [22]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [23]:
raw_text

'Index Page No.\nCondensed Consolidated Balance Sheet……………………………………………………………………………….. 1\nCondensed Consolidated Statement of Comprehensive Income……………………………………………………….. 2\nCondensed Consolidated Statement of Changes in Equity ……………………………………..…………………………………….. 3\nCondensed Consolidated Statement of Cash Flows………………………………………………………………………. 5\nOverview and Notes to the Interim Condensed Consolidated Financial Statements\n1. Overview\n1.1 Company overview …………………………………………………….……………………………………………………. 6\n1.2 Basis of preparation of financial statements …………………………………………………….……………………………………………………. 6\n1.3 Basis of consolidation……………………………………………………………………………… 6\n1.4 Use of estimates and judgments…………………………………………………………………. 6\n1.5 Critical accounting estimates and judgments…………………………………………………… 6\n1.6 Recent accounting pronouncements…………………………………………………………….. 7\n2. Notes to the Interim Condensed Consolidated Financial Statements \n2.1 Cash and cash equivalents ……………………………………………………………………….. 8\n2.2 Earmarked

In [24]:
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [25]:
len(texts)

278

In [26]:
# pip install -U langchain-openai

In [27]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [28]:
document_search = FAISS.from_texts(texts, embeddings)

In [29]:
document_search


<langchain_community.vectorstores.faiss.FAISS at 0x7ae355fd8e20>

In [32]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain_openai import OpenAI

In [33]:
chain = load_qa_chain(OpenAI(), chain_type="stuff")

In [34]:
query = "What happened at The Board of Directors in their meeting held on April 13, 2023"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

  warn_deprecated(


' The Board of Directors recommended a final dividend of ₹17.50/- per equity share for the financial year ended March 31, 2023.'

In [35]:
query = "How much investment did at June 30, 2023 ?"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

'\n\nThe total investment at June 30, 2023 was $1,638 million.'

In [36]:
from langchain.document_loaders import OnlinePDFLoader

In [None]:
loader = OnlinePDFLoader("https://arxiv.org/pdf/1706.03762.pdf")

In [38]:
# !pip install unstructured

In [None]:
data = loader.load()

In [None]:
data

[Document(page_content='A WEAK (k, k)-LEFSCHETZ THEOREM FOR PROJECTIVE TORIC ORBIFOLDS\n\nWilliam D. Montoya\n\n3 2 0 2\n\nInstituto de Matem´atica, Estat´ıstica e Computa¸c˜ao Cient´ıﬁca, Universidade Estadual de Campinas (UNICAMP),\n\nb e F 7\n\nRua S´ergio Buarque de Holanda 651, 13083-859, Campinas, SP, Brazil\n\n]\n\nFebruary 9, 2023\n\nG A . h t a m\n\nAbstract\n\nFirstly we show a generalization of the (1, 1)-Lefschetz theorem for projective toric orbifolds and secondly we prove that on 2k-dimensional quasi-smooth hyper- surfaces coming from quasi-smooth intersection surfaces, under the Cayley trick, every rational (k, k)-cohomology class is algebraic, i.e., the Hodge conjecture holds on them.\n\n[\n\n1 v 3 0 8 3 0 . 2 0 3 2 : v i X r a\n\n1\n\nIntroduction\n\nIn [3] we proved that, under suitable conditions, on a very general codimension s quasi- smooth intersection subvariety X in a projective toric orbifold Pd Σ with d + s = 2(k + 1) the Hodge conjecture holds, that is, every

In [None]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [39]:
#!pip install chromadb

In [None]:
from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders([loader])

In [None]:
query = "Explain me about Attention is all you need"
index.query(query)

' Attention is All You Need is a paper published in 2017 by researchers from Google Brain. The paper introduces the Transformer, a model architecture that relies entirely on an attention mechanism to draw global dependencies between input and output, instead of using recurrence. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. Additionally, self-attention could yield more interpretable models.'