<a href="https://colab.research.google.com/github/willdphan/gpt-pdf/blob/main/pdf_gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LOAD PDF

In [1]:
!pip install langchain
!pip install pymupdf
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyMuPDFLoader("https://arxiv.org/pdf/2302.03803.pdf")

In [3]:
data = loader.load()

In [4]:
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

You have 8 document(s) in your data
There are 1339 characters in your document


In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [6]:
print (f'Now you have {len(texts)} documents')

Now you have 17 documents


# CREATE EMBEDDINGS

In [22]:
!pip install pinecone-client
!pip install tiktoken
from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tiktoken
  Downloading tiktoken-0.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m67.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.3.3


In [8]:
OPENAI_API_KEY = '...'
PINECONE_API_KEY = '...'
PINECONE_API_ENV = '...'

In [9]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [20]:
# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV  # next to api key in console
)
index_name = "pdf-gpt"

In [23]:
docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

# ASK A QUESTION

In [26]:
query = "What is the pdf about?"
docs = docsearch.similarity_search(query, include__metadata=True)
docs

[Document(page_content='Σ and let π ∶ P(E) → Pd\nΣ be the\nprojective space bundle associated to the vector bundle E = L1 ⊕ ⋯ ⊕ Ls. It is known that\nP(E) is a (d + s − 1)-dimensional simplicial toric variety whose fan depends on the degrees\nof the line bundles and the fan Σ. Furthermore, if the Cox ring, without considering the\ngrading, of Pd\nΣ is C[x1,... ,xm] then the Cox ring of P(E) is\nC[x1,... ,xm,y1,... ,ys]\nMoreover for X a quasi-smooth intersection subvariety cut oﬀ by f1,... ,fs with deg(fi) =\n[Li] we relate the hypersurface Y cut oﬀ by F = y1f1 + ⋅⋅⋅ + ysfs which turns out to be\nquasi-smooth. For more details see Section 2 in [7].\n5', metadata={}),
 Document(page_content='Proof. By Proposition 5.3 and Corollary 3.6.\n7', metadata={}),
 Document(page_content='of Σ and each ρ ∈ Σ corresponds to an irreducible T-invariant Weil divisor Dρ on Pd\nΣ. Let\nCl(Σ) be the group of Weil divisors on Pd\nΣ module rational equivalences.\nThe total coordinate ring of Pd\nΣ is the p

# USE OPEN AI AND ORGANIZE INFO

In [30]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

In [31]:
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

In [33]:
query = "What is the pdf about? Make your response detailed and 1 paragraph long."
docs = docsearch.similarity_search(query, include_metadata=True)

In [34]:
chain.run(input_documents=docs, question=query)

' This pdf is about complex orbifolds. It begins by discussing the Cox ring of a simplicial toric variety and how it relates to the vector bundle associated to it. It then goes on to define the irrelevant ideal of a toric variety and how it is related to the group action of the Cl(Σ)-grading of S. Finally, it gives a brief introduction to complex orbifolds and mentions the needed theorems related to them.'