<a href="https://colab.research.google.com/github/sugarforever/LangChain-Tutorials/blob/main/LangChain_PDF_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

该Python notebook利用langchain的QA chain，结合Chroma来实现PDF文档Analysis-and-Comparison-between-Optimism-and-StarkNet.pdf的语义化搜索。

该PDF文档共61页。通过本notebook，我们演示该字数规模的文件的语义化索引的OpenAI API开销。

使用时，在本地创建`.env`，并如`.env.example`所示，设置有效的OpenAI API Key即可。

In [78]:
%pip install openai > /dev/null
%pip install chromadb > /dev/null
%pip install langchain > /dev/null


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [79]:
from langchain.document_loaders import PyMuPDFLoader

In [81]:
PDF_NAME='Analysis-and-Comparison-between-Optimism-and-StarkNet.pdf'
def load_pdf():
  return PyMuPDFLoader(PDF_NAME).load()

In [82]:
docs = load_pdf()

In [83]:
print (f'You have {len(docs)} document(s) in your data')
print (f'There are {len(docs[0].page_content)} characters in the first page of your document')

total = 0
for doc in docs:
  total += len(doc.page_content)
print (f'There are {total} characters in your document')

You have 61 document(s) in your data
There are 284 characters in the first page of your document
There are 112626 characters in your document


In [84]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
split_docs = text_splitter.split_documents(docs)

In [85]:
print (f'Now you have {len(split_docs)} documents')

Now you have 143 documents


In [86]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
import os

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

In [87]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [95]:
persist_directory = 'starknet'
collection_name = 'starknet_index'

In [89]:
from langchain.callbacks import get_openai_callback

In [96]:
with get_openai_callback() as cb:
    vectorstore = Chroma.from_documents(split_docs, embeddings, collection_name=collection_name, persist_directory=persist_directory)
    vectorstore.persist()
    print(cb)


Using embedded DuckDB with persistence: data will be stored in: starknet


Tokens Used: 0
	Prompt Tokens: 0
	Completion Tokens: 0
Successful Requests: 0
Total Cost (USD): $0.0


In [91]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

In [92]:
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

# Load the vectorstore from disk
vectordb = Chroma(collection_name=collection_name, persist_directory=persist_directory, embedding_function=embeddings)

query = "What is starknet?"
docs = vectorstore.similarity_search(query, 3, include_metadata=True)

Using embedded DuckDB with persistence: data will be stored in: starknet


In [93]:
print(chain.document_prompt)

input_variables=['page_content'] output_parser=None partial_variables={} template='{page_content}' template_format='f-string' validate_template=True


In [94]:
for doc in docs:
    print(doc.metadata)

{'source': 'Analysis-and-Comparison-between-Optimism-and-StarkNet.pdf', 'file_path': 'Analysis-and-Comparison-between-Optimism-and-StarkNet.pdf', 'page_number': 36, 'total_pages': 61, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'dvips + GPL Ghostscript GIT PRERELEASE 9.22', 'creationDate': "D:20221031205028-04'00'", 'modDate': "D:20221031205028-04'00'", 'trapped': ''}
{'source': 'Analysis-and-Comparison-between-Optimism-and-StarkNet.pdf', 'file_path': 'Analysis-and-Comparison-between-Optimism-and-StarkNet.pdf', 'page_number': 36, 'total_pages': 61, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'dvips + GPL Ghostscript GIT PRERELEASE 9.22', 'creationDate': "D:20221031205028-04'00'", 'modDate': "D:20221031205028-04'00'", 'trapped': ''}
{'source': 'Analysis-and-Comparison-between-Optimism-and-StarkNet.pdf', 'file_path': 'Analysis-a

In [None]:
with get_openai_callback() as cb:
    print(chain.run(input_documents=docs, question=query))
    print(cb)