# Question Answering

## PDF reader

https://archive.org/details/crossingthechasm_202002

In [None]:
!pip install langchain openai chromadb tiktoken pypdf llama-index tqdm

In [1]:
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader

In [5]:
import os
import tomli
import openai
with open('../.streamlit/secrets.toml','rb') as f:
    toml_dict = tomli.load(f)
openai.api_key = toml_dict['OPEN_AI_KEY']
os.environ['OPENAI_API_KEY'] = toml_dict['OPEN_AI_KEY']
# os.environ['PINECONE_API_KEY'] = toml_dict['PINECONE_API_KEY']
# os.environ['PINECONE_API_ENV'] = toml_dict['PINECONE_API_ENV']

In [4]:

# load document
loader = PyPDFLoader("../book/Crossing the Chasm.pdf")
documents = loader.load()


In [2]:
from llama_index import download_loader

PDFReader = download_loader("PDFReader")

loader = PDFReader()
documents = loader.load_data("../book/Crossing the Chasm.pdf")

### Create index

In [6]:
from llama_index import GPTSimpleVectorIndex

index = GPTSimpleVectorIndex.from_documents(documents)

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 118920 tokens


In [7]:
query = "what is Vendor-Oriented Pricing?"
response = index.query(query)

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 4188 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 9 tokens




Vendor-oriented pricing is a function of internal issues, beginning with cost of goods, and extending to cost of sales, cost of overhead, cost of capital, promised rate of risk-adjusted return, and any number of other factors. These factors are critical to being able to manage an enterprise profitably on an ongoing basis. To leave the chasm behind, the high-tech enterprise must accept that it is going through a phase and act competently with that knowledge. This means redirecting energy towards the concerns of a vendor-oriented pricing strategy, which includes understanding the market-visible issues that are impacted by the internal factors. This allows the enterprise to make informed decisions about pricing that will maximize profits and ensure the long-term success of the business.


In [9]:
print(response)



Vendor-oriented pricing is a function of internal issues, beginning with cost of goods, and extending to cost of sales, cost of overhead, cost of capital, promised rate of risk-adjusted return, and any number of other factors. These factors are critical to being able to manage an enterprise profitably on an ongoing basis. To leave the chasm behind, the high-tech enterprise must accept that it is going through a phase and act competently with that knowledge. This means redirecting energy towards the concerns of a vendor-oriented pricing strategy, which includes understanding the market-visible issues that are impacted by the internal factors. This allows the enterprise to make informed decisions about pricing that will maximize profits and ensure the long-term success of the business.


### Create Embeddings


In [None]:
embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=embed_model
)

### Create Pinecone DB

In [None]:
import pinecone 

pinecone.init(api_key=os.environ['PINECONE_API_KEY'], environment=os.environ['PINECONE_API_ENV']) 
# pinecone.list_indexes()
index = pinecone.Index('crossing-the-chasm') 

## QA

### load_qa_chain

In [16]:
# load document
loader = PyPDFLoader("book/Crossing the Chasm-202-217.pdf")
documents = loader.load()


In [14]:
from langchain.chains.question_answering import load_qa_chain
from langchain.chat_models import ChatOpenAI
chat = ChatOpenAI(model_name='gpt-3.5-turbo')

chain = load_qa_chain(llm=chat, chain_type="map_reduce")
query = "what is Vendor-Oriented Pricing?"
chain.run(input_documents=documents, question=query)

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=60).


'The given portion of the document does not provide a clear definition of Vendor-Oriented Pricing. However, it mentions that "vendor-oriented pricing represents the least sound basis for pricing decisions during the chasm period."'

### RetrievalQA

In [7]:
from langchain.chains import RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

In [8]:
# split the documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# select which embeddings we want to use
embeddings = OpenAIEmbeddings()
# create the vectorestore to use as the index
db = Chroma.from_documents(texts, embeddings)

Using embedded DuckDB without persistence: data will be transient


In [10]:
# expose this index in a retriever interface
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":3})
# create a chain to answer questions 
qa = RetrievalQA.from_chain_type(
    llm=chat, chain_type="stuff", retriever=retriever, return_source_documents=True)
query = "what is Vendor-Oriented Pricing?"
result = qa({"query": query})

In [11]:
result

{'query': 'what is Vendor-Oriented Pricing?',
 'result': 'Vendor-oriented pricing is a pricing strategy that is based on internal factors such as the cost of goods, cost of sales, cost of overhead, cost of capital, promised rate of risk-adjusted return, and any number of other factors critical to managing an enterprise profitably on an ongoing basis. Its impact is on the number of transactions required to create a given amount of annual revenue. It sets the distribution channel decision by establishing a price-point ballpark that puts the product in the direct sales, web self-service, or sales 2.0 camp. The pricing strategy is not the most sound basis for pricing decisions during the chasm period, as it requires being almost entirely externally focused- both on the new demands of the mainstream customer and the new relationship you are trying to build with a mainstream channel.',
 'source_documents': [Document(page_content='208 Crossing the Chasm\nVendor -Oriented Pricing\nVendor-orien

In [12]:
result['result']

'Vendor-oriented pricing is a pricing strategy that is based on internal factors such as the cost of goods, cost of sales, cost of overhead, cost of capital, promised rate of risk-adjusted return, and any number of other factors critical to managing an enterprise profitably on an ongoing basis. Its impact is on the number of transactions required to create a given amount of annual revenue. It sets the distribution channel decision by establishing a price-point ballpark that puts the product in the direct sales, web self-service, or sales 2.0 camp. The pricing strategy is not the most sound basis for pricing decisions during the chasm period, as it requires being almost entirely externally focused- both on the new demands of the mainstream customer and the new relationship you are trying to build with a mainstream channel.'

### VectorstoreIndexCreator

Wrapper for the logic above

Source:

https://python.langchain.com/en/latest/modules/chains/getting_started.html
https://github.com/hwchase17/langchain/blob/master/langchain/indexes/vectorstore.py#L21-L74

In [None]:
index = VectorstoreIndexCreator(
    # split the documents into chunks
    text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0),
    # select which embeddings we want to use
    embedding=OpenAIEmbeddings(),
    # use Chroma as the vectorestore to index and search embeddings
    vectorstore_cls=Chroma
).from_loaders([loader])
index.query(llm=chat, question=query, chain_type="stuff")

### ConversationalRetrievalChain

conversation memory + RetrievalQAChain

Allow for passing in chat history which can be used for follow up questions.

Source: https://python.langchain.com/en/latest/modules/chains/index_examples/chat_vector_db.html

In [None]:
from langchain.chains import ConversationalRetrievalChain

In [None]:
# split the documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# select which embeddings we want to use
embeddings = OpenAIEmbeddings()
# create the vectorestore to use as the index
db = Chroma.from_documents(texts, embeddings)
# expose this index in a retriever interface
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":2})
# create a chain to answer questions 
qa = ConversationalRetrievalChain.from_llm(chat, retriever)
chat_history = []
result = qa({"question": query, "chat_history": chat_history})

In [None]:
result["answer"]

In [None]:
chat_history = [(query, result["answer"])]
query = "How does it differ from Distribution-Oriented Pricing?"
result = qa({"question": query, "chat_history": chat_history})


In [None]:
chat_history

In [None]:
result['answer']