## Query Private Docs with GPT & LangChain
Open this notebook in Google Colab to run the code.
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rashlab/AI-Notes/blob/main/langchain/QueryMyDoc.ipynb)


### **1. The Challenges of Querying Private Corpus of Docs with GPT**

I spent a couple of hours today playing with the LangChain opensource package, and I have some findings and thoughts.

I practice law, and I want to be able to apply GPT’s intelligence to a private corpus of legal documents, which can be, of course, very large. I want GPT to do all the dirty work for me: read all the text (cover-to-cover), run a thorough legal analysis of whatever it is asked about, reach correct and reasonable legal conclusions (albeit not necessarily perfect), summarize it all nicely in a well-formatted memo I can use (as a first draft), and complete all that in less than a minute. For the first time, with GPT-4, I believe that it is doable. Today.

So, where are the problems?

Putting aside privacy and proprietary data issues (which are serious concerns for legal professionals), we simply cannot use GPT-4 to review our private corpus of documents; that's not possible using ChatGPT or the OpenAI APIs. Moreover, it makes no sense to send multiple PDFs (tens or hundreds of pages each) to GPT in each query; it's expensive, slow, and inefficient. We need to provide large language models (LLMs) like GPT access to our private corpus in a different way. So that's <u>problem #1</u>; let's call it the "**Accessibility**" problem.

But there's also <u>problem #2</u>, which is the notorious **Alignment** problem. It is well-known that LLMs are not "aligned" – these models frequently hallucinate, make up answers, and in general, cannot be trusted. This is why [Yann Lecun (@ylecun) tweeted today](https://twitter.com/ylecun/status/1643942324672536577?s=20) that LLMs are not reliable as factual information sources. That's true. Unlike standard data stores, when it comes to LLMs, we have limited understanding of how their memory actually works (we probably don't have much better understanding of how human memory functions either). Researchers study [where facts are located inside a language model](https://rome.baulab.info/) and the extent to which we can locate and edit factual associations inside an LLM. However, having said all that, my view is that the power of LLMs lies in the intelligence and reasoning capabilities that emerge during training, rather than the factual knowledge they have digested. As [Dan Shipper put it, GPT-4 is a Reasoning Engine](https://every.to/chain-of-thought/gpt-4-is-a-reasoning-engine), not a storage of facts.

These are only some of the challenges, so what can be done about it?

Now I know there are already several commercial legal AI solutions developed using GPT-4, most notably [CoCounsel of Casetext](https://casetext.com/cocounsel) and [Harvi](https://harvi.ai). These products were developed with OpenAI support long before GPT-4 was publicly announced, and I'm sure they have developed at least partial solutions to the Accessibility and Alignment problems mentioned above.

### **2. LangChain** 

Still, I thought it could be an interesting exercise to see what one can build today, mainly with open-source packages, and [Harrison Chase](https://twitter.com/hwchase17) with the help of 350 community members developed [@LangChainAI](https://twitter.com/LangChainAI) exactly for that! So, this is probably the best place to start the excercise.

The [LangChain repo](https://github.com/hwchase17/langchain) is high in the GitHub Trending list, with 22K stars and 2K forks as of early April. LangChain is shipping new components and updates daily. This is truly amazing, considering that the first release was only six months ago. It's no surprise that they have just raised a [$10M seed round](https://blog.langchain.dev/announcing-our-10m-seed-round-led-by-benchmark). 

So what's in it? 

LangChain is a framework for developing applications powered by large language models (LLMs). Their vision is that applications will want not only to make calls to an LLM's API but also connect the model to other data sources and allow the model to interact with its environment. With that in mind, they have developed a set of components, pre-built chains, and agents. Chains - the core concept of the package - allow developers to assemble components in a specific manner to accomplish a particular task, such as summarizing a large PDF document or querying a SQL database. Agents can be thought of as "dynamic chains" in which the sequence of steps taken is determined by a language model on the fly.

One of the use-cases demonstrated by LangChain is a Q&A of your own documents with GPT. The basics of this QA-Docs is explained [here](https://docs.langchain.com/docs/use-cases/qa-docs).

In a nutshell, the solution to the Alignment problem is to use prompt templates, and the solution to the Accessibility problem is to perform semantic vector search on our private data as a preliminary step, and then send our query, combined with the search results and LangChain prompts, to GPT for processing.

### **3. Prompt Templates** 

GPT halucinations are well known, but it is also established that by feeding a model with well-designed prompts (as instructions) and few examples, it (usually) improves its manners. So LangChain developed a framework of prompt templates, which they combine and chain together with the specific question that you may want to ask the model. 

For example, a prompt can be something like the following, simply instructing GPT to behave:

        "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer."

Now this may seem a bit infantile, but apparently it works – injecting such instructions, followed by a chunk of text defined as the context of the question, usually prevents those hallucinations. 

There are some default prompt templates, which one can of course modify as needed. The above prompt, for example, is used in the LangChain package by default in the basic method to query the GPT. However, if you want to ask GPT to specify the sources that its answer is based on, a different prompt is used: 

        "Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). If you don't know the answer, just say that you don't know. Don't try to make up an answer. ALWAYS return a "SOURCES" part in your answer."

Moreover, in the case of insisting on inclusion of SOURCES in the answer, after the clear instructions, two examples are given to the model – 

- in the first one, there is a proper source in the context provided that GPT can use to answer the query
  
- in the second one, however, there is actually no such source. This is how it's done: the context provided to the model is an excerpt of the president’s State of the Union speech, and the question on that is: “**What did the president say about Michael Jackson**?” (which he did not). The exemplary answer provided is then: “**The president did not mention Michael Jackson**”, and the SOURCES part of the answer is there but it's left empty. The purpose is to teach the model that it’s okay not to provide an answer if it doesn’t know it, and hope that with all that, the model will ‘get it right’ and behave.

### **4. Running a Preliminary Semantic Search**

Prompt templates are what LangChain is doing to tackle the Alignment problem, but the Accessibility problem requires a different type and scale of solution. Enter semantic search, vector database (or vectorstore), and sentence embedding.

The basic idea is this: we will use a (private) database to store our corpus of documents and create an efficient index that we will use to run a preliminary search of our query on our corpus. We will then take the results (which are reasonably-sized), combine them with the query and prompts, and all will then be sent as input to GPT to make its magic. 

This is a two-step approach. The first is an inexpensive database search, and the second is the expensive query to GPT. Thus, GPT is not directly exposed to our own documents but only to our (private) database results.

Now, the database used in these scenarios is usually a vector database (aka vectorstore), which indexes and stores vector embeddings of small text chunks for fast retrieval and similarity search. Note that eventually, what is sent to GPT are not the embedding vectors but plain text, which are chunks of our original docs. So a question can be asked: why use a vectorstore and not a traditional search engine using ranking algorithem such as [BM25](https://en.wikipedia.org/wiki/Okapi_BM25)? Apparently vector search works better, but there is also a developing trend now to combine traditional search with vector search, or in other words - [Hybrid Search](https://www.pinecone.io/learn/hybrid-search-intro/).

In any event, vectorstore is what LangChain and others are demonstrating. [Pinecone](https://pinecone.io) is a well-known provider of a proprietary (non-free) hosted vectorstore solution, but there are some open-source products such as [Chroma](https://www.trychroma.com), which is the default vectorstore used in the LangChain examples. Apparently, [Not all vector databases are created equal](https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696) as explained by [Dmitry Kan](https://twitter.com/DmitryKan). 

### **5. Putting it all Together**

We mentioned that a vector DB indexes and stores vector embeddings of small chunks of text. That means that the first step is to split our large text files to small chunks (e.g., 1000 chars each). We then need to create embeddings of these text chunks. LangChane conveniently takes care of all this for us. The default embedding provider in LangChain examples is OpenAI’s (non-free) embedding API. We then store these embedding in ChromaDB and create an index.

After splitting, embedding and storing the vectors in the vectorstore, we are finally ready to query our private documents. When we do this, our query is also converted into an embedding vector, and a vector semantic search is executed over our vectorised corpus (usually using [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)). The results returned are converted back to the original plain text (that was saved in the vectorstore as meta data), and it is then combined with LangChain prompts and our original query, before passed on to GPT for processing. The overall process is demonstrated in the gif below.

<center><img src='https://github.com/rashlab/AI-Notes/raw/main/filez/langChainProcess.png' width="90%" align="horizontal-center"/></center>

### **6. So How Good is it?** 

The LangChain Q&A document example is based on President Biden’s State of the Union speech, which is a .txt file of approximately 80,000 words. The query in the LangChain example is this: "**What did the president say about Ketanji Brown Jackson?**", which is answered quite nicely:

        "The president said that Ketanji Brown Jackson is one of our nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He also said that she is a consensus builder and has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans."

I then tried several other queries on different topics mentioned in the speech, such as the situation with Russia, inflation and the economy, which were all answered rather nicely, as expected. 

**HOWEVER**, then I asked, "**should we raise the corporate tax?**", and it turns out that it was a tricky question, because the answer generated by GPT was completely wrong! To understand why, let's first see what Biden actually said: 

        "Just last year, 55 Fortune 500 corporations earned $40 billion in profits and paid zero dollars in federal income tax. That’s simply not fair. That’s why I’ve proposed a 15% minimum tax rate for corporations…"

It seems that the word "minimum" confused the model, because on the question of whether we should raise the corporate tax, it answered:

        "No, the plan proposed by the speaker is to lower the corporate tax rate to a minimum of 15%." 

sad. 

I also asked the same question using the "with SOURCES" option, and this time the answer was quite different: "**The president did not mention raising the corporate tax.**" Apparently GPT took a note on the Michael Jackson warning in the “with SOURCES” prompt template.

Then I thought, maybe the model used by default in the example can be replaced with a better one that would demonstrate better understanding of the context? 

It turns out that by default, LangChain example is using the ***text-davinci-003*** model, which is a version of GPT-3.5 (see https://platform.openai.com/docs/models/gpt-3-5). Unfortunately, I could not find a way to set it to use GPT-4, but it was possible to switch to the ***gpt-3.5-turbo*** model, which is not only the cheapest model to use (according to OpenAI) but also a more intelligent creature, at least based on its answer to my question of whether we should raise the corporate tax, as it provided a reasonable answer:

        "The context suggests that the speaker believes that corporations and the wealthiest Americans should start paying their fair share and has proposed a 15% minimum tax rate for corporations. However, it is not explicitly stated whether the corporate tax should be raised or not."

I also tried the gpt-3.5-turbo to answer this query "with SOURCES," and the answer there was also quite good, although the turbo version got a bit confused between "corporate tax" vs. "taxes for corporations" (didn’t get it that those are actually the same):

        "The president proposes to raise taxes for corporations and the wealthiest Americans to ensure they pay their fair share, but he does not specifically mention raising the corporate tax rate. He also proposes a plan to fight inflation by lowering costs and increasing the productive capacity of the economy. 'sources': 'docs/modules/state_of_the_union.txt'"

### **7. GPT-4**
Finally, I managed to get GPT-4's opinion on the matter by injecting the exact input query that passed to the 3.5 models via the ChatGPT interface, and I was not disappointed. Here's GPT-4 answer on the question, "Should we raise the corporate tax?"

        "Based on the context provided, the speaker is advocating for a fairer tax system where corporations and the wealthiest Americans pay their fair share. They mention that 55 Fortune 500 corporations earned $40 billion in profits and paid zero dollars in federal income tax, which they believe is not fair. The speaker has proposed a 15% minimum tax rate for corporations. Therefore, the speaker's stance suggests that raising the corporate tax rate could be a viable solution for creating a more equitable tax system and addressing some of the economic challenges mentioned."

### **8. Final thoughts**

1. To be honest, I don't like the prompts solution for the Alignment problem. It's far from elegant, and at best a partial solution; I expect and assume that future GPT versions will be more aligned, at least when it comes to their ability to refrain from hallucinations and sticking to the facts and the context presented to them.
   
2. We need to build elegant, cost-effective, and secure tools and procedures that can apply advanced LLM's intelligence and reasoning capabilities to our private corpus of documents (and other sources), in a controlled manner, without bias and hallucinations. The LangChain solution is a brave effort, but a very initial one. This is only the beginning, and I believe that soon we shall see better tools and techniques developed by commercial companies and the open-source community.
   
3. Finally, this exercise proves once again that GPT-4 is a very intelligent artificial creature. I don't care much for the terminology debate around AGI, but it's clear to me that a powerful intelligence emerged in its gigantic latent space, which is both scary and beyond amazing. It feels like we are gaining access to an alien super-intelligence from a faraway galaxy.

### **9. The Code**

Here is the code. To run it, you will need an API-Key from OpenAI, which will cost you something like a cent per query. 

There is also similar code in [test.py](test.py) if you prefer that.

-------------------
**UPDATE April 7, 2023**

Interestingly, i did all my testing on April 6, and wrote this post the day after, and to my surprise, today *text-davinci-003* answered correctly: "**Yes, I've proposed a 15% minimum tax rate for corporations.**". Note, however, that davinci-003's answer is "**I've** proposed...", as if it is Biden himself, which I thought is kind of strange. 

You will have to take my word that this wasn't the case on April 6
        

In [None]:
!pip install openai tiktoken chromadb langchain

In [None]:
import os
import requests
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

os.environ["OPENAI_API_KEY"] = "YOUR API KEY HERE"

In [None]:
# download the text file we will use - state_of_the_union.txt
input_file = 'state_of_the_union.txt'
if not os.path.exists(input_file):
    data_url = 'https://github.com/rashlab/AI-Notes/raw/main/langchain/state_of_the_union.txt'
    with open(input_file, 'w') as f:
        f.write(requests.get(data_url).text)

loader = TextLoader(input_file)

# Splitting the documents into chunks
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# using OpenAI embeddings, and create a vectorstore using Chroma DB
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(texts, embeddings)

# create the vectorstore index
retriever = db.as_retriever()

# we will compare two models - the 3.5 Turbo and the Davinci 003 
# we make sure to use temperature=0.0, so that the model will not add any noise to the output
llm_davinci003 = OpenAI(model_name="text-davinci-003", temperature=0.0)
llm_35Turbo = OpenAI(model_name="gpt-3.5-turbo", temperature=0.0)

# create the QA chains - one for each model
# first we create a pair of QA chains w/o sources, and then a pair of QA chains with sources
# QA chains with sources will return the source document, but internally langchain also uses different prompt templates 
qaChain_davinci003 = RetrievalQA.from_chain_type(llm_davinci003, chain_type="stuff", retriever=retriever)
qaChain_35Turbo = RetrievalQA.from_chain_type(llm_35Turbo, chain_type="stuff", retriever=retriever)
qaWithSource_davinci003 = RetrievalQAWithSourcesChain.from_chain_type(llm_davinci003, chain_type="stuff", retriever=retriever)
qaWithSource_35Turbo = RetrievalQAWithSourcesChain.from_chain_type(llm_35Turbo, chain_type="stuff", retriever=retriever)

# we will use the following query, which is a bit tricky for the model...
query = "should we raise the corporate tax?"
# query = "What did the president say about Ketanji Brown Jackson"

In [None]:
qaChain_davinci003.run(query)

In [None]:
qaChain_35Turbo.run(query)

In [None]:
qaWithSource_davinci003({"question": query}, return_only_outputs=False)

In [None]:
qaWithSource_35Turbo({"question": query}, return_only_outputs=False)