# Applying advanced **LangChain** configurations and pipeline

<a target="_blank" href="https://colab.research.google.com/github/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/blob/liors_branch/Chapter9_notebooks/Ch9_Advanced_LangChain_Configurations_and_Pipeline.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**Purpose of this notebook:**  
This notebook is an enhancement of the pipeline we applied in chapter 8 on this notebook:    
**Ch8_Setting_Up_LangChain_Configurations_and_Pipeline.ipynb**

We put together a complete **RAG** pipeline that includes **embeddings** and their storage in a **vector DB** so to have an "inhouse" search in physician notes.  
Unlike the previous notebook, here we have an LLM driving the search, which avoides mistakes of mismatched retrievals.  

**Requirements:**  
* When running in Colab, use this runtime notebook setting: `Python 3, CPU`  
* This code picks OpenAI's API as a choice of LLM, so a paid **API key** is necessary.   

Install:

In [1]:
# REMARK:
# If the below code error's out due to a Python package discrepency, it may be because new versions are causing it.
# In which case, set "default_installations" to False to revert to the original image:
default_installations = True
if default_installations:
    !pip -q install langchain
    !pip -q install langchain-huggingface
    !pip -q install langchain-openai
    !pip -q install sentence_transformers
    !pip -q install faiss-cpu
    !pip -q install openai
else:
    import requests
    text_file_path = "advanced_langchain.txt"
    url = "https://raw.githubusercontent.com/python-devops-sre/nlp/master/requirements/" + text_file_path
    res = requests.get(url)
    with open(text_file_path, "w") as f:
      f.write(res.text)

    !pip install -r advanced_langchain.txt

Imports:

In [2]:
import requests
from langchain.document_loaders import TextLoader
import textwrap
from langchain.text_splitter import CharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

Code Settings:

In [3]:
# In the data file we're using, this short string is a dilimiter between different clinical reports:
split_text_by = '"Title: Mocked up record'
chunk_size = 2000
chunk_overlap = 0

Define OpenAI's API key:  
**You must provide a key and paste it as a string!**  

In [4]:
openai_api_key = "..."

### Load Text File With Mocked Physician Notes
These files hold the information we are looking to tap into.  
In this particular example, we concatenated all the mocked reports to a single .CSV table, just so to make the loading short and simple:  

In [5]:
text_file_path = "mocked_up_physician_records.csv"
url = "https://raw.githubusercontent.com/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/main/Chapter8_notebooks/" + text_file_path
res = requests.get(url)
with open(text_file_path, "w") as f:
  f.write(res.text)

Load the text content of the file:

In [6]:
# Document Loader
text_loader = TextLoader(text_file_path)
documents = text_loader.load()

Observe the LangChain variable type (this is useful so to know how to manipulate):

In [7]:
print(type(documents[0]))

<class 'langchain_core.documents.base.Document'>


See the count of documents:

In [8]:
len(documents)

1

Showcasing an example of accessing the raw text:

In [9]:
print(documents[0].page_content[0:200])

"Title: Mocked up record
Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain

History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old ma


### Process the data so to be prepared for embedding

In [10]:
# Text Splitter
text_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, separator=split_text_by)
splitted_docs = text_splitter.split_documents(documents)

In [11]:
len(splitted_docs)

4

In [12]:
print(splitted_docs[0].page_content)

Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain

History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old male, presents today with a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr. Anderson recently returned from a business trip to Europe about two weeks ago. He denies any respiratory symptoms or exposure to sick individuals during his travel.

During the evaluation, Mr. Anderson revealed a pertinent family history of cardiovascular disease, with his father having suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type 2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or hospitalizations.

Regarding his chief complaint, Mr. Anderson describes the abdominal pain as a dull, intermittent ache located in the lower right quadrant. He rates the pain as 5 out of 10 in severity. The pain is exacerbat

### Creating the embeddings that would be stored in the vector database
Using an open source model from Hugging Face.

In [13]:
from tqdm.autonotebook import tqdm, trange

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  from tqdm.autonotebook import tqdm, trange


### Create the vector database

For a vector database we picked FAISS (Facebook AI Similarity Search):  
More about it here: https://python.langchain.com/en/latest/modules/indexes/vectorstores.html

In [14]:
vector_db = FAISS.from_documents(splitted_docs, embeddings)

### Perform similarity search based on our "inhouse" documents

**Question #1: Are there any pregnant patients who are due to deliver in August?**  

In [15]:
query1 = "Are there any pregnant patients who are due to deliver in August?"

docs = vector_db.similarity_search(query1)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: July 10, 2099
Patient ID: 246813579
Chief Complaint: Pregnancy Follow-
up

History of Present Illness:
The patient, Mrs. Emily Adams, a 30-year-old female, presents today
for a routine pregnancy follow-up. She is currently 32 weeks pregnant, with a due date of August
27th, 2099. Mrs. Adams is married and lives with her husband.

During the evaluation, Mrs. Adams
reveals a family history of gestational diabetes, with her mother having developed the condition
during her own pregnancies. She mentions no personal history of significant medical conditions,
surgeries, or complications in previous pregnancies.

Regarding her chief complaint, Mrs. Adams
reports typical discomforts associated with the third trimester of pregnancy, including backache,
frequent urination, and occasional heartburn. She denies any vaginal bleeding, severe abdominal
pain, or significant changes in fetal movements. Mrs. Adams mentions adhering to a well-balanced
diet and regular exercise

**[EXAMPLE OF A MISTAKE!] Question #2: Are there any pregnant patients who are due to deliver in September?**  
This is an example where the similarity search gets it **wrong!**  
Indeed it provided a similar text to the question being asked, but this is an example of where similarity is different than getting the answer right.  

In [16]:
query2 = "Are there any pregnant patients who are due to deliver in September?"

docs = vector_db.similarity_search(query2)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: July 10, 2099
Patient ID: 246813579
Chief Complaint: Pregnancy Follow-
up

History of Present Illness:
The patient, Mrs. Emily Adams, a 30-year-old female, presents today
for a routine pregnancy follow-up. She is currently 32 weeks pregnant, with a due date of August
27th, 2099. Mrs. Adams is married and lives with her husband.

During the evaluation, Mrs. Adams
reveals a family history of gestational diabetes, with her mother having developed the condition
during her own pregnancies. She mentions no personal history of significant medical conditions,
surgeries, or complications in previous pregnancies.

Regarding her chief complaint, Mrs. Adams
reports typical discomforts associated with the third trimester of pregnancy, including backache,
frequent urination, and occasional heartburn. She denies any vaginal bleeding, severe abdominal
pain, or significant changes in fetal movements. Mrs. Adams mentions adhering to a well-balanced
diet and regular exercise

**Question #3: Which patients have travelled recently?**

In [17]:
query3 = "Which patients have travelled recently?"

docs = vector_db.similarity_search(query3)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain
History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old male, presents today with
a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr.
Anderson recently returned from a business trip to Europe about two weeks ago. He denies any
respiratory symptoms or exposure to sick individuals during his travel.

During the evaluation, Mr.
Anderson revealed a pertinent family history of cardiovascular disease, with his father having
suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type
2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or
hospitalizations.

Regarding his chief complaint, Mr. Anderson describes the abdominal pain as a
dull, intermittent ache located in the lower right quadrant. He rates the pain as 5 out of 10 in
severity. The pain is exacerbate

**Question #4: Which patients require lab work?**  

In [18]:
query4 = "Which patients require lab work?"

docs = vector_db.similarity_search(query4)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain
History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old male, presents today with
a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr.
Anderson recently returned from a business trip to Europe about two weeks ago. He denies any
respiratory symptoms or exposure to sick individuals during his travel.

During the evaluation, Mr.
Anderson revealed a pertinent family history of cardiovascular disease, with his father having
suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type
2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or
hospitalizations.

Regarding his chief complaint, Mr. Anderson describes the abdominal pain as a
dull, intermittent ache located in the lower right quadrant. He rates the pain as 5 out of 10 in
severity. The pain is exacerbate

****
Clear some memory before the next part (relevant for when opting for a locally hosted LLM):

In [19]:
import sys

local_vars = list(locals().items())
for var, obj in local_vars:
  if(sys.getsizeof(obj)) > 999:
    print(var, sys.getsizeof(obj))

TextLoader 1704
CharacterTextSplitter 1704
HuggingFaceEmbeddings 1704
FAISS 1704
tqdm 1704


In [20]:
import gc
del CharacterTextSplitter
del HuggingFaceEmbeddings
del TextLoader
del FAISS
gc.collect()

0

****

# The enahancement to the chapter 8 notebook: Setting up and LLM to process the requests
We will now enhance that pipeline. We will not settle for the results of the similarity search and surface those to the physicians; we will take those results which were deemed to be similar in content to the request, and we will employ an LLM to go through these results, vet them, and tell us which ones are
indeed relevant to the physician.

In [21]:
!pip -q install openai gpt4all

In [22]:
!wget https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML/resolve/main/nous-hermes-13b.ggmlv3.q4_0.bin

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [23]:
import os
import langchain
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub
from langchain_openai import ChatOpenAI
from langchain.llms import GPT4All

### Setting up an LLM: Choose between a paid LLM (OpenAI's GPT) and a free LLM (from Hugging Face)

In [24]:
paid_vs_free = "paid"

# The path to the GPT4all .bin file (this suits running in Google Colab):
path_to_bin = "./nous-hermes-13b.ggmlv3.q4_0.bin"

# The backend LLM:
# "gptj", "llama", etc.
backend_llm = "llama"

In [25]:
if paid_vs_free == "paid":
  os.environ["OPENAI_API_KEY"] = openai_api_key
  llm = ChatOpenAI()
elif paid_vs_free == "free":
  llm = GPT4All(
    model=path_to_bin,
    max_tokens=1000,
    # backend=backend_llm,
    verbose=False)

### Creating a QA chain
`load_qa_chain()` sets up a RAG framework. It accepts various text documents and sets them up for retrieval.  
It then considers the user prompt to identify the relevant text, surface it to the chosen LLM, for the LLM to respond to the prompt with the necessary context.  

In [26]:
chain = load_qa_chain(llm, chain_type="stuff")

### Now search based on the same requirements but using the LLM as the "brain" instead of embeddings similarity

**Question #1: Are there any pregnant patients who are due to deliver in August?**  

In [27]:
import langchain
langchain.debug = True

In [28]:
current_query = query1
docs = vector_db.similarity_search(current_query)
print(chain.run(input_documents=docs, question=current_query))

  warn_deprecated(


[32;1m[1;3m[chain/start][0m [1m[chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Are there any pregnant patients who are due to deliver in August?",
  "context": "Physician Name: Dr. ABC\nDate: July 10, 2099\nPatient ID: 246813579\nChief Complaint: Pregnancy Follow-up\n\nHistory of Present Illness:\nThe patient, Mrs. Emily Adams, a 30-year-old female, presents today for a routine pregnancy follow-up. She is currently 32 weeks pregnant, with a due date of August 27th, 2099. Mrs. Adams is married and lives with her husband.\n\nDuring the evaluation, Mrs. Adams reveals a family history of gestational diabetes, with her mother having developed the condition during her own pregnancies. She mentions no personal history of significant medical conditions, surgeries, or complications in previous pregnancies.\n\nRegarding her chief complain

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: .... You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

**[The LLM by OpenAI caught and avoided the MISTAKE!] Question #2: Are there any pregnant patients who are due to deliver in September?**  
However, some free LLMs that are quantized, and thus "suboptimal" may fail and present the August delivery date as the answer, even though a September delivery date was requested.

In [None]:
current_query = query2
docs = vector_db.similarity_search(current_query)

print(chain.run(input_documents=docs, question=current_query))

**Question #3: Which patients have travelled recently?**

In [None]:
current_query = query3
docs = vector_db.similarity_search(current_query)

print(chain.run(input_documents=docs, question=current_query))

**Question #4: Which patients require lab work?**  

In [None]:
current_query = query4
docs = vector_db.similarity_search(current_query)

print(chain.run(input_documents=docs, question=current_query))

**Refining question #4: Which patients *explicitly* require lab work?**  

In [None]:
current_query = "Which patients explicitly require lab work?"
docs = vector_db.similarity_search(current_query)

print(chain.run(input_documents=docs, question=current_query))