# Setting up **LangChain** configurations and pipeline

<a target="_blank" href="https://colab.research.google.com/github/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/blob/liors_branch/Chapter8_notebooks/Ch8_Setting_Up_LangChain_Configurations_and_Pipeline.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**Purpose of this notebook:**  
This notebook is dedicated to setting up semantic search pipeline using **LangChain**.  
We showcase how a physician may use such a pipeline so to search through their patient notes and find patients by asking questions such as:  
`"Which patients have travelled recently?"`  

We put together a complete **RAG** pipeline that includes **embeddings** and their storage in a **vector DB** so to have an "inhouse" search in physician notes.  
We show how such a pipeline, that is only based on similarity search without a LLM, is suboptimal.  
In the following notebook we upgrade it by incorporating an LLM that enhances the search:  
**Ch9_Advanced_LangChain_Configurations_and_Pipeline.ipynb**  

**Requirements:**  
* When running in Colab, use this runtime notebook setting: `Python 3, CPU`  

>*```Disclaimer: The content and ideas presented in this notebook are solely those of the authors and do not represent the views or intellectual property of the authors' employers.```*

Install:

In [1]:
# REMARK:
# If the below code error's out due to a Python package discrepency, it may be because new versions are causing it.
# In which case, set "default_installations" to False to revert to the original image:
default_installations = True
if default_installations:
    !pip -q install langchain
    !pip -q install langchain-huggingface
    !pip -q install sentence_transformers
    !pip -q install faiss-cpu
else:
    import requests
    text_file_path = "langchain.txt"
    url = "https://raw.githubusercontent.com/python-devops-sre/nlp/master/requirements/" + text_file_path           
    res = requests.get(url)
    with open(text_file_path, "w") as f:
      f.write(res.text)
      
    !pip install -r langchain.txt

Imports:

In [2]:
import requests
from langchain.document_loaders import TextLoader
import textwrap
from langchain.text_splitter import CharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

Code Settings:

In [3]:
# In the data file we're using, this short string is a dilimiter between different clinical reports:
split_text_by = '"Title: Mocked up record'
chunk_size = 2000
chunk_overlap = 0

### Load Text File With Mocked Physician Notes
These files hold the information we are looking to tap into.  
In this particular example, we concatenated all the mocked reports to a single .CSV table, just so to make the loading short and simple:  

In [4]:
text_file_path = "mocked_up_physician_records.csv"
url = "https://raw.githubusercontent.com/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/main/Chapter8_notebooks/" + text_file_path
res = requests.get(url)

with open(text_file_path, "w") as f:
    f.write(res.text)

Load the text content of the file:

In [5]:
# Document Loader
text_loader = TextLoader(text_file_path)
documents = text_loader.load()

Observe the LangChain variable type (this is useful so to know how to manipulate):

In [6]:
print(type(documents[0]))

<class 'langchain_core.documents.base.Document'>


See the count of documents:

In [7]:
len(documents)

1

Showcasing an example of accessing the raw text:

In [8]:
print(documents[0].page_content[0:200])

"Title: Mocked up record
Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain

History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old ma


### Process the data so to be prepared for embedding

In [9]:
# Text Splitter
text_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, separator=split_text_by)
splitted_docs = text_splitter.split_documents(documents)

In [10]:
len(splitted_docs)

4

In [11]:
print(splitted_docs[0].page_content)

Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain

History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old male, presents today with a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr. Anderson recently returned from a business trip to Europe about two weeks ago. He denies any respiratory symptoms or exposure to sick individuals during his travel.

During the evaluation, Mr. Anderson revealed a pertinent family history of cardiovascular disease, with his father having suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type 2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or hospitalizations.

Regarding his chief complaint, Mr. Anderson describes the abdominal pain as a dull, intermittent ache located in the lower right quadrant. He rates the pain as 5 out of 10 in severity. The pain is exacerbat

### Creating the embeddings that would be stored in the vector database
Using an open source model from Hugging Face.

In [13]:
from tqdm.autonotebook import tqdm, trange

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

### Create the vector database

For a vector database we picked FAISS (Facebook AI Similarity Search):  
More about it here: https://python.langchain.com/en/latest/modules/indexes/vectorstores.html

In [14]:
vector_db = FAISS.from_documents(splitted_docs, embeddings)

### Perform similarity search based on our "inhouse" documents

**Question #1: Are there any pregnant patients who are due to deliver in August?**  

In [15]:
query1 = "Are there any pregnant patients who are due to deliver in August?"

docs = vector_db.similarity_search(query1)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: July 10, 2099
Patient ID: 246813579
Chief Complaint: Pregnancy Follow-
up

History of Present Illness:
The patient, Mrs. Emily Adams, a 30-year-old female, presents today
for a routine pregnancy follow-up. She is currently 32 weeks pregnant, with a due date of August
27th, 2099. Mrs. Adams is married and lives with her husband.

During the evaluation, Mrs. Adams
reveals a family history of gestational diabetes, with her mother having developed the condition
during her own pregnancies. She mentions no personal history of significant medical conditions,
surgeries, or complications in previous pregnancies.

Regarding her chief complaint, Mrs. Adams
reports typical discomforts associated with the third trimester of pregnancy, including backache,
frequent urination, and occasional heartburn. She denies any vaginal bleeding, severe abdominal
pain, or significant changes in fetal movements. Mrs. Adams mentions adhering to a well-balanced
diet and regular exercise

**[EXAMPLE OF A MISTAKE!] Question #2: Are there any pregnant patients who are due to deliver in September?**  
This is an example where the similarity search gets it **wrong!**  
Indeed it provided a similar text to the question being asked, but this is an example of where similarity is different than getting the answer right.  

In [16]:
query2 = "Are there any pregnant patients who are due to deliver in September?"

docs = vector_db.similarity_search(query2)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: July 10, 2099
Patient ID: 246813579
Chief Complaint: Pregnancy Follow-
up

History of Present Illness:
The patient, Mrs. Emily Adams, a 30-year-old female, presents today
for a routine pregnancy follow-up. She is currently 32 weeks pregnant, with a due date of August
27th, 2099. Mrs. Adams is married and lives with her husband.

During the evaluation, Mrs. Adams
reveals a family history of gestational diabetes, with her mother having developed the condition
during her own pregnancies. She mentions no personal history of significant medical conditions,
surgeries, or complications in previous pregnancies.

Regarding her chief complaint, Mrs. Adams
reports typical discomforts associated with the third trimester of pregnancy, including backache,
frequent urination, and occasional heartburn. She denies any vaginal bleeding, severe abdominal
pain, or significant changes in fetal movements. Mrs. Adams mentions adhering to a well-balanced
diet and regular exercise

**Question #3: Which patients have travelled recently?**

In [17]:
query3 = "Which patients have travelled recently?"

docs = vector_db.similarity_search(query3)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain
History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old male, presents today with
a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr.
Anderson recently returned from a business trip to Europe about two weeks ago. He denies any
respiratory symptoms or exposure to sick individuals during his travel.

During the evaluation, Mr.
Anderson revealed a pertinent family history of cardiovascular disease, with his father having
suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type
2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or
hospitalizations.

Regarding his chief complaint, Mr. Anderson describes the abdominal pain as a
dull, intermittent ache located in the lower right quadrant. He rates the pain as 5 out of 10 in
severity. The pain is exacerbate

**Question #4: Which patients require lab work?**  

In [18]:
query4 = "Which patients require lab work?"

docs = vector_db.similarity_search(query4)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain
History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old male, presents today with
a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr.
Anderson recently returned from a business trip to Europe about two weeks ago. He denies any
respiratory symptoms or exposure to sick individuals during his travel.

During the evaluation, Mr.
Anderson revealed a pertinent family history of cardiovascular disease, with his father having
suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type
2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or
hospitalizations.

Regarding his chief complaint, Mr. Anderson describes the abdominal pain as a
dull, intermittent ache located in the lower right quadrant. He rates the pain as 5 out of 10 in
severity. The pain is exacerbate