# Research Notebook

Steps :
1. Load data - Encylopedia of Disease Book pdf in `../data/EMD.pdf`.
2. 
3. 
4. 

## Step 1: Read book pdf and extract text

In [2]:
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [3]:
# Extract text from PDF files
def load_pdfs_from_directory(directory_path):
    loader = DirectoryLoader(directory_path, glob="*.pdf", loader_cls=PyPDFLoader)
    documents = loader.load()
    return documents

In [9]:
extracted_docs = load_pdfs_from_directory("../data")
print(f"Number of documents loaded: {len(extracted_docs)}")
print(extracted_docs[50])  # Print the first document to verify

Number of documents loaded: 787
page_content='COLOR PLATES
Fig. 32.6. “Urban pixels” distribution and the number of epidem-
ic months among the 103 sub districts (199 7–1998 Dengue out-
break, Nakhon Pathom Province,Thailand). See text for full caption.
Fig. 32.7. Pig farm study site (obtained by Google° Earth). See text for
full caption.
Fig. 32.10. Breeding site dynamics of Rift Valley Fever vectors: evolution of cumulative rainfall and
abundance variations of A. v . arabiensisand C. poicilipes females. See text for full caption.' metadata={'producer': 'Acrobat Distiller 11.0 (Windows)', 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': '2018-10-17T11:50:32+02:00', 'author': 'Tibayrenc, Michel , ed.,', 'keywords': 'AFRIQUE / MALADIE; INFECTION; EPIDEMIOLOGIE; EMERGENCE; SIDA; AGENT PATHOGENE; PHYLOGENIE; BIOLOGIE MOLECULAIRE; IMMUNOLOGIE; ANTIGENE; LEISHMANIOSE; MALADIE DES PLANTES; PALUDISME; PARASITE; HISTOIRE; VACCINATION; ECOLOGIE; VIRUS; HOTE', 'moddate': '2018-10-17T16:2

### Split into smaller chunks

In [33]:
# Split documents into smaller chunks
def split_documents(documents, chunk_size=500, chunk_overlap=50):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    text_chunks = text_splitter.split_documents(documents)
    return text_chunks

In [34]:
text_chunks = split_documents(extracted_docs)
print(f"Number of text chunks created: {len(text_chunks)}")

Number of text chunks created: 8251


## Step 2: Get the Embeddings of text

In [35]:
from langchain.embeddings import HuggingFaceEmbeddings

def download_embeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"):
    """Download and return HuggingFace embeddings model."""
    embeddings = HuggingFaceEmbeddings(model_name=model_name)
    return embeddings

embeddings = download_embeddings()
print("Embeddings model downloaded successfully.")

Embeddings model downloaded successfully.


In [36]:
print(embeddings)
print(len(embeddings.embed_query("Disease")))  # Example usage

client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
) model_name='sentence-transformers/all-MiniLM-L6-v2' cache_folder=None model_kwargs={} encode_kwargs={} multi_process=False show_progress=False
384


### Get env keys

In [13]:
from dotenv import load_dotenv
import os
load_dotenv()

True

In [14]:
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")


os.environ["PINECONE_API_KEY"] = PINECONE_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

### Setup Pinecone and create index

In [16]:
from pinecone import Pinecone 
pinecone_api_key = PINECONE_API_KEY

pc = Pinecone(api_key=pinecone_api_key)
pc

<pinecone.pinecone.Pinecone at 0x147e1fb10>

In [37]:
from pinecone import ServerlessSpec

index_name = "medical-research"

if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=384,  # Dimension should match the embedding size
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index(index_name)
index

<pinecone.db_data.index.Index at 0x3269d3550>

In [38]:
from langchain_pinecone import PineconeVectorStore

docsearch = PineconeVectorStore.from_documents(
    documents=text_chunks,
    embedding=embeddings,
    index_name=index_name,
)

In [39]:
# load existing index
from langchain_pinecone import PineconeVectorStore
# Embed each chunk and add to Pinecone index
docsearch = PineconeVectorStore.from_existing_index(
    embedding=embeddings,
    index_name=index_name,
)

## Setup Retriever

In [42]:
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k": 3})
print(retriever)  # Verify the retriever object

tags=['PineconeVectorStore', 'HuggingFaceEmbeddings'] vectorstore=<langchain_pinecone.vectorstores.PineconeVectorStore object at 0x10e137910> search_kwargs={'k': 3}


In [50]:
retrieved_docs = retriever.invoke("What is Influenza?")
retrieved_docs

[Document(id='11919bb3-ce02-474c-9c29-c3d573b48d42', metadata={'author': 'Tibayrenc, Michel , ed.,', 'creationdate': '2018-10-17T11:50:32+02:00', 'creator': 'PScript5.dll Version 5.2.2', 'keywords': 'AFRIQUE / MALADIE; INFECTION; EPIDEMIOLOGIE; EMERGENCE; SIDA; AGENT PATHOGENE; PHYLOGENIE; BIOLOGIE MOLECULAIRE; IMMUNOLOGIE; ANTIGENE; LEISHMANIOSE; MALADIE DES PLANTES; PALUDISME; PARASITE; HISTOIRE; VACCINATION; ECOLOGIE; VIRUS; HOTE', 'moddate': '2018-10-17T16:29:56+02:00', 'page': 252.0, 'page_label': '253', 'producer': 'Acrobat Distiller 11.0 (Windows)', 'source': '../data/EMD.pdf', 'subject': '2007, 050EPID', 'title': 'Encyclopedia of infectious diseases : modern methodologies', 'total_pages': 787.0}, page_content='tissues other than those typically infected in birds. Highly\npathogenic avian inﬂuenza strains can evolve from low-\npathogenicity strains by a few point mutations in the HA\ngene; the accumulation of such residues in an evolutionary\nlineage over time is thus a cause fo

In [52]:
retrieved_docs = retriever.invoke("What is treatment of Influenza?")
retrieved_docs

[Document(id='37cb3226-fe55-4bf2-bcab-be3bc50fb177', metadata={'author': 'Tibayrenc, Michel , ed.,', 'creationdate': '2018-10-17T11:50:32+02:00', 'creator': 'PScript5.dll Version 5.2.2', 'keywords': 'AFRIQUE / MALADIE; INFECTION; EPIDEMIOLOGIE; EMERGENCE; SIDA; AGENT PATHOGENE; PHYLOGENIE; BIOLOGIE MOLECULAIRE; IMMUNOLOGIE; ANTIGENE; LEISHMANIOSE; MALADIE DES PLANTES; PALUDISME; PARASITE; HISTOIRE; VACCINATION; ECOLOGIE; VIRUS; HOTE', 'moddate': '2018-10-17T16:29:56+02:00', 'page': 260.0, 'page_label': '261', 'producer': 'Acrobat Distiller 11.0 (Windows)', 'source': '../data/EMD.pdf', 'subject': '2007, 050EPID', 'title': 'Encyclopedia of infectious diseases : modern methodologies', 'total_pages': 787.0}, page_content='available in the meantime. However, even the NA inhibitors\nmay only provide short-term protection. Resistance to\nthese drugs will undoubtedly evolve in due time. The\ninfluenza virus has already shown us its incredible evolu-\ntionary flexibility.There is no predicting 

## Setup OpenAI LLM model

In [44]:
from langchain_openai import ChatOpenAI

chatModel = ChatOpenAI(model="gpt-4o")

In [45]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

In [46]:
system_prompt = (
    "You are an Medical assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)


prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [47]:

question_answer_chain = create_stuff_documents_chain(chatModel, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [49]:
response = rag_chain.invoke({"input": "what is Influenza?"})
print(response["answer"])

Influenza, commonly known as the flu, is a contagious respiratory illness caused by influenza viruses. It primarily affects the nose, throat, and lungs, leading to symptoms such as fever, cough, sore throat, body aches, and fatigue. Influenza viruses are categorized into types A, B, and C, with types A and B causing seasonal flu epidemics.


In [54]:
response = rag_chain.invoke({"input": "what is the treatment of Influenza?"})
print(response["answer"])

The treatment of influenza primarily includes antiviral medications, like neuraminidase inhibitors (e.g., oseltamivir and zanamivir), which can reduce symptoms and duration if taken early. Treatment also involves supportive care, such as rest, hydration, and over-the-counter medications to alleviate symptoms like fever and aches. Vaccination is a key preventive measure and is prioritized over treatment.


In [55]:
response = rag_chain.invoke({"input": "aadfasdf"})
print(response["answer"])

I'm sorry, I don't understand your question. Could you please provide more details or clarify what you need assistance with?
