### Retreival QA with Openai gpt3.5 w/chaining
### Problem Statement for Generative AI  
A local healthcare company published multiple articles containing healthcare facts, 
information, and tips. It wishes to create a conversational chatbot that can address readers’ 
concerns in natural language using information from the trusted articles and in the 
healthcare context.   
The conversational chatbot should answer readers' queries using only the information from 
the published articles. Where appropriate, it should adopt an empathetic and understanding 
tone.  


The pipeline will be as follows:
   - Create a document collection
   - Embed all documents using Embedder
   - Fetch relevant documents for our question
   - Run an LLM answer the question

In [1]:
# import langchain
import os
import dotenv
import pandas as pd
import openai
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import OnlinePDFLoader
from langchain_community.document_loaders import UnstructuredURLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory


# set config/parameters
config = dotenv.dotenv_values("../.env")
openai.api_key = config['OPENAI_API_KEY']
openai_api_key = config['OPENAI_API_KEY']

#### 1. Read Documents, Chunking:

* Using the provided urls and the Langchain's Document Loaders such as Online PDF Loaders to read the pdf url, WebBase Loaders to read the html pages, data is collected and extracted.
* Documents are split into short, semi-self-contained sections or chunks which is later converted into embedding
* Used LangChain's RecursiveCharacterTextSplitter to split the documents into chunks of a specified max length - 2000   

In [2]:
#read input urls provided in the requirements
infile_path='../urls_list.txt'
with open(infile_path, 'r') as infile:
    urls_data = infile.readlines()

# load 
data_dict = {}
for i, url_link in enumerate(urls_data):
    url_link = url_link.strip()
    if str(url_link).endswith('pdf') or str(url_link).__contains__('ch-api'):
        loader = OnlinePDFLoader(url_link)
        text_data = loader.load()
        text_data[0].metadata['source'] = url_link
        data_dict.update({
            i: text_data
        })
    else:
        loader = WebBaseLoader(url_link)
        text_data = loader.load()
        text_data[0].metadata['source'] = url_link
        data_dict.update({
            i: text_data
        })

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
all_splits_pypdf_texts=[]
all_splits_pypdf_texts_src=[]
for k,v in data_dict.items():
    text_data = data_dict[k]
    texts = text_splitter.split_documents(text_data)
    all_splits_pypdf_texts.extend([d.page_content for d in texts])
    all_splits_pypdf_texts_src.extend([d.metadata['source'] for d in texts])


#### 2. Building Knowledge Base:
* Creating Embeddings: Built embeddings using LangChain's OpenAI embedding for each chunk of the documents
* Vector db - For each chunk, embedding are created and all the embedding are stored in a vector store in order to retrieve later. For this task, I have used FAISS - vector db, which is efficient for similarity search and clustering of dense vectors
* Embeddings are saved in a CSV file (for large datasets, use a vector database)

In [3]:
embedding = OpenAIEmbeddings()
vector_store = FAISS.from_texts(all_splits_pypdf_texts, embedding)

embed_list = []
for i, document in enumerate(all_splits_pypdf_texts): 
    embedding_rec = embedding.embed_documents([document])[0]
    embed_list.append(embedding_rec)

df = pd.DataFrame({"text": all_splits_pypdf_texts, "embedding": embed_list, "src":all_splits_pypdf_texts_src})


# # save document chunks and embeddings
# SAVE_PATH = "data/doc_embedding.csv"
# df.to_csv(SAVE_PATH, index=False)

  warn_deprecated(


### 3. Retrieve Related Documents:
- Takes a user query and a dataframe with text & embedding columns
- Embeds the user query with the OpenAI API
- Uses distance between query embedding and text embeddings to rank the texts
- Returns two lists:
    - The top N texts, ranked by relevance
    - Their corresponding relevance scores

In [4]:
# imports
import ast  # for converting embeddings saved as strings back to arrays
from openai import OpenAI # for calling the OpenAI API
import pandas as pd  # for storing text and embeddings data
import tiktoken  # for counting tokens
import os # for getting API token from env variable OPENAI_API_KEY
from scipy import spatial  # for calculating vector similarities for search

# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

In [5]:
# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 5
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response.data[0].embedding
    strings_and_relatednesses = [
        (row["text"]+"||"+row['src'], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

#### 4. Question Answering with related documents

Using the retriever, we can automatically retrieve relevant knowledge and insert it into messages to GPT.

Below, we define a function ask that:
- Takes a user query
- Searches for text relevant to the query
- Stuffs that text into a message for GPT
- Sends the message to GPT
- Returns GPT's answer

In [6]:
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    # print(strings,relatednesses)
    introduction = """Use the below articles on diabetes to answer the subsequent question with respect to healthcare context. \
    If the answer cannot be found in the articles, write a response in an emphatic and understanding tone \
    For example: "I couldn't find an exact match for your query. Could you rephrase the questions related to diabetes ?" """
    
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nNext article:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question, (strings,relatednesses)
    
def ask(
        query: str,
        df: pd.DataFrame = df,
        model: str = GPT_MODEL,
        token_budget: int = 4096 - 500,
        print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message, (strings, relatednesses) = query_message(query, df,
                                                      model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "system", "content": "You answer questions about the diabetes."},
        {"role": "user", "content": message},
    ]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response.choices[0].message.content

    source_list = []
    for i, docs in enumerate(strings):
        doc = docs.split('||')
        source_list.append({
            "source_doc": doc[0],
            "source": doc[1],
            "relatednesses_score": relatednesses[i]
        })

    return {
        "answer": response_message,
        "source_docs": source_list
    }



### Sample LLM responses :

In [7]:
ans = ask("What is gestational diabetes and how is it diagnosed?", df)
print("Answer:", ans['answer'])


Answer: Gestational diabetes mellitus (GDM) is a type of diabetes that develops during pregnancy. It is characterized by high blood sugar levels that can pose risks to both the mother and the baby. GDM is diagnosed through screening tests conducted during pregnancy, typically between 24 to 28 weeks of gestation. The screening process involves a 3-point 75 g oral glucose tolerance test (OGTT) for all pregnant women, unless they have already been diagnosed with diabetes or pre-diabetes. Women at increased risk of pre-existing diabetes are also screened for diabetes during their first trimester using non-pregnancy glucose thresholds. If GDM is identified, appropriate management strategies, including lifestyle interventions and possibly insulin therapy, are recommended to improve outcomes for both the mother and the baby.


In [8]:
ans = ask("What are some healthy eating tips for people with diabetes?")
print("Answer:", ans['answer'])


Answer: Some healthy eating tips for people with diabetes include:
1. Focus on a balanced diet that includes carbohydrates, protein, and fats, with an emphasis on managing carbohydrate intake to control blood sugar levels.
2. Choose healthier cooking methods like steaming, baking, boiling, or grilling, and use healthier ingredients.
3. Opt for whole grains over refined grains, lean meats, and unsaturated fats while limiting saturated and trans fats.
4. Incorporate plenty of vegetables and fruits into your meals, making them the main components of your plate.
5. Avoid sugary drinks and opt for water as your primary beverage choice.
6. Plan your meals ahead of time, make a shopping list, and practice moderation during festivals and celebrations to maintain a healthy eating routine.
7. Communicate your boundaries politely when faced with peer pressure to indulge in unhealthy food choices.
8. Stay hydrated with water and avoid excessive alcohol consumption, which can impact blood sugar lev

In [9]:
ans = ask("How can my outpatient bill for diabetes be covered? ")
print("Answer:", ans['answer'])

Answer: Based on the articles provided, your outpatient bill for diabetes can be covered through various means such as government subsidies, private medical insurance, and the use of MediSave under the Chronic Disease Management Programme (CDMP). The government subsidies available at public specialist outpatient clinics and polyclinics can help offset your bill, and you can also tap into your private medical insurance benefits if applicable. Additionally, you can utilize your MediSave account to cover a portion of the bill, especially if you are ≥ 60 years old. It is important to explore these options to reduce your out-of-pocket expenses for managing diabetes.


In [10]:
ans = ask("what is the blood sugar level for senior citizens having diabetic condition ?")
print("Answer:", ans['answer'])


Answer: For senior citizens with diabetes, the target blood sugar levels may vary. Generally, a less stringent HbA1c target of ≤8.0% may be appropriate. It is important to individualize the target HbA1c based on the overall health status of the patient, in consultation with their healthcare provider. Older patients, especially if frail, those with a long duration of the disease, short life expectancy, or advanced microvascular or macrovascular complications may benefit from a less stringent target to reduce the risk of hypoglycemia. It is crucial for senior citizens with diabetes to work closely with their healthcare team to determine the most suitable blood sugar targets for their specific health needs.


### Evaluating LLM Performance:

Assess the performance of the answers generated from the chatbot, given that there are no ground truth Q&A pairs provided to you.
For this, I have tried to use Lngchain, QAEvalChain to evaluate the LLM Response.

* Manual Annotation for 3 sample queries: For the provided sample queries, I have annotated the sample responses from the extracted docs as ground truth. QAEvalChain can asseses the predicted answers against the annotated responses and returns whether the predicted answer is valid or not.

* Assesing Openai model(LLM1) generated response as ground truth and validate the response against the LLM2(RetrievalQA with chaining) response as prediction. QAEvalChain can asseses the predicted answers(LLM2) against the responses(from LLM1) and returns whether the predicted answer is valid or not.

In [11]:
from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains import SimpleSequentialChain
# Eval!
from langchain.evaluation.qa import QAEvalChain
from langchain.chains import RetrievalQA

openai_api_key = os.environ["OPENAI_API_KEY"]
llm = OpenAI(temperature=1, openai_api_key=openai_api_key)

ground_truth_question_answers = [
    {'question': "What is gestational diabetes and how is it diagnosed?",
     'answer': 'Gestational diabetes is a type of diabetes that develops during pregnancy and usually goes away after delivery. It is diagnosed using a 3-point 75 g oral glucose tolerance test (OGTT) at 24 to 28 weeks of gestation, unless the woman has already been diagnosed with diabetes or pre-diabetes. It is important to also screen for pre-existing diabetes in the first trimester and after delivery, as women with a history of GDM are at increased risk of developing type 2 diabetes later on in life.'
    },
    {
        'question': "What are some healthy eating tips for people with diabetes?",
        'answer':"Some healthy eating tips for people with diabetes include:\n\n1. Focus on a balanced diet that includes carbohydrates, protein, and fats, with an emphasis on managing carbohydrate intake to control blood sugar levels.\n2. Choose healthier cooking methods like steaming, baking, boiling, or grilling to prepare meals.\n3. Opt for whole grains over refined grains, such as replacing white rice with brown rice.\n4. Select lean meats and remove visible fats before cooking to reduce saturated fat intake.\n5. Use natural seasonings like herbs and spices instead of excessive salt.\n6. Incorporate vegetables and fruits as the main components of your meals, making up at least 50% of your plate.\n7. Stay hydrated with water as your primary drink choice and avoid sugary beverages.\n8. Plan meals ahead, make a shopping list, and opt for healthier products during festivals and celebrations to maintain healthy eating habits.\n9. Communicate your boundaries politely when faced with peer pressure to indulge in unhealthy foods during social gatherings.\n\nRemember, personalized nutritional advice from a healthcare professional, such as a dietitian, can further enhance your diabetes management through tailored dietary recommendations." 
    },
    {
    'question': "How can my outpatient bill for diabetes be covered?",
    'answer': "Your outpatient bill for diabetes can be covered through various means, including government subsidies, employee benefits/private medical insurance, and the use of MediSave through the Chronic Disease Management Programme (CDMP). The bill can be further offset with government subsidies available at public specialist outpatient clinics, polyclinics, and through schemes like the Community Health Assist Scheme (CHAS), Pioneer Generation (PG), and Merdeka Generation (MG) outpatient subsidies. Additionally, patients can tap on accounts of immediate family members for MediSave, and those aged 60 and above can use MediSave for the 15% co-payment under CDMP."
    }
]

    
chain = RetrievalQA.from_chain_type(llm=llm, 
                                    chain_type="stuff", 
                                    retriever=vector_store.as_retriever(), 
                                    input_key="question")

predictions = chain.apply(ground_truth_question_answers)
print(predictions)


# Start your eval chain
eval_chain = QAEvalChain.from_llm(llm)
eval_outputs = eval_chain.evaluate(ground_truth_question_answers,
                                     predictions,
                                     question_key="question",
                                     prediction_key="result",
                                     answer_key='answer')
print(eval_outputs)



  warn_deprecated(
  warn_deprecated(


[{'question': 'What is gestational diabetes and how is it diagnosed?', 'answer': 'Gestational diabetes is a type of diabetes that develops during pregnancy and usually goes away after delivery. It is diagnosed using a 3-point 75 g oral glucose tolerance test (OGTT) at 24 to 28 weeks of gestation, unless the woman has already been diagnosed with diabetes or pre-diabetes. It is important to also screen for pre-existing diabetes in the first trimester and after delivery, as women with a history of GDM are at increased risk of developing type 2 diabetes later on in life.', 'result': ' Gestational diabetes is diabetes that is first diagnosed during pregnancy. It can be identified through screening using a 3-point 75 g oral glucose tolerance test (OGTT) at 24 to 28 weeks of gestation, unless the woman has already been diagnosed with diabetes or pre-diabetes.'}, {'question': 'What are some healthy eating tips for people with diabetes?', 'answer': 'Some healthy eating tips for people with diab

### LLM Convesation retrieval Chain using GPT 3.5

Using ConversationalRetrievalChain, conversational agent was built with limited features.Agent can able to retrieve the top-k documents from KB and perform model prediction. The prediction are added to chat history and ConversationBufferMemory will keep track of the memory for

In [45]:
from langchain.chains import ConversationalRetrievalChain
from langchain.prompts import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
    )

def get_answer(knowledge_base, df):

    openai_api_key = os.environ["OPENAI_API_KEY"]
    
    llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

    general_system_template = r"""Use the below articles on diabetes to answer the subsequent question with respect to healthcare context. \
    If the answer cannot be found in the articles, write a response in an emphatic and understanding tone \
    For example: "I couldn't find an exact match for your query. Could you rephrase the questions related to diabetes ?"
     ----
    {context}
    ----
    """
    general_user_template = "Question:```{question}```"
    messages = [
                SystemMessagePromptTemplate.from_template(general_system_template),
                HumanMessagePromptTemplate.from_template(general_user_template)
    ]
    qa_prompt = ChatPromptTemplate.from_messages( messages )
    
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        return_messages=True,
        output_key='answer'
    )
    
    pdf_qa = ConversationalRetrievalChain.from_llm(
        llm,
        retriever=knowledge_base.as_retriever(search_kwargs={'k': 3}),
        return_source_documents=True,
        verbose=False,
        memory=memory,
    )
    
    yellow = "\033[0;33m"
    green = "\033[0;32m"
    white = "\033[0;39m"

    chat_history = []
    print(f"{yellow}---------------------------------------------------------------------------------")
    print('Welcome to the Health care chatBot. You are now ready to start interacting with your documents')
    print('---------------------------------------------------------------------------------')
    while True:
        query = input(f"{green}User Query: ")
        if query == "exit" or query == "quit" or query == "q" or query == "f":
            print('Exiting')
            sys.exit()
        if query == '':
            continue
        result = pdf_qa.invoke(
            {"question": query, "chat_history": chat_history})

        print(f"{white}Answer: " + result["answer"])
        chat_history.append((query, result["answer"]))



In [46]:
get_answer(vector_store, df)

[0;33m---------------------------------------------------------------------------------
Welcome to the Health care chatBot. You are now ready to start interacting with your documents
---------------------------------------------------------------------------------


[0;32mUser Query:  What is gestational diabetes and how it is diagnosed ?


[0;39mAnswer:  Gestational diabetes is a type of diabetes that develops during pregnancy. It is diagnosed through a 3-point 75 g oral glucose tolerance test (OGTT) at 24 to 28 weeks of gestation, unless the woman has already been diagnosed with diabetes or pre-diabetes. Women with a history of gestational diabetes should also be regularly screened for diabetes every 1 to 3 years after delivery.


[0;32mUser Query:  What is gestational diabetes and how it is diagnosed ?


[0;39mAnswer:  Gestational diabetes is diagnosed by screening pregnant women for high blood sugar levels. This is typically done at 24-28 weeks of gestation using a 3-point 75 g oral glucose tolerance test (OGTT). If the results are abnormal, the woman is diagnosed with gestational diabetes. Women who are at increased risk of gestational diabetes may also be screened during their first trimester using non-pregnancy glucose thresholds. After delivery, women with gestational diabetes are re-evaluated using a 2-point 75 g OGTT to assess their glycaemic status. If the results are normal, they should be regularly screened for diabetes every 1-3 years.


[0;32mUser Query:  What is gestational diabetes and how it is diagnosed ?


[0;39mAnswer:  Gestational diabetes is diagnosed through a screening process that involves testing for pre-existing diabetes during the first trimester using non-pregnancy glucose thresholds. If the results are normal, women are re-evaluated for gestational diabetes at 24 to 28 weeks of gestation using a 3-point 75 g oral glucose tolerance test (OGTT). This test is also used to screen all women for gestational diabetes unless they have already been diagnosed with diabetes or pre-diabetes. After delivery, women with diabetes diagnosed during pregnancy should be reassessed using a 2-point 75 g OGTT between 6 to 12 weeks post-delivery. If the results are normal, they should be screened for diabetes every 1 to 3 years (ideally annually) from then on. Women with a history of gestational diabetes are also at increased risk of developing type 2 diabetes and should be regularly screened for diabetes.


[0;32mUser Query:  What is gestational diabetes and how it is diagnosed ?


[0;39mAnswer:  Gestational diabetes is typically diagnosed during the second or third trimester of pregnancy. It is usually screened for at 24 to 28 weeks of gestation using a 3-point 75 g oral glucose tolerance test (OGTT). This involves drinking a glucose solution and having blood drawn at three different time points to measure blood sugar levels. If the results are abnormal, a diagnosis of gestational diabetes is made. Women who are at increased risk of gestational diabetes may also be screened during their first trimester using non-pregnancy glucose thresholds. After delivery, women with gestational diabetes are typically re-evaluated for diabetes using a 2-point 75 g OGTT between 6 to 12 weeks postpartum. If the results are normal, they should be regularly screened for diabetes every 1 to 3 years (ideally annually) from then on.


[0;32mUser Query:  q


Exiting


NameError: name 'sys' is not defined