### Medical Question and Answering


To my understanding, this task is asking for Semantic Search over the Question space. This makes sense since, the dataset doesn't contain a context field which is typical in Extractive or Abstractive Question Answering datasets such as SQuAD.  Also, since Medical QnA has high impact risk, we should not try to use GenAI carelessly, as it may hallucinate and lead to spurious results. Even if we want to use LLMs, we should place strong guardrails in place for the same. Thus I have made my approaches based on this assumption of Semantic Search.

In [None]:
!pip install -U sentence-transformers



### Necessary Imports

In [None]:
from sentence_transformers import SentenceTransformer, util

In [None]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.tokenize import TreebankWordTokenizer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

### Dataset Understanding


In [None]:
## Loading the data into a Pandas DF
df = pd.read_csv('./files/medDataset_processed.csv')
df.head()

Unnamed: 0,qtype,Question,Answer
0,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,LCMV infections can occur after exposure to fr...
1,symptoms,What are the symptoms of Lymphocytic Choriomen...,LCMV is most commonly recognized as causing ne...
2,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,Individuals of all ages who come into contact ...
3,exams and tests,How to diagnose Lymphocytic Choriomeningitis (...,"During the first phase of the disease, the mos..."
4,treatment,What are the treatments for Lymphocytic Chorio...,"Aseptic meningitis, encephalitis, or meningoen..."


In [None]:
## Checking Basic Info
df.describe()

Unnamed: 0,qtype,Question,Answer
count,16407,16407,16407
unique,16,14979,15817
top,information,What causes Causes of Diabetes ?,This condition is inherited in an autosomal re...
freq,4535,20,348


In [None]:
## Checkout all the types of queries
df.qtype.unique()

array(['susceptibility', 'symptoms', 'exams and tests', 'treatment',
       'prevention', 'information', 'frequency', 'complications',
       'causes', 'research', 'outlook', 'considerations', 'inheritance',
       'stages', 'genetic changes', 'support groups'], dtype=object)

In [None]:
## Checking if we need to handle any Null or NaN values
df.isnull().sum()

qtype       0
Question    0
Answer      0
dtype: int64

### Data Preprocessing
Since I will only search for the questions in the Semantic space, I need to only worry about how to best clean and tokenize the questions.

In [None]:
## Tokenizing and Cleaning - the most difficult part in my opinion.
tokenizer = TreebankWordTokenizer()
def tok(sent):
    # Convert entire sent to lowercase
    sent = sent.lower()
    # HASHTAGS
    sent = re.sub(r"#\w+", "", sent)
    # Handle Punctuation
    sent = re.sub(r"[&{\"$@\[%\-\]\|,}()<`^#~\\'*/>:+;=_?!\.]+", "", sent)

    # Remove the "." after salutations
    sent = re.sub("mrs\.", "mrs", sent)
    sent = re.sub("mr\.", "mr", sent)
    sent = re.sub("ms\.", "ms", sent)
    sent = re.sub("dr\.", "dr", sent)
    sent = re.sub("prof\.", "prof", sent)

    # Remove all newline characters
    sent = re.sub("\n", " ", sent)

    # URLS
    sent = re.sub(r"(https?://[^\s]+)|(www\.[^\s]+)", "", sent)

    ## Tokenize each sentence
    # tokens = word_tokenize(sent)
    # ## Keep only stemmed words
    # stemmer = SnowballStemmer("english")
    # tokens = [stemmer.stem(word) for word in tok]
    # lemmatizer = WordNetLemmatizer()
    # tok = [lemmatizer.lemmatize(word) for word in tok]
    tokens = tokenizer.tokenize(sent)
    return " ".join(tokens)

In [None]:
df['docs'] = df['Question'].apply(tok)
df.head()

Unnamed: 0,qtype,Question,Answer,docs
0,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,LCMV infections can occur after exposure to fr...,who is at risk for lymphocytic choriomeningiti...
1,symptoms,What are the symptoms of Lymphocytic Choriomen...,LCMV is most commonly recognized as causing ne...,what are the symptoms of lymphocytic choriomen...
2,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,Individuals of all ages who come into contact ...,who is at risk for lymphocytic choriomeningiti...
3,exams and tests,How to diagnose Lymphocytic Choriomeningitis (...,"During the first phase of the disease, the mos...",how to diagnose lymphocytic choriomeningitis lcm
4,treatment,What are the treatments for Lymphocytic Chorio...,"Aseptic meningitis, encephalitis, or meningoen...",what are the treatments for lymphocytic chorio...


For my first approach I plan on using Bidirectional LSTM to try and encode each sentence to some specified dimensional space. For that I would need the vocabulary that I would work and also we would need to encode those.

In [None]:
def preprocess(tokens, cut_off=1):
    vocab = {}
    for sent in tokens:
        vocab.update({w: vocab.get(w,0)+1 for w in sent})

    unk = set(k for k,v in vocab.items() if v == 1)
    vocabulary = set(k for k,v in vocab.items() if v > cut_off)

    counts = sorted(list(vocab.items()), key=lambda x: x[1], reverse=True)
    # encoding
    encoding_dict = {w[0]: i+2 for i,w in enumerate(counts) if w[0] in vocabulary}
    encoding_dict.update({"<pad>":0, "<unk>":1})
    # decoding
    decoding_dict = {v:k for k,v in encoding_dict.items()}

    for i,sent in enumerate(tokens):
        tokens[i] = [w if w in vocabulary else "<unk>" for w in sent]
    return tokens, encoding_dict, decoding_dict

#### Train Test Validation Splits

In [None]:
train_df, val_test_df = train_test_split(df, test_size=0.3, random_state=47)
print(f"Train Length: {len(train_df)}")
val_df, test_df = train_test_split(val_test_df, test_size=0.5, random_state=47)
print(f"Validation Length: {len(val_df)}")
print(f"Test Length: {len(test_df)}")

Train Length: 11484
Validation Length: 2461
Test Length: 2462


In [None]:
train_docs = list(train_df['docs'])

### Loading the Model and Using it to precompute doc embeddings

In [None]:
### WAS INITIALLY WORKING ON BUILDING A LSTM BASED ENCODING, BUT SCRAPPED IT
### AFTER SOME MORE RESEARCH INTO SENTENCE TRANSFORMER BASED MODELS
# class Encoder(nn.Module):
#     def __init__(self, input_size, out_size, vocab_size, embed_size, num_layers=2):
#         super().__init__()
#         self.vocab_size = vocab_size
#         self.embed_size = embed_size
#         self.embed = nn.Embedding(self.vocab_size, self.embed_size)
#         self.lstm = nn.LSTM(input_size=input_size, hidden_size=out_size,
#                             num_layers=num_layers, batch_first=True,
#                             bidirectional=True)

#     def forward(self, X):
#         embed_out = self.embed()
#         pass

#### Sentence Transformer Based Approach
I finally decided to use sentence transformer library to encode all the documents in my train set, to use in the semantic search.
I have used the pretrained "multi-qa-MiniLM-L6-cos-v1" model, which encodes the documents in a 384-dimensional dense vector space. This model, was also specifically trained to excel in Semantic Search tasks.

In [None]:
# Load the model
model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')

In [None]:
# Encode all the Questions/Documents in this case
doc_emb = model.encode(train_docs)

## Now store the embeddings for further inference
torch.save(doc_emb, "./files/Document_Embeddings.mat")

In [None]:
train_df.to_csv('./files/inference.csv')

### Evaluation (Average Cosine Similarity Achieved)

In [None]:
## For evaluation I have considered the Average Cosine Similarity that I get for an query
## in both the Validation and Testing datasets.
val_results = util.semantic_search(model.encode(list(val_df['docs'])), doc_emb,
                                   top_k=1)
test_results = util.semantic_search(model.encode(list(test_df['docs'])), doc_emb,
                                    top_k=1)

avg_cos_sim_val = np.average([row[0]['score'] for row in val_results])
avg_cos_sim_test = np.average([row[0]['score'] for row in test_results])
print(f"The average cosine similarity achieved in Validation Set is {avg_cos_sim_val}")
print(f"The average cosine similarity achieved in Test Set is {avg_cos_sim_test}")

The average cosine similarity achieved in Validation Set is 0.8542121650120534
The average cosine similarity achieved in Test Set is 0.8546590789680458


### Inferencing on our Queries

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')
doc_emb = torch.load('./files/Document_Embeddings.mat')
inference_df = pd.read_csv('./files/Augnito_Assignment/inference.csv')

In [None]:
query = "What are Tumors?"

In [None]:
query = tok(query)
query_emb = model.encode(query)
result = util.semantic_search(query_emb, doc_emb, top_k=1)[0][0]
print(f"The Answer to your Query, which matched {result['score']*100} % with an existing query is:")
print(inference_df.iloc[result['corpus_id']]['Answer'])

The Answer to your Query, which matched 81.11408352851868 % with an existing query is:
Cancer begins in your cells, which are the building blocks of your body. Normally, your body forms new cells as you need them, replacing old cells that die. Sometimes this process goes wrong. New cells grow even when you don't need them, and old cells don't die when they should. These extra cells can form a mass called a tumor. Tumors can be benign or malignant. Benign tumors aren't cancer while malignant ones are. Cells from malignant tumors can invade nearby tissues. They can also break away and spread to other parts of the body.     Cancer is not just one disease but many diseases. There are more than 100 different types of cancer. Most cancers are named for where they start. For example, lung cancer starts in the lung, and breast cancer starts in the breast. The spread of cancer from one part of the body to another is called metastasis. Symptoms and treatment depend on the cancer type and how adv