# Information retrieval and Extraction on Bio-Medical Text (Question Answering)

## Gather Required Libraries / Data / Files

In [1]:
!gdown --id 1mxVUywvKzvA9bvrUc11RYuOTy7MYcXHF

Downloading...
From: https://drive.google.com/uc?id=1mxVUywvKzvA9bvrUc11RYuOTy7MYcXHF
To: /content/bio-QA.zip
  0% 0.00/5.48M [00:00<?, ?B/s]100% 5.48M/5.48M [00:00<00:00, 48.2MB/s]


In [2]:
!gdown --id 1qM-W9poYlsVB1ko6V7iskttrM03kj1xq
!gdown --id 1erkgXVLEZQjFvryGME9Ed8qHSMTNthWD
!gdown --id 16xt49qlnP9ccd4SrqtQKrS2Dp7GBg6oH

Downloading...
From: https://drive.google.com/uc?id=1qM-W9poYlsVB1ko6V7iskttrM03kj1xq
To: /content/BioASQ_data.zip
100% 4.52M/4.52M [00:00<00:00, 71.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1erkgXVLEZQjFvryGME9Ed8qHSMTNthWD
To: /content/BioASQ.py
100% 394/394 [00:00<00:00, 579kB/s]
Downloading...
From: https://drive.google.com/uc?id=16xt49qlnP9ccd4SrqtQKrS2Dp7GBg6oH
To: /content/main.py
100% 1.39k/1.39k [00:00<00:00, 1.20MB/s]


In [3]:
!unzip "/content/BioASQ_data.zip" -d "/content/"

Archive:  /content/BioASQ_data.zip
  inflating: /content/BioASQ_data/BioASQ-test-factoid-4b-1.json  
  inflating: /content/BioASQ_data/BioASQ-test-factoid-4b-2.json  
  inflating: /content/BioASQ_data/BioASQ-test-factoid-4b-3.json  
  inflating: /content/BioASQ_data/BioASQ-test-factoid-4b-4.json  
  inflating: /content/BioASQ_data/BioASQ-test-factoid-4b-5.json  
  inflating: /content/BioASQ_data/BioASQ-test-factoid-5b-1.json  
  inflating: /content/BioASQ_data/BioASQ-test-factoid-5b-2.json  
  inflating: /content/BioASQ_data/BioASQ-test-factoid-5b-3.json  
  inflating: /content/BioASQ_data/BioASQ-test-factoid-5b-4.json  
  inflating: /content/BioASQ_data/BioASQ-test-factoid-5b-5.json  
  inflating: /content/BioASQ_data/BioASQ-test-factoid-6b-1.json  
  inflating: /content/BioASQ_data/BioASQ-test-factoid-6b-2.json  
  inflating: /content/BioASQ_data/BioASQ-test-factoid-6b-3.json  
  inflating: /content/BioASQ_data/BioASQ-test-factoid-6b-4.json  
  inflating: /content/BioASQ_data/BioASQ-

In [4]:
!pip install --quiet sentence_transformers

[K     |████████████████████████████████| 78 kB 6.8 MB/s 
[K     |████████████████████████████████| 3.4 MB 59.9 MB/s 
[K     |████████████████████████████████| 6.8 MB 58.9 MB/s 
[K     |████████████████████████████████| 1.2 MB 63.1 MB/s 
[K     |████████████████████████████████| 67 kB 5.8 MB/s 
[K     |████████████████████████████████| 596 kB 74.1 MB/s 
[K     |████████████████████████████████| 895 kB 79.0 MB/s 
[K     |████████████████████████████████| 3.3 MB 71.1 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


## Importing Required Libraries

In [5]:
from BioASQ import BioASQDoc
from tqdm import tqdm
from sentence_transformers import SentenceTransformer,util
import torch
import glob
import json
from tqdm import tqdm
from collections import OrderedDict
import time

## Bio-Medical Data Preparation

In [6]:
doc_paths = glob.glob('/content/BioASQ_data/*')

In [7]:
def get_documents(doc_paths):
    documents = []
    for doc in tqdm(doc_paths):
        with open(doc) as file:
            json_doc = json.loads(file.read())
            documents.append(json_doc)

    return documents

In [8]:
json_document_content = get_documents(doc_paths)

100%|██████████| 18/18 [00:00<00:00, 111.87it/s]


In [9]:
bio_docs = []
for doc in json_document_content:
    for i in tqdm(range(len(doc['data'][0]['paragraphs']))):
        bio_docs.append(doc['data'][0]['paragraphs'][i]['context'])

100%|██████████| 152/152 [00:00<00:00, 379936.95it/s]
100%|██████████| 94/94 [00:00<00:00, 447011.99it/s]
100%|██████████| 145/145 [00:00<00:00, 684880.72it/s]
100%|██████████| 121/121 [00:00<00:00, 622712.62it/s]
100%|██████████| 121/121 [00:00<00:00, 481509.28it/s]
100%|██████████| 4950/4950 [00:00<00:00, 659272.35it/s]
100%|██████████| 101/101 [00:00<00:00, 577145.37it/s]
100%|██████████| 78/78 [00:00<00:00, 15709.76it/s]
100%|██████████| 4772/4772 [00:00<00:00, 1168021.63it/s]
100%|██████████| 167/167 [00:00<00:00, 820197.62it/s]
100%|██████████| 88/88 [00:00<00:00, 449572.17it/s]
100%|██████████| 100/100 [00:00<00:00, 480998.17it/s]
100%|██████████| 91/91 [00:00<00:00, 401347.70it/s]
100%|██████████| 93/93 [00:00<00:00, 313813.57it/s]
100%|██████████| 88/88 [00:00<00:00, 478107.19it/s]
100%|██████████| 117/117 [00:00<00:00, 543448.03it/s]
100%|██████████| 3266/3266 [00:00<00:00, 711652.39it/s]
100%|██████████| 116/116 [00:00<00:00, 531736.90it/s]


In [10]:
bio_docs = list(OrderedDict.fromkeys(bio_docs))

## Building Information Retreiver using Sentence Transformer Encoded Text

In [11]:
model = SentenceTransformer('multi-qa-mpnet-base-cos-v1')

Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.19k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Encoding Whole Bio-Medical Corpus

In [12]:
start_time = time.time()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
corpus_embeddings = model.encode(bio_docs,convert_to_tensor=True,batch_size=32,device=device)
print("--- Whole Bio-Medical Text Encoded in %f minutes using Sentence Transformer 'multi-qa-mpnet-base-cos-v1' ---" % ((time.time() - start_time)/60))

--- Whole Bio-Medical Text Encoded in 1.909891 minutes using Sentence Transformer 'multi-qa-mpnet-base-cos-v1' ---


### Encoding Query (or Question, to extract the answer later)

In [13]:
query = "Which enzyme is inhibited by Olaparib?"

In [14]:
query_embedding = model.encode(query, convert_to_tensor=True,device=device)

In [16]:
cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
top_results = torch.topk(cos_scores, k=10)
top_results

torch.return_types.topk(values=tensor([0.5702, 0.5470, 0.5072, 0.4968, 0.4885, 0.4847, 0.4816, 0.4811, 0.4783,
        0.4746], device='cuda:0'), indices=tensor([ 837,  830,  429,  323,  828, 3433,  838, 1616, 3427, 3506],
       device='cuda:0'))

### Retrieving Top 10 Documents based on Cosine Similarity

In [17]:
relevant_context = []
for score, idx in zip(top_results[0], top_results[1]):
    #if score>0.5:
    relevant_context.append(bio_docs[idx])
    print(bio_docs[idx], "(Score: {:.4f})".format(score))

PARP inhibitor olaparib increases the oncolytic activity of dl922-947 in in vitro and in vivo model of anaplastic thyroid carcinoma. PARP inhibitors are mostly effective as anticancer drugs in association with DNA damaging agents. We have previously shown that the oncolytic adenovirus dl922-947 induces extensive DNA damage, therefore we hypothesized a synergistic antitumoral effect of the PARP inhibitor olaparib in association with dl922-947. Anaplastic thyroid carcinoma was chosen as model since it is a particularly aggressive tumor and, because of its localized growth, it is suitable for intratumoral treatment with oncolytic viruses. Here, we show that dl922-947 infection induces PARP activation, and we confirm in vitro and in vivo that PARP inhibition increases dl922-947 replication and oncolytic activity. In vitro, the combination with olaparib exacerbates the appearance of cell death markers, such as Annexin V positivity, caspase 3 cleavage, cytochrome C release and propidium iodi

## Creating Question Answering Pipeline (Information Extraction)

In [18]:
from transformers import pipeline

In [19]:
bioBert_model = 'dmis-lab/biobert-large-cased-v1.1-squad'
nlp = pipeline(tokenizer=bioBert_model,model=bioBert_model,task='question-answering')

Downloading:   0%|          | 0.00/631 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.35G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [20]:
for context in relevant_context:
  print(nlp(question=query,context=context))

{'score': 0.8773244023323059, 'start': 0, 'end': 4, 'answer': 'PARP'}
{'score': 0.15354983508586884, 'start': 190, 'end': 217, 'answer': 'poly(ADP-ribose) polymerase'}
{'score': 0.02180912345647812, 'start': 798, 'end': 802, 'answer': 'IRAP'}


  return array(a, dtype, copy=False, order=order)


{'score': 0.3676687777042389, 'start': 1169, 'end': 1172, 'answer': 'Akt'}
{'score': 0.4034416377544403, 'start': 194, 'end': 198, 'answer': 'PARP'}
{'score': 0.8109048008918762, 'start': 221, 'end': 247, 'answer': 'poly-ADP-ribose polymerase'}
{'score': 0.9219761490821838, 'start': 1655, 'end': 1659, 'answer': 'PARP'}
{'score': 0.44107967615127563, 'start': 0, 'end': 15, 'answer': 'Peroxiredoxin 2'}
{'score': 0.43723803758621216, 'start': 1282, 'end': 1286, 'answer': 'PARP'}
{'score': 0.09396830201148987, 'start': 1120, 'end': 1126, 'answer': 'mGluR5'}
