<a href="https://colab.research.google.com/github/xy2119/COVID19_Knowledge_Graph/blob/main/notebooks/Covid19_Search_Engine_BioBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Biomedical Natural Language Processing | Word Representation and Semantic Search Engine
# Submission to the 2021 Imperial College Data Science Challenge

In the **[2021 Data Science Institute Natural Language Processing Challenge](https://www.imperial.ac.uk/data-science/)**, the released dataset for text mining task is [CORD-19](https://github.com/allenai/cord19), a corpus of academic papers about COVID-19 and related coronavirus research, curated and maintained by the Allen Institute for AI. We're keen on exploring word representations and semantic relations that could serve as a medical and scientific knowledge repository.

This notebook builds a search engine that indexes the literature on Covid 19, and then retrieves the papers that are most relevant to the user's query. The text is parsed from JSON files, cleaned (and saved into a cleaned csv), tokenized, tf-idf and transformer are then employed to create sentence embeddings.

The extracted associations ultimately contributed to the creation of the Covid 19 Knowledge Graph.

### **What is a Biomedical Entity?**
Biomedical entity is a term that refers to anything related to the field of biomedicine. This can include things like proteins, genes, diseases, and medical treatments etc.

### **BioBERT**
BioBERT is a BERT model pre-trained on the biomedical datasets. In the pre-training, weights of the regular BERT model was taken and then pre-trained on the medical datasets like (PubMed abstracts and PMC). This domain-specific pre-trained model can be fine-tunned on smaller datasets. Literatures has proven that fine-tuning BIOBERT model outperformed the fine-tuned BERT model for the biomedical domain-specific NLP tasks.



This notebook is organised as follows:

0. Install and Import Libraries
1. Download CORD 19 Dataset
2. Read JSON Files and Load Articles

    * Create Abstract Dictionary

    * Create Body Text Dictionary

    * Create Metadata Dictionary

    * Create Bibliography Dictionary

    * Save Cleaned Files in CSV

3. Text Processing
4. Embedding with TF-IDF
5. Search Engine based on TF-IDF
6. Embedding with BioBERT
7. Search Engine based on Transformer


Some information might be helpful before repreducing this notebook:

* As a reference, CORD 19 Dataset may take 20m 58s to download and unzip from Kaggle, using colab Tesla T4 GPU.


Feel free to contact me at xy2119@ic.ac.uk

In [None]:
!pip install sentence_transformers
!pip install faiss-gpu

In [None]:
!wget https://github.com/naver/biobert-pretrained/releases/download/v1.1-pubmed/biobert_v1.1_pubmed.tar.gz

## Install and Import Libraries

In [None]:
import re
import os
import json
import numpy as np
import pandas as pd
import dill as pickle
from pprint import pprint
from copy import deepcopy
from tqdm.notebook import tqdm

import torch
from sentence_transformers import SentenceTransformer

## Read JSON Files and Load Articles

**Note** This dataset takes 20m 58s to download and unzip from Kaggle

In [None]:
#create the .kaggle folder in the root directory
!mkdir ~/.kaggle 
# write kaggle API credentials to kaggle.json
!echo '{"username":"mancostart","key":"d5ae883c2cdbdb931ba7335620acf2fa"}' > ~/.kaggle/kaggle.json 
# set permissions
!chmod 600 ~/.kaggle/kaggle.json
# install the kaggle library
!pip install kaggle 
# download CORD 19 Kaggle Dataset
!kaggle datasets download -d allen-institute-for-ai/CORD-19-research-challenge -p ~/../content/CORD-19-research-challenge/data/  
# unzip
!unzip -o ~/../content/CORD-19-research-challenge/data/CORD-19-research-challenge.zip -d ~/../content/CORD-19-research-challenge/data/

In [None]:
dir = '/content/CORD-19/data/document_parses/pdf_json/'
filenames = os.listdir(dir)
print("Number of articles retrieved:", len(filenames))

Number of articles retrieved: 401214


In [None]:
# Load files for creating bio_clean_v1.csv
start=75000
count=0
all_files = []
for i in range(25000):
    i +=start
    filename = filenames[i]
    filename = dir + filename
    file = json.load(open(filename, 'rb'))
    all_files.append(file)

file = all_files[0]
print("Dictionary keys:", file.keys())

Dictionary keys: dict_keys(['paper_id', 'metadata', 'abstract', 'body_text', 'bib_entries', 'ref_entries', 'back_matter'])


### Create Abstract Dictionary

In [None]:
pprint(file['abstract'])

### Create Body Text Dictionary

In [None]:
print("body_text type:", type(file['body_text']))
print("body_text length:", len(file['body_text']))
print("body_text keys:", file['body_text'][0].keys())

body_text type: <class 'list'>
body_text length: 85
body_text keys: dict_keys(['text', 'cite_spans', 'ref_spans', 'section'])


In [None]:
print("body_text content:")
pprint(file['body_text'][1],depth=1)

body_text content:
{'cite_spans': [],
 'ref_spans': [],
 'section': '',
 'text': 'The Submitting Author accepts and understands that any supply made '
         'under these terms is made by BMJ to the Submitting Author unless you '
         'are acting as an employee on behalf of your employer or a '
         'postgraduate student of an affiliated institution which is paying '
         'any applicable article publishing charge ("APC") for Open Access '
         'articles. Where the Submitting Author wishes to make the Work '
         'available on an Open Access basis (and intends to pay the relevant '
         'APC), the terms of reuse of such Open Access shall be governed by a '
         'Creative Commons licence -details of these licences and which '
         'Creative Commons licence will apply to this Work are set out in our '
         'licence referred to above.'}


In [None]:
texts = [(di['section'], di['text']) for di in file['body_text']]
texts_di = {di['section']: "" for di in file['body_text']}
for section, text in texts:
    texts_di[section] += text

pprint(list(texts_di.keys()))

['',
 'ABSTRACT Introduction',
 'Methods and analysis',
 'Ethics and dissemination',
 'Strengths and limitations of this study',
 'BACKGROUND',
 'METHODS',
 'Scope of the COS',
 'Patient and public involvement (PPI)',
 'Stakeholders for COS (participants of Delphi surveys and consensus meeting)',
 'i. Patient and public representatives',
 'iii. Researchers',
 'Stage 1: Hierarchical systematic literature search',
 'Search strategy',
 'Study selection and data extraction',
 'Initial list of outcomes',
 'Stage 2: Delphi survey',
 'Pilot study',
 'st Delphi',
 'nd Delphi',
 'rd Delphi',
 'Stage 3: Consensus meeting',
 'Final COS',
 'Strength',
 'Limitation',
 'DISSEMINATION',
 "Authors' contributions:",
 'Competing interests: None declared.',
 'Stage 1: Systematic literature search',
 'Stage 2: Focus groups',
 'Page 7 of 20',
 '3) Researchers',
 'Stage 3: Delphi surveys',
 '1) Patient representatives',
 'Stage 4: Consensus meeting',
 'DISCUSSION']


In [None]:
def format_body(body_text):
    texts = [(di['section'], di['text']) for di in body_text]
    texts_di = {di['section']: "" for di in body_text}
    
    for section, text in texts:
        texts_di[section] += text

    body = ""

    for section, text in texts_di.items():
        body += section
        body += "\n\n"
        body += text
        body += "\n\n"
    
    return body

### Create Metadata Dictionary

In [None]:
print(all_files[0]['metadata'].keys())
print(all_files[0]['metadata']['title'])

dict_keys(['title', 'authors'])
Core outcome set for studies of pregnancy affected by multimorbidity: a protocol Protocol for pregnancy affected by multimorbidity COS Title: Core outcome set for studies of pregnancy affected by multimorbidity: a protocol Mairead Black* 11 * indicates equal contribution Affiliation


In [None]:
authors = all_files[0]['metadata']['authors']
pprint(authors[:2])

[{'affiliation': {'institution': 'University of Birmingham',
                  'laboratory': '',
                  'location': {'country': 'UK', 'settlement': 'Birmingham'}},
  'email': '',
  'first': 'Siang',
  'last': 'Lee',
  'middle': ['Ing'],
  'suffix': ''},
 {'affiliation': {'institution': 'University of St Andrews',
                  'laboratory': "Centre for Public Health, Queen's University "
                                'of Belfast 3',
                  'location': {'country': 'UK'}},
  'email': '',
  'first': 'Kelly-Ann',
  'last': 'Eastwood',
  'middle': [],
  'suffix': ''}]


In [None]:
def format_name(author):
    middle_name = " ".join(author['middle'])
    
    if author['middle']:
        return " ".join([author['first'], middle_name, author['last']])
    else:
        return " ".join([author['first'], author['last']])


def format_affiliation(affiliation):
    text = []
    location = affiliation.get('location')
    if location:
        text.extend(list(affiliation['location'].values()))
    
    institution = affiliation.get('institution')
    if institution:
        text = [institution] + text
    return ", ".join(text)
    
for author in authors:
    print("Name:", format_name(author))
    print("Affiliation:", format_affiliation(author['affiliation']))

Name: Siang Ing Lee
Affiliation: University of Birmingham, Birmingham, UK
Name: Kelly-Ann Eastwood
Affiliation: University of St Andrews, UK
Name: Ngawai Moss
Affiliation: The University of Manchester
Name: Amaya Azcoaga-Lorenzo
Affiliation: 
Name: Anuradhaa Subramanian
Affiliation: University of Birmingham, Birmingham, UK
Name: Astha Anand
Affiliation: University of Birmingham, Birmingham, UK
Name: Beck Taylor
Affiliation: University of Birmingham, Birmingham, UK
Name: Catherine Nelson-Piercy
Affiliation: The University of Manchester, UK
Name: Christopher Yau
Affiliation: University of Birmingham, Birmingham, UK
Name: Colin Mccowan
Affiliation: 
Name: Dermot O&apos;reilly
Affiliation: University of St Andrews, UK
Name: Holly Hope
Affiliation: The University of Manchester, UK
Name: Jonathan I Kennedy
Affiliation: University of Birmingham, Birmingham, UK
Name: Kathryn Abel
Affiliation: The University of Manchester, UK
Name: Louise Locock
Affiliation: 
Name: Peter Brocklehurst
Affiliatio

In [None]:
def format_authors(authors, with_affiliation=False):
    name_ls = []
    
    for author in authors:
        name = format_name(author)
        if with_affiliation:
            affiliation = format_affiliation(author['affiliation'])
            if affiliation:
                name_ls.append(f"{name} ({affiliation})")
            else:
                name_ls.append(name)
        else:
            name_ls.append(name)
    
    return ", ".join(name_ls)  

authors = all_files[2]['metadata']['authors']
print("Formatting without affiliation:")
print(format_authors(authors, with_affiliation=False))
print("\nFormatting with affiliation:")
print(format_authors(authors, with_affiliation=True))

Formatting without affiliation:
Augustino Isdory, Eunice W Mureithi, David J T Sumpter

Formatting with affiliation:
Augustino Isdory (University of Dar es Salaam, Dar es Salaam, Tanzania), Eunice W Mureithi (University of Dar es Salaam, Dar es Salaam, Tanzania), David J T Sumpter (Uppsala University, Uppsala, Sweden)


### Create Bibliography Dictionary

In [None]:
bibs = list(file['bib_entries'].values())
print("Formatting without affiliation:")
print(format_authors(bibs[1]['authors'], with_affiliation=False))

Formatting without affiliation:
L M Gawron, J N Sanders, K Sward


In [None]:
def format_bib(bibs):
    if type(bibs) == dict:
        bibs = list(bibs.values())
    bibs = deepcopy(bibs)
    formatted = []
    
    for bib in bibs:
        bib['authors'] = format_authors(
            bib['authors'], 
            with_affiliation=False
        )
        formatted_ls = [str(bib[k]) for k in ['title', 'authors', 'venue', 'year']]
        formatted.append(", ".join(formatted_ls))

    return "; ".join(formatted)

### Save Cleaned Files in CSV

In [None]:
cleaned_files = []
for file in tqdm(all_files):
    features = [
                file['paper_id'],
                file['metadata']['title'],
                format_authors(file['metadata']['authors']),
                format_authors(file['metadata']['authors'], 
                              with_affiliation=True),
                format_body(file['abstract']),
                format_body(file['body_text']),
                format_bib(file['bib_entries']),
                file['metadata']['authors'],
                file['bib_entries']
                ]
    
    cleaned_files.append(features)

  0%|          | 0/25000 [00:00<?, ?it/s]

In [None]:
def load_files(dirname):
    filenames = os.listdir(dirname)
    raw_files = []

    for filename in tqdm(filenames):
        filename = dirname + filename
        file = json.load(open(filename, 'rb'))
        raw_files.append(file)
    
    return raw_files

def generate_clean_df(all_files):
    cleaned_files = []
    
    for file in tqdm(all_files):
        features = [
            file['paper_id'],
            file['metadata']['title'],
            format_authors(file['metadata']['authors']),
            format_authors(file['metadata']['authors'], 
                           with_affiliation=True),
            format_body(file['abstract']),
            format_body(file['body_text']),
            format_bib(file['bib_entries']),
            file['metadata']['authors'],
            file['bib_entries']
        ]

        cleaned_files.append(features)

    col_names = ['paper_id', 'title', 'authors',
                 'affiliations', 'abstract', 'text', 
                 'bibliography','raw_authors','raw_bibliography']

    clean_df = pd.DataFrame(cleaned_files, columns=col_names)
    clean_df.head()
    
    return clean_df

col_names = [
              'paper_id', 
              'title', 
              'authors',
              'affiliations', 
              'abstract', 
              'text', 
              'bibliography',
              'raw_authors',
              'raw_bibliography'    
            ]

clean_df = pd.DataFrame(cleaned_files, columns=col_names)
clean_df.head()

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,cb2eb2185d80dfe7e3a801a33e997ad4536647d8,Core outcome set for studies of pregnancy affe...,"Siang Ing Lee, Kelly-Ann Eastwood, Ngawai Moss...","Siang Ing Lee (University of Birmingham, Birmi...",,"\n\nI, the Submitting Author has the right to ...","Clinical knowledge summaries: Multimorbidity, ...","[{'first': 'Siang', 'middle': ['Ing'], 'last':...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Clinica..."
1,3977edc885f15be3d771c67a57720ecc2d8dbab5,and Public Health Article Dengue Outbreaks in ...,"Shin-Yueh Liu, Tsair-Wei Chien, Ting-Ya Yang, ...","Shin-Yueh Liu, Tsair-Wei Chien, Ting-Ya Yang, ...",,Introduction\n\nDengue is a mosquito-borne vir...,"A screening tool for dengue fever in children,...","[{'first': 'Shin-Yueh', 'middle': [], 'last': ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'A scree..."
2,0a4f344f96d5a21e3d0f33e199983738c37a1631,The Impact of Human Mobility on HIV Transmissi...,"Augustino Isdory, Eunice W Mureithi, David J T...","Augustino Isdory (University of Dar es Salaam,...",Abstract\n\nDisease spreads as a result of peo...,Introduction\n\nSince the emergence of HIV/AID...,Transmission Dynamics and control of Severe Ac...,"[{'first': 'Augustino', 'middle': [], 'last': ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Transmi..."
3,a2c1521c8de153490e2a4c18cdc4b078c4fe8061,,"Stelvio Tonello, Manuela Rizzi, Erica Matino, ...",Stelvio Tonello (Università del Piemonte Orien...,,"Introduction\n\nSince the end of 2019, the wor...",COVID-19 diagnosis -a review of current method...,"[{'first': 'Stelvio', 'middle': [], 'last': 'T...","{'BIBREF0': {'ref_id': 'b0', 'title': 'COVID-1..."
4,8a5e42c2cb71818b76fa80c3871f7d715ebc3060,Letter to the Editor Analysis of Continuous Bl...,"Adrian H Heald, Mike Stedman, Linda Horne, Rus...","Adrian H Heald (University of Manchester, UK),...","Abstract\n\nSince its appearance in 2019, the ...",Letter to the Editor\n\nSince its appearance i...,Global Forum on Universal Health Coverage and ...,"[{'first': 'Adrian', 'middle': ['H'], 'last': ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Global ..."


In [None]:
clean_df.to_csv('bio_clean_v1.csv', index=False)

## Text Processing

In [None]:
bio_df = pd.read_csv('/content/drive/MyDrive/COVID19_KG/bio_clean_v1.csv')
bio_lst = bio_df['abstract'].astype(str).to_list()
print(bio_lst[5])

Abstract

Carbon emissions have emerged as an alarming and complex issue causing a long-lasting debate over climate change in the construction, building, and industrial sectors. There is tremendous growth in the construction and building industry, especially in low-middle-income developing countries, that involves rising production and consumption of cement and energy. As such, a growing amount of carbon emissions is becoming a serious challenge for developing economies. This study has assessed the driving factors that influence the critical levels of carbon emissions by employing Kaya identity and logarithmic mean Divisia index (LMDI) decomposition models in the growing cement manufacturing sector of a low-medium developing county, Pakistan, from 2005 to 2020. The results portrayed a typical trend of carbon emissions which are summarized as follows: (a) From 2006 to 2010, a slight increase is shown; (b) a slight decrease in the trend during 2011-2013; (c) from 2014 to 2018, there is a

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import sqlite3
import dill as pickle
import string

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

def remove_url(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

def remove_punct(text):
    table = str.maketrans('', '', string.punctuation)
    return text.translate(table)

def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

def decontraction(text):
    text = re.sub(r"won\'t", " will not", text)
    text = re.sub(r"won\'t've", " will not have", text)
    text = re.sub(r"can\'t", " can not", text)
    text = re.sub(r"don\'t", " do not", text)
    
    text = re.sub(r"can\'t've", " can not have", text)
    text = re.sub(r"ma\'am", " madam", text)
    text = re.sub(r"let\'s", " let us", text)
    text = re.sub(r"ain\'t", " am not", text)
    text = re.sub(r"shan\'t", " shall not", text)
    text = re.sub(r"sha\n't", " shall not", text)
    text = re.sub(r"o\'clock", " of the clock", text)
    text = re.sub(r"y\'all", " you all", text)

    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"n\'t've", " not have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'d've", " would have", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ll've", " will have", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\s+", " ", text)
    return text 

def seperate_alphanumeric(text):
    words = text
    words = re.findall(r"[^\W\d_]+|\d+", words)
    return " ".join(words)

def clean_text(text):
    '''
    Make text lowercase, 
    remove text in square brackets,
    remove links,
    remove punctuation
    and remove words containing numbers.
    '''
    text = str(text).lower()
    text = re.sub(r"<[^>]*>", "", text)  # remove HTML tags
    text = re.sub('\[.*?\]', ' ', text)
    text = re.sub('https?://\S+|www\.\S+', ' ', text)
    text = re.sub('<.*?>+', ' ', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', ' ', text)
    text = re.sub('\w*\d\w*', ' ', text)
    text = re.sub('\'',' ', text)
    text = re.sub(r'\W+', ' ', text)
    return text 

biorx_df=bio_df.copy()
biorx_df=biorx_df.dropna()
#for col in ['title','authors','abstract','text']:
for col in ['abstract']:
    biorx_df[col]=biorx_df[col].apply(lambda x : remove_url(str(x)))
    biorx_df[col]=biorx_df[col].apply(lambda x : remove_punct(str(x)))
    biorx_df[col]=biorx_df[col].apply(lambda x : decontraction(str(x)))
    biorx_df[col]=biorx_df[col].apply(lambda x : seperate_alphanumeric(str(x)))
    biorx_df[col]=biorx_df[col].apply(lambda x: clean_text(str(x)))

biorx_df['abstract']=biorx_df['abstract'].apply(lambda x: str(x)[8:])
biorx_df[['title','authors','abstract','text']].head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0,title,authors,abstract,text
0,Extinction and stationary distribution of a st...,"Yuncheng Xu, Xiaojun Sun, Hua Hu, B Hua Hu",by taking full consideration of contact heter...,Introduction\n\nInfectious diseases bring huma...
1,The Effect of a Mindfulness-Based Education Pr...,"Mijung Jung, Mikyoung Lee, Ian Walsh",citation jung m lee m the effect of a mindful...,Introduction\n\nUniversity students experience...
2,PROTOCOL FOR A CANADIAN POPULATION-BASED REGIS...,"Corinne M Hohl, Rhonda J Rosychuk, Andrew Mcra...",foundation and the fondation du chu de québec...,INTRODUCTION (2478)\n\nCoronavirus Disease 201...
3,Locked nucleic acid (LNA): High affinity targe...,"Sakari Kauppinen, Birte Vester, Jesper Wengel",locked nucleic acid lna is a nucleic acid ana...,Introduction\n\nThe future challenges in diagn...
5,Environmental Science and Pollution Research D...,"Rizwan Rasheed, · Fizza Tahir, · Muhammad Afza...",carbon emissions have emerged as an alarming ...,Introduction\n\nClimate change is arguably the...


In [None]:
title = biorx_df['title']
author = biorx_df['authors']
abstract = biorx_df['abstract']
body = biorx_df['text']

## Embedding with TF-IDF 

In [None]:
def preprocess_text(text):
    tokenizer = nltk.RegexpTokenizer(r'[A-Za-z]+')
    tokens = tokenizer.tokenize(text)
 
    nltk_stop_words = nltk.corpus.stopwords.words('english')
    tokens = [token for token in tokens if token not in nltk_stop_words]

    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t.lower()) for t in tokens]

    return tokens

vectorizer = TfidfVectorizer(analyzer=preprocess_text, min_df=40)
document_tf_idf_fit = vectorizer.fit(abstract)
document_tf_idf = vectorizer.transform(abstract)

In [None]:
feature_names = vectorizer.get_feature_names_out()
dense = document_tf_idf.todense().tolist()
tfidf = pd.DataFrame(dense, columns=feature_names)
tfidf.head()



Unnamed: 0,aa,ab,abbreviation,abdominal,ability,able,abnormal,abnormality,absence,absent,...,youth,z,zero,zhang,zika,zinc,zone,zoonotic,zu,zur
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.051615,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
with open('vectorizer.pickle', 'wb') as file:
    pickle.dump(vectorizer, file)
    
with open('document_tf_idf.pickle', 'wb') as file:
    pickle.dump(document_tf_idf, file)

In [None]:
with open('vectorizer.pickle', 'rb') as file:
    vectorizer = pickle.load(file)
        
with open('document_tf_idf.pickle', 'rb') as file:
    document_tf_idf = pickle.load(file)

## Implement Search Engine using TF-IDF 
Input a question relating to Covid 19, it will return related articles 

In [None]:
query = ['what is long covid?']

In [None]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([c['concepts'] for c in corpus])

def search(query):
  query_tf_idf = vectorizer.transform(query)
  scores = document_weight * cosine_similarity(document_tf_idf, query_tf_idf) 
  print(scores.argmax())
  results = []
  for i, score in enumerate(scores):
    if score > 0.5:
      results.append(corpus[i])
  return results
  
search(query)

3078


In [None]:
biorx_df.iloc[3078]['abstract']

' people who have covid can experience symptoms for months studies on long covid in the population lack representative samples and longitudinal data focusing on newonset symptoms occurring with covid while accounting for preinfection symptoms we use a sample representing the us community population from the understanding america study covid survey which surveyed around respondents biweekly from march to march our nal sample includes infected individuals who were interviewed one month before around the time of and weeks after infection about of the sample experienced newonset symptoms during infection which lasted for more than weeks and thus can be considered as having long covid the most common persistent newonset symptoms among those included in the study were headache runny or stuffy nose abdominal discomfort fatigue and diarrhea long covid was more likely among obese individuals or p and those who experienced hair loss or p headache or p and sore throat or p during infection risk w

## Embedding with BioBERT 

In [None]:
model = SentenceTransformer('monologg/biobert_v1.1_pubmed')

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/monologg_biobert_v1.1_pubmed were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
if torch.cuda.is_available():
    model = model.to(torch.device("cuda"))

In [None]:
def clean_sent(text):
    text = re.sub(r"<[^>]*>", "", text)  # remove HTML tags
    text = re.sub(r'[^a-zA-Z0-9_\s.?!]', '', text)
    text = re.sub('\n', ' ', text)
    return text 
    
biorx_df=bio_df.copy()
biorx_df=biorx_df.dropna()

col='abstract'
bio_df[col]=bio_df[col].apply(lambda x: clean_sent(str(x)))
bio_df[col]=bio_df[col].apply(lambda x: str(x)[8:])
bio_df[col].head()

0      By taking full consideration of contact hete...
1      Citation Jung M. Lee M. The Effect of a Mind...
2      Foundation 5357 and the Fondation du CHU de ...
3      Locked nucleic acid LNA is a nucleic acid an...
4                                                     
Name: abstract, dtype: object

In [None]:
abstract = bio_df['abstract'][:2500]
document_embeddings = model.encode(abstract, show_progress_bar=True) 

Batches:   0%|          | 0/79 [00:00<?, ?it/s]

In [None]:
with open('document_biobert.pickle', 'wb') as file :
    pickle.dump(document_embeddings, file)

In [None]:
with open('document_biobert.pickle', 'rb') as file:
    document_embeddings = pickle.load(file)

## Implement Search Engine using Transformer
Return top-ranked documents that are most related to the supplied query.

In [None]:
num_vectors = document_embeddings.shape[0]
dimension = document_embeddings.shape[1]
num_neighbours = 100
document_weight = 1 

document_index = faiss.IndexFlatIP(dimension)
document_index.add(normalize(document_embeddings, norm='l2'))

In [None]:
query = ['what is long covid?']
query_embedding = model.encode(query)
query_embedding_normalized = normalize(query_embedding, norm='l2')

In [None]:
document_distances, document_indices = document_index.search(query_embedding_normalized, num_neighbours)
papers = list(( set(document_indices[0])) | set(document_indices[0, -5:]))

papers_dict = {}
for paper in papers :
    document_distance = 0    
    if paper in document_indices[0]:
        index = np.where(document_indices[0] == paper)
        document_distance = document_distances[0][index][0]
    else : document_distance = cosine_similarity(document_embeddings[paper].reshape(1, -1), query_embedding)[0][0]

    print(paper, document_distance)
        
    papers_dict[paper] =  document_weight * document_distance


1018 0.80913514
1544 0.8116505
1545 0.8147518
1037 0.80130905
2318 0.8084621
1809 0.8157422
785 0.8004762
1303 0.80328774
1053 0.8055978
30 0.80641973
543 0.826722
804 0.7999857
549 0.8099907
1829 0.8032053
1065 0.82132125
43 0.81807554
1838 0.7998309
815 0.8081739
50 0.80560756
1846 0.8104883
1593 0.80738777
2364 0.8217234
1852 0.80660206
1602 0.80068046
1350 0.7990281
2119 0.80271894
839 0.8018321
840 0.811577
1864 0.804547
588 0.8083928
337 0.79909015
1111 0.8012934
2394 0.8007143
2144 0.8160878
2149 0.8113272
1126 0.81172717
2156 0.8143844
112 0.80091757
1139 0.8096751
1142 0.81311107
1399 0.8208518
378 0.79888505
1663 0.80695635
1409 0.82232493
1154 0.80265635
1157 0.83915657
647 0.80398977
1418 0.79885954
1675 0.8162248
1682 0.79976535
402 0.80731475
916 0.80460835
920 0.80118084
1695 0.8028025
671 0.8014089
1696 0.7993308
1951 0.8501102
1189 0.81982875
933 0.8067859
1958 0.80126554
1960 0.8178854
2473 0.81618315
429 0.82405514
2481 0.83144134
2482 0.80328816
2227 0.80514574
1972