<a href="https://colab.research.google.com/github/ravadhani/NLP/blob/main/Word2Vec_SearchEngine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Problem Statement:**

To build a search engine, which will take a query and does a search to give most relevant research papers in this case.

**Objective:**

We are going to build a search engine based on semantic similarity of the query with the abstract of research papers. And then we are going to give top n number of research papers that are relevant based on our query.

In [27]:
!pip install -q langdetect
!gdown 1VkNpuudQnlj7g5uUCNPJ4MKxFdDdh7bZ

Downloading...
From (original): https://drive.google.com/uc?id=1VkNpuudQnlj7g5uUCNPJ4MKxFdDdh7bZ
From (redirected): https://drive.google.com/uc?id=1VkNpuudQnlj7g5uUCNPJ4MKxFdDdh7bZ&confirm=t&uuid=fee76d17-a759-4353-8c98-bbbc81f63fba
To: /content/metadata.csv
100% 1.65G/1.65G [00:26<00:00, 63.1MB/s]


In [28]:
import spacy
import string
import warnings

import numpy as np
import pandas as pd

from pprint import pprint
from IPython.utils import io
from tqdm.notebook import tqdm
from gensim.models import Word2Vec
from langdetect import DetectorFactory, detect
from IPython.core.display import HTML, display
from IPython.display import Image
from spacy.lang.en.stop_words import STOP_WORDS

warnings.filterwarnings('ignore')
tqdm.pandas()


In [29]:
#loading the dataset

df = pd.read_csv("metadata.csv").sample(100000)  #taking only sample set of cases
df.reset_index(inplace=True, drop=True)
df.head(5)


Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id
0,t6nmo2s4,,WHO,Are Schools' Lockdown Drills Really Beneficial...,,,,unk,,2021,"Saggers, Beth; Campbell, Marilyn A; Kelly, Adr...",J Sch Health,,#1236397,,,,,233301026.0
1,401kxcmi,,WHO,Asymptomatic transmission and the infection fa...,,,,unk,Asymptomatic infection occurs for numerous res...,2020,"Vermund, Sten H; Pitzer, Virginia E",Clin. infect. dis,,#614249,,,,,220073054.0
2,j7ra25mo,,WHO,SARS-Cov-2 variants of concern decelerate the ...,,,,unk,,2022,"Yang, Z. W.; Han, Y.; Ding, S. L.; Finzi, A.; ...",Biophysical Journal,,#covidwho-1756157,,,,,246750947.0
3,twrhrrgi,875c2d653809b449b4850eaaec896ce256be04a0,PMC,Acute respiratory distress syndrome caused by ...,10.1186/s13256-021-03023-w,PMC8439957,34521457.0,cc-by,BACKGROUND: Inhalation injury from smoke or ch...,2021-09-15,"Jang, Ji Hoon; Jang, Hang Jea; Kim, Hyun-Kuk; ...",J Med Case Rep,,,,document_parses/pdf_json/875c2d653809b449b4850...,document_parses/pmc_json/PMC8439957.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8...,
4,2tyye3vt,a90e5d56b7ef93544d8992d29496d352923a5c8e,Medline; PMC,CD47 as a Potential Target to Therapy for Infe...,10.3390/antib9030044,PMC7551396,32882841.0,cc-by,The integrin associated protein (CD47) is a wi...,2020-09-01,"Cham, Lamin B.; Adomati, Tom; Li, Fanghui; Ali...",Antibodies (Basel),,,,document_parses/pdf_json/a90e5d56b7ef93544d899...,document_parses/pmc_json/PMC7551396.xml.json,https://www.ncbi.nlm.nih.gov/pubmed/32882841/;...,221496056.0


In [30]:
print(df.columns)

Index(['cord_uid', 'sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id',
       'license', 'abstract', 'publish_time', 'authors', 'journal', 'mag_id',
       'who_covidence_id', 'arxiv_id', 'pdf_json_files', 'pmc_json_files',
       'url', 's2_id'],
      dtype='object')


If we check the columns representing the data, we can see that we can use the "abstract" column as it gives a concise understanding of the publication's research and findings

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   cord_uid          100000 non-null  object 
 1   sha               35229 non-null   object 
 2   source_x          100000 non-null  object 
 3   title             99953 non-null   object 
 4   doi               62164 non-null   object 
 5   pmcid             36747 non-null   object 
 6   pubmed_id         47325 non-null   object 
 7   license           100000 non-null  object 
 8   abstract          77774 non-null   object 
 9   publish_time      99816 non-null   object 
 10  authors           97757 non-null   object 
 11  journal           91722 non-null   object 
 12  mag_id            0 non-null       float64
 13  who_covidence_id  45658 non-null   object 
 14  arxiv_id          1345 non-null    object 
 15  pdf_json_files    35229 non-null   object 
 16  pmc_json_files    297

Will consider only features used to build the semantic search engine

In [32]:
df_covid = pd.DataFrame(columns=['paper_id', 'title', 'abstract', 'doi'])
df_covid['paper_id'] = df.sha
df_covid['title'] = df.title
df_covid['abstract'] = df.abstract
df_covid['doi'] = df.doi

df_covid.head()


Unnamed: 0,paper_id,title,abstract,doi
0,,Are Schools' Lockdown Drills Really Beneficial...,,
1,,Asymptomatic transmission and the infection fa...,Asymptomatic infection occurs for numerous res...,
2,,SARS-Cov-2 variants of concern decelerate the ...,,
3,875c2d653809b449b4850eaaec896ce256be04a0,Acute respiratory distress syndrome caused by ...,BACKGROUND: Inhalation injury from smoke or ch...,10.1186/s13256-021-03023-w
4,a90e5d56b7ef93544d8992d29496d352923a5c8e,CD47 as a Potential Target to Therapy for Infe...,The integrin associated protein (CD47) is a wi...,10.3390/antib9030044


In [33]:
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   paper_id  35229 non-null  object
 1   title     99953 non-null  object
 2   abstract  77774 non-null  object
 3   doi       62164 non-null  object
dtypes: object(4)
memory usage: 3.1+ MB


Looking for null values in the dataframe

In [34]:
df_covid.isnull().sum()/len(df_covid)*100

paper_id    64.771
title        0.047
abstract    22.226
doi         37.836
dtype: float64

Dropping the duplicates and null values from the dataframe

In [35]:
df_covid.drop_duplicates(['abstract'], inplace=True)
df_covid.dropna(inplace=True)
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29475 entries, 3 to 99998
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   paper_id  29475 non-null  object
 1   title     29475 non-null  object
 2   abstract  29475 non-null  object
 3   doi       29475 non-null  object
dtypes: object(4)
memory usage: 1.1+ MB


Our focus in this project is on English papers. So we will drop anything that is not English

In [36]:
#set seed
DetectorFactory.seed = 0

#hold label - language
languages = []

#loop through each text
for i in tqdm(range(0, len(df_covid))):
  text = df_covid.iloc[i]['abstract'].split(" ")

  lang = "en"
  try:
    if len(text) > 50:
      lang = detect(" ".join(text[:50]))
    elif len(text) > 0:
      lang = detect(" ".join(text[:len(text)]))
  except Exception as e:
    all_words = set(text)
    try:
      lang = detect(" ".join(all_words))
    except Exception as e:
      lang = "unknown"
      pass

  #Appending to language label
  languages.append(lang)


  0%|          | 0/29475 [00:00<?, ?it/s]

Count the number of articles for each language

In [37]:
languages_dict = {}
for lang in set(languages):
  languages_dict[lang] = languages.count(lang)

print("Total: {}\n".format(len(languages)))
print(languages_dict)

Total: 29475

{'fr': 57, 'unknown': 1, 'nl': 6, 'pt': 2, 'ro': 1, 'cy': 1, 'it': 4, 'de': 51, 'en': 29295, 'es': 57}


As most of the papers are in English, we can drop the rest.

In [38]:
df_covid['language'] = languages
df_covid = df_covid[df_covid['language'] == 'en']
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29295 entries, 3 to 99998
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   paper_id  29295 non-null  object
 1   title     29295 non-null  object
 2   abstract  29295 non-null  object
 3   doi       29295 non-null  object
 4   language  29295 non-null  object
dtypes: object(5)
memory usage: 1.3+ MB


Now that we filtered the papers by English language, we can drop the language column as it is of no use further.

In [39]:
df_covid = df_covid.drop(['language'], axis = 1)
df_covid.head()

Unnamed: 0,paper_id,title,abstract,doi
3,875c2d653809b449b4850eaaec896ce256be04a0,Acute respiratory distress syndrome caused by ...,BACKGROUND: Inhalation injury from smoke or ch...,10.1186/s13256-021-03023-w
4,a90e5d56b7ef93544d8992d29496d352923a5c8e,CD47 as a Potential Target to Therapy for Infe...,The integrin associated protein (CD47) is a wi...,10.3390/antib9030044
8,87f975153a9bb242b2dfbb614dcd5acd88ec2f8c; 09c5...,Genome-wide characterization of SARS-CoV-2 cyt...,Therapeutic inhibition of critical viral funct...,10.1101/2021.11.23.469747
15,ea42a7a8beaa91069cdac3bdd631e8c3c7aa898a; af49...,Common value: transferring development rights ...,In 2019 floods made up 49 % of disasters and 4...,10.1016/j.envsci.2020.08.017
16,edd2de4f22adac3771e9910e8fa59ddb491bf57c,Knee Pathology before and after SARS-CoV-2 Pan...,SIMPLE SUMMARY: The SARS-CoV-2 pandemic drasti...,10.3390/healthcare9101311


#Text Preprocessing

Use text pre-processing techniques to remove punctuations and stop words

In [40]:
punctuations = string.punctuation
stopwords = list(STOP_WORDS)

Creating custom list of stop words based on our corpus

In [41]:
custom_stop_words = [
    'doi', 'preprint', 'copyright', 'peer', 'reviewed', 'org', 'https', 'et', 'al', 'author', 'figure',
    'rights', 'reserved', 'permission', 'used', 'using', 'biorxiv', 'medrxiv', 'license', 'fig', 'fig.',
    'al.', 'Elsevier', 'PMC', 'CZI', 'www'
]

for w in custom_stop_words:
  if w not in stopwords:
    stopwords.append(w)

Cleaning the text in column - 'abstracts'

In [42]:
def pre_processor(sentence):
  mytokens = sentence.split(' ')
  mytokens = [word.lower() for word in mytokens if word not in stopwords and word not in punctuations]
  mytokens = " ".join([token for token in mytokens])
  return mytokens

df_covid['processed_abstract'] = df_covid['abstract'].progress_apply(pre_processor)

  0%|          | 0/29295 [00:00<?, ?it/s]

In [43]:
df_covid.head()

Unnamed: 0,paper_id,title,abstract,doi,processed_abstract
3,875c2d653809b449b4850eaaec896ce256be04a0,Acute respiratory distress syndrome caused by ...,BACKGROUND: Inhalation injury from smoke or ch...,10.1186/s13256-021-03023-w,background: inhalation injury smoke chemical p...
4,a90e5d56b7ef93544d8992d29496d352923a5c8e,CD47 as a Potential Target to Therapy for Infe...,The integrin associated protein (CD47) is a wi...,10.3390/antib9030044,the integrin associated protein (cd47) widely ...
8,87f975153a9bb242b2dfbb614dcd5acd88ec2f8c; 09c5...,Genome-wide characterization of SARS-CoV-2 cyt...,Therapeutic inhibition of critical viral funct...,10.1101/2021.11.23.469747,therapeutic inhibition critical viral function...
15,ea42a7a8beaa91069cdac3bdd631e8c3c7aa898a; af49...,Common value: transferring development rights ...,In 2019 floods made up 49 % of disasters and 4...,10.1016/j.envsci.2020.08.017,in 2019 floods 49 disasters 43 disaster relate...
16,edd2de4f22adac3771e9910e8fa59ddb491bf57c,Knee Pathology before and after SARS-CoV-2 Pan...,SIMPLE SUMMARY: The SARS-CoV-2 pandemic drasti...,10.3390/healthcare9101311,simple summary: the sars-cov-2 pandemic drasti...


To train the Word2Vec model, we have to convert the sentences into list of words

In [44]:
abstracts = df_covid['processed_abstract'].values

#spacy can we used for faster tokenization
nlp = spacy.load('en_core_web_sm', disable=['tagger', 'ner'])
nlp.add_pipe('sentencizer')

def tokenize_sentences(sentence):
  sentence_corpus = []
  doc = nlp(sentence)
  sentences = [sent.text.strip() for sent in doc.sents]
  for sent in sentences:
    processed_sent_list = sent.split(" ")
    sentence_corpus.append(processed_sent_list)
  return sentence_corpus

df_covid['tokenized_abstract'] = df_covid['processed_abstract'].progress_apply(lambda x: tokenize_sentences(x))

corpus_data = df_covid['tokenized_abstract'].to_list()
word2vec_corpus = [item for items in corpus_data for item in items]

  0%|          | 0/29295 [00:00<?, ?it/s]

We are using Gensim's API to train the Word2Vec.
Using skip-gram here instead of CBOW

In [45]:
model = Word2Vec(word2vec_corpus, min_count=3, vector_size=100, workers=4, window=5, sg=1, negative=5)



*   We next calculate the centroid of each 'abstract' from the word vectors of our corpus
*   after finding the centroid, we calculate the cosine distance between the centroid and the query to find the similarity.



In [67]:
#calculating the centroid of each "abstract"

a = [0.0] * 100
df_covid['centroid'] = [a]*df_covid.shape[0]

for index, row in df_covid.iterrows():
  abstract = row['processed_abstract']
  total_sim = 0
  words = abstract.split(" ")
  centroid = np.array([0.0]*100)
  for word in words:
    try:
      b = model[word]
    except:
      continue
    centroid = np.add(centroid, b)

  df_covid.at[index, 'centroid'] = centroid.tolist()

Function that takes given query as input and retrieves documents and ranks them based on similarity.

In [74]:
def rank_docs(model, query, df_covid, num):

  cosine_list = []

  a = []
  query = query.split(" ")
  for q in query:
    try:
      a.append(model[q])
    except:
      continue

  for index, row in df_covid.iterrows():
    # print("5")
    # print(index)
    # print(row)
    # print(row['centroid'])
    centroid = row['centroid']
    # print("6")
    total_sim = 0
    for a_i in a:
      cos_sim = np.dot(a_i, centroid)/(np.linalg.norm(a_i)*np.linalg.norm(centroid))
      total_sim += cos_sim

    cosine_list.append((row['title'], row['doi'], total_sim))

  cosine_list.sort(key=lambda x:x[2], reverse=True) #ordering in descending order

  papers_list = []
  for item in cosine_list[:num]:
    papers_list.append((item[0], item[1], item[2]))

  return papers_list

Function to retrieve top matching documents

In [77]:
#this method sends the query and number of top matching papers we are looking for
def query(query, top_matches=10):
  model_to_use = model
  df_covid_to_use = df_covid
  return rank_docs(model_to_use, query, df_covid_to_use, top_matches)
  print("end")


Calling on our query

In [78]:
query('origin of corona virus')

[('Acute respiratory distress syndrome caused by carbon monoxide poisoning and inhalation injury recovered after extracorporeal membrane oxygenation along with direct hemoperfusion with polymyxin B-immobilized fiber column: a case report',
  '10.1186/s13256-021-03023-w',
  0),
 ('CD47 as a Potential Target to Therapy for Infectious Diseases',
  '10.3390/antib9030044',
  0),
 ('Genome-wide characterization of SARS-CoV-2 cytopathogenic proteins in the search of antiviral targets',
  '10.1101/2021.11.23.469747',
  0),
 ('Common value: transferring development rights to make room for water',
  '10.1016/j.envsci.2020.08.017',
  0),
 ('Knee Pathology before and after SARS-CoV-2 Pandemic: An Analysis of 1139 Patients',
  '10.3390/healthcare9101311',
  0),
 ('Positive impact of oral hydroxychloroquine and povidone-iodine throat spray for COVID-19 prophylaxis: an open-label randomized trial',
  '10.1016/j.ijid.2021.04.035',
  0),
 ('The role of respiratory viruses in the etiology of bacterial p

**Summary:**



*   We used continuous text representation to capture syntactic and semantic similarity
*   Basic way of doing it is using SVD, but SVD sufferes froms caling. So iterative methods like Word2Vec are used.
*   **Word2Vec** models the data as multi class problem and learns the word representations
*   There are two Word2Vec architechtures
    1.   CBoW
    2.   Skip-Gram **bold text**


*   Training methods are two types. **Negative Sampleing **and** Hierarchical Softmax**. We used Negative sampling here
*   **Gensim** library is used in this project.







