### Install sentence transformers library

In [None]:
!pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/6a/e2/84d6acfcee2d83164149778a33b6bdd1a74e1bcb59b2b2cd1b861359b339/sentence-transformers-0.4.1.2.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 5.6MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/88/b1/41130a228dd656a1a31ba281598a968320283f48d42782845f6ba567f00b/transformers-4.2.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 14.8MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/67/e42bd1181472c95c8cda79305df848264f2a7f62740995a46945d9797b67/sentencepiece-0.1.95-cp36-cp36m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 45.3MB/s 
[?25hCollecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m

### Useful imports

In [None]:
import json,glob,nltk,copy,torch,time
from scipy import spatial
from queue import PriorityQueue
from sentence_transformers import SentenceTransformer,util
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Retrieve dataset

In [None]:
!wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2020-03-13.tar.gz
!tar -xf cord-19_2020-03-13.tar.gz
!tar -xf 2020-03-13/comm_use_subset.tar.gz

--2021-02-04 10:44:53--  https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2020-03-13.tar.gz
Resolving ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com (ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com)... 52.218.153.153
Connecting to ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com (ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com)|52.218.153.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 278921140 (266M) [application/x-tar]
Saving to: ‘cord-19_2020-03-13.tar.gz’


2021-02-04 10:44:59 (52.3 MB/s) - ‘cord-19_2020-03-13.tar.gz’ saved [278921140/278921140]



### Prepare GPU Cuda.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device available for running: ")
print(device)

Device available for running: 
cuda


### Read JSON files and store title,abstract and text of each article into a list

In [None]:
data = []

files = glob.glob('comm_use_subset/*', recursive=True)
number_of_articles = len(files)
bound = 9000

for single_file in files[0:bound]:
  with open(single_file, 'r') as f:
    json_file = json.load(f)
    
    # Retrieve title
    title = json_file["metadata"]["title"]

    # Retrieve abstracts
    abstracts = []
    if len(json_file["abstract"]) != 0 :
      for abstract in json_file["abstract"]:
        abstracts.append(abstract["text"])

    # Retrieve texts
    texts = []
    for text in json_file["body_text"]:
      texts.append(text["text"])

    data.append([title,abstracts,texts])

### Convert corpus to sentences

In [None]:
# For each article
for article in range(bound):

  # Title section
  data[article][0] = nltk.sent_tokenize(data[article][0])

  # Abstracts section 
  for abstract in range(len(data[article][1])): 
    data[article][1][abstract] = nltk.sent_tokenize(data[article][1][abstract])
   
  # Texts section
  for text in range(len(data[article][2])):
    data[article][2][text] = nltk.sent_tokenize(data[article][2][text])

### Transform only each article's title and abstract's sentences to embeddings with second sentence trasformer 'stsb-roberta-base' model

In [None]:
second_start_time = time.time()

# Declare our second sentence transformer model and pass it to appropriate device
second_model = SentenceTransformer('stsb-roberta-base').to(device)

# Here we'll put for each article sentences of abstract and body text
abstract_embeddings = [[] for i in range(bound)]
all_abstracts = [[] for i in range(bound)]
title_embeddings = []

# For each article
for article in range(bound):
  
  # Get title and abstract and of each article
  title_ = data[article][0]
  abstract_ = data[article][1]

  # Process to keep abstract and sentences of each article in a big list
  for abstract in abstract_:
    for sentence in abstract:
      all_abstracts[article].append(sentence)
  
  # Transform title to embeddings 
  if len(title_) != 0:
    title_embeddings.append(second_model.encode(title_,convert_to_tensor=True))

  if len(abstract_) != 0:
    # Transform abstract's sentences to embeddings 
    abstract_embeddings[article].append(second_model.encode(all_abstracts[article],convert_to_tensor=True))

    # Convert all sentence embeddings to a 2D pytorch tensor
    abstract_embeddings[article] = torch.cat(abstract_embeddings[article]) 
  
# Check elapsed time of second model
second_model_time = (time.time() - second_start_time)/60
print("Elapsed time: %s minutes" % (round(second_model_time,1)))

100%|██████████| 461M/461M [00:25<00:00, 18.1MB/s]


Elapsed time: 8.6 minutes


### Declare our queries and tranform them to embeddings based on our model.

In [None]:
queries = ['What are the coronoviruses?','What was discovered in Wuhuan in December 2019?',
           'What is Coronovirus Disease 2019?','What is COVID-19?','What is caused by SARS-COV2?',
           'How is COVID-19 spread?','Where was COVID-19 discovered?','How does coronavirus spread?']

second_queries_embeddings = second_model.encode(queries,convert_to_tensor=True)

print("For 2nd model... Number Of Queries:",len(second_queries_embeddings)," Query Embedding's Length:",len(second_queries_embeddings[0]))

For 2nd model... Number Of Queries: 8  Query Embedding's Length: 768


### Now we are going implement function, which search body texts from articles which are derived from those which their abstracts or/and titles gave us the best cosine similarities, 10% percentage.

In [None]:
def cos_sim(v1,v2):
  """
  Function which calculates cosine similarity between two vectors.
  """
  res = 1 - spatial.distance.cosine(v1,v2)
  return float(round(res, 2))

def reduce_searching(query_embedding,titles,abstracts,query_sentence,data_,ratio,title_or_abstract):
  """
  Function which gives the closest vector (metric:cosine similarity) as an answer, 
  but isn't searching all articles (brutely forced). We search only the top ratio% 
  of articles which were closer based on average cosine similarity of title section or abstract
  section or both of them.
  
  Variable title_or_abstract:
  Consider only title if -1
  Consider only abstract if 1
  Consider both of them if 0
  """

  # Declare our priority queues
  pq = PriorityQueue() 
  answer_pq = PriorityQueue()
  
  # For each article
  for index,article in enumerate(zip(titles,abstracts)):

    title_cos_sim = 0
    abstract_cos_sim = 0
    title = article[0]
    abstract = article[1]

    # For each sentence of title section 
    for sentence in title:
      title_cos_sim += cos_sim(query_embedding,sentence)
    # If title isn't empty calculate average abstract cosine similarity
    if len(title) != 0: 
      title_cos_sim = round(title_cos_sim/len(title),2)   

    # For each text of abstract section 
    for sentence in abstract:
      abstract_cos_sim += cos_sim(query_embedding,sentence)
    # If abstract isn't empty calculate average abstract cosine similarity
    if len(abstract) != 0:
      abstract_cos_sim = round(abstract_cos_sim/len(abstract),2)   

    # Calculate average cosine similarity (of title and abstract) 
    total_cos_sim = round(((title_cos_sim + abstract_cos_sim)/2),2)
        
    # Insert into priority queue pair of cosine similarity of abstract and article's index
    if title_or_abstract == 1:
      pq.put((abstract_cos_sim*(-1),index))
    # Insert into priority queue pair of cosine similarity of title and article's index
    elif title_or_abstract == -1:
      pq.put((title_cos_sim*(-1),index))
    # Insert into priority queue pair of average cosine similarity (of title and abstract) and article's index
    else:
      pq.put((total_cos_sim*(-1),index))

  # Number of articles that we gonna search further (their body text)
  search_articles = round(ratio*len(data_))

  # Declare empty lists in order to save inside them embeddings and sentences of abstracts
  body_text_embeddings = [[] for i in range(search_articles)]
  body_texts = [[] for i in range(search_articles)]

  # Search thorougly articles which had big similarity on title and abstract section  
  for i in range(search_articles):
    _,index = pq.get()
    body_text = data_[index][2]

    # Stack all body_text's sentences to a temporary list
    for text in body_text:
      for sentence in text:
        body_texts[i].append(sentence)
    
    # Transform sentences to embeddings 
    body_text_embeddings[i].append(second_model.encode(body_texts[i],convert_to_tensor=True))

    # Convert all sentence embeddings to a 2D pytorch tensor
    body_text_embeddings[i] = torch.cat(body_text_embeddings[i])

    # For each sentence embedding of body_text section 
    for sen_index,sentence in enumerate(body_text_embeddings[i]): 
      answer_pq.put((cos_sim(query_embedding,sentence)*(-1),(index,i,sen_index)))

  # Display results 
  print("Query:", query_sentence,"\n")

  for i in range(4):  
    _,res = answer_pq.get()

    print("Answer",i+1,":",body_texts[res[1]][res[2]])
    print("From article with title:",end=" ")
    for text in data_[res[0]][0]:
      print(text,end =" ")

    print("\nCosine Similarity:", -_,"\n")
  print("--------------------------------------------------------------------------------------------------------------------\n")

### Test our second model with our non Brute Force function.

In [None]:
second_start_time = time.time()

for i in range(len(queries)):
  reduce_searching(second_queries_embeddings[i],title_embeddings,abstract_embeddings,queries[i],data,0.1,0)

second_model_time = (time.time() - second_start_time)/60
print("Elapsed time: %s minutes" % (round(second_model_time,1)))

Query: What are the coronoviruses? 

Answer 1 : With some coronaviruses, e.g.
From article with title: Differential Sensitivity of Bat Cells to Infection by Enveloped RNA Viruses: Coronaviruses, Paramyxoviruses, Filoviruses, and Influenza Viruses 
Cosine Similarity: 0.73 

Answer 2 : Coronaviruses.
From article with title: A viral metagenomic survey identifies known and novel mammalian viruses in bats from Saudi Arabia 
Cosine Similarity: 0.68 

Answer 3 : The Coronaviridae Family.
From article with title: A Genome-Wide Analysis of RNA Pseudoknots That Stimulate Efficient −1 Ribosomal Frameshifting or Readthrough in Animal Viruses 
Cosine Similarity: 0.68 

Answer 4 : coronaviruses.
From article with title: Action Mechanisms of Lithium Chloride on Cell Infection by Transmissible Gastroenteritis Coronavirus 
Cosine Similarity: 0.67 

--------------------------------------------------------------------------------------------------------------------

Query: What was discovered in Wuhuan 

### Notes & Conclusions

>Note that our dataset is the initial-first release of CORD-19 dataset, 2020-03-13, which is the smallest possible dataset with 9000 articles. 
You can find it here: [CORD-19_Releases](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html)

#### Just a recap of what we presented in this notebook!

>In contrast with first notebook,where we use a brute force way, in order to search all sentences of each article's abstract 
and body text section and compare them via cosine similarity metric with a single query embedding and thus return the most 
'similar' answer, here in this notebook we tried to reduce searching so as to save time. 

>Let's explain, of course our whole concept...
>1. Initially, after corpus's tokenization to sentences, we obtained all article's titles and abstracts and transform them to 
sentence embeddings. 
>2. So we have on our hands, only a limited information about each article, cause we have ignored core text of each article. So, by 
reduce_searching's function implementation we have the privilege to search only a specific percentage of all articles's body texts. 
We limit our searching by simply compare the appropriate query embedding only with sentences which derived from tile and abstract 
section of each article.
In that point, let us just note that we have three options: 

>> A) Consider only title section.

>> B) Consider only abstract section.

>> C) Consider both of abstract and title section.

>>We experimented with each option, but we decided to keep as much as possible information we could. Also, note that in the 
scenario that we consider only title section we saved a lot of time, but model returns titles from papers that didn't match 
question's context. In contrast, with considering abstract section, where it exists a small summary of paper's context, we have 
more probabilities to find close-similar words, cause there are more than one sentences at most cases.

>3. We have to pin that we have the ability to determine the percentage of articles which have best cosine similarity between 
query's embedding and each sentence from title's or/and abstract's section. We decide to set this ratio to 10%.
>4. With the previous explanations, we could easily say that we have a crucial difference off between time and wasted resources between our two 
notebooks.

>>a) Where in first we follow a more brute force way, in order to search every possible information from articles, but that 
cost us to time, approximately whole process endured 2 hours, with total use of Cuda GPU. In addition, just to remind in 
first notebook we used pretrained sbert model,'stsb-bert-base', in order to transform sentences to embeddings of size 768. 

>>b) On the other hand in current, second, notebook we tried to save time, but this cost us in accuracy of answers to our questions 
about COVID-19. So to be more specific, in this notebook we managed to save half time, 1 hour, thus our whole process here lasts 
approximately 1 hour. Last but not least, in current notebook we used 'stsb-roberta-base' model.

>Here you can check all the available SBERT pretrained models that somebody can use [Sentence Transformers PreTrained Models](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0), order to transform a sentence to embedding. 

>In the above link you can observe by yourself that 'stdb-bert-base' and 'stsb-roberta-base', if you compare them that they don't have so many differences. Let's be more detailed...both of these models have same same speed, 2300 sentences per second on V100 GPU and about their performances 'stsb-roberta-base' is slightly better with score 85,44 against 85,14 of 'stsb-bert-base'. Let's emphasize that these specific models are among the best, based on their performance models, which sentence transformers library provide.

>So we can conclude that despite the fact that our two sentence pretrained models don't differ too much, we followed in each one a different approach and thus resulting to have some major differences between our two notebooks. So let's analyze some crucial criteria, with which we gonna compare our two approaches:

1. Time 
>First criteria, that we pick, in order to compare our two approaches is time that each notebook endured. So, for our first notebook, where we followed a more brute force way, our whole notebook lasts approximately 2 hours. In contrast to first notebook, on our second notebook we tried to save time and we achieved that, cause elapsed time was 1 hour. That difference between time was more than expected. Practically, our mindset behind second notebook was to search a small part (10%) of whole articles, like a heuristic function, which build with help of title and abstract section.

2. Computing Power
>Second criteria, that we pick, in order to compare our two approaches is computing power. With term 'Computing Power' we mean two critical subfactors,percentage of usage of hardware accelerator (GPU Cuda) of Google Colab and memory limit usage of Google Colab. 

>> In first notebook, we have to note that during transforming sentences to embeddings we waste all possible computational resources, specifically we had total usage of GPU,marginally use all available Google Colab RAM,~12GB, and 
obviously to save,these embeddings,we had to sacrifice approximately ~6.5GB in our Google Drive. -Note that Google Drive provides for free 15GB memory storage-.

>> On the other hand, in second notebook due to the fact we transform initially only title and abstract section of all articles and then only 10% of body texts of all articles, we managed to save a lot of memory, approximately we utilized only ~1GB for storage. Furthermore, by watching closely GPU we conclude that in this notebook Google didn't provide to us total usage of CUDA!
    
>> So to sum up, as a result we can refer that we save a lot of computing power in second notebook rather than in first.

3. Accuracy
>Third possible criteria, that we could pick, in order to compare our two approaches is accuracy and efficiency of our results. Generally, our two notebooks with our non-official evaluation were pretty close as performance, with some slight differences. We believed marginally this notebook performed more efficiently!Although, both of notebooks had the same problem (with cosine 
similarity, which we discussed in previous notebook), they have also some exceptions, in which we obtained brilliant answers! We analyzed these cases for previous notebook so let us discuss for current notebook:

---
> Query: "What was discovered in Wuhuan in December 2019?"

>  Answer: "In December 2019, a cluster of pneumonia of unknown etiology was detected in Wuhan City, Hubei Province of China."

>  With cosine similarity score 0.62.  
---
> Query: "What is Coronovirus Disease 2019?"

> Answer: "Thereafter, this disease was named Coronavirus Disease 2019 (COVID-19) by World Health Organization (WHO), and the causative virus was designated as SARS-CoV-2 by the International Committee on Taxonomy of Viruses."

> With cosine similarity score 0.62. 
---
>Those two mentioned answers were really relevant and satisfying, although cosine similarity hasn't pretty high value!
---
>Let us just explain, as last note, that for both notebooks about their results, we may not retrieve in all questions, desirable 
answers, but at most of our examples the retrieved title of paper, had as main concept COVID-19, so if all this search was for 
information extraction or retrieval, we achieved in high rate to obtain relevant articles/papers! 