In [None]:
# Install required pip packages. Please note that the internet must be switched on to install them in the Kaggle kernel.
!pip install -U kneed

# Imports
import os
from os.path import join as join_path
import numpy as np
rng_seed = 368
np.random.seed(rng_seed)
import pandas as pd

import warnings
from numba.errors import NumbaPerformanceWarning
warnings.filterwarnings("ignore", category=NumbaPerformanceWarning) # Silence NumbaPerformanceWarning for UMAP
import umap
from sklearn.cluster import KMeans

from kneed import KneeLocator
from scipy.spatial.distance import pdist
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
import plotly.express as px

from IPython.display import IFrame

# Coordle - Search Engine using Word2Vec and TF-IDF
Due to the special circumstances of the COVID-19 pandemic, the students of the Selected Topics in Machine Learning (topic being "Deep Learning") course ([INF368 Spring 2020](https://www.uib.no/en/course/INF368?sem=2020v)) at the University of Bergen were asked to participate in the competition.

In this notebook, you will find a search engine for the articles in the CORD-19 dataset. We named it Coordle (from Google + CORD) and was made using TF-IDF and Word2Vec. The search engine can be found in the interactive cell below, or by clicking here: https://coordle.triki.no/.

In [None]:
IFrame('https://coordle.triki.no/', width=800, height=600)

### Table of contents
1. [Installing the Coordle library](#installing_cord_library)
2. [Data preprocessing](#data_preprocessing)
3. [Creating word embeddings from scratch using Gensim](#create_word2vec)
4. [Visualize word embeddings using UMAP](#visualize_word_embeddings)
5. [Creating a search engine using TF-IDF](#search_engine_tf_idf)
6. [Task results](#task_results)
7. [Pros/cons with the search engine](#pros_cons)
8. [Future work](#future_work)

<a id='installing_cord_library'></a>
# 1. Installing the Coordle library
We have separated the code into two Github repositories. [The first one](https://github.com/JonasTriki/inf368-exercise-3-cord-19) is used for the data preprocessing and experimentation. [The second repository](https://github.com/JonasTriki/inf368-exercise-3-coordle) is where the Coordle library is maintained at and which we will use throughout the notebook. To run the cell below, please note that the internet must be switched on. This is to install the Coordle library.

In [None]:
!pip install -U https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz
!pip install -U git+https://github.com/JonasTriki/inf368-exercise-3-coordle.git
    
# Import Coordle modules
from coordle.preprocessing import CORD19Data
from coordle.utils import clean_text
from coordle.backend import QueryAppenderIndex

<a id='data_preprocessing'></a>
# 2. Data preprocessing
To load and preprocess the CORD-19 data, we take inspiration from [Daniel Wolffram's "CORD-19: Create Dataframe" Notebook](https://www.kaggle.com/danielwolffram/cord-19-create-dataframe) and the ["Date updates thread"](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/137474) from the challenge itself. The goal of the data preprocessing is to get a single .csv file with all the cleaned/parsed data in place.

In particular, we first load the `metadata.csv` file using Pandas and perform some cleaning on it (dropping duplicates and articles without metadata). Then, we go through each and every row of the .csv file and parse it. We ensure that each row has either a PDF or PMC parse, and we prefer the PMC over PDF articles. For the body text of each article, we remove cite spans from the text since they are useless for creating word embeddings. We observed there were some false positive articles that had less than around 1000 characters in the body text and we exclude these.

Next, we remove duplicate articles that have the same abstract/body_text and we extract the language from the article using spaCy. We do this because we only would like to have english articles in our final dataframe. After this, we save the result using Pandas. For more details about how we preprocessed the data, please consult the [cord_19_data.py](https://github.com/JonasTriki/inf368-exercise-3-coordle/blob/master/coordle/preprocessing/cord_19_data.py) file from the Coordle library repository.

In [None]:
# Define some constants
kaggle_input_dir = join_path('/', 'kaggle', 'input')
cord_data_raw_dir = join_path(kaggle_input_dir, 'CORD-19-research-challenge')

In [None]:
# Perform preprocessing on the raw data
cord_df = CORD19Data(cord_data_raw_dir).process_data()

In [None]:
#  Sanity check the processed dataframe
cord_df.head()

<a id='create_word2vec'></a>
# 3. Creating word embeddings from scratch using Gensim
To create the word embeddings for the CORD-19 dataset, we used Gensim and the `Word2Vec` class. However, before we can train the model we first define some helper classes. In particular, we define a data interator that yields sentences for Word2Vec to train on and a callback that saves an intermediate model after each epoch. The data iterator uses the `clean_text` function from the Coordle library. In short, it cleans the text by turning it into lowercase, removing punctuations, stopwords, numerics and words with one character. At last, it lemmatizes the text (turning the word `viruses` into `virus` for instance). The code for the function can be found [by clicking here](https://github.com/JonasTriki/inf368-exercise-3-coordle/blob/master/coordle/utils/utils.py#L40).

In [None]:
# Implement the data interator for Word2Vec
class CORDDataIteratorWord2Vec():
    def __init__(self, texts: np.ndarray):
        self.texts = texts
    
    def __iter__(self):
        for text in self.texts:
            sentences = nltk.tokenize.sent_tokenize(text)
            cleaned_sentences = [clean_text(sent) for sent in sentences]
            for sentence in cleaned_sentences:
                yield sentence

In [None]:
# Implement the epoch saver for Word2Vec
class EpochSaver(CallbackAny2Vec):
    '''Callback to save model after each epoch.'''

    def __init__(self, output_dir: str, prefix: str, start_epoch: int = 1):
        self.output_dir = output_dir
        self.prefix = prefix
        self.epoch = start_epoch

    def on_epoch_end(self, model):
        output_path = join_path(self.output_dir, f'{self.prefix}_epoch_{self.epoch}.model')
        model.save(output_path)
        self.epoch += 1

After we have defined these two classes, we train the model in three steps:
1. Initialize Word2Vec model
2. Build Word2Vec vocabulary
3. Train the model

We split the steps into three parts to further sanity check that we did not make any mistakes on the way.

This is illustrated in the code below and takes around ~10 hours to run. For your converience, we have imported the final model/weights into the kernel in the `input/gensim-word2vec-model` folder after running for 20 epochs.
```python
# Extract English only texts
cord_df_eng = cord_df[cord_df['language'] == 'en']
eng_texts = cord_df_eng['body_text'].values

cord_sentences = CORDDataIteratorWord2Vec(eng_texts)
w2v_saved_models_dir = 'models-word2vec'
saved_models_prefix = 'model'

# 1. Setup initial model
w2v_model = Word2Vec(
    min_count=20,
    window=2,
    size=300,
    negative=5,
    callbacks=[EpochSaver(w2v_saved_models_dir, saved_models_prefix)]
)

# 2. Build vocabulary
w2v_model.build_vocab(tqdm(cord_sentences, total=cord_num_sentences), progress_per=int(cord_num_sentences / 100))

# 3. Train model
w2v_model.train(
    cord_sentences,
    total_examples=w2v_model.corpus_count,
    epochs=20,
    report_delay=30
)
```

In [None]:
# Load the trained Gensim model
model_path = join_path(kaggle_input_dir, 'gensim-word2vec-model', 'cord-19-w2v.model')
w2v_model = Word2Vec.load(model_path)
word_embedding_matrix = w2v_model.trainables.syn1neg

### Test word embeddings by finding most similar word vectors

In [None]:
w2v_model.wv.most_similar('covid')

In [None]:
w2v_model.wv.most_similar('virus')

In [None]:
w2v_model.wv.most_similar('pandemic')

We observe above that the word embeddings do indeed make sense, when exploring a few examples. Next we will visualize the embeddings as well to get a deeper understanding of it.

<a id='visualize_word_embeddings'></a>
# 4. Visualize word embeddings using K-means clustering and UMAP
We decided to use K-means clustering and UMAP to cluster and reduce the dimentionality of the word embeddings. This part is mainly as a sanity check to see that the word embeddings we have gotten from the Word2Vec algorithm actually make sense. To find the best number of clusters for K-means, we use the elbow method.

In [None]:
# Cluster
min_k = 2
ks = np.arange(min_k, 21)
errors = np.zeros(len(ks))
clusterings = np.zeros((len(ks), word_embedding_matrix.shape[0]))
for k in ks:
    print(f'Clustering using k={k}...')
    clusterer = KMeans(n_clusters=k, n_jobs=-1)
    pred_labels = clusterer.fit_predict(word_embedding_matrix)
    clusterings[k - min_k] = pred_labels
    errors[k - min_k] = clusterer.inertia_

In [None]:
# Show the elbow plot to determine the best k
kneedle = KneeLocator(ks, errors, S=1.0, curve='convex', direction='decreasing')
kneedle.plot_knee()

# Select best clustering
best_clustering = clusterings[kneedle.knee - min_k]

In [None]:
# Reduce dimensionality using UMAP (with default params)
word_embedding_3d = umap.UMAP(n_components=3).fit_transform(word_embedding_matrix)

In [None]:
# Visualize the words in 3D with Plotly
word_embedding_vis_df = pd.DataFrame({
    'x': word_embedding_3d[:, 0],
    'y': word_embedding_3d[:, 1],
    'z': word_embedding_3d[:, 2],
    'cluster_label': best_clustering,
    'word': w2v_model.wv.index2word
})
fig = px.scatter_3d(word_embedding_vis_df, x='x', y='y', z='z', color='cluster_label', hover_name='word')
fig.show()

By zooming into some of the smaller clusters, we observe that months such as feburary and august are clustered together. We also observe that we get some clusters with date related words and words that represent temperatures, as well as some outliers here and there, which is not too unexpected.

<a id='search_engine_tf_idf'></a>
# 5. Creating a search engine using TF-IDF

### Motivation and concept
We want to obtain answers to the problems by creating our own search engine for the dataset. Naturally, we want the search engine to be fast and give relevant results when given a query. 

To achieve the desired searching speeds we pre-compute an index for the data. The index is a [hash table](https://en.wikipedia.org/wiki/Hash_table) that maps query tokens (e.g. words) to [sets](https://en.wikipedia.org/wiki/Set_(mathematics)) of documents that contains the tokens. 

To achieve relevant search results, we analyze the query and add tokens that are similar by utilizing word2vec. For example given the query "fat cat", we may effectively turn the query into "fat obese cat dog". The sets corresponding to each query can then for example be combined to a larger set. Then to score the relevance of the queried documents, we calculate the [TF-IDF](http://www.tfidf.com/) weights for each document. 

For example, when given the query "covid symptoms" the search engine essentially does:
1.  Insert similar tokens to the query, e.g. corona and diagnose, effectively turning the query into "covid corona symptoms diagnose" 
2.  Obtain set $A$ of documents that contains "covid"
3.  Obtain set $B$ of documents that contains "corona"
4.  Obtain set $C$ of documents that contains "symptoms"
5.  Obtain set $D$ of documents that contains "diagnose"
6.  Obtain the union $E$ of the sets
7.  Calculate the TF-IDF weights for each document in $E$ with respect to the query tokens
8.  Return the documents sorted by the TF-IDF weights in descending order

### Query syntax with set operators
Sometimes we would like to specify a more refined query. For example what if specifically we want the documents that contains the words "corona" and "influenza", without the word "swine". To enable this ability to the user, the search engine implements a parser that can parse set operators as well.

<a id='syntax'></a>
The search engine supports three operators: OR (union), AND (intersection), NOT (difference), from highest to lowest preceedence, meaning that the OR operator is evaluated before AND and so on. To override the order of preceedence, the user can use parenthesis', e.g. "(cat AND dog) OR goose" will be different from "cat AND dog OR goose". To sum it up with another example:

Given the query "cat AND virus" the search engine essentially does:
1.  Insert similar tokens in the query, e.g. dog and disease, effectively turning the query into "(cat OR dog) AND (virus OR disease)" 
2.  Obtain set $A$ set of documents that contains "cat"
3.  Obtain set $B$ set of documents that contains "dog"
4.  Obtain set $C$ set of documents that contains "virus"
5.  Obtain set $D$ set of documents that contains "disease"
6.  Let set $E$ be the result of evaluating $(A \cup B) \cap (C \cup D)$
7.  Calculate the TF-IDF weights for each document in $E$ with respect to the query tokens
8.  Return the documents sorted by the TF-IDF weights in descending order

Practically, the implementation of the search engine automatically adds OR operators between tokens that does not have any explicit operators, e.g. "cat dog horse AND goose" $\Rightarrow$ "cat OR dog OR horse AND goose ". To explicitly illustrate the effect of the preceedence, the query is equivalent to "(cat OR dog OR horse) AND goose".

### Indexing process
We will now give a high level explanation on how the indexing process works. The source code for the indexing process is available [here](https://github.com/JonasTriki/inf368-exercise-3-coordle/blob/master/coordle/backend/coordle_backend.py#L537) if you so desire to examine the implementation details. 

Given a document $d_i$ from the collection of all documents $D$ that has an unique id $u_i$, and its corresponding text $t_i$ we do the following:
1. Clean the text using the function ``clean_text`` mentioned in [section 3](#create_word2vec). The return value is a list containing text tokens. 
2. Get the unique tokens $\tau_{ij}$ and their counts $c_{ij}$. ($i$ denotes document, $j$ denotes token index)
3. For the document, create a dictionary $f_i$ that maps each token $\tau_{ij}$ to their corresponding counts $c_{ij}$. 

We repeat this process for all the documents in $D$. 

### Parser 
Without going into the details; the parsing method used to parse the [syntax](#syntax) is [recursive descent parsing](https://en.wikipedia.org/wiki/Recursive_descent_parser). The source code for the parser part of the source engine can be found [here](https://github.com/JonasTriki/inf368-exercise-3-coordle/blob/master/coordle/backend/coordle_backend.py#L421). 

<a id='parser_and_syntax_note'></a>
#### Things to note about the parser and the syntax
As a design choice, when using the NOT operator (set difference), the search engine will not append similar terms to the term that is used for the difference. For example: "cat AND car" may effectively become "(cat OR dog) AND (car OR bus)", while "cat NOT car" becomes "(cat OR dog) NOT car". This is to avoid removing potential relevant results. 

An issue that arises (that is yet to be handled) is that given the query "cat NOT car sheep" (all documents with cats but not car or sheep), the query effectively becomes for example "(cat AND dog) NOT car OR (sheep OR cow)", which effectively is "(cat AND dog) NOT (car OR (sheep OR cow))" because of operator preceedence. This may lead to a removal of a too large subset. For now the workaround the user has to do to ensure "proper behaviour" when using NOT is to chain them together; e.g "cat NOT car NOT sheep".

### TD-IDF 
[TF-IDF](http://tfidf.com/) stands for *term frequency-inverse document frequency*. It is a weight that is used often used for information retrieval and text mining. Given a collection of documents, TF-IDF is a statistical measure used to determine how strongly connected a word is to a specific document relatively to all the other documents. Given a word and a document from the collection, we compute TF (term frequency) and multiply it with IDF (inverse document frequency) to obtain its TF-IDF weight. The calculations are as follows:

TF = (number of times the word appears in the document) / (total number of words in said document)

IDF = log(documents in collection / number of documents with the word in it)

The important parts to note for the query process are:
- Given a query $Q$, the search engine must retrieve sets for each query token $q_i$.
- Query $Q$ can have repeated tokens.
- Right after retrieving the set corresponding to $q_i$, it calculates the TF-IDF score for each document in the set.
- If a document is retrieved multiple times, it will accumulate TF-IDF score.

In [None]:
# To demonstrate how the search engine works, we index on a subset of the documents in the CORD-19 dataframe.
ai_index = QueryAppenderIndex(w2v_model.wv.most_similar, n_similars=1)
ai_index.build_from_df(
    cord_df[:1000],
    'cord_uid',
    'title',
    'body_text', 
    verbose=True, 
    use_multiprocessing=True,
    workers=-1
)

In [None]:
def search_and_show(query: str, max_results: int = 3, max_body_length: int = 500):
    '''Searches using the AI Index and shows the result
    
    Args:
        query: Search query
        max_results: Max results to show for each query    
    '''
    docs, scores, errmsgs = ai_index.search(query)
    if errmsgs:
        print('The following errors occurred:', errmsgs)
    else:
        if len(docs) == 0:
            print('Sorry, no results found.')
        else:
            for doc, score in zip(docs[:max_results], scores[:max_results]):
                print(f'{doc.uid}  {str(doc.title)[:70]:<70}  {score:.4f}')
                print('---')
                print(f'{cord_df[cord_df.cord_uid == doc.uid].body_text.values[0][:max_body_length]} {...}')
                print('---')

In [None]:
search_and_show('virus')

In [None]:
search_and_show('virus AND')

In [None]:
search_and_show('coronavirus symptoms in humans')

<a id='task_results'></a>
# 6. Task results
Due to computational limitations, we only indexed a few of the 35k+ articles in our dataset. To show the task results, we simply performed the following queries on the live website which has all articles indexed. For simplicity reasons, we only show the top 3 results for each query.
1. What do we know about COVID-19 risk factors?
    - covid AND risk AND factors
2. What do we know about vaccines and therapeutics?
    - covid AND vaccines
    - covid AND therapy
3. What has been published about medical care?
    - covid AND medical care
4. What do we know about diagnostics and surveillance?
    - covid AND diagnostics
    - covid AND surveillance

## 6.1. What do we know about COVID-19 risk factors?
![](https://i.imgur.com/j56hQIT.png)

## 6.2. What do we know about vaccines and therapeutics?
![](https://i.imgur.com/nSzQdDT.png)
![](https://i.imgur.com/w1YQ8Yd.png)

## 6.3. What has been published about medical care?
![](https://i.imgur.com/upnLRTU.png)

## 6.4. What do we know about diagnostics and surveillance?
![](https://i.imgur.com/V8jrFc9.png)
![](https://i.imgur.com/vai3KW0.png)

<a id="pros_cons"></a>
# 7. Pros/cons with the search engine
Here we quickly explain some of the pros and cons with the Coordle search engine.

## Pros
+ Easy to use
+ Searching is very fast
+ Good control over search queries (with the introduction of AND, OR and NOT)

## Cons
- Requires a lot of RAM 
- Ambiguity of results because it does not tell the user which extra tokens it appends to the query. However, this should be an easy fix. 

<a id='future_work'></a>
# 8. Future work
There is always room to improve, and here we will explain some of the challenges and improvements we could have done to make our search engine even better.

## Using database indexing instead of in-memory
At the time of writing, we are currently indexing the articles into memory using our custom made TF-IDF search engine. The reason for doing this is mostly to enhance the search speed. We looked into indexing the database using [MongoDB](https://www.mongodb.com/), but after some experimentation we felt that the search time was infeasible for any production app (~24 seconds vs sub 1 second when indexed in memory). Now, by indexing all the articles into memory, the memory requirements are substantially higher than if we were to index the articles into a database/disk. We estimated around 16GB of memory required to store the whole index.

## Highlighting search query words
To further enhance the search experience for the user, we would like to emphasize or highlight the words from the search query in the web application. This way the user will get a better feeling of what search queries were used in the search, as they are currently kind of hidden in the Coordle backend (recall that the Coordle backend uses the word embeddings to add similar words to search query).

## Enhance suggestions using Doc2Vec
As of now we utilize word embeddings to widen the set of possible results by effectively appending similar tokens to given queries. In addition to this, we could for example find documents related to the top searches by using Doc2Vec and include them as well. 

## Alternative parser behavior
An implication of the automatic addition of tokens in the query is that using set operators are technically not precise for the user. Since “cat AND car” effectively becomes for example “(cat OR dog) AND (car OR bus)”, one may be confused by the results, since one won’t necessarily get results that definitively contain cat and car. A solution for this is to maybe display what the effective query becomes so one always knows what it actually searches for, and also add the option to toggle whether you want the automatic addition of similar tokens. 

## Better handling of NOT operator
As explained [here](#parser_and_syntax_note), the parser still needs some work when handling the NOT (difference) operator in the queries.  