*original post [here](https://medium.com/@ahmedbesbes/releasing-corona-papers-an-ai-powered-search-engine-to-explore-covid-19-research-4b18d1259491)*

After participating in this great challenge with @marwandebbiche and @pmlee2017, we took the liberty to bring our software engineering expertise and build a tool. 

Without further ado, meet [Corona Papers](https://covid19.ai2prod.com)  üéâ


In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('iVsteYEF0ko', width=800, height=450)

In this post, I‚Äôll:
- **Go through the main functionalities of Corona Papers and emphasize what makes it different from other search engines**
- **Share the code of the data preprocessing and topic detection pipelines so that it can be applied in similar projects**

## What is Corona Papers?

Corona Papers is a search engine that indexes the latest research papers about COVID-19.

If you‚Äôve just watched the video, the following sections will dive into more details. If you haven't watched it yet, all you need to know is here.

### **1 ‚Äî A curated list of papers and rich metadata üìÑ**

Corona Papers indexes the COVID-19 Open Research Dataset (CORD-19) provided by [Kaggle](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). This dataset is a regularly updated resource of over 138,000 scholarly articles, including over 69,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.


Corona Papers also integrates additional metadata from [Altmetric](https://www.altmetric.com/) and Scimgo [Journal](https://www.scimagojr.com/journalrank.php) to account for the online presence and academic popularity of each article. More specifically, it fetches information such as the number of shares on Facebook walls, the number of posts on Wikipedia, the number of retweets, the [H-Index](https://en.wikipedia.org/wiki/H-index) of the publishing journal, etc.


The goal of integrating this metadata is to consider each paper‚Äôs impact and virality among the community, both the academic and non-academic ones.

<div align="center">
<img width="100%" src="https://miro.medium.com/max/1400/1*jnnO6sc6O8f9zAT-bPhIAw.png">
</div>

### **2 ‚Äî Automatic topic extraction using a language model ü§ñ**

Corona Papers **automatically tags each article with a relevant topic using a machine learning pipeline**.

This is done using [CovidBERT](https://huggingface.co/gsarti/covidbert-nli): a state-of-the-art language model fine-tuned on medical data. With the great power of the Hugging Face library, using this model is pretty easy.

<div align="center">
<img src="https://miro.medium.com/max/350/1*rr1CnqBV4_xxDH8oOZigeg.png">
</div>


Let‚Äôs break down the topic detection pipeline for more clarity:

- Given that abstracts represent the main content of each article, they‚Äôll be used to discover the topics instead of the full content.<br>
  They are first embedded using CovidBERT. This produces vectors of **768 dimensions**. <br>
  ‚Äî Note that CovidBERT **embeds each abstract as a whole** so that the resulting vector encapsulates the semantics of the full document.
  

- Principal Component Analysis (PCA) is performed on these vectors to reduce their dimension in order to remove redundancy and speed-up later computations. **250 components** are retained to ensure 95% of the explained variance.


- KMeans clustering is applied on top of these PCA components in order to discover topics. After many iterations on the number of clusters, **8 seemed to be the right choice**. <br>
  ‚Äî There are many ways to select the number of clusters. I personally looked at the silhouette plot of each cluster (figure below). <br>
  ‚ö†Ô∏è An assumption has been made in this step: each article is assigned a unique topic, i.e. the dominant one. If you're looking at generating a mixture of topics per paper, the right way would be to use Latent Dirichlet Allocation. **The downside of this approach, however, is that it doesn‚Äôt integrate COVIDBert embeddings**.
  

- After generating the clusters, I looked into each one of them to understand the underlying sub-topics. I first tried a word-count and [TF-IDF](https://fr.wikipedia.org/wiki/TF-IDF) scoring to select the most important keywords per cluster. **But what worked best here was extracting those keywords by performing an LDA on the documents of each cluster.** This makes sense because each cluster is itself a collection of sub-topics. <br>
  Different coherent clusters were discovered. Here are some examples, with the corresponding keywords (the cluster names have been manually attributed on the basis of the keywords)
  
<div align="center">
<img src="https://miro.medium.com/max/552/1*pRIQIr6AApNsxy3o-R2iXw.png">
<img src="https://miro.medium.com/max/552/1*HU9NDoexQ5ixGRFNio0LDQ.png">
<img src="https://miro.medium.com/max/552/1*pLO7_8YUW03HVJC__7WnZg.png">
</div>
<br>

- I decided, finally, and for fun mainly, to represent the articles in an interactive 2D map to provide a visual interpretation of the clusters and their separability. To do this, I applied a tSNE dimensionality reduction on the PCA components. Nothing fancy.

<div align="center">
<img src="https://miro.medium.com/max/1400/1*RFSYt1RaT31y1mOab1g0WA.png">
</div>

To bring more interactivity to the search experience, I decided to embed the tSNE visualization into the search results (this is available on Desktop view only).

On each result page, the points on the plot (on the left) represent the same search results (on the right): this gives an idea on how results relate to each other in a semantic space.

<div align="center">
<img src="https://miro.medium.com/max/1400/1*v1OY0kIQJdzyDx00ydfyMg.png">
</div>

### **3 ‚Äî Recommendation of similar papers **

Once you click on a given paper, Corona Papers with show you detailed information about it such as the title, the abstract, the full content, the URL to the original document, etc.

Besides, it proposes a selection of similar articles that the user can read and bookmark.

These articles are based on a similarity measure computed on CovidBert embeddings.

Here are two examples:

<div align="center">
<img src="https://miro.medium.com/max/1400/1*lJQnvJOSV2WZ-8JJyD6wuQ.png">
</div>

<div align="center">
<img src="https://miro.medium.com/max/1400/1*Wyi57QvG2NPUvmtu0Vpp5w.png">
</div>




### **4 ‚Äî A stack of modern web technologies üì≤ **

Corona Papers is built using modern web technologies

<div align="center">
   <img src="https://miro.medium.com/max/1400/1*yf_yfEYRxpoP8j8W8t5iMg.png">
</div>


> ### Back-end
 
At its core, it uses Elasticsearch to perform full-text queries, complex aggregation and sorting. When you type in a list of keywords, for example, Elasticsearch matches them first with the titles, abstract, and eventually the author names.


Here‚Äôs an example of a query:


<div align="center">
   <img src="https://miro.medium.com/max/1400/1*7WQDsOIgnvBMqksgQv4nMw.png">
</div>
<br>
<br>

And here‚Äôs a second one that matches an author‚Äôs name:

<br>
<br>
<div align="center">
   <img src="https://miro.medium.com/max/1400/1*KkclSa0G2lEfoaaqOA11WA.png">
</div>



> ### Front-end

The front-end interface is built using Material-UI, a great React UI [library](http://material-ui.com/) with a variety of well-designed and robust components.

It has been used to design the different pages, and more specifically the search page with its collapsable panel of search filters :

- publication date
- publishing company (i.e. the source)
- journal name
- peer-reviewed articles
- h-index of the journal
- the topics

<div align="center">
   <img src="https://miro.medium.com/max/1400/1*IBsgPuNG5Eyw4uBK5EMH2w.png">
</div>


Because accessibility matters, I aimed at making Corona Papers a responsive tool that researchers can use on different devices. Using Material-UI helped us design a clean and simple interface.

<div align="center">
   <img src="https://miro.medium.com/max/1400/1*qujEaCIgwTy37JT6ppZA9w.png">
</div>


> ### Cloud and DevOps

I deployed Corona Papers on AWS using docker-compose.

## How to use CovidBERT in practice


Using the **sentence_transformers** package to load and generate embedding from CovidBERT is as easy writing these few lines

<div align="center">
   <img src="https://miro.medium.com/max/1400/1*Rl-eEcWPx5Z7Hw-W6h7T5w.png">
</div>

If you‚Äôre interested in the data processing and topic extraction pipelines, you can look at the code in my Github [repository](https://github.com/ahmedbesbes/covidbert-topic-mining).
You‚Äôll find two notebooks:
- **1-data-consolidation.ipynb:**
    - consolidates the CORD database with external metadata from Altmetric, Scimago Journal, and CrossRef
    - generates CovidBERT embeddings from the titles and excerpts


- **2-topic-mining.ipynb:**
    - generates topics using CovidBERT embeddings
    - select relevant keywords for each cluster

## What key lessons can be learned from this project?


Building Corona Papers has been a fun journey. It was an opportunity to mix up NLP, search technologies, and web design. This was also a playground for a lot of experiments.

Here are some technical and non-technical notes I first kept to myself but am now sharing with you:


- Don‚Äôt underestimate the power of Elasticsearch. This tool offers great customizable search capabilities. Mastering it requires a great deal of effort but it‚Äôs a highly valuable skill. <br>
  Visit the official [website](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html) to learn more.


- Using language models such as CovidBERT provides efficient representations for text similarity tasks.
  If you‚Äôre working on a text similarity task, look for a language model that is pretrained on a corpus that resembles yours. Otherwise, train your own language model.
  There are lots of available models [here](https://huggingface.co/models).


- Docker is the go-to solution for deployment. Pretty neat, clean, and efficient to orchestrate the multiple services of your app.
  Learn more about Docker [here](https://docker-curriculum.com/).


- Composing a UI in React is really fun and not particularly difficult, especially when you play around with libraries such as Material-UI.
  The key is to first start by sketching your app, then design individual components separately, and finally assemble the whole thing.
  This took me a while to grasp because I was new to React, but here are some tutorials I used:
  
    - React official [website](https://reactjs.org/tutorial/tutorial.html)
    - Material UI official [website](https://material-ui.com/) where you can find a bunch of components
    - I also recommend this guy‚Äôs channel. It‚Äôs awesome, fun, and quickly gets you to start with React fundamentals.


In [None]:
YouTubeVideo('dGcsHMXbSOA', width=800, height=450)

- Text clustering is not a fully automatic process. You‚Äôll have to fine-tune the number of clusters almost manually to find the right value. This requires monitoring some metrics and qualitatively evaluating the results.

Of course, there are things I wish I had time to try like setting up CI-CD workflow with Github actions and building unit tests. If you have experience with those tools, I‚Äôd really appreciate your feedback.

## Spread the word! Share Corona Papers with your community

If you made it this far, I‚Äôd really want to thank you for reading!

If you find Corona Papers useful, please share this link: https://covid19.ai2prod.com with your community.

If you have a feature request for improvement or if you want to report a bug, don‚Äôt hesitate to contact me.


I‚Äôm looking forward to hearing from you!