# Prioritizing research papers based on abstracts
#### The aim of this project is to Implement an efficient system that prioritizes and recommends papers based on research topic or query. This notebook contains the baseline implementation of the core feature in the Medical Cloud Platform (MCP) which aims to speed up the pace of medical research. This can be achieved increasing the collaboration between researchers around the world which, as stated by the WHO, can speed up the pace of [solidarity clinical Trials](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov/solidarity-clinical-trial-for-covid-19-treatments) by 80%. It also aims to save the researchers' time and make them able to direct all their efforts towards innovating new ways to fight COVID-19.

#### In this Notebooks we will focus on the basic features of the search engine only which includes the following steps:
1. Exploratory Data Analysis, EDA
2. Data Cleaning
3. Web Scraping
4. Implementation
5. Performance

##### We will start first by data exploration and cleaning then make a hypothesis based on our data exploration and domain knowledge, after that we will be implementing our solution using NLP and ML libraries like Gensim, Scikit-Learn and Selenium. For more details about the MCP project watch the video below.

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo('HctunZLmc10', width=800/1.2, height=450/1.25)

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import re
import os
import string
import pickle


import nltk
import gensim

from nltk.tokenize import word_tokenize, sent_tokenize
from gensim.parsing.preprocessing import remove_stopwords
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

from gensim.test.utils import common_corpus, common_dictionary
from gensim.similarities import MatrixSimilarity

from gensim.test.utils import datapath, get_tmpfile
from gensim.similarities import Similarity

from IPython.display import display, Markdown, Math, Latex, HTML


import pandas as pd
import seaborn as sns

!pip install webdriverdownloader
from webdriverdownloader import GeckoDriverDownloader

!pip install selenium
from selenium.webdriver.common.by  import By as selenium_By
from selenium.webdriver.support.ui import Select as selenium_Select
from selenium.webdriver.support.ui import WebDriverWait as selenium_WebDriverWait
from selenium.webdriver.support    import expected_conditions as selenium_ec
from IPython.display import Image



from selenium import webdriver as selenium_webdriver
from selenium.webdriver.firefox.options import Options as selenium_options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities as selenium_DesiredCapabilities

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()


# YOU MUST ADD YOUR USERNAME AND PASSWORD OF RESEARCH GATE TO THE SECRET CREDENTIALS TO BE ABLE TO GET THE SCRAPED DATA
# email = user_secrets.get_secret("email")
# password = user_secrets.get_secret("pass")
email = user_secrets.get_secret("email")
password = user_secrets.get_secret("pass")
prox_s = user_secrets.get_secret("proxy_server")

### Exploratory Data Analysis
There are a lot of files in the directory, json files for each paper, a description file and a metadata file for each paper. We will explore the metadata file to see whether we can find some useful features that serve our purpose or not.

In [None]:
meta = pd.read_csv("/kaggle/input/CORD-19-research-challenge/metadata.csv")
print("Cols names: {}".format(meta.columns))
meta.head(7)

In [None]:
plt.figure(figsize=(20,10))
meta.isna().sum().plot(kind='bar', stacked=True)

It seems that the two features 'Microsoft Academic Paper ID', 'WHO #Covidence' have a very huge number of missing values as obvious from the histogram so we will remove them to regulate the scale of the histogram's frequencies

In [None]:
meta.columns

In [None]:
meta_dropped = meta.drop(['who_covidence_id'], axis = 1)

In [None]:
plt.figure(figsize=(20,10))

meta_dropped.isna().sum().plot(kind='bar', stacked=True)

From the above histogram we can see there is a small number of papers with missing urls and a considerable number with missing doi and abstracts. We are interested only in papers that have abstracts and either doi or url to be able to recommend them to the researcher. So let's explore some statistics about the papers with missing abstracts and remove them if possible.

In [None]:
miss = meta['abstract'].isna().sum()
print("The number of papers without abstracts is {:0.0f} which represents {:.2f}% of the total number of papers".format(miss, 100* (miss/meta.shape[0])))

As we see from these previous results the percentage of missing abstracts is considerable but we can ignore it as it is one of the most important features in our approach. Now lets see the number of missing doi and the number of missing urls from the papers without missing abstracts.

In [None]:
abstracts_papers = meta[meta['abstract'].notna()]
print("The total number of papers is {:0.0f}".format(abstracts_papers.shape[0]))
missing_doi = abstracts_papers['doi'].isna().sum()
print("The number of papers without doi is {:0.0f}".format(missing_doi))
missing_url = abstracts_papers['url'].isna().sum()
print("The number of papers without url is {:0.0f}".format(missing_url))

We will need to extract the year of publication to give high priority to the papers with earlier publication data as they are more likely to be cited in newer papers and also because the might not be sufficient for today's usage (ex: papers from 1955)

In [None]:
abstracts_papers = abstracts_papers[abstracts_papers['publish_time'].notna()]
abstracts_papers['year'] = pd.DatetimeIndex(abstracts_papers['publish_time']).year

Now lets see if the papers with urls have doi or not

In [None]:
missing_url_data = abstracts_papers[abstracts_papers["url"].notna()]
print("The total number of papers with abstracts, urls, but missing doi = {:.0f}".format( missing_url_data.doi.isna().sum()))

Based on the above data exploration, we will need to remove the data that doesn't have both url so that research scientists can find the paper recommended to them. We will only choose papers with abstracts and either doi or url as follows:

In [None]:
abstracts_papers = abstracts_papers[abstracts_papers["url"].notna()]

### Web Scraping
The number of citations can be a good factor that helps us determine the importance of the research work. This dataset doesn't contain the number of citations but we are going to use selenium to get this number from researchgate based on the doi identification number of each research paper.
Now I will do some web scraping to get the number of citations for each paper in the top 10 results and to do that I am going to use the code from the great [Notebook](https://www.kaggle.com/dierickx3/kaggle-web-scraping-via-headless-firefox-selenium) by Tom Dierickx

In [None]:
!cat /etc/os-release

!mkdir "/kaggle/working/firefox"
!ls -l "/kaggle/working"
!cp -a "/kaggle/input/firefox-63.0.3.tar.bz2/firefox-63.0.3/firefox/." "/kaggle/working/firefox"
!ls -l "/kaggle/working/firefox"


!chmod -R 777 "/kaggle/working/firefox"
!ls -l "/kaggle/working/firefox"

gdd = GeckoDriverDownloader()
gdd.download_and_install("v0.23.0")
!apt-get install -y libgtk-3-0 libdbus-glib-1-2 xvfb
!export DISPLAY=:99

In [None]:
from selenium.webdriver.common.proxy import Proxy, ProxyType

In [None]:
browser_options = selenium_options()
browser_options.add_argument("--headless")
browser_options.add_argument("--window-size=1920,1080")

proxy = prox_s

capabilities_argument = selenium_DesiredCapabilities().FIREFOX
capabilities_argument["marionette"] = True
capabilities_argument['proxy'] = {
    "proxyType": "MANUAL",
    "httpProxy": proxy,
    "ftpProxy": proxy,
    "sslProxy": proxy
}

browser = selenium_webdriver.Firefox(
    options=browser_options,
    firefox_binary="/kaggle/working/firefox/firefox",
    capabilities=capabilities_argument
)

In [None]:
browser.get("https://www.researchgate.net/login")
browser.find_element_by_id('input-login').send_keys(str(email))
browser.find_element_by_id('input-password').send_keys(str(password))
browser.find_element_by_class_name('action-submit').click()

In [None]:
# ids = browser.find_elements_by_xpath('//*[@class]')

In [None]:
# for ii in ids:
# #     print(ii.tag_name)
#     print (ii.get_attribute('id'))    # id name as string

In [None]:
# browser.find_element_by_class_name('challenge-form').click()

In [None]:
# print(browser.current_url)

Let's see an example on how the web page would look like. We are only interested on the integer number written on the citation tab.

In [None]:
# We will be usingn researchgate to scrape more data as follows
# browser.get("https://www.researchgate.net/login")
browser.save_screenshot("screenshot.png")
Image("screenshot.png", width=800, height=500)

Now we will use Selenium library to get the number of citations for each research paper based on the unique doi number

In [None]:
def get_citations(top10):
    
    citations = []
    for paper in range(10):
        if(not (type(top10.doi[paper]) == float)):
            browser.get("https://www.researchgate.net/search.Search.html?type=researcher&query=" + str(top10.doi[paper]))

            element = browser.find_elements_by_class_name('nova-c-nav__item-label')
            if(len(element) >= 3):
                numbers = re.findall("\d+" , (element[3].text).lower())
                if len(numbers) >= 1:
                    citations.append(max(int(numbers[0]), 0.1))
                else:
                    citations.append(0.1)
            else:
                citations.append(0.1)   
        else:
            citations.append(0.1)
            
    return citations

### Remove stopwords: 
now we are going to remove the stop words which are common words as (the, a, an, etc.) we will also remove punctuation, we will then revert the words to its basis using the famous stemming algorithm (Porter algorithm) before using the TF-IDF techniques. We will also take care while removing punctuation as we don't want to lose the meaning of words like (non-medical) while using the dash but we will convert it to (nonmedical) instead to keep its meaning with respect to the search algorithm

We will also vectorize the corpus using TF-IDF, the TF-IDF technique to be computed to be used to compute the similarity between the query and the abstracts for prioritization in later steps.

### *NOTE: IT TAKES A LOT OF TIME TO CALCULATE THE TFIDF SO WE CALCULATED IT AND STORED IT AS A KAGGLE DATASET FOR FASTER REFERENCE, TO RE-CALCULATE IT YOU CAN UNCOMMENT THE FOLLOWING CELL*


In [None]:
porter = PorterStemmer()
lancaster=LancasterStemmer()

# abstracts_only = abstracts_papers['abstract']
# tokenized_abs = []

# for abst in abstracts_only:
#     tokens_without_stop_words = remove_stopwords(abst)
#     tokens_cleaned = sent_tokenize(tokens_without_stop_words)
#     words = [porter.stem(w.lower()) for text in tokens_cleaned for w in word_tokenize(text) if (w.translate(str.maketrans('', '', string.punctuation))).isalnum()]
#     tokenized_abs.append(words)
    
    
# dictionary = gensim.corpora.Dictionary(tokenized_abs)
# corpus = [dictionary.doc2bow(abstract) for abstract in tokenized_abs]
# tf_idf = gensim.models.TfidfModel(corpus)

# tf_idf.save("tfidf")
# dictionary.save("dict")

# with open("corpus.txt", "wb") as fp:
#     pickle.dump(corpus, fp)


To save time, we are gonna load the TFIDF, Corpus and Dictionary saved in the previous step instead of running it every time we modify the code for faster inference

In [None]:
with open("/kaggle/input/tfidfcovid19/corpus.txt", "rb") as fp:
    corpus = pickle.load(fp)
    
dictionary = gensim.corpora.Dictionary.load("/kaggle/input/tfidfcovid19/dict")
tf_idf = gensim.models.TfidfModel.load("/kaggle/input/tfidfcovid19/tfidf")

Now, we will need to apply the same cleaning steps, that we applied before to the corpus, to the query itself so that we can get consistent results. We will do the following: Removing stop words, removing punctuation and stemming. then we will map the words to their integer ids using the dictionary of words computed before.

In [None]:
def query_tfidf(query):
    
    query_without_stop_words = remove_stopwords(query)
    tokens = sent_tokenize(query_without_stop_words)

    query_doc = [porter.stem(w.lower()) for text in tokens for w in word_tokenize(text) if (w.translate(str.maketrans('', '', string.punctuation))).isalnum()]

    # mapping from words into the integer ids
    query_doc_bow = dictionary.doc2bow(query_doc)
    query_doc_tf_idf = tf_idf[query_doc_bow]
    
    return query_doc_tf_idf

Now we will use the cosine similarity algorithm from gensim, we tried to use text vectorizer instead of the TF-IDF but the TF-IDF gave better results. We will store the similarity in the same dataframe with the abstracts and sort from highest similarity to the lowest one.

In [None]:
def rankings(query):

    query_doc_tf_idf = query_tfidf(query)
    index_temp = get_tmpfile("index")
    index = Similarity(index_temp, tf_idf[corpus], num_features=len(dictionary))
    similarities = index[query_doc_tf_idf]

    # Storing similarity in the dataframe and sort from high to low simmilatiry
    print(similarities)
    abstracts_papers["similarity"] = similarities
    abstracts_papers_sorted = abstracts_papers.sort_values(by ='similarity' , ascending=False)
    abstracts_papers_sorted.reset_index(inplace = True)
    
    top20 = abstracts_papers_sorted.head(10)
    top20["doi"].astype(str)
    citations = get_citations(top20)
    top20["citations"] = citations
    top20["similarity"] = top20["similarity"] * (top20.citations / top20.citations.max()) * 0.5
    norm_range = top20['year'].max() - top20['year'].min()
    top20["similarity"] -= (abs(top20['year'] - top20['year'].max()) / norm_range)*0.1
    top20 = top20.sort_values(by ='similarity' , ascending=False)
    top20.reset_index(inplace = True)
    
    return top20

We will also reorder the top 20 papers by penalizing the older papers using the equation: similarity - 0.1*abs(x - newest_year_puplication) / (newest_year_puplication - oldest_year_puplication) and by giving the papers with more citations a higher priority

Now we will print the top 10 abstracts from the similarity algorithm

In [None]:
top = rankings("Non-pharmaceutical interventions")

In [None]:
pd.set_option('display.max_colwidth', -1)
top[['index','abstract']].style.set_properties(**{'text-align': "justify"})

In [None]:
pd.set_option('display.max_colwidth', -1)
top[['index','url']].style.set_properties(**{'text-align': "left"})

In [None]:
top = rankings("MERS respiratory sendrom")

In [None]:
pd.set_option('display.max_colwidth', -1)
top[['index','abstract',]].style.set_properties(**{'text-align': "justify"})

In [None]:
pd.set_option('display.max_colwidth', -1)
top[['index','url']].style.set_properties(**{'text-align': "left"})

### Model Performance
We will measure the model performance using the precision@k method. Since the data doesn't contain a groundtruth of relevance, We needed to evaluate each research paper manually to calculate the precision@K.
#### Precision:
It measures the amount of instances that were identified correctly as positive and can be measured as follows:

$$Precision = {TP \over TP + FP}.$$

Where:<br>
TP stands for True Positives: The number of data points that were marked correctly as relevant.<br>
FP stands for False Positives: The number of data points that were wrongly marked as relevant while it truly belongs to the irrelevant class.

#### Precision@K:
It measures the precision at each k level starting from `k = 1` till `k = n`, where n is the total number of queries that you are using to measure the precision. For our case, we will only recommend the top 10 research articles so we will have 10 levels from `k = 1` to `k = 10`.

For the first query "Social distancing measures query result" ::

$$P@K = [{{1 \over 1}, {2 \over 2}, {2 \over 3}, {3 \over 4}, {4 \over 5}, {5 \over 6}, {6 \over 7}, {7 \over 8}, {8 \over 9}, {9 \over 10}}].$$


For the second query "MERS respiratory sendrom" ::

$$P@K = [{{1 \over 1}, {2 \over 2}, {3 \over 3}, {4 \over 4}, { 5 \over 5 }, {6 \over 6}, {7 \over 7}, {8 \over 8}, {9 \over 9}, {10 \over 10}}].$$

It seems for the second query, all the papers are relevant.

Now we will use the Mean Average Precision (MAP)  to get an estimate of the precision of our system on multiple queries:

#### Mean Average Precision (MAP):
$$ MAP(Q) = {{\sum_{j=1}^{\lvert Q \lvert} { {1 \over m_j} \sum_{k=1}^{m_j} {precision(R_{j, k})} } } \over \lvert Q \lvert }$$

Where:<br>
`Q`: The number of Queries performed.<br>
`m_j`: the total number of true positives for query `j`.<br>
`R_(j, k)`: precision of the relevant result `k` in query `j`


$$ MAP = {{(1+1+ 3/4 + 4/5 + 5/6 + 6/7 + 7/8 + 8/9 + 9/10)/ 8} + {(1 * 10)/ 10} \over 2} = { 0.99} $$

From the performance that we calculated, it seems that our direction is promising but can be further improved by retrieving more data about the authors and the impact factors of the journals.