# COVID-19 Challenge

![alt text](../images/covid_wordcloud_full.png "wordcloud")

The goal of this notebook is to analyse data from the [COVID-19 Open Research Dataset Challenge (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).

There are various questions asked about the current state of our knowledge about the covid-19 virus and there are also many ways in which the published papers can be useful. We have decided to implement a few simple tools that can make it faster to find answers and which can be helpful in versatile use cases.  

The tools can be used to help with any of the questions, but for demonstrative purposes, we try do address the questions *What do we know about COVID-19 risk factors?* and *What do we know about virus genetics, origin, and evolution?*


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import string
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity  
from nltk.stem import PorterStemmer
from gensim.models import Word2Vec
from nltk.corpus import stopwords
from nltk import word_tokenize
from gensim import corpora, models
import seaborn as sns
import nltk
from itertools import  combinations
import networkx as nx
from networkx.algorithms.centrality import degree_centrality, eigenvector_centrality, closeness_centrality, betweenness_centrality
from networkx.algorithms.components import connected_components, number_connected_components
from itertools import  combinations
from string import digits
nltk.download('stopwords')
nltk.download('punkt')
import os
import json
from pprint import pprint
from copy import deepcopy
from tqdm.notebook import tqdm
from wordcloud import WordCloud
from nltk import FreqDist
import pyLDAvis
import pyLDAvis.gensim
import umap
from sklearn.manifold import TSNE

## Read data
 @xhlulu provided a great notebook for data reading, the loading functions are taken from it: https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv, thanks to @xhlulu.

In [2]:
def format_name(author):
    middle_name = " ".join(author['middle'])
    
    if author['middle']:
        return " ".join([author['first'], middle_name, author['last']])
    else:
        return " ".join([author['first'], author['last']])


def format_affiliation(affiliation):
    text = []
    location = affiliation.get('location')
    if location:
        text.extend(list(affiliation['location'].values()))
    
    institution = affiliation.get('institution')
    if institution:
        text = [institution] + text
    return ", ".join(text)

def format_authors(authors, with_affiliation=False):
    name_ls = []
    
    for author in authors:
        name = format_name(author)
        if with_affiliation:
            affiliation = format_affiliation(author['affiliation'])
            if affiliation:
                name_ls.append(f"{name} ({affiliation})")
            else:
                name_ls.append(name)
        else:
            name_ls.append(name)
    
    return ", ".join(name_ls)

def format_body(body_text):
    texts = [(di['section'], di['text']) for di in body_text]
    texts_di = {di['section']: "" for di in body_text}
    
    for section, text in texts:
        texts_di[section] += text

    body = ""

    for section, text in texts_di.items():
        body += section
        body += "\n\n"
        body += text
        body += "\n\n"
    
    return body

def format_bib(bibs):
    if type(bibs) == dict:
        bibs = list(bibs.values())
    bibs = deepcopy(bibs)
    formatted = []
    
    for bib in bibs:
        bib['authors'] = format_authors(
            bib['authors'], 
            with_affiliation=False
        )
        formatted_ls = [str(bib[k]) for k in ['title', 'authors', 'venue', 'year']]
        formatted.append(", ".join(formatted_ls))

    return "; ".join(formatted)

def load_files(dirname):
    filenames = os.listdir(dirname)
    raw_files = []

    for filename in tqdm(filenames):
        filename = dirname + filename
        file = json.load(open(filename, 'rb'))
        raw_files.append(file)
    
    return raw_files

def generate_clean_df(all_files):
    cleaned_files = []
    
    for file in tqdm(all_files):
        features = [
            file['paper_id'],
            file['metadata']['title'],
            format_authors(file['metadata']['authors']),
            format_authors(file['metadata']['authors'], 
                           with_affiliation=True),
            format_body(file['abstract']),
            format_body(file['body_text']),
            format_bib(file['bib_entries']),
            file['metadata']['authors'],
            file['bib_entries']
        ]

        cleaned_files.append(features)

    col_names = ['paper_id', 'title', 'authors',
                 'affiliations', 'abstract', 'text', 
                 'bibliography','raw_authors','raw_bibliography']

    clean_df = pd.DataFrame(cleaned_files, columns=col_names)
    clean_df.head()
    
    return clean_df

In [3]:
biorxiv_dir = '../input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/pdf_json/'
biorxiv_files = load_files(biorxiv_dir)
biorxiv_df = generate_clean_df(biorxiv_files)
biorxiv_df['subset'] = "biorxiv"

pmc_dir = '../input/CORD-19-research-challenge/custom_license/custom_license/pdf_json/'
pmc_files = load_files(pmc_dir)
pmc_df = generate_clean_df(pmc_files)
pmc_df['subset'] = "pmc"

comm_dir = '../input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/pdf_json/'
comm_files = load_files(comm_dir)
comm_df = generate_clean_df(comm_files)
comm_df['subset'] = "comm"

noncomm_dir = '../input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/pdf_json/'
noncomm_files = load_files(noncomm_dir)
noncomm_df = generate_clean_df(noncomm_files)
noncomm_df['subset'] = "noncomm"

HBox(children=(FloatProgress(value=0.0, max=1342.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1342.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=23152.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=23152.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=9365.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=9365.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2377.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2377.0), HTML(value='')))




In [4]:
complete_df = pd.concat([biorxiv_df, pmc_df, comm_df, noncomm_df])
complete_df.reset_index(inplace=True, drop=True)

In [5]:
meta_info_dir = "../input/CORD-19-research-challenge/metadata.csv"
meta_info  = pd.read_csv(meta_info_dir, low_memory=False)
len1 = len(meta_info)

meta_info = meta_info.loc[meta_info["authors"].isnull() == False]
len2 = len(meta_info)

# exclude "duplicate" papers
vars_unique = ['title', 'authors'] # 
meta_info = meta_info.drop_duplicates(vars_unique)
len3 = len(meta_info)
msg = "{} out of {} papers do not mention authors\n{} papers are duplicates w.r.t {}".format(len1-len2, len1, len2-len3, vars_unique)
print(msg)


2109 out of 47298 papers do not mention authors
109 papers are duplicates w.r.t ['title', 'authors']


# Make data help us

We wanted to address some of the potential use-cases of the covid-19 papers. Specifically, we tried to facilitate the following processes:
* Finding articles that refer to some specific topics so that it is possible to concentrate on the most relevant papers.
* Looking for papers that belong to important / credible authors
* Detecting possible areas of research and also new directions of published papers.

In the following we develop tools that could help in the above use cases.

## Topic relevant articles
In the first use case we want to find articles that refer to some specific topics so that we can focus on the most relevant papers. To do that we use Word2Vec model.

First we implement two different preprocessing pipelines to make our tool more robust to details of text preprocessing by just having the ability to analyse results after two different preprocessing methods.

**Pipeline 1**: Here we define a text preprocessing pipeline, using self-defined filters and transforms.

In [6]:
class Selector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]
    
class Tokenizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.apply(lambda x: word_tokenize(x))

class Stemmer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.stemmer = PorterStemmer()
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.apply(lambda x: [self.stemmer.stem(word) for word in x])

class StopWordsFilter(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.stopwords = stopwords.words('english') + list(string.punctuation)
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.apply(lambda x: [word for word in x if word not in self.stopwords])

class LowerCaseTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.apply(lambda x: x.lower())

class ShortWordRemover(BaseEstimator, TransformerMixin):
    def __init__(self, min_length):
        self.min_length=min_length
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.apply(lambda x: [word for word in x if len(word) >= self.min_length])
    

pipe_1 = Pipeline([
        ("select_textual_features", Selector("text")),
        ("to_lower_case", LowerCaseTransformer()),
        ("tokenize", Tokenizer()),
        ("filter_stop_words", StopWordsFilter()),
        ("stemming", Stemmer()),
        ("short_words_remover", ShortWordRemover(min_length=3))
    ])

**Pipeline 2:** Here we make use of the gensim preprocessing library:

In [7]:
from gensim import utils
import gensim.parsing.preprocessing as gsp

class AllInOneTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.filters = [
            gsp.strip_tags, 
            gsp.strip_punctuation,
            gsp.strip_multiple_whitespaces,
            gsp.strip_numeric,
            gsp.remove_stopwords, 
            gsp.strip_short, 
            gsp.stem_text
        ]
    def fit(self, X, y=None):
        return self
    def do_cleaning(self, x, filters):
        x = x.lower()
        x = utils.to_unicode(x)
        for f in filters:
            x = f(x)

        return x
    def transform(self, X):
        return X.apply(lambda x: self.do_cleaning(x, self.filters))

pipe_2 = Pipeline([
        ("select_textual_features", Selector("text")),
        ("all_in_one", AllInOneTransformer()),         
        ("to_list_of_words", Tokenizer()),
        ("short_words_remover", ShortWordRemover(min_length=3))
    ])

**Build Doc2Vec Vector List**

After we fit the word2vec models for the outputs of the preprocessing pipelines, we then use the trained model to calculate an average vector for each document in the corpus. This is done by simply summing up all the word-vectors and then dividing by the number of words.

In [8]:
def calculate_doc2vec(transformed_text: pd.Series, model) -> pd.DataFrame:
    """ The input argument transformed_text should be the result of one of the two pipelines"""
    df = pd.DataFrame()
    df["text"] = transformed_text
    df["doc_vec"] = df["text"].apply(
        lambda x: sum([model.wv[word] for word in x if word in model.wv.vocab]) /
        max(1, sum([1 if word in model.wv.vocab else 0 for word in x]))
    )
    return df

#### Keyword and Word2Vec Filter for Relevant Documents:

We implement a keyword-based relevance-filtering by first finding the document in the corpus that contains the keyword most often. Afterwards we find similiar documents to the found one in terms of the Word2Vec distance of latent representations. We return num_docs relevant documents.

In [9]:
def get_relevant_docs(corpus: pd.DataFrame, keyword: str, min_occurence: int, num_docs: int) -> pd.DataFrame:
    
    # find document that has most keyword occurences:
    max_count = -1
    target_vector = None
    for index, row in corpus.iterrows():
        if row["text"].count(keyword) > max_count:
            max_count = row["text"].count(keyword)
            target_vector = row["doc_vec"]
            
    if target_vector is None:
        print("Could not find keyword, returning empty DataFrame...")
        return pd.DataFrame()
    
    # list of tuples, consisting of (index, distance_to_target_vector)-pairs
    result = [(0, np.inf)] * num_docs
    
    for index, row in corpus.iterrows():
        dist = np.linalg.norm(target_vector - row["doc_vec"])
        if dist < result[-1][1]:
            result.append((index, dist))
            result.sort(key=lambda x : x[1])
            result = result[:-1]
            
    result = [x[0] for x in result]
    result = corpus.loc[result].copy()
    
    return result

We also provide a way to look directly for documents similar to the some phrase by interpreting the phrase as a new document and looking for documents that are close to it in the latent space.

In [10]:
def get_relevant_docs_for_phrase(phrase,corpus, model, pipeline, num_docs) -> pd.DataFrame:
    
    phrase_transformed = pipe_2.fit_transform(pd.DataFrame({"text":[phrase]}))
    target_vector = calculate_doc2vec(phrase_transformed, model)["doc_vec"].iloc[0]
    
    result = [(0, np.inf)] * num_docs
    
    for index, row in corpus.iterrows():
        dist = np.linalg.norm(target_vector - row["doc_vec"])
        if dist < result[-1][1]:
            result.append((index, dist))
            result.sort(key=lambda x : x[1])
            result = result[:-1]
            
    result = [x[0] for x in result]
    result = corpus.loc[result].copy()
    
    return result

### Demonstration

In [11]:
# take a sample of the dataset
sample_df = complete_df.sample(frac=0.02, random_state=42)
sample_df = sample_df[sample_df["title"].apply(lambda x: len(x)) > 0] # use papers with non-empty title

# fit pipelines and word2vec models:
transformed_text_1 = pipe_1.fit_transform(sample_df)
model_1 = Word2Vec(transformed_text_1)
vocab_1 = list(model_1.wv.vocab)

transformed_text_2 = pipe_2.fit_transform(sample_df)
model_2 = Word2Vec(transformed_text_2)
vocab_2 = list(model_2.wv.vocab)

Show which **viruses might be similar to covid**. Perhaps some of them can help with learning about covid evolution and origin.

In [12]:
print("Resulting similarities are slightly different: ")
pprint(model_1.wv.most_similar('virus')[:5])
pprint(model_2.wv.most_similar('virus')[:5])

Resulting similarities are slightly different: 
[('viru', 0.7995303869247437),
 ('coronavirus', 0.7309995889663696),
 ('avian', 0.703175961971283),
 ('paramyxovirus', 0.7028070688247681),
 ('flaviviru', 0.6978926062583923)]
[('viru', 0.7958441376686096),
 ('avian', 0.7379115223884583),
 ('paramyxovirus', 0.6789348125457764),
 ('viral', 0.6589496731758118),
 ('pathogen', 0.6571005582809448)]


Try to detect **papers related to immunity and children** to look at different risk groups.

In [13]:
# calculate document vectors:
with_doc2vec_1 = calculate_doc2vec(transformed_text_1, model_1)
with_doc2vec_2 = calculate_doc2vec(transformed_text_2, model_2)

In [14]:
result_1 = get_relevant_docs(with_doc2vec_1, "immun", -1, 10)
print("*** immunity related papers ***\n")
for title in list(complete_df.loc[result_1.index.values]["title"].values):
    print(title)
    print()

*** immunity related papers ***

The Fourth International Neonatal and Maternal Immunization Symposium (INMIS 2017): Toward Integrating Maternal and Infant Immunization Programs

Specialty Grand challenge in pediatric infectious diseases

Ad Hoc Influenza Vaccination During Years of Significant Antigenic Drift in a Tropical City With 2 Seasonal Peaks A Cross-Sectional Survey Among Health Care Practitioners

Lassa virus diversity and feasibility for universal prophylactic vaccine [version 1; referees: 3 approved]

Invited review Vaccines, adjuvants and autoimmunity

A review of vaccine research and development: Human acute respiratory infections ଝ

Man and Microbe: Fraternizing with the frenemy

Opinion piece

Seasonal influenza vaccination knowledge, risk perception, health beliefs and vaccination behaviours of nurses

Neonatal vaccine effectiveness and the role of adjuvants



In [15]:
result_2 = get_relevant_docs(with_doc2vec_2, "child", -1, 10)
print("*** children related papers ***\n")
for title in list(complete_df.loc[result_2.index.values]["title"].values):
    print(title)
    print()

*** children related papers ***

Respiratory Research The cost of community-managed viral respiratory illnesses in a cohort of healthy preschool-aged children

Household transmission of respiratory viruses -assessment of viral, individual and household characteristics in a population study of healthy Australian adults

Breastfeeding and Respiratory Infections in the First 6 Months of Life: A Case Control Study

Viral respiratory tract infections in young children with cystic fibrosis: a prospective full-year seasonal study

Methods Coccidioidomycosis as a Common Cause of Community-acquired Pneumonia

Burden of respiratory syncytial virus infections in China: Systematic review and meta-analysis

the World Health Organization/Global Outbreak Alert and Response Network Avian Influenza Investigation Team in Vietnam Human Avian Influenza A H5N1, Vietnam Emerging Infectious Diseases • www

Phone: (1) 301-451-9881, jbeigel@niaid.nih.gov

Respiratory Syncytial Virus Outbreak in the Basic Milit

Find **papers about "covid origins, transmission and evolution"**.

In [16]:
phrase_result = get_relevant_docs_for_phrase("covid origins, transmission and evolution", with_doc2vec_2, model_2, pipe_2, 5)
print("*** 'covid origins, transmission and evolution' related papers ***\n")
for title in list(complete_df.loc[phrase_result.index.values]["title"].values):
    print(title)
    print()

*** 'covid origins, transmission and evolution' related papers ***

First respiratory transmitted food borne outbreak?

Unanswered questions about the Middle East respiratory syndrome coronavirus (MERS-CoV)

Assessing the Epidemic Potential of RNA and DNA Viruses SYNOPSIS Get the content you want delivered to your inbox. · Table of Contents · Podcasts · Ahead of Print arƟcles · CME · Specialized Content

The Novel H7N9 Influenza A Virus: Its Present Impact and Indeterminate Future

COVID-19: Zoonotic aspects Fig. 1. Potential transmission cycles of SARS-CoV2 (formerly 2019nCoV). Travel Medicine and Infectious Disease xxx (xxxx) xxxx



#### **tf-idf cosine similarity matching**

Here we additionally implement a naive but widely-used approach for docoument matching. We use it to match papers to queries based on their title and abstract

In [17]:
reformulation_rules = {"what do we know about": "", 
                       "what is known about": "", 
                       " and": "",
                       "?": ""}


def rule_based_reformulation(query, rules=reformulation_rules):
    
    query = query.lower()
    
    for (phrase,substitute) in reformulation_rules.items():
        query = query.replace(phrase, substitute)
    
    query = query.strip()
    return query


def get_tfid (df_text, queries, vectorizer=None, pipeline = pipe_1):
    """
    :param queries: transformed queries
    :param df_text: pd.DataFrame with a variable "text"
    
    """
    transformed_text = pipeline.fit_transform(df_text)
    transformed_queries = pipeline.fit_transform(pd.DataFrame( {"text": queries }))
    
    transformed_text = transformed_text.map(lambda x: " ".join(x))
    transformed_queries = transformed_queries.map(lambda x: " ".join(x))

    if vectorizer is None:
        # concatenate all text together
        phrase_all = transformed_text.append(transformed_queries)
        phrase_all.index = range(len(phrase_all))
        assert len(transformed_text) + len(transformed_queries) == len(phrase_all)
        
        vectorizer = TfidfVectorizer()
        vectorizer.fit(phrase_all)
        
    tfidf_text = vectorizer.transform(transformed_text)
    tfidf_queries = vectorizer.transform(transformed_queries)
    assert tfidf_text.shape[0] == len(df_text), "number of obs before and after tfidf do not match"
    
    return tfidf_text, tfidf_queries, queries, vectorizer


def compute_cosine_similarity(tfid_queries, tfid_text):

    query_num = tfid_queries.shape[0]
    text_num = tfid_text.shape[0] 
    cosine_df = pd.DataFrame(index = range(text_num))
    
    for q in range(query_num):
        cosine_tmp = cosine_similarity(tfid_queries[q], tfid_text)
        assert len(cosine_tmp[0]) == text_num
        cosine_df["query_{}".format(int(q))] = cosine_tmp[0]
    
    return cosine_df


def extract_relevant_papers(cosine_df, num_papers=5, verbose=True):

    num_queries = len(cosine_df.columns)
    
    paper_dict = {}
    
    meta_info_copy = meta_info.copy()
    
    for q in cosine_df.columns.values:
        
        meta_info_copy[q] = cosine_df[q]
        meta_info_copy = meta_info_copy.sort_values(by=q, ascending=False)
        paper_dict[q] =  meta_info_copy["title"].values[:num_papers]

        if verbose:
            print("Selected {} papers, with similarity between {} and {}".format( num_papers, 
                                                                                  min(meta_info_copy[q].values[:num_papers]),
                                                                                  max(meta_info_copy[q].values[:num_papers])))
    return paper_dict  

### Demonstration

For 4 queries below, we will print up to 10 most similair papers according to the cosine similarty based on the abstract and title only. 

In [18]:
QUERIES = [
           "What is known about transmission, incubation, and environmental stability?",
           "What do we know about COVID-19 risk factors?",
           "What do we know about virus genetics, origin, and evolution?",
           "What do we know about vaccines and therapeutics?"
            ]
queries_transformed = [rule_based_reformulation(q) for q in QUERIES]


meta_info["text"] = meta_info["title"] + meta_info["abstract"]
meta_info_idf = meta_info.loc[meta_info["text"].isnull() == False].copy()


tfid_text, tfid_queries, queries, _ = get_tfid (df_text=meta_info_idf, queries=queries_transformed)
cosine_df = compute_cosine_similarity(tfid_queries, tfid_text)
paper_dict = extract_relevant_papers(cosine_df)

Selected 5 papers, with similarity between 0.3064912820151344 and 0.3732382703390843
Selected 5 papers, with similarity between 0.5429319720134561 and 0.6261394960630955
Selected 5 papers, with similarity between 0.33085171886975456 and 0.47726504941489145
Selected 5 papers, with similarity between 0.4153059038644511 and 0.49904314978418485


In [19]:
for i in range(len(QUERIES)):
    print("********************************************************\n")
    print(QUERIES[i])
    print("--------------------------------------------------------\n")
    rel = paper_dict["query_{}".format(i)]
    for j in range(5):
        print("{}\n".format(rel[j]))

********************************************************

What is known about transmission, incubation, and environmental stability?
--------------------------------------------------------

Imported malaria in the UK

Quantitative Proteomics Identifies Host Factors Modulated during Acute Hepatitis E Virus Infection in the Swine Model

Hard decisions will have to be made: view from intensive care

The CD225 Domain of IFITM3 Is Required for both IFITM Protein Association and Inhibition of Influenza A Virus and Dengue Virus Replication

Enteric viruses in HIV-related diarrhoea

********************************************************

What do we know about COVID-19 risk factors?
--------------------------------------------------------

SARS could still affect the United Kingdom, health secretary warns

H5N1 influenza viruses: Facts, not fear

Ebolavirus VP24 Binding to Karyopherins Is Required for Inhibition of Interferon Signaling

H7N9 and Other Pathogenic Avian Influenza Viruses Elici

## **Find key contributers**

Once we have developed a framework for finding relevant paper, there is another important question we want to address: *what is the quality of the content?* 

To answer this question, we examine the co-authorship network and try to determine the key contributers. Their importance is expressed using two simple measures: *degree centrality* and *eigenvector centrality*.

Our roadmap looks as follows:
1. name preprocessing
2. network creation
3. preliminary analysis of connected components
4. computing centrality

#### **Name preprocessing**

Upon examining the author's names, we have noticed several inconsistencies which we wanted to correct before creating the network. Not addressing this issue would result in extra nodes and missing connections between the actual authors. 

In [20]:
meta_info.loc[28501, "authors"] = meta_info.loc[28501, "authors"].replace("Wei,", "Wei")

# upone examination we have seen that some author names
# were not separated neither by , nor by ;
meta_info.loc[23966, "authors"] = meta_info.loc[28501, "authors"].replace("Hu,", "Hu;")
meta_info.loc[23966, "authors"] = meta_info.loc[28501, "authors"].replace("Xin", "Xin;")
meta_info.loc[23968, "authors"] = meta_info.loc[28501, "authors"].replace("Lei", "Lei;")
meta_info.loc[23968, "authors"] = meta_info.loc[28501, "authors"].replace("Wang", "Wang;")
meta_info.loc[31154, "authors"] = meta_info.loc[28501, "authors"].replace("Lee,", "Lei;")
meta_info.loc[31154, "authors"] = meta_info.loc[28501, "authors"].replace("Chiew", "Chiew;")


meta_info["authors"] = list(map(lambda x: str(x).replace("and", ";"), meta_info["authors"]))
meta_info["authors"] = list(map(lambda x: str(x).replace("-", " "), meta_info["authors"]))

# delete scientific titles from the names
meta_info["authors"] = list(map(lambda x: str(x).replace("MPhil", ";"), meta_info["authors"]))
meta_info["authors"] = list(map(lambda x: str(x).replace("PhD", ";"), meta_info["authors"]))
meta_info["authors"] = list(map(lambda x: str(x).replace("M. D.", ";"), meta_info["authors"]))

# delete all numbers
remove_digits = str.maketrans('', '', digits)
meta_info["authors"] = list(map(lambda x: x.translate(remove_digits), meta_info["authors"]))
meta_info["authors"] = list(map(lambda x: x.title(), meta_info["authors"]))

# other smaller inconsistencies which I have noticed during the inspection
meta_info["authors"] = list(map(lambda x: str(x).replace("'T", "T"), meta_info["authors"]))

# some author names were not separated neither by , nor by ;
names_to_separate = [ "A Aziz Aboobaker", "Abd Elaal Ahmed", "A Franco Manuel", "Ali Mohamed", "Abayomi Akin", "Abba Yusuf", "Abbag Lubna F", "Abbott Catherine A",
                      "Aarestrup Frank M", "Aaron Shawn D", "Abd Alla Adly M M", "Abd El Wahed Ahmed",
                      "Abdel Aziz Hatem", "Abdelwhab Elsayed M", "Abdullah Rasedee","Abe Shigeto","Abeyratne Ruwantissa",
                      "Abergel Chantal","Abraham Samuel","Acres S D","Adam Vojtech","Adams David H",
                      "Adams Timothy E","Ader Florence","Afonso Claudio L","Abdollahi Hamed","Abdul Careem Mohamed Faizal"]

for auth in names_to_separate:
    meta_info["authors"] = list(map(lambda x: str(x).replace(auth, "{}; ".format(auth)), meta_info["authors"]))
    
# Sometimes co-authors are separated by ";", sometimes by "," - trying to determine when 
meta_info["commas_per_autor_list"] = list(map(lambda x: len(str(x).split(","))-1, meta_info["authors"]))
meta_info["authors_per_publication"] = list(map(lambda x: len(str(x).split("; ")), meta_info["authors"]))

my_idx = meta_info.query('authors_per_publication == 1 and commas_per_autor_list > 1').index
meta_info.loc[my_idx, "authors"] = list(map(lambda x: str(x).replace(",", "; "), meta_info.loc[my_idx, "authors"] ))

meta_info["authors"] = list(map(lambda x: str(x).replace(",", "."), meta_info["authors"]))
meta_info["authors"] = list(map(lambda x: str(x).replace(".", ""), meta_info["authors"]))

authors = meta_info["authors"].dropna()

list_all_authors = "; ".join( meta_info["authors"].dropna() )
list_authors = list_all_authors.split("; ") # assume co-authors are separated by a ;
list_authors = list(map(lambda x: str(x).strip(), list_authors))

list_authors_unique = np.unique(list_authors)
list_authors_unique = list(filter(lambda x: len(x) > 2, list_authors_unique ))
list_authors_unique = list(filter(lambda x: not ((len(x.replace(" ", ""))==2)&(len(x) == 3)), list_authors_unique ))
list_authors_unique = list(filter(lambda x: str(x)[0] != "@", list_authors_unique ))

In [21]:
author_dict = {}
for i, auth in enumerate(list_authors_unique):
    author_dict[auth] = i
    
author_dict_decode = { k:v for v, k in author_dict.items() }

#### **Create an authorship network**

We represent the connections between authors by an unweighted, where vertices represent authors and an edge indicates the presence of a common paper. 

During the data exploration it has come to other attention that several papers has an excessivly large set of authors (over 800). These entries corresponded to conference proceedings e.g. *XXIV World Allergy Congress 2015: Seoul, Korea. 14-17 October 2015', 'Guidelines for the use and interpretation of assays for monitoring autophagy (3rd edition)*. Since the information about the direct collaborations is not preserved in the metadata we have decided to exclude them from our analysis.

In general, we have decided not too include edges, if the number of contributers to the paper exceeded 100, since the individual contribution of the majority of author is likely to be marginal in this case.

In [22]:
G=nx.Graph()

# add nodes corresponding to authors
for key, value in author_dict.items():
    G.add_node(author_dict.get(key)) 

names_not_found = []
names_found = []
too_many_collaborators = {}

for i in meta_info.index:

    collaborators = str(meta_info.loc[i, "authors"]).split("; ")
    
    # inflates the degree - > proceedings of the conferences
    if( len(collaborators) < 100 ):  # 268

        for auth1, auth2 in combinations(collaborators, 2):

            auth1 = auth1.strip()
            auth2 = auth2.strip()

            # check the name is in the dictionary:
            if (auth1 in author_dict.keys()) and (auth2 in author_dict.keys()):
                
                v1 = min([author_dict.get(auth1), author_dict.get(auth2)])
                v2 = max([author_dict.get(auth1), author_dict.get(auth2)])

                # count all collaborations only once
                if not G.has_edge(v1, v2):
                    G.add_edge(v1, v2)
                names_found.append(auth1)
                names_found.append(auth2)

            elif auth1 not in author_dict.keys():
                names_not_found.append(auth1)

            elif auth2 not in author_dict.keys():
                names_not_found.append(auth2)

Our network contains a small portion of nodes with more than 100 connections. This list predominantly consists of chinese authors. This could be attributed to the fact that the english translation of many different chinese names is the same. Consequently the numbers account for the collaborations of several different researchers. We have decided to exclude this suspicious/ambiguous nodes from the subsequent analysis.

In [23]:
degree_dict_full = { n:d for n, d in G.degree() }
normal_activity_nodes = dict(filter(lambda elem: elem[1] <= 100, degree_dict_full.items()))
G1 = G.subgraph(list(normal_activity_nodes.keys()))

The resulting authorship network is disconnected. Furhtermore, there are 12588 authors that always contribute individually in the current version of the database.

We focus our attention on the largest connected component, that links together 110742 our of 174047 scientists

In [24]:
components = connected_components(G1)
num_comp = number_connected_components(G1)
print( "Number of connected components: {}".format(num_comp) )

# filter the list of components, work only with componets with more than 1 element
component_list = list(filter(lambda x: len(x) > 1, components ))

component_sizes = [len(x) for x in component_list]
largest_component = component_list[np.argmax(component_sizes)]
print("Size of the largest component: {}\n{} components  had just one element".format(len(largest_component), num_comp-len(component_list)))

Number of connected components: 21127
Size of the largest component: 105321
12083 components  had just one element


In [25]:
G1_component = G1.subgraph(list(largest_component))
degree_dict = { n:d for n, d in G1_component.degree() }
assert len(degree_dict) == len(largest_component)

#### **Centrality measures**

For every node in the largest connected component we evaluate two characteristics: **degree centrality** and **eigenvalue centrality**

In [26]:
centrality_degree = degree_centrality(G1_component)
centrality_eigenvector = eigenvector_centrality(G1_component)

assert len(centrality_degree.keys()) == len(centrality_eigenvector.keys()), "length mismatch"
assert len(set(centrality_degree.keys()).difference(set(centrality_eigenvector.keys()))) == 0

In [27]:
author_centrality_pd = pd.DataFrame({"author": [author_dict_decode.get(x) for x in centrality_degree.keys()] })
author_centrality_pd["degree_centrality"] = centrality_degree.values()
author_centrality_pd["eigenvector_centrality"] = centrality_eigenvector.values()

Nodes with the highest **degree centrality** have the largest number of co-authors in the subgraph. Below we list 10 authors with the largest set of diverse collaborations according to our dataset

In [28]:
author_centrality_pd = author_centrality_pd.sort_values(by="degree_centrality", ascending=False, na_position='first')
#author_centrality_pd[["author", "degree_centrality"]].head(10)

print("*** 10 most actively collaborating authors ***\n")
for auth in author_centrality_pd[["author"]].head(10).values:
    print(auth[0])

*** 10 most actively collaborating authors ***

Cauchemez S
Tapparel Caroline
Yang Jyh Yuan
Kao Chuan Liang
Falsey Ann R
Elia Gabriella
Madhi Shabir A
Ping Liu
Lavazza Antonio
Liu Ping


**Eigenvalue centrality** shifts the emphasis from the number of connections to the 'quality' / 'importance' of the neighbors. Higher values are associated with nodes that collaborate with a lot of 'important' (w.r.t degree centrality) nodes. 

In [29]:
author_centrality_pd = author_centrality_pd.sort_values(by="eigenvector_centrality", ascending=False, na_position='first')

print("*** 10 authors with the highest eigenvalue centrality ***\n")
for auth in author_centrality_pd[["author"]].head(10).values:
    print(auth[0])

*** 10 authors with the highest eigenvalue centrality ***

Cauchemez S
Kraemer M U G
Fonseca V
Giovanetti M
De Oliveira T
Pybus O G
Funk S
Faria N R
De Albuquerque C F C
Komninakis S C V


## Detecting unexplored areas and reasearch directions.
We want to show what are the pursued areas of research and also what are the new research directions of published papers.

First we wanted to find possible clusters in the latent space using T-SNE or UMAP

`vectors = np.array([np.array(vec) for vec in with_doc2vec_2["doc_vec"]])`  

`standard_embedding = umap.UMAP(n_neighbors= 25,min_dist = 1.,random_state=42).fit_transform(vectors)`  

`sns.scatterplot(standard_embedding[:,0], standard_embedding[:, 1])
plt.title("UMAP visualisation")
plt.show()`

It detected non-english papers when fitted on the whole corpus: ![alt text](../images/covid_umap.png "umap")However, when trained on english papers only, UMAP with different hyperparameter settings did not reveal any sensible potential clustered latent structure.  

*We include the image of the results as the exact result might depend on GPU randomness*

### LDA model

Using the LDA model we can discover various directions of research by analysing the found topics.

In [30]:
not_stemming_pipeline = Pipeline([
        ("select_textual_features", Selector("text")),
        ("to_lower_case", LowerCaseTransformer()),
        ("tokenize", Tokenizer()),
        ("filter_stop_words", StopWordsFilter()),
        ("remove_short_words", ShortWordRemover(4)),
    ])

def perform_lda(X, pipeline=not_stemming_pipeline, num_topics=5):
    transformed_text = pipeline.fit_transform(X).values
    dictionary_LDA = corpora.Dictionary(transformed_text)
    corpus = [dictionary_LDA.doc2bow(document) for document in transformed_text]
    lda_model = models.LdaModel(corpus, num_topics=num_topics, \
                            id2word=dictionary_LDA, \
                            passes=4, alpha=[0.1]*num_topics, \
                            eta=[0.01]*len(dictionary_LDA.keys()))
  
    return lda_model, corpus, dictionary_LDA

In [31]:
lda_model, corpus, dictionary_LDA = perform_lda(complete_df.sample(frac=0.01, random_state=42))

### Demonstration

In [32]:
vis = pyLDAvis.gensim.prepare(topic_model=lda_model, corpus=corpus, dictionary=dictionary_LDA)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

The exact topics found depend on the hyperparameters but still exploring the particular topics found might help to see where reasearch is currenly going. For instance, the topic 3 found by one of the models probably **detects a topic about virus origins, relation to covid in animals and ways of transmission**: ![alt text](../images/covid_lda.png "lda")

### Final remarks
We have developed a few simple and interpretable tools and demonstrated, with a couple of examples, how they can help with the posed questions.