# CORD-19: Most Collaborative & Influential Authors 

## Motivations

This notebook seeks to create a methodology to determine and quantify those individuals adding the most to the literature collected in this dataset. The goal would be to help identify the researchers leading this field and providing the greatest impact to other's work. This can be utilized in other models and potential weightings to highlight work that may deserve additional focus over the general corpus.

## Methodology

There are various mean to quantify such an individual as well as varying amounts of data to consider. I have split the measures into XXX different metrics each relying on a foundational “Scoring Mechanism.” May in the future also split based on all time vs those wirting since the start of the pandemic if this work proves useful to others.

## Suggestions and Other Resources

Building “impact frameworks” like this have always been a deep interest of mine so please comment with recommendations and I would love to augment this work with other input.
 
## Data Sources

Beyond the provided data, I additionally took advantage of the CORD19 Metadata Enrichment Dataset. This can be loaded into any notebook using the *+ Add Data* button at the top-right of the notebook and using the url linked below.

### [CORD-19 Metadata Enrichment Dataset](https://www.kaggle.com/dannellyz/cord19-metadata-enrichment)

# Starting with only the papers in the CORD-19 Dataset
While this is the most limted dataset it does hold most true to the effort of the research effort to evaluate the given data. This would in turn give the most influential authors ranked only by those contibuting most to this set of literature.

## Scoring Mechaism: CORD-19 Impact
Currently this is is made up of the simple equation of two points for every paper and one point for every references to their work in another paper in the dataset. This equations says that authoring a paper on the subject is twice as significant as having your work referenced in someone elses paper. I am happy to take thoughts on reworking this model. I specifically did not go tieh H-Factor due to its controversial nature. Also happy to discuss this decision as well.

$$\sum_{\text{All Papers}} 2*_{(\text{Authored Paper in CORD-19)}} + 1*_{(\text{Author's Paper Referenced in other CORD-19 Paper)}} \rightarrow \text{Normalized} $$

### Most Influential Authors

#### All Time
1. Yuen, Kwok-Yung
2. Perlman, Stanley	
3. Baric, Ralph S.	
4. Drosten, Christian	
5. Enjuanes, Luis	

## Count cited works referenced in the CORD dataset

Load in the dataset

In [None]:
import pandas as pd
import numpy as np
#Set max columns expansion
pd.set_option('display.max_columns', None)

#Ensure that the Enrichment Dataset is loaded
enrich_file_path = "/kaggle/input/cord19-metadata-enrichment/"

#Load CORD Metadata
cord_file_path = "/kaggle/input/CORD-19-research-challenge/"
metadata = pd.read_csv(cord_file_path + "metadata.csv", index_col="cord_uid")

#Set the publish time to Datetime
metadata["publish_time"] = pd.to_datetime(metadata["publish_time"])
metadata.head(1)

Get all of the paper paths into a list.

In [None]:
import os
paper_folders = ['noncomm_use_subset/','comm_use_subset/','custom_license/','biorxiv_medrxiv/',]
all_papers = []
for folder in paper_folders:
    #Get all papers in a given folder (2x for traversal)
    current_dir = cord_file_path + 2*folder
    papers = os.listdir(current_dir)
    #Add back in the path for full reference
    paper_dirs = [current_dir + paper for paper in papers]
    #Append to all papers list
    all_papers.extend(paper_dirs)
print("There are a total of {} papers.".format(len(all_papers)))

Count the papers in the various bibliographies. 

In [None]:
%%time
import json
import itertools
import multiprocessing
from tqdm.notebook import tqdm

#Get a list of all unique titles found in CORD-19
cord19_titles = set(metadata.title)

#Start by reading the json and use generator
def get_series(paper):
    one_json = json.load(open(paper))["bib_entries"]
    title_series = pd.DataFrame.from_dict(one_json, orient="index").title
    #Drop those not found in CORD-19
    return list(title_series)

def get_title_counts():
    #Place all papers into a dataframe
    #Use multiprocess to speed up
    #Send to lower for compare
    p=multiprocessing.Pool(4)
    all_bibs = pd.Series(itertools.chain(*p.map(get_series, all_papers))).str.lower()
    
    #Count the numer of times each title exists
    title_counts = all_bibs.value_counts()
    return title_counts

title_counts = get_title_counts()
title_counts.sort_values(ascending=False).head()

Merge into the dataset

In [None]:
#Make Title Counts into Dataframe
titles_df = pd.DataFrame(title_counts).reset_index()
titles_df.columns = ["title", "title_counts"]
#Merge with Metadata
#set title to lower in order to merge
metadata["title"] = metadata["title"].str.lower()
score_df = metadata.merge(titles_df, on="title", how="left")

Expand the CORD-19 Dataset to have a row for each author

In [None]:
#Split the authors column by the ; seperator
score_df["authors"] = score_df.authors.str.split(";")

#Explode the df on the aurhors column
#This makes a duplicate row for each of the items in the authors list
score_df = score_df.explode("authors")
score_df.shape

## Get CORD-19 Impact Score

In [None]:
#Fill title_coutns NaNs with 0
score_df["title_counts"] = score_df["title_counts"].fillna(0)

#Drop columns not used in calc
score_df = score_df[["authors", "title_counts"]]

#Group by authors and get a count and sum of the 
score_df = score_df.groupby("authors").agg(["count", "sum"])

#Normalize the columns since they are such unqiue types
from sklearn import preprocessing
x = score_df.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
score_df_norm = pd.DataFrame(x_scaled, index=score_df.index, columns=["works_authored", "works_cited"])

#Calc Impact
score_df_norm["CORD19_impact"] = 2*score_df_norm["works_authored"] + score_df_norm["works_cited"]

#Merge in raw for comparison
cord_impact_final = pd.merge(score_df,score_df_norm, left_index=True, right_index=True)
cord_impact_final.sort_values("CORD19_impact", ascending=False).head(5)

# Paper Significance: Adding Journal Impact and Social Media References
This extends the original dataset with features derived from both the [Microsoft Academic Knowledge API](https://www.kaggle.com/dannellyz/cord19-metadata-enrich-microsoft-academic-api) as well as the [Altmetric API](https://www.kaggle.com/dannellyz/cord19-metadata-enrich-altmetric-api). Those datasets are integrated into the [enrichment dataset](https://www.kaggle.com/dannellyz/cord19-metadata-enrichment). 

## Scoring Mechaism: Paper Significance
This equation takes the root of both the scoring systems as they are exponentially distributed. This is account for the viral nature of the Altmetric Score and the insular nature of the Journal Rankings.

$$\sum_{\text{All Papers}} \sqrt{_\text{Scimago Journal & Country Rank}} + \sqrt{_\text{Altmetric Score}} \rightarrow \text{Normalized}$$

### Most influential Authors

#### All Time
1. Baric, Ralph S.	
2. Yuen, Kwok-Yung	
3. Drosten, Christian	
4. Perlman, Stanley	
5. Daszak, Peter

### Add in Journal Rankings to Dataset

In [None]:
journal_rankings = pd.read_csv(enrich_file_path + "scimago_journal_rankings.csv", encoding="utf-8")
journal_rankings["SJR"] = journal_rankings["SJR"].str.replace(",","").astype("float")
journal_rankings.head(1)

In [None]:
#Send both journal columns to lower in case of odd caps
metadata["journal"] = metadata["journal"].str.lower()
journal_rankings["Title"] = journal_rankings["Title"].str.lower()

#Add journal raknings to the journals
#Read in journal names to fix from 
#https://www.kaggle.com/dannellyz/cord19-meta-enrich-replacing-fixing-journal-names

jrnl_names = pd.read_csv(enrich_file_path + "journal_abrv_replace.csv", names=["metadata_name", "sjr_name"])
jrnl_dict = dict(zip(jrnl_names.metadata_name, jrnl_names.sjr_name))
metadata["journal"] = metadata["journal"].replace(jrnl_dict)
paper_significance = metadata.merge(journal_rankings[["Title","SJR"]], left_on="journal", right_on="Title", how="left")
paper_significance.drop(["Title"], axis=1, inplace=True)
paper_significance.notnull().groupby(["journal", "SJR"]).size()

### Add in Altmetric Score to dataset

In [None]:
#Load Altmetric Data
#Due to licensing this data can not be shared, but a notebook to get an API key and the data can be found here
#https://www.kaggle.com/dannellyz/cord19-metadata-enrich-altmetric-api
private_file_path = "/kaggle/input/altmetric-private/"
altmetric_metadata = pd.read_csv(private_file_path + "altmetric_metadata.csv", usecols=["doi", "score"])
altmetric_metadata.sort_values(by="score", ascending=False).head(5)

In [None]:
#Add to Dataframe
paper_significance = paper_significance.merge(altmetric_metadata, left_on="doi", right_on="doi", how="left")
paper_significance.drop(["doi"], axis=1, inplace=True)
paper_significance.head(5)

Explode like done in the previous example

In [None]:
#Split the authors column by the ; seperator
paper_significance["authors"] = paper_significance.authors.str.split(";")

#Explode the df on the aurhors column
#This makes a duplicate row for each of the items in the authors list
paper_significance_all = paper_significance.explode("authors")
paper_significance_all.shape

### Get Paper Significance Score

In [None]:
import math
#Fill SJR and score NaNs with 0
paper_significance_all["SJR"] = paper_significance_all["SJR"].fillna(0)
paper_significance_all["score"] = paper_significance_all["score"].fillna(0)

#Drop columns not used in calc
paper_sig = paper_significance_all[["authors", "SJR", "score"]]

#Group by authors and get a count and sum of the 
paper_sig_groups = paper_sig.groupby("authors").sum()
#paper_sig_groups.columns = paper_sig_groups.columns.droplevel()
paper_sig_groups["sjr_root"] = np.sqrt(paper_sig_groups["SJR"])
paper_sig_groups["score_root"] = np.sqrt(paper_sig_groups["score"])

#Normalize the columns since they are such unqiue types
from sklearn import preprocessing
#Save df for merging later
x = paper_sig_groups.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
paper_sig_norm = pd.DataFrame(x_scaled, index=paper_sig_groups.index, columns=["SJR", "score","sjr_root","score_root"])

paper_sig_norm["Paper_Significance"] = paper_sig_norm["sjr_root"] + paper_sig_norm["score_root"] 
#Merge in raw for comparison
paper_sig_final = pd.merge(paper_sig_groups[["SJR", "score"]],paper_sig_norm, left_index=True, right_index=True)
paper_sig_final.sort_values("Paper_Significance", ascending=False).head(5)