## Enrich the metadata by publication exposure using journal ranking

While it is not always possible to judge about publication quality from the rank of the journal the paper is published in, it certainly is safe to assume that papers published in higher ranked journals have higher exposure. This contributes to the overall knowledge landscape.

To obtain in-depth Journal ranking information we will use the Scimago data set available from either Sciemago website or Kaggle. First we will split the a COVID-19 metadata CSV into peer-reviewed publications and pre-prints. Next, we will join the Kaggle COVID-19 Metadata and Scimago and explore the H-index of the overlapping publications in the Kaggle set. Finally, we will use H-score to create categorical Exposure categories "low", "medium", "high".

In [None]:
import pandas as pd

# load covid metadata table


metadata_path = '/kaggle/input/CORD-19-research-challenge/metadata.csv'

metadata = pd.read_csv(metadata_path)
metadata.head()

In [None]:
df = metadata
df = df[df['authors'].str.contains("Yakimovich", na=False)]
df.head()

Remarkably, a paper I co-authored is also in the dataset.

In [None]:
#load scimago path
import os

scimago_path = os.path.join('/kaggle/input','scimagojournalcountryrank','scimagojr 2018.csv')

scimago_data = pd.read_csv(scimago_path, sep=';')
scimago_data.head()

In [None]:
# join journals
metadata_preprints = metadata[(metadata['source_x']=='biorxiv')|(metadata['source_x']=='medrxiv')]
metadata_journals =  metadata.dropna(subset=['journal'])

enriched_metadata = pd.merge(left=metadata_journals, right=scimago_data, left_on='journal', right_on='Title')
print('total metadata entries: {} (of which prerints {} and journals {}), scimago entries: {}, matching entries: {}'.format(len(metadata),len(metadata_preprints), len(metadata_journals), len(scimago_data), len(enriched_metadata)))

In [None]:
enriched_metadata.head()

In [None]:
# let's explore the distribution of H-factors in the Kaggle literature
import matplotlib.pyplot as plt
import seaborn as sns


ax = sns.distplot(enriched_metadata['H index'].values)

In [None]:
enriched_metadata[enriched_metadata['H index']>1000].head()

In [None]:
enriched_metadata['exposure'] = pd.cut(enriched_metadata['H index'], bins=[0, 10, 50, 10000], labels=["low", "medium", "high"])

enriched_metadata.to_csv('/kaggle/working/enriched_metadata.csv')
enriched_metadata[enriched_metadata['exposure']=='low'].head()