# Introduction 

For this notebook, I took an analytics approach to get insights for COVID-19 experts/stakeholders who want to see recent relevant papers to review for different COVID-19 sub-topics. 

To rank papers, I used [the impact factor](https://en.wikipedia.org/wiki/Impact_factor) of journals, and created a list of keywords to flag for sub-topics. 

This notebook is a proof-of-concept. Possible next steps include: 
- Using Mechanical Turk to expand out journal impact factor labels (current labels only cover ~17% of 2020 papers)
- Create more keyword mappings or use deep learning techniques to classify for more sub-topics (currently only one sub-topic labeled, viral sheddings)
- Use cloud services to run this notebook either daily or whenever the underlying dataset is updated
- Using analytics or deep learning techniques to populate other desired columns to then feed into reporting, such as median days after onset for COVID-19 presence detected, material, method, study types, etc. 
- Compare paper rankings with Kaggle's [current insights dashboard](https://app.powerbi.com/view?r=eyJrIjoiODg5ODk5ZGEtYTViMy00ODAzLThiNzMtNWY2MjM5ZWUyNzU3IiwidCI6ImRjMWYwNGY1LWMxZTUtNDQyOS1hODEyLTU3OTNiZTQ1YmY5ZCIsImMiOjEwfQ%3D%3D), survey medical/research experts to see which ranking is more insightful 

I would love to hear your feedback on what would make this better! Email me at zthomas.nc@gmail.com or find me on [Twitter](https://twitter.com/zach_i_thomas). Kudos to Jake Thomas (Duke Medicine '23) for providing feedback on this notebook from the perspective of a medical/research stakeholder. 

# Sample Viral Shedding Insights
See the [Dashboard](https://www.kaggle.com/zthomas/material-studies-summary-analytics-approach?scriptVersionId=36477272#Dashboards) section of this notebook.
- **According to papers from journals with the high impact factors**, SARS-CoV-2 may be detectable 50%-59% of the time in stool using PCR ([source](https://www.ncbi.nlm.nih.gov/pubmed/32125362/), [source](https://doi.org/10.1136/bmj.m1443)). SARS-CoV-2 may be present in stool samples longer than in respiratory and serum samples ([source](https://doi.org/10.1136/bmj.m1443)).
- **According to papers pubished in the last 28 days**, patients with COVID-19 have persistent alterations in the fecal microbiome at the time of hospitalization compared to controls ([source](https://www.ncbi.nlm.nih.gov/pubmed/32442562/)). A study in Italy showed that SARS-CoV-2 RNA could be detected in waste waters in samples collected a few days after the first notified Italian case of the disease ([source](https://doi.org/10.1016/j.scitotenv.2020.139652)). 
- **According to papers pubished in the last 28 days that may not be peer-reviewed**, wastewater surveillance could serve as a data source for COVID-19 detection in countries like Pakistan ([source](https://www.medrxiv.org/content/10.1101/2020.06.03.20121426v2)). According to a study of 12 patients in India, both symptomatic and asymptomatic patients could be positive for the SARS-CoV-2 genome in their fecal component ([source](https://www.medrxiv.org/content/10.1101/2020.05.26.20113167v1)).

# Library Import/Install 

In [None]:
# !pip install pandasql
# !pip install plotly==4.8.1

In [None]:
from collections import defaultdict
import matplotlib.pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandasql import sqldf
import plotly.graph_objects as go
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
pysqldf = lambda q: sqldf(q, globals())

# Define Mappings

Viral Shedding Mappings were manually curated by looking at example papers/insights on competition homepage. 

In [None]:
keywords_dict = defaultdict(set)
keywords_dict['viral_shedding_stool'] = { 
      'feces',
      'fecal',
      'stool',
      'faecal'}

Journal Impact scores were found with Google Search. A service like Mechanical Turk could be used to obtain a more complete dataset of Impact Scores (the journals below represent about 17% of papers published in 2020).

In [None]:
journal_impact_dict = {'BMJ': 27.604, 
 'Nature': 21.126,
 'Lancet': 59.102,
 'Science': 41.063,
 'J Med Virol': 2.373,
 'J. med. virol': 2.373,
 'JAMA': 51.273,
 'Int J Infect Dis': 3.538, 
 'Crit Care': 6.7, 
 'Sci Rep': 4.011, 
 'J Infect': 4.603, 
 'N Engl J Med': 70.670, 
 'Information Processing and Management of Uncertainty in Knowledge-Based Systems': 1,
 'Lancet Infect Dis': 27.516,
 'Travel Med Infect Dis': 3.42,
 'Dermatol Ther': 3.810,
 'J Am Acad Dermatol': 7.102, 
 'Med Hypotheses': 1.322, 
 'Sci Total Environ': 5.589,
 'Advances in Information Retrieval': 1,
 'Advances in Knowledge Discovery and Data Mining': 1,
 'Clin Infect Dis': 9.117,
 'Intensive Care Med': 18.967,
 'Head Neck': 2.442, 
 'Clin. infect. dis': 9.117, 
 'Infect Control Hosp Epidemiol': 3.084, 
 'Bull Acad Natl Med': 1, 
 'New Scientist': 1, 
 'Gastroenterology': 20.877, 
 'J Clin Virol': 2.950, 
 'Psychiatry Res': 3.917, 
 'JMIR Public Health Surveill':5.175, 
 'Radiology': 7.608,
 'Viruses': 3.8111, 
 'Infection control and hospital epidemiology':3.084,
 'Brain Behav Immun': 6.306, 
 'Artificial Intelligence Applications and Innovations':1,
 'The New England journal of medicine': 70.670,
 'Scientific reports': 4.011,
 'Lancet Respir Med': 22.992, 
 'Circulation': 23.054, 
 'MMW Fortschr Med': 1,
 'Ann Intern Med': 19.315, 
 'Diabetes Metab Syndr': 3.319, 
 'Responsible Design, Implementation and Use of Information and Communication Technology':1,
 'Asian J Psychiatr':2.030,
 'Emerg Microbes Infect': 6.212,
 'Int J Surg': 3.158, 
 'PLoS One': 2.776, 
 'Anesth Analg': 3.827}

# Load Data, Feature Engineering 

In [None]:
meta_df = pd.read_csv("/kaggle/input/CORD-19-research-challenge/metadata.csv")

In [None]:
meta_df['publish_time'] = pd.to_datetime(meta_df['publish_time'])

Add relevant columns ... 

In [None]:
meta_df['abstract_clean'] = meta_df.apply(lambda x: x.abstract if (
                                            x.abstract==x.abstract and len(x.abstract) > 5) else 
                                                                  x.title, axis=1)

In [None]:
pr_sources = ['Medline', 'PMC', 'Elsevier', 'WHO']
meta_df['peer_review_source'] = meta_df.source_x.apply(lambda x: True if any(
                                                    [(pr_source in x) for pr_source in pr_sources]) 
                                                       else False)

In [None]:
meta_df['journal_impact'] = meta_df.journal.apply(lambda x: journal_impact_dict.get(x))

In [None]:
meta_df['year'] = meta_df.publish_time.dt.year

In [None]:
meta_df['viral_shedding_stool'] = meta_df.abstract_clean.apply(lambda x: 
                                        any([(keyword in str(x)) for keyword in keywords_dict['viral_shedding_stool']]))

# Dashboard Pipeline

In [None]:
max_time = """(SELECT MAX(publish_time) FROM meta_df 
                WHERE viral_shedding_stool = TRUE AND peer_review_source = TRUE and publish_time <= date('now'))"""
pr_source = "AND peer_review_source = TRUE"
pr_source_false = "AND peer_review_source = FALSE"
pr_monthly_sql = """
    SELECT 
        *
    FROM (
        SELECT
            "viral_shedding" category,
            DENSE_RANK() OVER(ORDER BY journal_impact DESC) rank,
            STRFTIME("%m/%d/%Y", DATE({max_time}, '-27 day')) || " - " || STRFTIME("%m/%d/%Y",{max_time}) date_range, 
            STRFTIME("%m/%d/%Y",publish_time) publish_time,
            "<a href='"|| substr(url, 1, instr(url, ';') - 1) || "'>" || title || "</a>" title, 
            abstract abstract, 
            journal || " (" || ROUND(journal_impact,2) || ")" journal
          FROM 
              meta_df 
          WHERE 
              publish_time BETWEEN 
                DATE({max_time},'-27 day') 
                AND
                {max_time}
              AND
              viral_shedding_stool = TRUE
              {pr_source}
    )
    WHERE 
        rank <= 20 
  ORDER BY 
     rank ASC
"""
pr_YTD_sql = """
        SELECT 
        *
        FROM 
        (
        SELECT
            "viral_shedding" category, 
            DENSE_RANK() OVER(ORDER BY journal_impact DESC) rank,
           (SELECT 
               STRFTIME("%m/%d/%Y", MIN(publish_time)) 
               FROM meta_df 
               WHERE year = 2020 ) || " - " || 
               (SELECT 
               STRFTIME("%m/%d/%Y", MAX(publish_time)) 
               FROM meta_df 
               WHERE year = 2020 AND publish_time <= date('now')) date_range, 
            STRFTIME("%m/%d/%Y",publish_time) publish_time,
            "<a href='"|| substr(url, 1, instr(url, ';') - 1) || "'>" || title || "</a>" title, 
            abstract abstract, 
            journal || " (" || ROUND(journal_impact,2) || ")" journal
        FROM 
            meta_df 
        WHERE 
            viral_shedding_stool = TRUE
            AND 
            year = 2020 
            {pr_source}
        )
        WHERE 
            rank <= 20
      ORDER BY 
        rank ASC
"""

last_28_days_pr_dataset = pysqldf(pr_monthly_sql.format(max_time=max_time, pr_source = pr_source))
last_28_days_dataset = pysqldf(pr_monthly_sql.format(max_time=max_time, pr_source =pr_source_false))
ytd_dataset = pysqldf(pr_YTD_sql.format(pr_source = pr_source))

# Dashboards

In [None]:
fig = go.Figure(data=[go.Table(
                columnwidth = [40*1.5,80*1.5,100*1.5,250*1.5,70*1.5], 
                header=dict(values=['Rank', 'Publish Time', 'Title + Link', 'Abstract', 'Journal (I-Factor)']),
                 cells=dict(values=[ytd_dataset['rank'],
                                    ytd_dataset.publish_time, 
                                    ytd_dataset.title, 
                                    ytd_dataset.abstract, 
                                    ytd_dataset.journal])
                        )
                     ])
fig.update_layout(
    title="Most Notable COVID-19 Papers YTD ({})".format(ytd_dataset.date_range.iloc[0,])
)

fig.update_layout(
    updatemenus=[
        dict(
            buttons=list([
                dict(
                    args=["category", "viral_shedding"],
                    label="Viral Shedding",
                    method="restyle"
                ),
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.15,
            xanchor="left",
            y=1.065,
            yanchor="top"
        ),
    ]
)

# Add annotation
fig.update_layout(
    annotations=[
        dict(text="Paper Category:", showarrow=False,
        x=0, y=1.05, yref="paper", align="left")
    ]
)

fig.update_layout(
    height=1000,
)

fig.show()

In [None]:
fig = go.Figure(data=[go.Table(
                columnwidth = [40*1.5,80*1.5,100*1.5,250*1.5,70*1.5], 
                header=dict(values=['Rank', 'Publish Time', 'Title + Link', 'Abstract', 'Journal (I-Factor)']),
                 cells=dict(values=[last_28_days_pr_dataset['rank'],
                                    last_28_days_pr_dataset.publish_time, 
                                    last_28_days_pr_dataset.title, 
                                    last_28_days_pr_dataset.abstract, 
                                    last_28_days_pr_dataset.journal])
                        )
                     ])
fig.update_layout(
    title="Most Notable COVID-19 Papers, last 28 days ({})".format(last_28_days_dataset.date_range.iloc[0,])
)

fig.update_layout(
    updatemenus=[
        dict(
            buttons=list([
                dict(
                    args=["category", "viral_shedding"],
                    label="Viral Shedding",
                    method="restyle"
                ),
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.15,
            xanchor="left",
            y=1.065,
            yanchor="top"
        ),
    ]
)
# Add annotation
fig.update_layout(
    annotations=[
        dict(text="Paper Category:", showarrow=False,
        x=0, y=1.05, yref="paper", align="left")
    ]
)

fig.update_layout(
    height=1000,
)

fig.show()

In [None]:
fig = go.Figure(data=[go.Table(
                columnwidth = [40*1.5,80*1.5,100*1.5,250*1.5,70*1.5], 
                header=dict(values=['Rank', 'Publish Time', 'Title + Link', 'Abstract', 'Journal']),
                 cells=dict(values=[last_28_days_dataset['rank'],
                                    last_28_days_dataset.publish_time, 
                                    last_28_days_dataset.title, 
                                    last_28_days_dataset.abstract, 
                                    last_28_days_dataset.journal])
                        )
                     ])
fig.update_layout(
    title="Most Notable COVID-19 Papers (not necessarily peer reviewed), last 28 days ({})".format(last_28_days_dataset.date_range.iloc[0,])
)

fig.update_layout(
    updatemenus=[
        dict(
            buttons=list([
                dict(
                    args=["category", "viral_shedding"],
                    label="Viral Shedding",
                    method="restyle"
                ),
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.15,
            xanchor="left",
            y=1.065,
            yanchor="top"
        ),
    ]
)
# Add annotation
fig.update_layout(
    annotations=[
        dict(text="Paper Category:", showarrow=False,
        x=0, y=1.05, yref="paper", align="left")
    ]
)

fig.update_layout(
    height=1000,
)

fig.show()

# EDA

**Summary of insights**:
- **When were papers published?**: Papers go back to 2003. Prior to 2020, there were two peaks in publication volume: between 2003 and 2009 annual research peaked around 2006, and between 2010 and 2019 annual research peaked in 2015 and 2016.
    - Some papers do not have a MM-DD-YYYY date and are just tagged with time “2020”. _That is why, in addition to a view of the best papers for the past 28 days, I also include a year-to-date view so those papers aren't included._ 
- **Abstracts:** About 20% of papers since 2010 don’t have an abstract. In fact, that goes up to 33% for 2020 papers. That’s a lot ... _That is why in my ETL I replace abstracts of length less than 5 with the paper title._ (`abstract_clean`). 
    - The distribution of Abstract character length is not normal. When you exclude papers with no abstract, it skews right.  
- **Number of articles**: 100.1K since 2010, 136k overall
- **Journals**: There are 15,612 unique journals in the dataset, 13,223 from papers published on or after 2010
    - Of the post-2009 papers, the journals with the most publications are PLoS One and BioRxiv. The top 20 journals account for 10.7% of papers in this time period — in other words, there is a very long tail here. 
    - Even when only looking at papers from 2020, the top 50 journals by paper count only account for 17.2% of 2020 papers

In [None]:
meta_df['publish_time_month_year'] = pd.to_datetime(meta_df.publish_time).dt.to_period('M')

In [None]:
meta_df.query("publish_time_month_year > '1970-01'").publish_time_month_year.value_counts().sort_index().plot()

In [None]:
meta_df.query("publish_time_month_year > '2001-01' and publish_time_month_year < '2020-06'").publish_time_month_year.value_counts().sort_index().plot()

In [None]:
meta_df.query("publish_time_month_year >= '2019-11' and publish_time_month_year <= '2020-06'").publish_time_month_year.value_counts().sort_index().plot()

(Spike in January is from papers with publishing time of "2020" (no specific date) that getted mapped to 2020-01-01 

In [None]:
meta_df.query("publish_time >= '2019-11-01' and publish_time <= '2020-06-01'").publish_time.value_counts().sort_index().plot(figsize=(10,8))

In [None]:
meta_df.query("publish_time_month_year >= '2019-01' and publish_time_month_year <= '2019-12'").publish_time_month_year.value_counts().sort_index().plot()

In [None]:
meta_df.query("publish_time >= '2019-01-01' and publish_time <= '2019-02-01'").publish_time.value_counts().sort_index().plot(figsize=(10,8))

In [None]:
meta_df.source_x.value_counts().iloc[::-1].plot.barh(figsize=(10,8))

In [None]:
meta_df.journal.nunique()

In [None]:
meta_df.query("publish_time >= '2010-01-01'").journal.nunique()

In [None]:
meta_df.query("publish_time >= '2010-01-01'").journal.value_counts().head(20).sum()/meta_df.query("publish_time >= '2010-01-01'").shape[0]

In [None]:
meta_df.journal.value_counts().head(10)

In [None]:
meta_df['abstract_len'] = meta_df.abstract.str.len()

In [None]:
meta_df['abstract_len'].describe()

In [None]:
meta_df['abstract_len'].hist()

In [None]:
meta_df['abstract_words'] = meta_df.abstract.apply(lambda x: len(str(x).split(' ')))

In [None]:
meta_df['abstract_words'].describe()

In [None]:
meta_df.query('abstract_words >1 ')['abstract_words'].describe()

In [None]:
meta_df.query("publish_time >= '2020-01-01'")['abstract_words'].describe()

In [None]:
meta_df.query("publish_time >= '2020-01-01' and abstract_words <10").shape[0]/\
meta_df.query("publish_time >= '2020-01-01'").shape[0]

In [None]:
meta_df.query("publish_time >= '2010-01-01' and abstract_words < 400")['abstract_words'].hist()

In [None]:
meta_df.query("publish_time >= '2010-01-01' and abstract_words < 400")['abstract_words'].value_counts().head(5)

In [None]:
meta_df.query("publish_time >= '2010-01-01' and abstract_words < 10")['abstract_words'].value_counts()

In [None]:
21856/100000

In [None]:
meta_df.query("abstract_words ==7").abstract.head()

In [None]:
print(meta_df.query("publish_time >= '2010-01-01'").shape[0])
print(meta_df.shape[0])

In [None]:
wordcloud_abstract = WordCloud(background_color="white").generate(' '.join(meta_df.query("publish_time >= '2019-10-01'").abstract.astype(str)))


In [None]:
plt.figure(figsize = (10,10))
plt.imshow(wordcloud_abstract, interpolation='bilinear')
plt.axis("off")

In [None]:
wordcloud_titles = WordCloud(background_color="white").generate(' '.join(meta_df.query("publish_time >= '2019-10-01'").title.astype(str)))

In [None]:
plt.figure(figsize = (10,10))
plt.imshow(wordcloud_titles, interpolation='bilinear')
plt.axis("off")