## About this notebook

In this notebook, I quickly explore the `biorxiv` subset of the papers. Since it is stored in JSON format, the structure is likely too complex to directly perform analysis. Thus, I not only explore the structure of those files, but I also provide the following helper functions for you to easily format inner dictionaries from each file:
* `format_name(author)`
* `format_affiliation(affiliation)`
* `format_authors(authors, with_affiliation=False)`
* `format_body(body_text)`
* `format_bib(bibs)`

Feel free to reuse those functions for your own purpose!

Throughout the EDA, I show you how to use each of those files. At the end, I show you how to generate a clean version of the `biorxiv` as well as all the other datasets, which you can directly use by choosing this notebook as a data source ("File" -> "Add or upload data" -> "Kernel Output File" tab -> search the name of this notebook).

In [1]:
import os
import json
from pprint import pprint
from copy import deepcopy

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

## Helper Functions

Unhide the cell below to find the definition of the following functions:
* `format_name(author)`
* `format_affiliation(affiliation)`
* `format_authors(authors, with_affiliation=False)`
* `format_body(body_text)`
* `format_bib(bibs)`

In [2]:
def format_name(author):
    middle_name = " ".join(author['middle'])
    
    if author['middle']:
        return " ".join([author['first'], middle_name, author['last']])
    else:
        return " ".join([author['first'], author['last']])


def format_affiliation(affiliation):
    text = []
    location = affiliation.get('location')
    if location:
        text.extend(list(affiliation['location'].values()))
    
    institution = affiliation.get('institution')
    if institution:
        text = [institution] + text
    return ", ".join(text)

def format_authors(authors, with_affiliation=False):
    name_ls = []
    
    for author in authors:
        name = format_name(author)
        if with_affiliation:
            affiliation = format_affiliation(author['affiliation'])
            if affiliation:
                name_ls.append(f"{name} ({affiliation})")
            else:
                name_ls.append(name)
        else:
            name_ls.append(name)
    
    return ", ".join(name_ls)

def format_body(body_text):
    texts = [(di['section'], di['text']) for di in body_text]
    texts_di = {di['section']: "" for di in body_text}
    
    for section, text in texts:
        texts_di[section] += text

    body = ""

    for section, text in texts_di.items():
        body += section
        body += "\n\n"
        body += text
        body += "\n\n"
    
    return body

def format_bib(bibs):
    if type(bibs) == dict:
        bibs = list(bibs.values())
    bibs = deepcopy(bibs)
    formatted = []
    
    for bib in bibs:
        bib['authors'] = format_authors(
            bib['authors'], 
            with_affiliation=False
        )
        formatted_ls = [str(bib[k]) for k in ['title', 'authors', 'venue', 'year']]
        formatted.append(", ".join(formatted_ls))

    return "; ".join(formatted)

Unhide the cell below to find the definition of the following functions:
* `load_files(dirname)`
* `generate_clean_df(all_files)`

In [3]:
def load_files(dirname):
    filenames = os.listdir(dirname)
    raw_files = []

    for filename in tqdm(filenames):
        filename = dirname + filename
        file = json.load(open(filename, 'rb'))
        raw_files.append(file)
    
    return raw_files

def generate_clean_df(all_files):
    cleaned_files = []
    
    for file in tqdm(all_files):
        features = [
            file['paper_id'],
            file['metadata']['title'],
            format_authors(file['metadata']['authors']),
            format_authors(file['metadata']['authors'], 
                           with_affiliation=True),
            format_body(file['abstract']),
            format_body(file['body_text']),
            format_bib(file['bib_entries']),
            file['metadata']['authors'],
            file['bib_entries']
        ]

        cleaned_files.append(features)

    col_names = ['paper_id', 'title', 'authors',
                 'affiliations', 'abstract', 'text', 
                 'bibliography','raw_authors','raw_bibliography']

    clean_df = pd.DataFrame(cleaned_files, columns=col_names)
    clean_df.head()
    
    return clean_df

## Biorxiv: Exploration

Let's first take a quick glance at the `biorxiv` subset of the data. We will also use this opportunity to load all of the json files into a list of **nested** dictionaries (each `dict` is an article).

In [4]:
biorxiv_dir = '/kaggle/input/CORD-19-research-challenge/2020-03-13/biorxiv_medrxiv/biorxiv_medrxiv/'
filenames = os.listdir(biorxiv_dir)
print("Number of articles retrieved from biorxiv:", len(filenames))

Number of articles retrieved from biorxiv: 803


In [5]:
all_files = []

for filename in filenames:
    filename = biorxiv_dir + filename
    file = json.load(open(filename, 'rb'))
    all_files.append(file)

In [6]:
file = all_files[0]
print("Dictionary keys:", file.keys())

Dictionary keys: dict_keys(['paper_id', 'metadata', 'abstract', 'body_text', 'bib_entries', 'ref_entries', 'back_matter'])


## Biorxiv: Abstract

The abstract dictionary is fairly simple:

In [7]:
pprint(file['abstract'])

[{'cite_spans': [],
  'ref_spans': [],
  'section': 'Abstract',
  'text': 'Viruses possessing class I fusion proteins require proteolytic '
          'activation by host cell proteases to mediate 18 fusion with the '
          'host cell membrane. The mammalian SPINT2 gene encodes a protease '
          'inhibitor that 19 targets trypsin-like serine proteases. Here we '
          'show the protease inhibitor, SPINT2, restricts cleavage-20 '
          'activation efficiently for a range of influenza viruses and for '
          'human metapneumovirus (HMPV). SPINT2 21 treatment resulted in the '
          'cleavage and fusion inhibition of full-length influenza A/CA/04/09 '
          '(H1N1) HA, 22'}]


## Biorxiv: body text

Let's first probe what the `body_text` dictionary looks like:

In [8]:
print("body_text type:", type(file['body_text']))
print("body_text length:", len(file['body_text']))
print("body_text keys:", file['body_text'][0].keys())

body_text type: <class 'list'>
body_text length: 36
body_text keys: dict_keys(['text', 'cite_spans', 'ref_spans', 'section'])


We take a look at the first part of the `body_text` content. As you will notice, the body text is separated into a list of small subsections, each containing a `section` and a `text` key. Since multiple subsection can have the same section, we need to first group each subsection before concatenating everything.

In [9]:
print("body_text content:")
pprint(file['body_text'][:2], depth=3)

body_text content:
[{'cite_spans': [],
  'ref_spans': [{...}],
  'section': 'Introduction 9',
  'text': 'Influenza-like illnesses (ILIs) represent a significant burden on '
          'public health and can be caused by 10 a range of respiratory '
          'viruses in addition to influenza virus itself (1). An ongoing goal '
          'of anti-viral drug 11 discovery is to develop broadly-acting '
          'therapeutics that can be used in the absence of definitive '
          'diagnosis, 12 such as in the case of ILIs. For such strategies to '
          'succeed, drug targets that are shared across virus families 13 need '
          'to be identified. 14 For the SPINT2 inhibition assays, trypsin '
          '(which typically resides in the intestinal tract and expresses a '
          'very 1 broad activity towards different HA subtypes and HMPV F) '
          'served as a control (41). In addition, furin was 2 used as a '
          'negative control that is not inhibited by SPINT2. A

Let's see what the grouped section titles are for the example above:

In [10]:
texts = [(di['section'], di['text']) for di in file['body_text']]
texts_di = {di['section']: "" for di in file['body_text']}
for section, text in texts:
    texts_di[section] += text

pprint(list(texts_di.keys()))

['Introduction 9',
 'SPINT2 inhibits HA mediated cell-cell fusion 20',
 'SPINT2 reduces viral growth in cell culture 19',
 'Discussion 10',
 'Materials and Methods 16',
 '293T, VERO and MDCK cells (American Type Culture Collection) used for '
 'influenza experiments 18',
 'Peptide Assays 16',
 'Cytotoxicity assay 3',
 'Metabolic protein labeling and immunoprecipitation of HMPV fusion protein 11',
 'Inhibition of influenza viral infection in cell culture 8',
 'Inhibition of HMPV viral infection in cell culture 16']


The following example shows what the final result looks like, after we format each section title with its content:

In [11]:
body = ""

for section, text in texts_di.items():
    body += section
    body += "\n\n"
    body += text
    body += "\n\n"

print(body[:3000])

Introduction 9

Influenza-like illnesses (ILIs) represent a significant burden on public health and can be caused by 10 a range of respiratory viruses in addition to influenza virus itself (1). An ongoing goal of anti-viral drug 11 discovery is to develop broadly-acting therapeutics that can be used in the absence of definitive diagnosis, 12 such as in the case of ILIs. For such strategies to succeed, drug targets that are shared across virus families 13 need to be identified. 14 For the SPINT2 inhibition assays, trypsin (which typically resides in the intestinal tract and expresses a very 1 broad activity towards different HA subtypes and HMPV F) served as a control (41). In addition, furin was 2 used as a negative control that is not inhibited by SPINT2. As none of the peptides used in combination 3 with the aforementioned proteases has a furin cleavage site we tested furin-mediated cleavage on a 4 peptide with a H5N1 HPAI cleavage motif in the presence of 500 nM SPINT2 (Supplementar

The function below lets you display the body text in one line (unhide to see exactly the same as above):

In [12]:
print(format_body(file['body_text'])[:3000])

Introduction 9

Influenza-like illnesses (ILIs) represent a significant burden on public health and can be caused by 10 a range of respiratory viruses in addition to influenza virus itself (1). An ongoing goal of anti-viral drug 11 discovery is to develop broadly-acting therapeutics that can be used in the absence of definitive diagnosis, 12 such as in the case of ILIs. For such strategies to succeed, drug targets that are shared across virus families 13 need to be identified. 14 For the SPINT2 inhibition assays, trypsin (which typically resides in the intestinal tract and expresses a very 1 broad activity towards different HA subtypes and HMPV F) served as a control (41). In addition, furin was 2 used as a negative control that is not inhibited by SPINT2. As none of the peptides used in combination 3 with the aforementioned proteases has a furin cleavage site we tested furin-mediated cleavage on a 4 peptide with a H5N1 HPAI cleavage motif in the presence of 500 nM SPINT2 (Supplementar

## Biorxiv: Metadata

Let's first see what keys are contained in the `metadata` dictionary:

In [13]:
print(all_files[0]['metadata'].keys())

dict_keys(['title', 'authors'])


Let's take a look at each of the correspond values:

In [14]:
print(all_files[0]['metadata']['title'])

SPINT2 inhibits proteases involved in activation of both influenza viruses and metapneumoviruses 1 2


In [15]:
authors = all_files[0]['metadata']['authors']
pprint(authors[:3])

[{'affiliation': {},
  'email': '',
  'first': 'Marco',
  'last': 'Straus',
  'middle': ['R'],
  'suffix': ''},
 {'affiliation': {'institution': 'University of Kentucky',
                  'laboratory': '',
                  'location': {'country': 'United 9 States',
                               'region': 'KY',
                               'settlement': 'Lexington'}},
  'email': '',
  'first': 'Jonathan',
  'last': 'Kinder',
  'middle': ['T'],
  'suffix': ''},
 {'affiliation': {},
  'email': '',
  'first': 'Michal',
  'last': 'Segall',
  'middle': [],
  'suffix': ''}]


The `format_name` and `format_affiliation` functions:

In [16]:
for author in authors:
    print("Name:", format_name(author))
    print("Affiliation:", format_affiliation(author['affiliation']))
    print()

Name: Marco R Straus
Affiliation: 

Name: Jonathan T Kinder
Affiliation: University of Kentucky, Lexington, KY, United 9 States

Name: Michal Segall
Affiliation: 

Name: Rebecca Ellis Dutch
Affiliation: University of Kentucky, Lexington, KY, United 9 States

Name: Gary R Whittaker
Affiliation: 

Name: M R S 
Affiliation: 

Name: J T K 
Affiliation: 



Now, let's take as an example a slightly longer list of authors:

In [17]:
pprint(all_files[4]['metadata'], depth=4)

{'authors': [{'affiliation': {'institution': 'Bioinformatics Unit (MF',
                              'laboratory': '',
                              'location': {}},
              'email': '',
              'first': 'Mathias',
              'last': 'Kuhring',
              'middle': [],
              'suffix': ''},
             {'affiliation': {'institution': '',
                              'laboratory': 'Centre for Biological Threats and '
                                            'Special Pathogens, Proteomics and '
                                            'Spectroscopy (ZBS',
                              'location': {}},
              'email': '',
              'first': 'Joerg',
              'last': 'Doellinger',
              'middle': [],
              'suffix': ''},
             {'affiliation': {'institution': 'Robert Koch Institute',
                              'laboratory': '',
                              'location': {...}},
              'email': '',
            

Here, I provide the function `format_authors` that let you format a list of authors to get a final string, with the optional argument of showing the affiliation:

In [18]:
authors = all_files[4]['metadata']['authors']
print("Formatting without affiliation:")
print(format_authors(authors, with_affiliation=False))
print("\nFormatting with affiliation:")
print(format_authors(authors, with_affiliation=True))

Formatting without affiliation:
Mathias Kuhring, Joerg Doellinger, Andreas Nitsche, Thilo Muth, Bernhard Y Renard

Formatting with affiliation:
Mathias Kuhring (Bioinformatics Unit (MF), Joerg Doellinger, Andreas Nitsche (Robert Koch Institute, Berlin, Germany), Thilo Muth (Bioinformatics Unit (MF), Bernhard Y Renard (Bioinformatics Unit (MF)


## Biorxiv: bibliography

Let's take a look at the bibliography section. 

In [19]:
bibs = list(file['bib_entries'].values())
pprint(bibs[:2], depth=4)

[{'authors': [{'first': 'J', 'last': 'Mckerrow', 'middle': [...], 'suffix': ''},
              {'first': 'A', 'last': 'Renslo', 'middle': [...], 'suffix': ''},
              {'first': 'G', 'last': 'Simmons', 'middle': [], 'suffix': ''}],
  'issn': '',
  'other_ids': {},
  'pages': '76--84',
  'ref_id': 'b1',
  'title': 'Protease inhibitors targeting coronavirus and 13 filovirus entry',
  'venue': 'Antiviral Res',
  'volume': '116',
  'year': 2015},
 {'authors': [{'first': 'S', 'last': 'Matsuyama', 'middle': [], 'suffix': ''},
              {'first': 'N', 'last': 'Nagata', 'middle': [], 'suffix': ''},
              {'first': 'K', 'last': 'Shirato', 'middle': [], 'suffix': ''},
              {'first': 'M', 'last': 'Kawase', 'middle': [], 'suffix': ''},
              {'first': 'M', 'last': 'Takeda', 'middle': [], 'suffix': ''},
              {'first': 'F', 'last': 'Taguchi', 'middle': [], 'suffix': ''}],
  'issn': '',
  'other_ids': {},
  'pages': '',
  'ref_id': 'b2',
  'title': 'Efficie

You can reused the `format_authors` function here:

In [20]:
format_authors(bibs[1]['authors'], with_affiliation=False)

'S Matsuyama, N Nagata, K Shirato, M Kawase, M Takeda, F Taguchi'

The following function let you format the bibliography all at once. It only extracts the title, authors, venue, year, and separate each entry of the bibliography with a `;`.

In [21]:
bib_formatted = format_bib(bibs[:5])
print(bib_formatted)

Protease inhibitors targeting coronavirus and 13 filovirus entry, J H Mckerrow, A R Renslo, G Simmons, Antiviral Res, 2015; Efficient Activation of 15 the Severe Acute Respiratory Syndrome Coronavirus Spike Protein by the Transmembrane 16, S Matsuyama, N Nagata, K Shirato, M Kawase, M Takeda, F Taguchi, , 2010


## Biorxiv: Generate CSV

In this section, I show you how to manually generate the CSV files. As you can see, it's now super simple because of the `format_` helper functions. In the next sections, I show you have to generate them in 3 lines using the `load_files` and `generate_clean_dr` helper functions.

In [22]:
cleaned_files = []

for file in tqdm(all_files):
    features = [
        file['paper_id'],
        file['metadata']['title'],
        format_authors(file['metadata']['authors']),
        format_authors(file['metadata']['authors'], 
                       with_affiliation=True),
        format_body(file['abstract']),
        format_body(file['body_text']),
        format_bib(file['bib_entries']),
        file['metadata']['authors'],
        file['bib_entries']
    ]
    
    cleaned_files.append(features)

HBox(children=(FloatProgress(value=0.0, max=803.0), HTML(value='')))




In [23]:
col_names = [
    'paper_id', 
    'title', 
    'authors',
    'affiliations', 
    'abstract', 
    'text', 
    'bibliography',
    'raw_authors',
    'raw_bibliography'
]

clean_df = pd.DataFrame(cleaned_files, columns=col_names)
clean_df.head()

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,4602afcb8d95ebd9da583124384fd74299d20f5b,SPINT2 inhibits proteases involved in activati...,"Marco R Straus, Jonathan T Kinder, Michal Sega...","Marco R Straus, Jonathan T Kinder (University ...",Abstract\n\nViruses possessing class I fusion ...,Introduction 9\n\nInfluenza-like illnesses (IL...,Protease inhibitors targeting coronavirus and ...,"[{'first': 'Marco', 'middle': ['R'], 'last': '...","{'BIBREF1': {'ref_id': 'b1', 'title': 'Proteas..."
1,90b5ecf991032f3918ad43b252e17d1171b4ea63,The role of absolute humidity on transmission ...,"Wei Luo, Maimuna S Majumder, Diambo Liu, Canel...","Wei Luo (Boston Children's Hospital, 02215, Bo...",,"Introduction\n\nSince December 2019, an increa...",A novel coronavirus from patients with pneumon...,"[{'first': 'Wei', 'middle': [], 'last': 'Luo',...","{'BIBREF0': {'ref_id': 'b0', 'title': 'A novel..."
2,d3c2e2839498c613ee95739dce7052109750362c,Long-Term Persistence of IgG Antibodies in SAR...,"Xiaoqin Guo, Zhongmin Guo, Chaohui Duan, Zelia...","Xiaoqin Guo (Sun Yat-sen University, 510080, G...",Abstract\n\n23 BACKGROUND 24 The ongoing world...,\n\nCC-BY-ND 4.0 International license It is m...,"The severe acute respiratory syndrome, J S Pei...","[{'first': 'Xiaoqin', 'middle': [], 'last': 'G...","{'BIBREF0': {'ref_id': 'b0', 'title': 'The sev..."
3,da3aa20131ac2805c0d9e1b29f094683479ab5b7,Ruler elements in chromatin remodelers set nuc...,"Elisa Oberbeckmann, Vanessa Niebauer, Shinya W...",Elisa Oberbeckmann (Martinsried near to Munich...,Abstract\n\nArrays of regularly spaced nucleos...,\n\nArrays of regularly spaced nucleosomes dom...,The Snf2 homolog Fun30 acts as a homodimeric A...,"[{'first': 'Elisa', 'middle': [], 'last': 'Obe...","{'BIBREF0': {'ref_id': 'b0', 'title': 'The Snf..."
4,d2c7ac58cc309df9d3da8e766184d09c32d16dca,TaxIt: An iterative and automated computationa...,"Mathias Kuhring, Joerg Doellinger, Andreas Nit...","Mathias Kuhring (Bioinformatics Unit (MF), Joe...",Abstract\n\nUntargeted accurate strain-level c...,Introduction\n\nPathogenic strain identificati...,Vaccinia Virus Strain Differences in Cell Atta...,"[{'first': 'Mathias', 'middle': [], 'last': 'K...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Vaccini..."


In [24]:
clean_df.to_csv('biorxiv_clean.csv', index=False)

## PMC: Generate CSV

In [25]:
pmc_dir = '/kaggle/input/CORD-19-research-challenge/2020-03-13/pmc_custom_license/pmc_custom_license/'
pmc_files = load_files(pmc_dir)
pmc_df = generate_clean_df(pmc_files)
pmc_df.head()

HBox(children=(FloatProgress(value=0.0, max=1426.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1426.0), HTML(value='')))




Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,05326cc45fa2898c5850df85d30dad3d2c82acef,References 1. STI Study Group. Syphilis and go...,"J P Doris, K Saha, N P Jones","J P Doris, K Saha, N P Jones",,\n\nA quatic wild birds are the natural reserv...,"Evolution and ecology of influenza A viruses, ...","[{'first': 'J', 'middle': ['P'], 'last': 'Dori...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Evoluti..."
1,4c7759f1da349496e9308d9f4f0d4e071a23c19f,,,,,\n\nP andemic preparedness ideally would inclu...,Avian influenza: assessing the pandemic threat...,[],"{'BIBREF0': {'ref_id': 'b0', 'title': 'Avian i..."
2,4fc3b508543068a9d54f11f0ecf52580df768307,,"Carol Y Rao, Grace W Goryoka, Olga L Henao, Ke...","Carol Y Rao (Emory University, Atlanta (G.W. G...",Abstract\n\nThe Centers for Disease Control an...,\n\nI nfectious disease outbreaks present a se...,US Centers for Disease Control and Prevention ...,"[{'first': 'Carol', 'middle': ['Y'], 'last': '...","{'BIBREF0': {'ref_id': 'b0', 'title': 'US Cent..."
3,867e1b0f6ca8757f2a32a625d99b23888ab40d49,"Comparisons of substitution, insertion and del...","Mazen W Karaman, Susan Groshen, Chi-Chiang Lee...","Mazen W Karaman, Susan Groshen (University of ...",Abstract\n\nAlthough oligonucleotide probes co...,INTRODUCTION\n\nOligonucleotide microarrays ar...,Sequence variation in genes and genomic DNA: m...,"[{'first': 'Mazen', 'middle': ['W'], 'last': '...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Sequenc..."
4,713670f3bba541ae77e6ac7bb32c9ea27c01e19d,Transmission of Middle East Respiratory Syndro...,"Jennifer C Hunter, Duc Nguyen, Bashir Aden, Zy...","Jennifer C Hunter (1600 Clifton Road NE, Mails...",Abstract\n\nMiddle East respiratory syndrome c...,M iddle East respiratory syndrome coronavirus ...,Middle East respiratory syndrome coronavirus i...,"[{'first': 'Jennifer', 'middle': ['C'], 'last'...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Middle ..."


In [26]:
pmc_df.to_csv('clean_pmc.csv', index=False)

## Commercial Use: Generate CSV

In [27]:
comm_dir = '/kaggle/input/CORD-19-research-challenge/2020-03-13/comm_use_subset/comm_use_subset/'
comm_files = load_files(comm_dir)
comm_df = generate_clean_df(comm_files)
comm_df.head()

HBox(children=(FloatProgress(value=0.0, max=9000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=9000.0), HTML(value='')))




Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,25621281691205eb015383cbac839182b838514f,SMARCA2-regulated host cell factors are requir...,"Dominik Dornfeld, Alexandra H Dudek, Thibaut V...","Dominik Dornfeld (University of Freiburg, 7910...",Abstract\n\nThe human interferon (IFN)-induced...,\n\nInfluenza A viruses (IAV) are severe human...,"Influenza virus evolution, host adaptation, an...","[{'first': 'Dominik', 'middle': [], 'last': 'D...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Influen..."
1,7db22f7f81977109d493a0edf8ed75562648e839,Recombinant Scorpine Produced Using SUMO Fusio...,"Chao Zhang, Xinlong He, Yaping Gu, Huayun Zhou...",Chao Zhang (Medical College of Soochow Univers...,"Abstract\n\nScorpine, a small cationic peptide...",Introduction\n\nThe oldest known scorpions liv...,Oxygen transport proteins: I. Structure and or...,"[{'first': 'Chao', 'middle': [], 'last': 'Zhan...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Oxygen ..."
2,a137eb51461b4a4ed3980aa5b9cb2f2c1cf0292a,The effect of inhibition of PP1 and TNFα signa...,"Jason E Mcdermott, Hugh D Mitchell, Lisa E Gra...","Jason E Mcdermott, Hugh D Mitchell, Lisa E Gra...",Abstract\n\nBackground: The complex interplay ...,Background\n\nThe emergence of Severe Acute Re...,Isolation of a novel coronavirus from a man wi...,"[{'first': 'Jason', 'middle': ['E'], 'last': '...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Isolati..."
3,6c3e1a43f0e199876d4bd9ff787e1911fd5cfaa6,Review Article Microbial Agents as Putative In...,"Rossella Talotta, Piercarlo Sarzi-Puttini, Fab...","Rossella Talotta (University of Milan, Milan, ...",,Introduction\n\nSjögren's syndrome (SS) is a c...,Reviewing primary Sjögren's syndrome: beyond t...,"[{'first': 'Rossella', 'middle': [], 'last': '...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Reviewi..."
4,2ce201c2ba233a562ee605a9aa12d2719cfa2beb,A cluster of adenovirus type B55 infection in ...,"Lina Yi, Lirong Zou, Jing Lu, Min Kang, Yingch...","Lina Yi, Lirong Zou, Jing Lu, Min Kang, Yingch...",Abstract\n\nBackground: Human adenovirus type ...,| INTRODUCTION\n\nHuman adenovirus (HAdV) is a...,Clinical features and treatment of adenovirus ...,"[{'first': 'Lina', 'middle': [], 'last': 'Yi',...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Clinica..."


In [28]:
comm_df.to_csv('clean_comm_use.csv', index=False)

## Non-commercial Use: Generate CSV

In [29]:
noncomm_dir = '/kaggle/input/CORD-19-research-challenge/2020-03-13/noncomm_use_subset/noncomm_use_subset/'
noncomm_files = load_files(noncomm_dir)
noncomm_df = generate_clean_df(noncomm_files)
noncomm_df.head()

HBox(children=(FloatProgress(value=0.0, max=1973.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1973.0), HTML(value='')))




Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography
0,bbfbefd19c9c471dbacabd1f57124d1ad36f24f5,Selection of Reference Genes for Gene Expressi...,"Jiying Wang, Yanping Wang, Huaizhong Wang, Xia...","Jiying Wang, Yanping Wang, Huaizhong Wang, Xia...",Abstract\n\nInvestigating gene expression of i...,Quantitative\n\nreal-time PCR (qRT-PCR) can si...,Normalization of real-time quantitative revers...,"[{'first': 'Jiying', 'middle': [], 'last': 'Wa...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Normali..."
1,c84aa58b12823cdd6f119b4da3b0c4ba488856b5,,,,,\n\nlogical studies can be used to control dis...,Digital methods in epidemiology can transform ...,[],"{'BIBREF0': {'ref_id': 'b0', 'title': 'Digital..."
2,32b8b12a0247b3413442631d5a1bee05a4e5d689,H O W D O I ? How do we … integrate pathogen r...,"Sara Rutter, Edward L Snyder","Sara Rutter, Edward L Snyder",Abstract\n\nABBREVIATIONS: ARDS = acute respir...,\n\nFor more than 50 years there has been an o...,Cerus Corporation. FDA approves pathogen reduc...,"[{'first': 'Sara', 'middle': [], 'last': 'Rutt...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Cerus C..."
3,52069d14f038d493dce5d6cc1fdcdc7c1f0823f9,,,,,\n\nCanine leishmaniasis is an important infec...,"Veterinary Pathology, which is partially funde...",[],"{'BIBREF0': {'ref_id': 'b0', 'title': 'Veterin..."
4,1f7394c2575f1eb1a196c9f19cd4b57ddd05789b,Quantifying Pathogen Surveillance Using Tempor...,"Joseph M Chan, Raul Rabadan","Joseph M Chan, Raul Rabadan (Columbia Universi...",Abstract\n\nWith the advent of deep sequencing...,\n\nD espite many therapeutic and epidemiologi...,The world health Report 2004 -changing history...,"[{'first': 'Joseph', 'middle': ['M'], 'last': ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'The wor..."


In [30]:
noncomm_df.to_csv('clean_noncomm_use.csv', index=False)