## Process commerical published papers

**Name: Vidhi Gupta (vg5vc)**

Some exploration and preprocessing to create one dataframe and export it as .csv.

Adjusted to data from 2020-04-17.

# Load Packages

In [1]:
import numpy as np 
import pandas as pd

import glob
import json

# Load and Prepare Data

To read the JSON files we follow [COVID EDA: Initial Exploration Tool](https://www.kaggle.com/ivanegapratama/covid-eda-initial-exploration-tool).

In [2]:
root_path = '../dataset/CORD-19-research-challenge/'
meta_df = pd.read_csv(root_path+'metadata.csv', dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head(2)

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url
0,xqhn0vbp,1e1286db212100993d03cc22374b624f7caee956,PMC,Airborne rhinovirus detection and effect of ul...,10.1186/1471-2458-3-5,PMC140314,12525263,no-cc,"BACKGROUND: Rhinovirus, the most common cause ...",2003-01-13,"Myatt, Theodore A; Johnston, Sebastian L; Rudn...",BMC Public Health,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
1,gi6uaa83,8ae137c8da1607b3a8e4c946c07ca8bda67f88ac,PMC,Discovering human history from stomach bacteria,10.1186/gb-2003-4-5-213,PMC156578,12734001,no-cc,Recent analyses of human pathogens have reveal...,2003-04-28,"Disotell, Todd R",Genome Biol,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...


In [3]:
all_json = glob.glob(f'{root_path}comm_use_subset/comm_use_subset/pdf_json/*.json', recursive=True)
len(all_json)

9557

In [4]:
all_json_pmc = glob.glob(f'{root_path}comm_use_subset/comm_use_subset/pmc_json/*.json', recursive=True)
len(all_json_pmc)

9189

# Commercial use pdf_json

In [5]:
methods = ['methods','method','statistical methods','materials','materials and methods',
                'data collection','the study','study design','experimental design','objective',
                'objectives','procedures','data collection and analysis', 'methodology',
                'material and methods','the model','experimental procedures','main text']

In [6]:
# [''.join(x.lower() for x in m if x.isalpha()) for m in methods]

# for m in methods:
#     print(''.join(x.lower() for x in m if x.isalpha()))

In [7]:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            self.methods = []
            self.results = []

            # Abstract
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            # Methods
            methods = ['methods','method','statistical methods','materials','materials and methods',
                'data collection','the study','study design','experimental design','objective',
                'objectives','procedures','data collection and analysis', 'methodology',
                'material and methods','the model','experimental procedures','main text']
            for entry in content['body_text']:
                section_title = ''.join(x.lower() for x in entry['section'] if x.isalpha()) #remove numbers and spaces
                if any(m in section_title for m in [''.join(x.lower() for x in m if x.isalpha()) for m in methods]) : 
                    self.methods.append(entry['text'])
            # Results
            results_synonyms = ['result']
            for entry in content['body_text']:
                section_title = ''.join(x.lower() for x in entry['section'] if x.isalpha())
                if any(r in section_title for r in results_synonyms) :
                    self.results.append(entry['text'])
                    
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
            self.methods = '\n'.join(self.methods)
            self.results = '\n'.join(self.results)

    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'
first_row = FileReader(all_json[0])
print(first_row)

5e0c586f047ff909c8ed3fe171c8975a90608d08: Background: Porcine epidemic diarrhea virus (PEDV) is emerging as a pathogenic coronavirus that causes a huge economic burden to the swine industry. Interaction of the viral spike (S) surface glycopro... Porcine epidemic diarrhea virus (PEDV), which belongs to the Alphacoronavirus genus of the Coronaviridae family, is an etiological agent of porcine epidemic diarrhea (PED) and causes an enteric diseas...


In [8]:
dict_ = {'paper_id': [], 'abstract': [], 'body_text': [], 'methods': [], 'results': []}
for idx, entry in enumerate(all_json):
    if idx % (len(all_json) // 10) == 0:
        print(f'Processing index: {idx} of {len(all_json)}')
    content = FileReader(entry)
    dict_['paper_id'].append(content.paper_id)
    dict_['abstract'].append(content.abstract)
    dict_['body_text'].append(content.body_text)
    dict_['methods'].append(content.methods)
    dict_['results'].append(content.results)

Processing index: 0 of 9557
Processing index: 955 of 9557
Processing index: 1910 of 9557
Processing index: 2865 of 9557
Processing index: 3820 of 9557
Processing index: 4775 of 9557
Processing index: 5730 of 9557
Processing index: 6685 of 9557
Processing index: 7640 of 9557
Processing index: 8595 of 9557
Processing index: 9550 of 9557


In [9]:
papers = pd.DataFrame(dict_, columns=['paper_id', 'abstract', 'body_text', 'methods', 'results'])
papers.head()

Unnamed: 0,paper_id,abstract,body_text,methods,results
0,5e0c586f047ff909c8ed3fe171c8975a90608d08,Background: Porcine epidemic diarrhea virus (P...,"Porcine epidemic diarrhea virus (PEDV), which ...",,
1,1579fbff7af9b156c6f49fee0526e48f852ea460,"Currently, live-attenuated IBV vaccines are us...","Generation of rNDVs expressing S1, S2 or S pro...",,"Generation of rNDVs expressing S1, S2 or S pro..."
2,e0668c4b793d0cad26639b070819334a94648123,,The incidence of complete Achilles tendon rupt...,,
3,38aa050ad79d8a1d7022c33535255ce9d47914e5,The new world arenavirus Junín virus (JUNV) is...,Arenaviruses are enveloped RNA viruses with bi...,,
4,61722c462b054f36461375e96e502cbf22648c04,and subtropical countries and is a significant...,"In this study, the anti-dengue activity of nic...",,


In [10]:
papers[(papers.results.str.len() != 0) | (papers.methods.str.len() != 0)].shape

(4606, 5)

In [11]:
df = pd.merge(papers, meta_df, left_on='paper_id', right_on='sha', how='left').drop('sha', axis=1)

In [12]:
df.columns

Index(['paper_id', 'abstract_x', 'body_text', 'methods', 'results', 'cord_uid',
       'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license',
       'abstract_y', 'publish_time', 'authors', 'journal',
       'Microsoft Academic Paper ID', 'WHO #Covidence', 'has_pdf_parse',
       'has_pmc_xml_parse', 'full_text_file', 'url'],
      dtype='object')

# Commercial use pmc_json

This only contains the full text - no abstracts!

In [13]:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.body_text = []
            self.methods = []
            self.results = []

            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            # Methods
            methods = ['methods','method','statistical methods','materials','materials and methods',
                'data collection','the study','study design','experimental design','objective',
                'objectives','procedures','data collection and analysis', 'methodology',
                'material and methods','the model','experimental procedures','main text']
            for entry in content['body_text']:
                section_title = ''.join(x.lower() for x in entry['section'] if x.isalpha()) #remove numbers and spaces
                if any(m in section_title for m in [''.join(x.lower() for x in m if x.isalpha()) for m in methods]) : 
                    self.methods.append(entry['text'])
            # Results
            results_synonyms = ['result']
            for entry in content['body_text']:
                section_title = ''.join(x.lower() for x in entry['section'] if x.isalpha())
                if any(r in section_title for r in results_synonyms) :
                    self.results.append(entry['text'])
                    
            self.body_text = '\n'.join(self.body_text)
            self.methods = '\n'.join(self.methods)
            self.results = '\n'.join(self.results)

    def __repr__(self):
        return f'{self.paper_id}: {self.body_text[:200]}...'
first_row = FileReader(all_json_pmc[0])
print(first_row)

PMC5396510: Gold nanoparticles (GNPs) have attracted significant interest as a novel platform in nanobiotechnology and biomedicine because of their convenient surface bioconjugation with molecular probes1 and the...


In [14]:
dict_ = {'paper_id': [], 'body_text': [], 'methods': [], 'results': []}
for idx, entry in enumerate(all_json_pmc):
    if idx % (len(all_json_pmc) // 10) == 0:
        print(f'Processing index: {idx} of {len(all_json_pmc)}')
    content = FileReader(entry)
    dict_['paper_id'].append(content.paper_id)
    dict_['body_text'].append(content.body_text)
    dict_['methods'].append(content.methods)
    dict_['results'].append(content.results)

Processing index: 0 of 9189
Processing index: 918 of 9189
Processing index: 1836 of 9189
Processing index: 2754 of 9189
Processing index: 3672 of 9189
Processing index: 4590 of 9189
Processing index: 5508 of 9189
Processing index: 6426 of 9189
Processing index: 7344 of 9189
Processing index: 8262 of 9189
Processing index: 9180 of 9189


In [15]:
pmc_text = pd.DataFrame(dict_, columns=['paper_id', 'body_text', 'methods', 'results'])
pmc_text.head()

Unnamed: 0,paper_id,body_text,methods,results
0,PMC5396510,Gold nanoparticles (GNPs) have attracted signi...,,
1,PMC5678581,More than 300 novel RNA-binding proteins have ...,EMSAs were performed with an end-labeled pre-l...,TRIM25 is known to associate with RNAs in cell...
2,PMC3795028,Post-translational modification of proteins by...,"DL-Dithiothreitol (DTT, D0632), N-Ethylmaleimi...",The Akata-Bx1 cell line was used to study the ...
3,PMC2860491,"Chikungunya virus (CHIKV), a mosquito-borne pa...",,CHIKV undertakes a complex replication cycle u...
4,PMC3256159,Viruses have evolved as genome packaging machi...,Zaire Ebolavirus was propagated in Vero E6 cel...,We identified filamentous EBOV particles 20 mi...


In [16]:
pmc_text.shape

(9189, 4)

Careful, some of the new texts are empty strings!

In [17]:
pmc_text[pmc_text.body_text == '']

Unnamed: 0,paper_id,body_text,methods,results
144,PMC7088756,,,
180,PMC7078228,,,
200,PMC7108780,,,
687,PMC5854734,,,
939,PMC7089303,,,
...,...,...,...,...
8804,PMC3570808,,,
8819,PMC6258726,,,
8943,PMC7089429,,,
9044,PMC7048180,,,


In [18]:
pmc_text = pmc_text[pmc_text.body_text != '']

In [19]:
pmc_text.shape

(9118, 4)

In [20]:
df.head()

Unnamed: 0,paper_id,abstract_x,body_text,methods,results,cord_uid,source_x,title,doi,pmcid,...,abstract_y,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url
0,5e0c586f047ff909c8ed3fe171c8975a90608d08,Background: Porcine epidemic diarrhea virus (P...,"Porcine epidemic diarrhea virus (PEDV), which ...",,,25tcxise,PMC,Neutralizing antibodies against porcine epidem...,10.1186/s12985-018-1042-3,PMC6117962,...,BACKGROUND: Porcine epidemic diarrhea virus (P...,2018-08-30,"Gong, Lang; Lin, Ying; Qin, Jianru; Li, Qianni...",Virol J,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
1,1579fbff7af9b156c6f49fee0526e48f852ea460,"Currently, live-attenuated IBV vaccines are us...","Generation of rNDVs expressing S1, S2 or S pro...",,"Generation of rNDVs expressing S1, S2 or S pro...",o7y3wygc,PMC,A Recombinant Newcastle Disease Virus (NDV) Ex...,10.1038/s41598-018-30356-2,PMC6086832,...,Infectious bronchitis virus (IBV) causes a hig...,2018-08-10,"Shirvani, Edris; Paldurai, Anandan; Manoharan,...",Sci Rep,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
2,e0668c4b793d0cad26639b070819334a94648123,,The incidence of complete Achilles tendon rupt...,,,6h34gp3x,PMC,‘Hajj: what it means for general practice’,10.3399/bjgpopen18x101493,PMC6184103,...,,2018-04-18,"Mughal, Faraz; Chew-Graham, Carolyn A; Saad, A...",BJGP Open,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
3,38aa050ad79d8a1d7022c33535255ce9d47914e5,The new world arenavirus Junín virus (JUNV) is...,Arenaviruses are enveloped RNA viruses with bi...,,,7smnk0ob,PMC,Potent Inhibition of Junín Virus Infection by ...,10.1371/journal.pntd.0002933,PMC4046933,...,The new world arenavirus Junín virus (JUNV) is...,2014-06-05,"Huang, Cheng; Walker, Aida G.; Grant, Ashley M...",PLoS Negl Trop Dis,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
4,61722c462b054f36461375e96e502cbf22648c04,and subtropical countries and is a significant...,"In this study, the anti-dengue activity of nic...",,,wmfwl2bh,PMC,Neutralization of Acidic Intracellular Vesicle...,10.1038/s41598-019-45095-1,PMC6582152,...,Dengue fever is one of the most important mosq...,2019-06-18,"Jung, Eunhye; Nam, Sangwoo; Oh, Hyeryeon; Jun,...",Sci Rep,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...


In [21]:
df = pd.merge(df, pmc_text, left_on='pmcid', right_on='paper_id', how='left')

In [22]:
df.head(3)

Unnamed: 0,paper_id_x,abstract_x,body_text_x,methods_x,results_x,cord_uid,source_x,title,doi,pmcid,...,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url,paper_id_y,body_text_y,methods_y,results_y
0,5e0c586f047ff909c8ed3fe171c8975a90608d08,Background: Porcine epidemic diarrhea virus (P...,"Porcine epidemic diarrhea virus (PEDV), which ...",,,25tcxise,PMC,Neutralizing antibodies against porcine epidem...,10.1186/s12985-018-1042-3,PMC6117962,...,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6117962,"Porcine epidemic diarrhea virus (PEDV), which ...",The Vero E6 cell line was cultured and maintai...,PEDV concentration was enriched 100 fold and p...
1,1579fbff7af9b156c6f49fee0526e48f852ea460,"Currently, live-attenuated IBV vaccines are us...","Generation of rNDVs expressing S1, S2 or S pro...",,"Generation of rNDVs expressing S1, S2 or S pro...",o7y3wygc,PMC,A Recombinant Newcastle Disease Virus (NDV) Ex...,10.1038/s41598-018-30356-2,PMC6086832,...,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6086832,Infectious Bronchitis (IB) is an acute and hig...,Chicken embryo fibroblast (DF-1) cells and hum...,The expression cassettes containing the codon ...
2,e0668c4b793d0cad26639b070819334a94648123,,The incidence of complete Achilles tendon rupt...,,,6h34gp3x,PMC,‘Hajj: what it means for general practice’,10.3399/bjgpopen18x101493,PMC6184103,...,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6184103,Hajj is the fifth pillar of Islam and is descr...,,


In [23]:
df.columns

Index(['paper_id_x', 'abstract_x', 'body_text_x', 'methods_x', 'results_x',
       'cord_uid', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license',
       'abstract_y', 'publish_time', 'authors', 'journal',
       'Microsoft Academic Paper ID', 'WHO #Covidence', 'has_pdf_parse',
       'has_pmc_xml_parse', 'full_text_file', 'url', 'paper_id_y',
       'body_text_y', 'methods_y', 'results_y'],
      dtype='object')

In [24]:
#df.drop(columns=['has_pdf_parse', 'has_pmc_xml_parse', 'Microsoft Academic Paper ID', 'WHO #Covidence'], inplace=True)

In [25]:
df.head(2)

Unnamed: 0,paper_id_x,abstract_x,body_text_x,methods_x,results_x,cord_uid,source_x,title,doi,pmcid,...,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url,paper_id_y,body_text_y,methods_y,results_y
0,5e0c586f047ff909c8ed3fe171c8975a90608d08,Background: Porcine epidemic diarrhea virus (P...,"Porcine epidemic diarrhea virus (PEDV), which ...",,,25tcxise,PMC,Neutralizing antibodies against porcine epidem...,10.1186/s12985-018-1042-3,PMC6117962,...,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6117962,"Porcine epidemic diarrhea virus (PEDV), which ...",The Vero E6 cell line was cultured and maintai...,PEDV concentration was enriched 100 fold and p...
1,1579fbff7af9b156c6f49fee0526e48f852ea460,"Currently, live-attenuated IBV vaccines are us...","Generation of rNDVs expressing S1, S2 or S pro...",,"Generation of rNDVs expressing S1, S2 or S pro...",o7y3wygc,PMC,A Recombinant Newcastle Disease Virus (NDV) Ex...,10.1038/s41598-018-30356-2,PMC6086832,...,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6086832,Infectious Bronchitis (IB) is an acute and hig...,Chicken embryo fibroblast (DF-1) cells and hum...,The expression cassettes containing the codon ...


# Exploration/Cleaning

### Different Abstract in Metadata and JSON files

abstract_x from json, abstract_y from metadata

In [26]:
df[df.abstract_x != df.abstract_y].shape

(7983, 26)

In [27]:
df[df.abstract_x != df.abstract_y][['abstract_x', 'abstract_y', 'url']].tail(10)

Unnamed: 0,abstract_x,abstract_y,url
9545,Background: Anatid herpesvirus 1 (AHV-1) is an...,BACKGROUND: Anatid herpesvirus 1 (AHV-1) is an...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...
9546,Coronaviruses (CoVs) are positive-sense single...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...
9547,Viruses in avian hosts can pose threats to avi...,Viruses in avian hosts can pose threats to avi...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...
9548,Over 2.5 billion people are exposed to the ris...,BACKGROUND: Over 2.5 billion people are expose...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
9549,Plasmacytoid dendritic cell (pDC)-mediated pro...,Plasmacytoid dendritic cell (pDC)-mediated pro...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...
9551,"Background: Canine distemper, caused by Canine...","BACKGROUND: Canine distemper, caused by Canine...",https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...
9552,We hypothesized that postnatal development of ...,We hypothesized that postnatal development of ...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
9553,"Venereal syphilis is a multi-stage, sexually t...","Venereal syphilis is a multi-stage, sexually t...",https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...
9554,The emergence of Middle East respiratory syndr...,The emergence of Middle East respiratory syndr...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
9556,"Virus host shifts occur frequently, but the wh...","Virus host shifts occur frequently, but the wh...",https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...


In [28]:
df[df.abstract_x != df.abstract_y][['abstract_x', 'abstract_y', 'url']].url.iloc[-1]

'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6920936/'

Checking some of the files online, it seems that where the abstract is missing in the metadata, the abstract in the JSON file is simply the beginning of the text.

In [29]:
df[df.abstract_x != df.abstract_y][['abstract_x', 'abstract_y', 'url', 'body_text_x', 'body_text_y']][
    (df.abstract_y.isnull()) & (df.abstract_x != '') & (~df.url.isnull())]

  


Unnamed: 0,abstract_x,abstract_y,url,body_text_x,body_text_y
59,The Drosophila genome encodes 18 canonical nuc...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,Nuclear receptors constitute a protein superfa...,Epidemiologists have used mathematical models ...
393,Emerging and reemerging infectious diseases (E...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,"R ecent events have shown that public health, ...","Recent events have shown that public health, a..."
546,"Epidemiological research on the pathogenesis, ...",,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,"Epidemiological research on the pathogenesis, ...","Epidemiological research on the pathogenesis, ..."
696,Human fungal diseases differ fundamentally fro...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,of all ages suffer from a serious fungal infec...,Human fungal diseases differ fundamentally fro...
874,Essays articulate a specific perspective on a ...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,"Within the past century, a number of ''emergin...",The human genome is a living document of ancie...
1062,Perspectives\nGlobal atlas of zoonotic viruses...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,At the Prince Mahidol Awards Conference on 30 ...,Novel viruses usually emerge in regions where ...
1086,Antibody-dependent enhancement (ADE) is a mech...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,Antibody-dependent enhancement (ADE) is a mech...,Antibody-dependent enhancement (ADE) is a mech...
1307,Please cite this paper as: Chiu et al. (2014) ...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,To the editor::\nRespiratory viral infections ...,Please cite this paper as: Chiu et al. (2014) ...
1339,Data management and integration are complicate...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,There is a growing awareness of the need to wo...,Epidemiologists have long used maps to track t...
1784,A number of sentences in the first paragraph o...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,Aromatase is a cytochrome P-450 dependent enzy...,Aromatase is a cytochrome P-450 dependent enzy...


body_text_x is from pdf, body_text_y from pmc

In [30]:
df.shape

(9557, 26)

In [31]:
df.abstract_x.isnull().sum(), (df.abstract_x =='').sum() # missing abstracts in json files

(0, 1016)

In [32]:
df.abstract_y.isnull().sum(), (df.abstract_y=='').sum() # missing abstracts in metadata

(1107, 0)

Since the abstracts from the metadata seem more reliable we generally use these, but fill the missing values with the abstract from the extracted values from the JSON file.

In [33]:
df.loc[df.abstract_y.isnull() & (df.abstract_x != ''), 'abstract_y'] = df[(df.abstract_y.isnull()) & (df.abstract_x != '')].abstract_x

In [34]:
df.abstract_y.isnull().sum()

539

the remaining missing values are also empty in the json files

In [35]:
(df.abstract_y.isnull() & (df.abstract_x!='')).sum()

0

In [36]:
df.rename(columns = {'abstract_y': 'abstract'}, inplace=True)
df.drop('abstract_x', axis=1, inplace=True)

In [37]:
df.columns

Index(['paper_id_x', 'body_text_x', 'methods_x', 'results_x', 'cord_uid',
       'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license', 'abstract',
       'publish_time', 'authors', 'journal', 'Microsoft Academic Paper ID',
       'WHO #Covidence', 'has_pdf_parse', 'has_pmc_xml_parse',
       'full_text_file', 'url', 'paper_id_y', 'body_text_y', 'methods_y',
       'results_y'],
      dtype='object')

We still have to compare the text body from pdf and pmc files.

In [38]:
df.shape

(9557, 25)

# Quick comparison of both texts

In [39]:
df.shape

(9557, 25)

In [40]:
(df.body_text_x != df.body_text_y).sum()

9557

In [41]:
df[(df.body_text_x != df.body_text_y) & df.body_text_y.notnull()][['body_text_x', 'body_text_y']].head(10).iloc[2].values[0][:500]

'The incidence of complete Achilles tendon rupture is 18 per 100 000 patient-years 1 and is usually diagnosed clinically by GPs. The extent of clinical misdiagnosis is unknown in Norway, but may be high. 2 This is important as delayed treatment has unfavourable consequences. 1, 3 We report how a GP, with no clinical ultrasound experience, recorded images with a pocket-sized ultrasound device (PSUD) under supervision to confirm a complete Achilles tendon rupture. This could present a new indicatio'

In [42]:
df[(df.body_text_x != df.body_text_y) & df.body_text_y.notnull()][['body_text_x', 'body_text_y']].head(10).iloc[2].values[1][:500]

"Hajj is the fifth pillar of Islam and is described in the Quran, as Almighty God says: 'Pilgrimage to this House is an obligation by God upon whoever is able among the people' (3:97).3 It is obligatory for every Muslim adult with mental capacity to perform the Hajj once in a lifetime, if reasonably able to do so without excessive hardship. The pilgrimage lasts 5 days, although pilgrims usually travel for longer. The Hajj occurs 10 days earlier each year (adhering to the lunar calendar), and in 2"

In [43]:
df[df.body_text_x != df.body_text_y].head()

Unnamed: 0,paper_id_x,body_text_x,methods_x,results_x,cord_uid,source_x,title,doi,pmcid,pubmed_id,...,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url,paper_id_y,body_text_y,methods_y,results_y
0,5e0c586f047ff909c8ed3fe171c8975a90608d08,"Porcine epidemic diarrhea virus (PEDV), which ...",,,25tcxise,PMC,Neutralizing antibodies against porcine epidem...,10.1186/s12985-018-1042-3,PMC6117962,30165871,...,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6117962,"Porcine epidemic diarrhea virus (PEDV), which ...",The Vero E6 cell line was cultured and maintai...,PEDV concentration was enriched 100 fold and p...
1,1579fbff7af9b156c6f49fee0526e48f852ea460,"Generation of rNDVs expressing S1, S2 or S pro...",,"Generation of rNDVs expressing S1, S2 or S pro...",o7y3wygc,PMC,A Recombinant Newcastle Disease Virus (NDV) Ex...,10.1038/s41598-018-30356-2,PMC6086832,30097608,...,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6086832,Infectious Bronchitis (IB) is an acute and hig...,Chicken embryo fibroblast (DF-1) cells and hum...,The expression cassettes containing the codon ...
2,e0668c4b793d0cad26639b070819334a94648123,The incidence of complete Achilles tendon rupt...,,,6h34gp3x,PMC,‘Hajj: what it means for general practice’,10.3399/bjgpopen18x101493,PMC6184103,30564715,...,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6184103,Hajj is the fifth pillar of Islam and is descr...,,
3,38aa050ad79d8a1d7022c33535255ce9d47914e5,Arenaviruses are enveloped RNA viruses with bi...,,,7smnk0ob,PMC,Potent Inhibition of Junín Virus Infection by ...,10.1371/journal.pntd.0002933,PMC4046933,24901990,...,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,PMC4046933,Arenaviruses are enveloped RNA viruses with bi...,The pathogenic strain Romero JUNV was obtained...,To understand if IFN has direct impact on JUNV...
4,61722c462b054f36461375e96e502cbf22648c04,"In this study, the anti-dengue activity of nic...",,,wmfwl2bh,PMC,Neutralization of Acidic Intracellular Vesicle...,10.1038/s41598-019-45095-1,PMC6582152,31213630,...,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6582152,Dengue virus (DENV) represents a major mosquit...,C6/36 mosquito cells (ATCC® CRL-1660TM) and Ve...,"In this study, the anti-dengue activity of nic..."


In [44]:
#df.iloc[34885].body_text_x[:500]

In [45]:
#df.iloc[34885].body_text_y[:500]

In [46]:
#df.iloc[34885].url

In [47]:
#df.iloc[34888].body_text_x[:500]

In [48]:
#df.iloc[34888].body_text_y[:500]

In [49]:
#df.iloc[34888].url

In [50]:
df.iloc[1337].body_text_x[:500]

'Despite remarkable progress in the control and treatment of infectious diseases, the problem of emerging and re-emerging pathogens is likely to be one of the main issues of medical science and public health in the twenty-first century [1] . In this respect viral diseases are of particular concern, because advances in the field of antiviral drugs have lagged behind those regarding bactericidal drugs and antibiotics. It was shown by the emergence of the severe acute respiratory syndrome (SARS) tha'

In [51]:
df.iloc[1337].body_text_y[:500]

'Despite remarkable progress in the control and treatment of infectious diseases, the problem of emerging and re-emerging pathogens is likely to be one of the main issues of medical science and public health in the twenty-first century [1]. In this respect viral diseases are of particular concern, because advances in the field of antiviral drugs have lagged behind those regarding bactericidal drugs and antibiotics. It was shown by the emergence of the severe acute respiratory syndrome (SARS) that'

In [52]:
df.iloc[1337].url

'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3169512/'

In [53]:
df.iloc[1242].body_text_x[:500]

'The Hospital Authority (HA) is a statutory body established under the Hospital Authority Ordinance in 1990, which manages all the public hospitals and institutes in Hong Kong (HK). The strategic priorities of Hong Kong Hospital Authority (HKHA) are to (i) allay staff shortage and high turnover; (ii) better manage growing service demand; (iii) ensure service quality and safety; (iv) enhance partnership with patients and community; (v) ensure adequate resources to meet the service demand; and (vi)'

In [54]:
df.iloc[1242].body_text_y[:500]

'The Hospital Authority (HA) is a statutory body established under the Hospital Authority Ordinance in 1990, which manages all the public hospitals and institutes in Hong Kong (HK). The strategic priorities of Hong Kong Hospital Authority (HKHA) are to (i) allay staff shortage and high turnover; (ii) better manage growing service demand; (iii) ensure service quality and safety; (iv) enhance partnership with patients and community; (v) ensure adequate resources to meet the service demand; and (vi)'

In [55]:
df.iloc[1242].url

'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5590922/'

Where available we use the text from the pmc file (body_text_y), trusting the statement that it is of higher quality.

In [56]:
df.body_text_x.isnull().sum(), df.body_text_y.isnull().sum()

(0, 945)

In [57]:
(df.body_text_x == '').sum(), (df.body_text_y == '').sum()

(0, 0)

In [58]:
df.loc[df.body_text_y.notnull(), 'body_text_x'] = df.loc[df.body_text_y.notnull(), 'body_text_y']

In [59]:
df.body_text_x.isnull().sum()

0

In [60]:
df.rename(columns = {'body_text_x': 'body_text'}, inplace=True)
df.drop('body_text_y', axis=1, inplace=True)

In [61]:
df.columns

Index(['paper_id_x', 'body_text', 'methods_x', 'results_x', 'cord_uid',
       'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license', 'abstract',
       'publish_time', 'authors', 'journal', 'Microsoft Academic Paper ID',
       'WHO #Covidence', 'has_pdf_parse', 'has_pmc_xml_parse',
       'full_text_file', 'url', 'paper_id_y', 'methods_y', 'results_y'],
      dtype='object')

In [62]:
df[['methods_x', 'methods_y', 'url']][df.methods_y.notnull()]

Unnamed: 0,methods_x,methods_y,url
0,,The Vero E6 cell line was cultured and maintai...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
1,,Chicken embryo fibroblast (DF-1) cells and hum...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
2,,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
3,,The pathogenic strain Romero JUNV was obtained...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
4,,C6/36 mosquito cells (ATCC® CRL-1660TM) and Ve...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
...,...,...,...
9552,Prior to beginning studies involving animals t...,Prior to beginning studies involving animals t...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
9553,The diagnosis of secondary syphilis was based ...,"SS patients from Cali, Colombia were recruited...",https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...
9554,,All animal experiments were approved by the In...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
9555,The following are available online at http://w...,Vero cells were grown in modified Eagle’s medi...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...


In [63]:
(df.methods_x == '').sum(), df.methods_x.isnull().sum()

(6269, 0)

In [64]:
(df.methods_y == '').sum(), df.methods_y.isnull().sum()

(2050, 945)

In [65]:
# use methods_y (from pmc) when it's available
mask = (df.methods_y.notnull()) & (df.methods_y != '')
df.loc[mask, 'methods_x'] = df.loc[mask, 'methods_y']

# same for results
mask = (df.results_y.notnull()) & (df.results_y != '')
df.loc[mask, 'results_x'] = df.loc[mask, 'results_y']

In [66]:
(df.results_x == '').sum(), df.results_x.isnull().sum()

(2827, 0)

In [67]:
(df.results_y == '').sum(), df.results_y.isnull().sum()

(2134, 945)

In [68]:
df.rename(columns = {'methods_x': 'methods', 'results_x': 'results'}, inplace=True)
df.drop(columns=['methods_y', 'results_y'], inplace=True)

In [69]:
df.rename(columns = {'paper_id_x': 'paper_id', 'source_x': 'source'}, inplace=True)

In [70]:
df.columns

Index(['paper_id', 'body_text', 'methods', 'results', 'cord_uid', 'source',
       'title', 'doi', 'pmcid', 'pubmed_id', 'license', 'abstract',
       'publish_time', 'authors', 'journal', 'Microsoft Academic Paper ID',
       'WHO #Covidence', 'has_pdf_parse', 'has_pmc_xml_parse',
       'full_text_file', 'url', 'paper_id_y'],
      dtype='object')

In [71]:
df.head()

Unnamed: 0,paper_id,body_text,methods,results,cord_uid,source,title,doi,pmcid,pubmed_id,...,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url,paper_id_y
0,5e0c586f047ff909c8ed3fe171c8975a90608d08,"Porcine epidemic diarrhea virus (PEDV), which ...",The Vero E6 cell line was cultured and maintai...,PEDV concentration was enriched 100 fold and p...,25tcxise,PMC,Neutralizing antibodies against porcine epidem...,10.1186/s12985-018-1042-3,PMC6117962,30165871,...,2018-08-30,"Gong, Lang; Lin, Ying; Qin, Jianru; Li, Qianni...",Virol J,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6117962
1,1579fbff7af9b156c6f49fee0526e48f852ea460,Infectious Bronchitis (IB) is an acute and hig...,Chicken embryo fibroblast (DF-1) cells and hum...,The expression cassettes containing the codon ...,o7y3wygc,PMC,A Recombinant Newcastle Disease Virus (NDV) Ex...,10.1038/s41598-018-30356-2,PMC6086832,30097608,...,2018-08-10,"Shirvani, Edris; Paldurai, Anandan; Manoharan,...",Sci Rep,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6086832
2,e0668c4b793d0cad26639b070819334a94648123,Hajj is the fifth pillar of Islam and is descr...,,,6h34gp3x,PMC,‘Hajj: what it means for general practice’,10.3399/bjgpopen18x101493,PMC6184103,30564715,...,2018-04-18,"Mughal, Faraz; Chew-Graham, Carolyn A; Saad, A...",BJGP Open,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6184103
3,38aa050ad79d8a1d7022c33535255ce9d47914e5,Arenaviruses are enveloped RNA viruses with bi...,The pathogenic strain Romero JUNV was obtained...,To understand if IFN has direct impact on JUNV...,7smnk0ob,PMC,Potent Inhibition of Junín Virus Infection by ...,10.1371/journal.pntd.0002933,PMC4046933,24901990,...,2014-06-05,"Huang, Cheng; Walker, Aida G.; Grant, Ashley M...",PLoS Negl Trop Dis,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,PMC4046933
4,61722c462b054f36461375e96e502cbf22648c04,Dengue virus (DENV) represents a major mosquit...,C6/36 mosquito cells (ATCC® CRL-1660TM) and Ve...,"In this study, the anti-dengue activity of nic...",wmfwl2bh,PMC,Neutralization of Acidic Intracellular Vesicle...,10.1038/s41598-019-45095-1,PMC6582152,31213630,...,2019-06-18,"Jung, Eunhye; Nam, Sangwoo; Oh, Hyeryeon; Jun,...",Sci Rep,,,True,True,comm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6582152


# Duplicates

Some paper ids are duplicated

In [72]:
len(df)

9557

In [73]:
df.paper_id.nunique()

9557

In [74]:
df[df.duplicated(subset=['paper_id'], keep=False)][['paper_id', 'body_text']]

Unnamed: 0,paper_id,body_text


But luckily they also have the same text body. So we will just keep one article per paper_id.
Check for example [https://www.sciencedirect.com/science/article/pii/S1386653209701295?via%3Dihub](https://www.sciencedirect.com/science/article/pii/S1386653209701295?via%3Dihub) and [https://www.sciencedirect.com/science/article/pii/S1386653209701325?via%3Dihub](https://www.sciencedirect.com/science/article/pii/S1386653209701325?via%3Dihub) - they have the same content.

In [75]:
df[df.duplicated(subset=['paper_id', 'body_text'], keep=False)].shape

(0, 22)

In [76]:
df.drop_duplicates(['paper_id', 'body_text'], inplace=True)

In [77]:
len(df)

9557

In [78]:
df[df.duplicated(['paper_id'], keep=False)].head(2)

Unnamed: 0,paper_id,body_text,methods,results,cord_uid,source,title,doi,pmcid,pubmed_id,...,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url,paper_id_y


In [79]:
df.drop_duplicates(['paper_id'], inplace=True)

In [80]:
df.paper_id.nunique()

9557

In [81]:
df.shape

(9557, 22)

Now the paper_id is unique.

In [82]:
df.isnull().sum()

paper_id                          0
body_text                         0
methods                           0
results                           0
cord_uid                        849
source                          849
title                           849
doi                             915
pmcid                           863
pubmed_id                      1045
license                         849
abstract                        539
publish_time                    849
authors                         871
journal                         939
Microsoft Academic Paper ID    9490
WHO #Covidence                 9457
has_pdf_parse                   849
has_pmc_xml_parse               849
full_text_file                  849
url                             849
paper_id_y                      945
dtype: int64

# Some new columns for convenience

In [83]:
# some new columns for convenience
df['publish_year'] = df.publish_time.str[:4].fillna(-1).astype(int) # 360 times None
# df['link'] = 'http://dx.doi.org/' + df.doi #dataset now has url column

In [84]:
df.columns

Index(['paper_id', 'body_text', 'methods', 'results', 'cord_uid', 'source',
       'title', 'doi', 'pmcid', 'pubmed_id', 'license', 'abstract',
       'publish_time', 'authors', 'journal', 'Microsoft Academic Paper ID',
       'WHO #Covidence', 'has_pdf_parse', 'has_pmc_xml_parse',
       'full_text_file', 'url', 'paper_id_y', 'publish_year'],
      dtype='object')

In [85]:
df.body_text

0       Porcine epidemic diarrhea virus (PEDV), which ...
1       Infectious Bronchitis (IB) is an acute and hig...
2       Hajj is the fifth pillar of Islam and is descr...
3       Arenaviruses are enveloped RNA viruses with bi...
4       Dengue virus (DENV) represents a major mosquit...
                              ...                        
9552    Early nutritional environment affects long ter...
9553    Syphilis is a sexually transmitted disease (ST...
9554    The MERS-CoV receptor DPP4 is the main host re...
9555    Arenaviruses are enveloped RNA viruses contain...
9556    Emerging infectious diseases (EIDs) can cause ...
Name: body_text, Length: 9557, dtype: object

In [86]:
df['is_covid19'] = df.body_text.str.contains('COVID-19|covid|sar cov 2|SARS-CoV-2|2019-nCov|2019 ncov|SARS Coronavirus 2|2019 Novel Coronavirus|coronavirus 2019| Wuhan coronavirus|wuhan pneumonia|wuhan virus', case=False)

In [87]:
df.is_covid19.sum()

318

# Language Detection to remove non-english articles and abstracts

from IPython.utils import io

with io.capture_output() as captured:
    !pip install scispacy
    !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz
    !pip install spacy-langdetect
    !pip install spac scispacy spacy_langdetect https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.3/en_core_sci_lg-0.2.3.tar.gz

In [88]:
# !pip install spacy-langdetect

In [89]:
import scispacy
import spacy
import en_core_sci_lg
from spacy_langdetect import LanguageDetector

In [90]:
# medium model
nlp = en_core_sci_lg.load(disable=["tagger", "ner"])
nlp.max_length = 2000000
nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)

In [91]:
doc = nlp('This is some English text. Das ist ein Haus. This is a house.')
doc._.language

{'language': 'en', 'score': 0.999998063116629}

In [92]:
for s in doc.sents:
    print(s._.language)

{'language': 'en', 'score': 0.9999989411217607}
{'language': 'de', 'score': 0.9999976255795803}
{'language': 'en', 'score': 0.9999979286758364}


In [93]:
#doc = nlp(df[df.paper_id == '1a8a4dbbaa94ced4ef6af69ec7a09d3fa4c0eece'].body_text.iloc[0])

In [94]:
#doc[:500]

In [95]:
doc_engl = ''
for s in doc.sents:
    if (s._.language['language'] == 'en'):
        doc_engl += s.text 

In [96]:
doc_engl[:2000]

'This is some English text.This is a house.'

Check language of each text body (only use the first 2000 characters).

In [97]:
df['text_language'] = df.body_text.apply(lambda x: nlp(str(x[:2000]))._.language['language'])

df.text_language.value_counts()

en         9540
zh-cn         3
ru            2
fr            2
ko            2
UNKNOWN       2
ro            1
tl            1
id            1
de            1
pt            1
so            1
Name: text_language, dtype: int64

## Number of non-english texts to drop.

In [98]:
df.loc[df[df.text_language != 'en'].index].shape

(17, 25)

In [99]:
df = df.drop(df[df.text_language != 'en'].index)

In [100]:
df.columns

Index(['paper_id', 'body_text', 'methods', 'results', 'cord_uid', 'source',
       'title', 'doi', 'pmcid', 'pubmed_id', 'license', 'abstract',
       'publish_time', 'authors', 'journal', 'Microsoft Academic Paper ID',
       'WHO #Covidence', 'has_pdf_parse', 'has_pmc_xml_parse',
       'full_text_file', 'url', 'paper_id_y', 'publish_year', 'is_covid19',
       'text_language'],
      dtype='object')

In [101]:
df1 = df

In [102]:
df = df[['paper_id', 'title', 'results', 'abstract', 'publish_time', 'journal', 'publish_year', 'is_covid19']]

# Export as .csv

In [103]:
df.head()

Unnamed: 0,paper_id,title,results,abstract,publish_time,journal,publish_year,is_covid19
0,5e0c586f047ff909c8ed3fe171c8975a90608d08,Neutralizing antibodies against porcine epidem...,PEDV concentration was enriched 100 fold and p...,BACKGROUND: Porcine epidemic diarrhea virus (P...,2018-08-30,Virol J,2018,False
1,1579fbff7af9b156c6f49fee0526e48f852ea460,A Recombinant Newcastle Disease Virus (NDV) Ex...,The expression cassettes containing the codon ...,Infectious bronchitis virus (IBV) causes a hig...,2018-08-10,Sci Rep,2018,False
2,e0668c4b793d0cad26639b070819334a94648123,‘Hajj: what it means for general practice’,,,2018-04-18,BJGP Open,2018,False
3,38aa050ad79d8a1d7022c33535255ce9d47914e5,Potent Inhibition of Junín Virus Infection by ...,To understand if IFN has direct impact on JUNV...,The new world arenavirus Junín virus (JUNV) is...,2014-06-05,PLoS Negl Trop Dis,2014,False
4,61722c462b054f36461375e96e502cbf22648c04,Neutralization of Acidic Intracellular Vesicle...,"In this study, the anti-dengue activity of nic...",Dengue fever is one of the most important mosq...,2019-06-18,Sci Rep,2019,False


In [104]:
df['abstract'].replace('', np.nan, inplace=True)
df['results'].replace('', np.nan, inplace=True)
df = df.dropna()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


In [105]:
results = df.drop('abstract', axis = 1)
abstract = df.drop('results', axis = 1)

In [106]:
results['section_num'] = 1
abstract['section_num'] = 0

In [107]:
results = results.rename({'results' : 'text'}, axis = 1)
abstract = abstract.rename({'abstract' : 'text'}, axis = 1)

In [108]:
import datetime

In [109]:
commercial_df = pd.concat([abstract, results]) 

In [110]:
commercial_df['publish_month'] = pd.DatetimeIndex(commercial_df['publish_time']).month

In [111]:
commercial_df = commercial_df.drop('publish_time', axis = 1)

In [112]:
commercial_df.to_csv('../dataset/processed-files/commercial.csv', index=False)

In [113]:
commercial_df

Unnamed: 0,paper_id,title,text,journal,publish_year,is_covid19,section_num,publish_month
0,5e0c586f047ff909c8ed3fe171c8975a90608d08,Neutralizing antibodies against porcine epidem...,BACKGROUND: Porcine epidemic diarrhea virus (P...,Virol J,2018,False,0,8
1,1579fbff7af9b156c6f49fee0526e48f852ea460,A Recombinant Newcastle Disease Virus (NDV) Ex...,Infectious bronchitis virus (IBV) causes a hig...,Sci Rep,2018,False,0,8
3,38aa050ad79d8a1d7022c33535255ce9d47914e5,Potent Inhibition of Junín Virus Infection by ...,The new world arenavirus Junín virus (JUNV) is...,PLoS Negl Trop Dis,2014,False,0,6
4,61722c462b054f36461375e96e502cbf22648c04,Neutralization of Acidic Intracellular Vesicle...,Dengue fever is one of the most important mosq...,Sci Rep,2019,False,0,6
5,7107f088cbed45d8a06a026276ccf4d602d50f10,Microglia Play a Major Role in Direct Viral-In...,Microglia are the resident macrophage-like pop...,Clin Dev Immunol,2013,False,0,6
...,...,...,...,...,...,...,...,...
9552,228650bc0429064d800d4b9c5fb0e00c2533a579,Lipidome profiles of postnatal day 2 vaginal s...,The discovery phase identified a total of 1486...,PLoS One,2019,False,1,9
9553,2246e28681bde69c65dc9081df367bb661997f19,"Secondary Syphilis in Cali, Colombia: New Conc...","A total of 57 patients (age 18–64 years, media...",PLoS Negl Trop Dis,2010,False,1,5
9554,577c6a13f9ef70e9756890fc66e98f537c01ac0a,Replication and shedding of MERS-CoV in Jamaic...,The MERS-CoV receptor DPP4 is the main host re...,Sci Rep,2016,False,1,2
9555,c5c2bc7a07670d6fb970d84a59aab3832752a3f1,Role of the ERK1/2 Signaling Pathway in the Re...,"Clade B New World (NW) arenaviruses, including...",Viruses,2018,False,1,4
