<a href="https://colab.research.google.com/github/zakaria-aabbou/Covid_19_topic_modeling/blob/main/1_Load_%26_clean_data_files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load & clean data files
-------------------------------------------------

### Dataset description:

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 57,000 scholarly articles, including over 45,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

This dataset was created by the Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine - National Institutes of Health, in coordination with The White House Office of Science and Technology Policy.


Our goal here is to grab all the json files that contain the liturature for each paper and merge them to the metadata.csv file accordingly and output the result to a new .csv file to be used for our final capstone.

### Notebook content:


#### 1. Import libraries & load meta data

In [None]:
import numpy as np
import pandas as pd
import os
import json
import glob
import sys
sys.path.insert(0, "../")

root_path = 'CORD-19-research-challenge/'

In [None]:
# List all files in directory:
!ls CORD-19-research-challenge

COVID.DATA.LIC.AGMT.pdf [1m[34mcord_19_embeddings_4_17[m[m metadata.csv
[1m[34mbiorxiv_medrxiv[m[m         [1m[34mcustom_license[m[m          metadata.readme
[1m[34mcomm_use_subset[m[m         json_schema.txt         [1m[34mnoncomm_use_subset[m[m


In [None]:
# Load the metadata dataset:
root_path = 'CORD-19-research-challenge'
metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url
0,zjufx4fo,b2897e1277f56641193a6db73825f707eed3e4c9,PMC,Sequence requirements for RNA strand transfer ...,10.1093/emboj/20.24.7220,PMC125340,11742998,unk,Nidovirus subgenomic mRNAs contain a leader se...,2001-12-17,"Pasternak, Alexander O.; van den Born, Erwin; ...",The EMBO Journal,,,True,True,custom_license,http://europepmc.org/articles/pmc125340?pdf=re...
1,ymceytj3,e3d0d482ebd9a8ba81c254cc433f314142e72174,PMC,"Crystal structure of murine sCEACAM1a[1,4]: a ...",10.1093/emboj/21.9.2076,PMC125375,11980704,unk,CEACAM1 is a member of the carcinoembryonic an...,2002-05-01,"Tan, Kemin; Zelus, Bruce D.; Meijers, Rob; Liu...",The EMBO Journal,,,True,True,custom_license,http://europepmc.org/articles/pmc125375?pdf=re...
2,wzj2glte,00b1d99e70f779eb4ede50059db469c65e8c1469,PMC,Synthesis of a novel hepatitis C virus protein...,10.1093/emboj/20.14.3840,PMC125543,11447125,no-cc,Hepatitis C virus (HCV) is an important human ...,2001-07-16,"Xu, Zhenming; Choi, Jinah; Yen, T.S.Benedict; ...",EMBO J,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
3,2sfqsfm1,cf584e00f637cbd8f1bb35f3f09f5ed07b71aeb0,PMC,Structure of coronavirus main proteinase revea...,10.1093/emboj/cdf327,PMC126080,12093723,unk,The key enzyme in coronavirus polyprotein proc...,2002-07-01,"Anand, Kanchan; Palm, Gottfried J.; Mesters, J...",The EMBO Journal,,,True,True,custom_license,http://europepmc.org/articles/pmc126080?pdf=re...
4,i0zym7iq,dde02f11923815e6a16a31dd6298c46b109c5dfa,PMC,Discontinuous and non-discontinuous subgenomic...,10.1093/emboj/cdf635,PMC136939,12456663,unk,"Arteri-, corona-, toro- and roniviruses are ev...",2002-12-01,"van Vliet, A.L.W.; Smits, S.L.; Rottier, P.J.M...",The EMBO Journal,,,True,True,custom_license,http://europepmc.org/articles/pmc136939?pdf=re...


In [None]:
# Inspect the shape of the metadata dataset:
meta_df.shape

(57366, 18)

#### 2. Extract data from json files that contain liturerature

In [None]:
# Grab all json files that contain all lituratures:
all_json = glob.glob(f'{root_path}/**/*.json', recursive=True)
len(all_json)

68204

In [None]:
# Open 1 json file to see what the structure looks like:
with open(all_json[1]) as file:
    first_entry = json.load(file)
    print(json.dumps(first_entry, indent=4))

{
    "paper_id": "PMC7109766",
    "metadata": {
        "title": "Reply to Tso et al",
        "authors": [
            {
                "first": "Xinchun",
                "middle": [],
                "last": "Chen",
                "suffix": "",
                "email": null,
                "affiliation": {}
            },
            {
                "first": "Boping",
                "middle": [],
                "last": "Zhou",
                "suffix": "",
                "email": null,
                "affiliation": {}
            },
            {
                "first": "Meizhong",
                "middle": [],
                "last": "Li",
                "suffix": "",
                "email": null,
                "affiliation": {}
            },
            {
                "first": "Xiaorong",
                "middle": [],
                "last": "Liang",
                "suffix": "",
                "email": null,
                "affiliation": {}
            },
  

In [None]:
# Build helper function to read files:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id'] #extract paper_id
            self.title = content['metadata']['title'] #extract paper title
            self.body_text = []
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])#extract body_text
            self.body_text = '\n'.join(self.body_text)
            # Extend Here
            #
            #
    def __repr__(self):
        return f'{self.paper_id}: {self.body_text[:300]}...'

# Use FileReader class to inspect the second row all_json, only showing the first 300 characters:
second_row = FileReader(all_json[1])
print(second_row)

PMC7109766: 
To the Editor—We appreciate the letters from Tso et al. [1, 2] that expand on our study [3] of the serologic profile of severe acute respiratory syndrome (SARS). On the basis of the findings of our study, we are certain that IgG antibody persists for at least 60 days after the onset of symptoms. Th...


In [None]:
# Load literatures into a dataframe:
dict_ = {'paper_id': [], 'title': [], 'body_text': []}
for idx, entry in enumerate(all_json):
    if idx % (len(all_json) // 10) == 0:
        print(f'Processing index: {idx} of {len(all_json)}')
    content = FileReader(entry)
    dict_['paper_id'].append(content.paper_id)
    dict_['title'].append(content.title)
    dict_['body_text'].append(content.body_text)
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'title', 'body_text'])

Processing index: 0 of 68204
Processing index: 6820 of 68204
Processing index: 13640 of 68204
Processing index: 20460 of 68204
Processing index: 27280 of 68204
Processing index: 34100 of 68204
Processing index: 40920 of 68204
Processing index: 47740 of 68204
Processing index: 54560 of 68204
Processing index: 61380 of 68204
Processing index: 68200 of 68204


#### 3. Data cleaning and feature engineering

In [None]:
# Adding wordcount feature:
df_covid['body_word_count'] = df_covid['body_text'].apply(lambda x: len(x.strip().split()))
df_covid.head()

Unnamed: 0,paper_id,title,body_text,body_word_count
0,PMC2750777,The spike protein of SARS-CoV — a target for v...,,0
1,PMC7109766,Reply to Tso et al,\nTo the Editor—We appreciate the letters from...,252
2,PMC7107989,Acute Respiratory Distress Syndrome and Pneumo...,The sequence from bacterial pneumonia to ARDS ...,2630
3,PMC7121640,Autoimmune Processes in the Central Nervous Sy...,The central nervous system (CNS) has been cons...,7123
4,PMC7094896,Critics slam treatment for SARS as ineffective...,Health authorities in Hong Kong are coming und...,452


In [None]:
df_covid.describe(include='all')

Unnamed: 0,paper_id,title,body_text,body_word_count
count,68204,68204.0,68204.0,68204.0
unique,68204,50029.0,65242.0,
top,5f7a653beae752fae3a898cbfce92cb2854a874a,,,
freq,1,4997.0,2373.0,
mean,,,,4346.978814
std,,,,8869.285139
min,,,,0.0
25%,,,,2066.0
50%,,,,3405.0
75%,,,,5148.0


Looking at the statistical description of the df_covid dataset, it looks like we have 65,242 unique body_text out of 68,204 entries, indicating that we have duplicate literatures. Let drop all duplicates to make sure that our dataset only contain unique documents.

In [None]:
# Drop duplicate research papers:
df_covid.drop_duplicates(['body_text'], inplace=True)

In [None]:
# Inspect the shape of the dataset after droping entries:
df_covid.shape

(65242, 4)

In [None]:
df_covid

Unnamed: 0,paper_id,title,body_text,body_word_count
0,PMC2750777,The spike protein of SARS-CoV — a target for v...,,0
1,PMC7109766,Reply to Tso et al,\nTo the Editor—We appreciate the letters from...,252
2,PMC7107989,Acute Respiratory Distress Syndrome and Pneumo...,The sequence from bacterial pneumonia to ARDS ...,2630
3,PMC7121640,Autoimmune Processes in the Central Nervous Sy...,The central nervous system (CNS) has been cons...,7123
4,PMC7094896,Critics slam treatment for SARS as ineffective...,Health authorities in Hong Kong are coming und...,452
...,...,...,...,...
68199,228650bc0429064d800d4b9c5fb0e00c2533a579,Lipidome profiles of postnatal day 2 vaginal s...,Early nutritional environment affects long ter...,4139
68200,2246e28681bde69c65dc9081df367bb661997f19,"Secondary Syphilis in Cali, Colombia: New Conc...",Syphilis is a sexually transmitted disease (ST...,5621
68201,577c6a13f9ef70e9756890fc66e98f537c01ac0a,Replication and shedding of MERS- CoV in Jamai...,Scientific RepoRts | 6:21878 | DOI: 10 .1038/s...,2832
68202,c5c2bc7a07670d6fb970d84a59aab3832752a3f1,Role of the ERK1/2 Signaling Pathway in the Re...,Arenaviruses are enveloped RNA viruses contain...,4805


In [None]:
# Merge meta_df with df_covid:
covid19 = pd.merge(meta_df, df_covid, how='inner', on='title')

In [None]:
# Only use features that are useful to our analysis:
covid19 = covid19[['sha', 'pmcid', 'paper_id', 'source_x', 'title', 'license', 'abstract', 
                   'publish_time', 'journal', 'body_text', 'body_word_count', 'url']]

In [None]:
# Print out the first 10 rows of the new covid19 dataset:
covid19.head(10)

Unnamed: 0,sha,pmcid,paper_id,source_x,title,license,abstract,publish_time,journal,body_text,body_word_count,url
0,b2897e1277f56641193a6db73825f707eed3e4c9,PMC125340,PMC125340,PMC,Sequence requirements for RNA strand transfer ...,unk,Nidovirus subgenomic mRNAs contain a leader se...,2001-12-17,The EMBO Journal,The genetic information of RNA viruses is orga...,5942,http://europepmc.org/articles/pmc125340?pdf=re...
1,e3d0d482ebd9a8ba81c254cc433f314142e72174,PMC125375,PMC125375,PMC,"Crystal structure of murine sCEACAM1a[1,4]: a ...",unk,CEACAM1 is a member of the carcinoembryonic an...,2002-05-01,The EMBO Journal,Carcinoembryonic antigen (CEA; CD66e) was init...,5673,http://europepmc.org/articles/pmc125375?pdf=re...
2,cf584e00f637cbd8f1bb35f3f09f5ed07b71aeb0,PMC126080,PMC126080,PMC,Structure of coronavirus main proteinase revea...,unk,The key enzyme in coronavirus polyprotein proc...,2002-07-01,The EMBO Journal,Transmissible gastroenteritis virus (TGEV) bel...,5667,http://europepmc.org/articles/pmc126080?pdf=re...
3,dde02f11923815e6a16a31dd6298c46b109c5dfa,PMC136939,PMC136939,PMC,Discontinuous and non-discontinuous subgenomic...,unk,"Arteri-, corona-, toro- and roniviruses are ev...",2002-12-01,The EMBO Journal,Positive (+)-strand RNA viruses have developed...,5075,http://europepmc.org/articles/pmc136939?pdf=re...
4,dde02f11923815e6a16a31dd6298c46b109c5dfa,PMC136939,dde02f11923815e6a16a31dd6298c46b109c5dfa,PMC,Discontinuous and non-discontinuous subgenomic...,unk,"Arteri-, corona-, toro- and roniviruses are ev...",2002-12-01,The EMBO Journal,Positive (+)-strand RNA viruses have developed...,6061,http://europepmc.org/articles/pmc136939?pdf=re...
5,1e1286db212100993d03cc22374b624f7caee956,PMC140314,PMC140314,PMC,Airborne rhinovirus detection and effect of ul...,no-cc,"BACKGROUND: Rhinovirus, the most common cause ...",2003-01-13,BMC Public Health,Rhinoviruses have been associated with 40% to ...,3366,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
6,8ae137c8da1607b3a8e4c946c07ca8bda67f88ac,PMC156578,PMC156578,PMC,Discovering human history from stomach bacteria,no-cc,Recent analyses of human pathogens have reveal...,2003-04-28,Genome Biol,Charles Darwin recognized that the distributio...,1474,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
7,8ae137c8da1607b3a8e4c946c07ca8bda67f88ac,PMC156578,8ae137c8da1607b3a8e4c946c07ca8bda67f88ac,PMC,Discovering human history from stomach bacteria,no-cc,Recent analyses of human pathogens have reveal...,2003-04-28,Genome Biol,Charles Darwin recognized that the distributio...,1781,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
8,23bc55d6f63fab18b02004483888db2b6a0bfa48,PMC169038,PMC169038,PMC,Prokaryotic-style frameshifting in a plant tra...,unk,Ribosomal frameshifting signals are found in m...,2003-08-01,The EMBO Journal,The elongation phase of protein synthesis is a...,5774,http://europepmc.org/articles/pmc169038?pdf=re...
9,,PMC193621,PMC193621,PMC,A new recruit for the army of the men of death,no-cc,"The army of the men of death, in John Bunyan's...",2003-06-27,Genome Biol,"The message was dated Tuesday, 2 June 2003. ""F...",1572,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...


In [None]:
# Print out the shape of the covid19 dataset:
covid19.shape

(42690, 12)

In [None]:
# Remove any liturature duplicates & split publish_time column to year and month:
covid19 = covid19.drop_duplicates(subset=['body_text', 'abstract'])
covid19 = covid19.drop(index=1276).reset_index() # this row contains nan value for abstract

# Split publish_time to extract month and year data:
covid19['publish_year'] = pd.to_datetime(covid19['publish_time'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)
covid19['publish_month'] = pd.to_datetime(covid19['publish_time'], errors='coerce').apply(lambda x: str(x).split('-')[1] if x != np.nan else np.nan)

# Change datatypes of month and year features:
covid19['publish_year'] = covid19['publish_year'].astype('int64')
covid19['publish_month'] = covid19['publish_month'].astype('int64')

# Drop publish_time column now that we don't need it anymore:
covid19.drop('publish_time', axis=1, inplace=True)

In [None]:
# Inspect the shape of the dataset:
print('The shape of the dataset is: {}'.format(covid19.shape))
print('')

# Inspect any missing values in percentage:
missing_values = covid19.isnull().sum()/len(covid19)*100
print('The features with missing values are:\n{}'.format(missing_values[missing_values>0]))

The shape of the dataset is: (41949, 14)

The features with missing values are:
sha         2.524494
pmcid       6.631862
abstract    6.548428
journal     4.212258
url         0.009535
dtype: float64


In [None]:
# Data imputation:
covid19['abstract'] = covid19['abstract'].fillna('unknown', axis=0)
covid19['journal'] = covid19['journal'].fillna(covid19['journal'].mode()[0], axis=0)

In [None]:
covid19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41949 entries, 0 to 41948
Data columns (total 14 columns):
index              41949 non-null int64
sha                40890 non-null object
pmcid              39167 non-null object
paper_id           41949 non-null object
source_x           41949 non-null object
title              41949 non-null object
license            41949 non-null object
abstract           41949 non-null object
journal            41949 non-null object
body_text          41949 non-null object
body_word_count    41949 non-null int64
url                41945 non-null object
publish_year       41949 non-null int64
publish_month      41949 non-null int64
dtypes: int64(4), object(10)
memory usage: 4.5+ MB


In [None]:
# Drop unused columns:
covid19 = covid19.drop(['index', 'sha', 'pmcid'], axis=1)

In [None]:
covid19

Unnamed: 0,paper_id,source_x,title,license,abstract,journal,body_text,body_word_count,url,publish_year,publish_month
0,PMC125340,PMC,Sequence requirements for RNA strand transfer ...,unk,Nidovirus subgenomic mRNAs contain a leader se...,The EMBO Journal,The genetic information of RNA viruses is orga...,5942,http://europepmc.org/articles/pmc125340?pdf=re...,2001,12
1,PMC125375,PMC,"Crystal structure of murine sCEACAM1a[1,4]: a ...",unk,CEACAM1 is a member of the carcinoembryonic an...,The EMBO Journal,Carcinoembryonic antigen (CEA; CD66e) was init...,5673,http://europepmc.org/articles/pmc125375?pdf=re...,2002,5
2,PMC126080,PMC,Structure of coronavirus main proteinase revea...,unk,The key enzyme in coronavirus polyprotein proc...,The EMBO Journal,Transmissible gastroenteritis virus (TGEV) bel...,5667,http://europepmc.org/articles/pmc126080?pdf=re...,2002,7
3,PMC136939,PMC,Discontinuous and non-discontinuous subgenomic...,unk,"Arteri-, corona-, toro- and roniviruses are ev...",The EMBO Journal,Positive (+)-strand RNA viruses have developed...,5075,http://europepmc.org/articles/pmc136939?pdf=re...,2002,12
4,dde02f11923815e6a16a31dd6298c46b109c5dfa,PMC,Discontinuous and non-discontinuous subgenomic...,unk,"Arteri-, corona-, toro- and roniviruses are ev...",The EMBO Journal,Positive (+)-strand RNA viruses have developed...,6061,http://europepmc.org/articles/pmc136939?pdf=re...,2002,12
...,...,...,...,...,...,...,...,...,...,...,...
41944,2d502e7bb600b3f1b16fa83995717993cbe316e7,PMC,Short-Sighted Virus Evolution and a Germline H...,cc-by,With extremely short generation times and high...,Trends Microbiol,With extremely short generation times and high...,6119,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,2017,4
41945,2e237dc7ffdacf42c76c0671bea72fae2a3ec9e1,PMC,Short-Sighted Virus Evolution and a Germline H...,cc-by,With extremely short generation times and high...,Trends Microbiol,Susceptibility to short-sighted evolution will...,5270,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,2017,4
41946,29e61d2288dcd264a7bbca9fdd23f6ed4f09cbc1,Elsevier,An immunosuppressed Syrian golden hamster mode...,els-covid,Abstract Several small animal models have been...,Virology,The severe acute respiratory syndrome coronavi...,4918,https://doi.org/10.1016/j.virol.2008.07.026,2008,10
41947,2c01b65a4468d43dad054457c80d05d5a2715ac6,Elsevier,Structural model of the SARS coronavirus E cha...,els-covid,Abstract Coronaviruses (CoV) cause common cold...,Biochimica et Biophysica Acta (BBA) - Biomembr...,Coronaviruses (CoV) typically affect the respi...,3296,https://doi.org/10.1016/j.bbamem.2018.02.017,2018,6


In [None]:
# Exclude any rows with body_text <20 words:
covid19 = covid19[covid19['body_word_count'] > 20]

In [None]:
covid19.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41843 entries, 0 to 41948
Data columns (total 11 columns):
paper_id           41843 non-null object
source_x           41843 non-null object
title              41843 non-null object
license            41843 non-null object
abstract           41843 non-null object
journal            41843 non-null object
body_text          41843 non-null object
body_word_count    41843 non-null int64
url                41839 non-null object
publish_year       41843 non-null int64
publish_month      41843 non-null int64
dtypes: int64(3), object(8)
memory usage: 3.8+ MB


#### 4. Write the result into a new .csv file to use later

In [None]:
# Write the result into a new .csv file in the directory:
corona_out = covid19.to_csv('kaggle_covid-19_clean_format.csv')