 # Deconstrunting CORD-19

>  On March 13th, the White House, along the Chan-Zuckerberg Foundation and the Allen Center for AI launched the COVID-19 Open Research Dataset Challenge on Kaggle with the objective of better understanding the novel Corona Virus. For this purpose, a massive corpus of more than 45,000 scholarly articles was made available as the competition dataset, called the COVID-19 Open Research Dataset (CORD-19).

> I found this initiative fascinating, and hints me of a world where science is a truly open, collaborative pursuit. For that reason, I decided to understand the story of the dataset better by asking:
> * What can the metadata tell me about the views being represented
> [](http://)* What are the assumptions made in a corpus of a modern, western view of science and knowledge

> ## The Dataset Structure
> The CORD-19 adds up to a total of 4GB, which can be downloaded. However, the best way to work with it is using a Kaggle notebook that can access the dataset in its workspace. Upon of the local filesystem, we see that the dataset is made up of json files, organized within direcotories.

In [44]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import json
import os
directories = set()
prints = 0
for dirname, _, filenames in os.walk('/kaggle/input'): 
    for filename in filenames:   
        if prints < 10:
            print(os.path.join(dirname, filename))
            prints += 1

/kaggle/input/CORD-19-research-challenge/metadata.csv
/kaggle/input/CORD-19-research-challenge/json_schema.txt
/kaggle/input/CORD-19-research-challenge/metadata.readme
/kaggle/input/CORD-19-research-challenge/COVID.DATA.LIC.AGMT.pdf
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/d99acb4e99be7852aa61a688c9fbd38d44b5a252.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/748d4c57fe1acc8d9d97cf574f7dea5296f9386c.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/b891efc6e1419713b05ff7d89b26d260478c28df.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/353852971069ad5794445e5c1ab6077ce23da75d.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/76d2990a2663635e195b8a9818f9664872b6d3af.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/5407b1aabede9eeed76c59a7b890be9f513712b7.json


> ## What can the directory structure tell us about embedded views in the corpus, about the standards and requirements that shaped it? 
> To start answering, we look at the unique directories and see that there are four. It seems that each one corresponds to a type a licence. In a Western, modern view science, licence rights are often more important than knowledge. Even though this dataset has been made open, its structure acknowledges that importance. 

In [52]:
directories = set()
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        directories.add(dirname)
directories

{'/kaggle/input/CORD-19-research-challenge',
 '/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv',
 '/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset',
 '/kaggle/input/CORD-19-research-challenge/custom_license/custom_license',
 '/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset'}

> ## Structure of an inidividual paper
> Jsons are made up of nested key-value pairs. By looking at the outermost 'shell' of keys, we see the basic components of a paper. These components correspond to a standarized composition of scholarly articles, so it can tell us a story about the underlying system of knowledge creation. Here, we take a single paper and examine its outermost keys.


In [53]:
with open('/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/64b327001f8fa95b83dc23259a2ad617be12498c.json') as f:
    data = json.load(f)
list(data.keys())

['paper_id',
 'metadata',
 'abstract',
 'body_text',
 'bib_entries',
 'ref_entries',
 'back_matter']

> ### Paper id
>  In the Western view of science, each paper, each unit of knowledge contribution, must be unique, differentiable and indentifiable. In contrast, indigenous knowledge, for example, is a more amourphous fluid concept with no easily discretizable units.


In [54]:
data['paper_id']

'64b327001f8fa95b83dc23259a2ad617be12498c'

> ### Metadata
> The metadata constains the most important factor to rank the importance of the papers contribution, not by design, but by the facto. It contains the name of the researcher and its academic affiliation. In Modern Science, this has become a sort of pedigree. A paper marked with it is deemed to contain a valid contribution.

In [55]:
data['metadata']['authors']

[{'first': 'Maria',
  'middle': [],
  'last': 'Nevot',
  'suffix': '',
  'affiliation': {'laboratory': '',
   'institution': 'Universitat Autònoma de 6 Barcelona (UAB)',
   'location': {'settlement': 'Badalona', 'country': 'Spain'}},
  'email': ''},
 {'first': 'Ana',
  'middle': [],
  'last': 'Jordan-Paiz',
  'suffix': '',
  'affiliation': {},
  'email': ''},
 {'first': 'Glòria',
  'middle': [],
  'last': 'Martrus',
  'suffix': '',
  'affiliation': {'laboratory': '',
   'institution': 'Universitat Autònoma de 6 Barcelona (UAB)',
   'location': {'settlement': 'Badalona', 'country': 'Spain'}},
  'email': ''},
 {'first': 'Cristina',
  'middle': [],
  'last': 'Andrés',
  'suffix': '',
  'affiliation': {'laboratory': '',
   'institution': 'Universitat Autònoma de 6 Barcelona (UAB)',
   'location': {'settlement': 'Badalona', 'country': 'Spain'}},
  'email': ''},
 {'first': 'Damir',
  'middle': [],
  'last': 'García-3 Cehic',
  'suffix': '',
  'affiliation': {'laboratory': 'Current address: V

> ### Abstract 
> Science is atomic and made up from individually, separable and fully differenciable pieces of knowledge. It follows then, that there must be an abstract that as the University of Melbourne recommends: 'must be fully self-contained and make sense by itself, without further reference to outside sources or to the actual paper.'

In [56]:
data['abstract'][0]['text']

'One unexplored aspect of HIV-1 genetic architecture is how codon choice influences 27 population diversity and evolvability. Here we compared the development of HIV-1 28 resistance to protease inhibitors (PIs) between wild-type (WT) virus and a synthetic 29 virus (MAX) carrying a codon-pair re-engineered protease sequence including 38 (13%) 30 synonymous mutations. WT and MAX viruses showed indistinguishable replication in 31 MT-4 cells or PBMCs. Both viruses were subjected to serial passages in MT-4 cells 32 with selective pressure from the PIs atazanavir (ATV) and darunavir (DRV). After 32 33 successive passages, both the WT and MAX viruses developed phenotypic resistance to 34'

> ### Text body
> The text body section of the json file contains text but also information on references and cites. While science is created from unique contributions, differenciable for the rest, each component is interconnected. However, this interconnections must remain fully traceable: reference and cites fulfill that purpose.

In [57]:
data['body_text']

[{'text': 'The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/315366 doi: bioRxiv preprint 5 mechanisms within the innate immune response (11, 29, 30) , as well as to resolve the 87 importance of codon usage in the temporal regulation of viral gene expression (31). In (Table 1) . WTp32 and MAXp32, respectively, showed 5-129 fold and 13-fold increases in IC 50 for ATV, and 6-fold and 10-fold increases in IC 50 for 130 DRV (Table 1) . Although MAXp32 displayed a higher resistance to ATV and DRV 131 than WTp32 (Table1), these differences were not significant (P = 0.4816 and P = The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/315366 doi: bioRxiv preprint compared the frequencies of resistant mutations. For each of the two studied viruses and 137 the two tested drugs, we sequenced between 1.9 × 10 7 and 4.1 × 10 7 individual protease 138 nucleotides (Table 2) . Sequence clonal analysis r

> ### Bibliography and references
> This is likely the most defining component of modern science. Valid contributions are only those who build on top of other valid contributions. 
>**What happens when everything is based on something false and the mistake is never detected?**

In [61]:
data['bib_entries']['BIBREF1']

{'ref_id': 'b1',
 'title': 'Exposing synonymous mutations',
 'authors': [],
 'year': None,
 'venue': 'Trends in genetics : TIG',
 'volume': '30',
 'issn': '',
 'pages': '308--321',
 'other_ids': {}}

> ## Retrieving the data
We will know retireve the data into a Python datastructure to work with it. Initially, we will put it all in a list.

In [None]:
groups = ['custom_license','noncomm_use_subset','comm_use_subset','biorxiv_medrxiv']
all_jsons = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if('.json' in os.path.join(dirname, filename)):
            with open(os.path.join(dirname, filename)) as f:
                data = json.load(f)
            all_jsons.append(data)

> ## What can the metadata tell us?
> First, we will take a look at the authors information and we start by asking how many papers don't have any authors: 9%, a fair amount. Where does this knowledge come from? 


In [63]:
count = 0
for i in all_jsons:
    if (i['metadata']['authors'] != []):
        count = count + 1

total_empty = len(all_jsons) - count
empty_proportion = total_empty/len(all_jsons)
print(empty_proportion*100)

12.135922330097088


> ## How many authors are there, or how many human minds have contributed to the corpus?
> 198,297 authors are counted. 

In [28]:
authors = 0
for i in all_jsons:
    if (i['metadata']['authors'] != []):
        
        for j in i['metadata']['authors']:
            authors += 1
authors

198297

> ## Are each one of them an individual mind? Or are they minds that live in an environment, within a infrastructure, within a social setting? 
> Can examining their institutions give us a hint? First, we look at how many of them have an affiliation: 62% 


In [29]:
no_affil = 0
affil = 0
for i in all_jsons:
    if (i['metadata']['authors'] != []):
        for j in i['metadata']['authors']:
            if(j['affiliation'] == {}):
                no_affil = no_affil + 1
            else:
                affil = affil + 1
                
total_authors = no_affil + affil
no_affil_share = no_affil/total_authors
print((1-no_affil_share)*100)
            

62.69131656051277


> ## Who are those institutions?
> We present a table with the most represented institutions in the authors affiliations:

In [30]:
institutions_dict = {}

for i in all_jsons:
    if (i['metadata']['authors'] != []):
        
        for j in i['metadata']['authors']:

            if(j['affiliation'] != {}):
                if(str(j['affiliation']['institution']) in institutions_dict.keys()):
                    
                    institutions_dict[str(j['affiliation']['institution'])] += 1
                else:
                    institutions_dict[str(j['affiliation']['institution'])] = 1
                    
institutions_dict.values()
ins_df = pd.DataFrame(
    {
        'Institution': list(institutions_dict.keys()),
        'Times Appeared': list(institutions_dict.values())
    }
)

ins_df.sort_values(by='Times Appeared', ascending = False, inplace = True)
ins_df = ins_df[ins_df['Institution'] != '']
ins_df.iloc[0:20]


Unnamed: 0,Institution,Times Appeared
213,Chinese Academy of Sciences,1890
236,National Institutes of Health,1297
47,University of California,1255
416,The University of Hong Kong,1011
192,Chinese Academy of Agricultural Sciences,816
25,Fudan University,649
32,Wuhan University,554
139,University of Washington,541
539,The Chinese University of Hong Kong,533
393,Utrecht University,528


> ## What countries are represented by those institutions?
> We present a table with the most frequent locations in these institutions. We observe that mostly caucasian and asian countries are represented. In the context of finding useful knowledge for fighting the pandemic, how do we know the solutions are applicable to Latin America or Africa?

In [31]:
countries_dict = {}
for i in all_jsons:
    if (i['metadata']['authors'] != []):
        for j in i['metadata']['authors']:
            if(j['affiliation'] != {}):
                if ('country' in j['affiliation']['location'].keys()):
                    if(str(j['affiliation']['location']['country']) in countries_dict.keys()):

                        countries_dict[str(j['affiliation']['location']['country'])] += 1
                    else:
                        countries_dict[str(j['affiliation']['location']['country'])] = 1
                    
coun_df = pd.DataFrame(
    {
        'Country': list(countries_dict.keys()),
        'Times Appeared': list(countries_dict.values())
    }
)

coun_df.sort_values(by='Times Appeared', ascending = False, inplace = True)
coun_df = coun_df[coun_df['Country'] != '']
coun_df.iloc[0:10]



Unnamed: 0,Country,Times Appeared
1,China,17981
0,USA,17652
21,Japan,4035
28,France,3863
3,Canada,3730
12,UK,3294
4,Germany,3290
40,United States,2749
9,Italy,2702
46,Taiwan,2423


 # Final thoughts: 
> This has been a first excersice in trying to understand the fascinating COVID-19 Open Research Dataset. While most Kaggle participants are diving right into the abstracts or the text in search for information, it can be useful to take a step back and look the underlying structure of the data. With that, we can ask better questions and understand the limitations of our answers. 