# CORD-19 Software Mentions with DOIs where software is mentioned

A notebook to link the random sample dataset with DOIs for papers the single software packages are mentioned in.

First up, import some packages that may be helpful.

In [None]:
import numpy as np
import pandas as pd
import csv
import ast
import collections
import matplotlib.pyplot as plt

### Prepare the input and output files to be used as variables

We want to enrich the random sample file with information from the CORD-19 file, and finally write a new version of the random sample file with added DOI information.

In [None]:
RANDOM_SAMPLE_CSV = '../data/output/CORD19_software_popularity_sampled_QA.csv'
CORD19_CSVFILE = '../data/cord-19/CORD19_software_mentions.csv'
DOI_FILE = '../data/output/CORD19_software_popularity_sampled_QA_DOI.csv'

### Read the titles of the sampled software packages as keys into a dict

We want to iterate that list later to add DOIs for occurrences in the CORD-19 file.

In [None]:
sample_doi_map = {}

with open(RANDOM_SAMPLE_CSV, newline='') as samplecsv:
    samplereader = csv.DictReader(samplecsv)
    for row in samplereader:
        title = row['Title']
        # Prepopulate the map with the title as key and an empty set as value
        sample_doi_map[title] = set()
        
# Sanity check
len_map = len(sample_doi_map)
if not len_map == 100:
    raise AssertionError('Expected 100 titles, mapped ' + str(len_map))

### Take a quick look at the set of titles

In [None]:
sample_doi_map

### Build a dict from row indices to DOIs

We now want to iterate through the mentions in the CORD-19 file, and for each mention where it equals a title in our sampled list, read the DOI and add it to a list in a dict, where the title is the key, and the value is a list of DOIs which reference that title.

In [None]:
with open(CORD19_CSVFILE, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for cordrow in reader:
        mentions = set(ast.literal_eval(cordrow['software']))
        for mention in mentions:
            for sample in sample_doi_map.keys():
                if mention == sample:
                    doi = cordrow['doi']
                    sample_doi_map[sample].add(doi)

### Have a peek at the map

In [None]:
sample_doi_map

### Update the sample dataset

In the random sample file, create a new column "mentioning DOIs", and populate it with the set of DOIs for each sample.

In [None]:
with open(RANDOM_SAMPLE_CSV, newline='') as samplecsv:
    with open(DOI_FILE, 'w') as sample_doi_csv:
        writer = csv.writer(sample_doi_csv)

        for row in csv.reader(samplecsv):
            if row[1] == 'Title':
                writer.writerow(row+['Mentioning DOIs'])
            else:
                # build a comma-separated list of non-empty DOIs
                doi_set = sample_doi_map[row[1]]
                writer.writerow(row+[';'.join(str(doi) for doi in doi_set)])