# CORD-19 Software Mentions - License Study

## RQ1: Impact of licenses on reference and citation types

Howison and Bullard, 2015 define seven types of software mentions in publications:  
- Cite to publication
- Cite to users manual 
- Cite to project name or website
- Instrument-like
- URL in text
- In-text name mention only
- Not even name mentioned. 

We seek to understand if the way that the software is licensed has an impact on the way that the software is cited or mentioned in publications. We hypothesise that commercial software is more likely to be cited using an in-text name mention or citation to project name or website, but that open source software would be more likely to have a repository or associated research publication that could be cited, and thus make it easier to credit the authors. The results of this part of the study would provide evidence concerning whether the increasing prevalence of Open Science / Open Research approaches could improve the quality of software citation.

## Methodology

The original randomly sampled and manually annotated set of 100 candidate software names from the CORD-19 Software Mentions datasetwas used as a starting point. Any originally marked as UN (Unknown) were re-examined and re-classified.

Of the 10 mentions ofSpectra, 7 were to devices and discarded; the remaining three referenced two distinct pieces of software, Spectra Calc®andCobe Spectra software. This was also the case with JAM, was found to refer to two distinct pieces of software (Jet AA Microscopic Transport Model and JAM:a scalable, Bayesian framework for joint analysis of marginal SNP effects) along with two incorrectly identified mentions, which were discarded. Three more mentions were to preprints of papers already in the dataset and were also discarded. Two groups of references, Ensembl genome browser and ENSEMBL browser, were to the same software and merged. For the two most popular pieces of software mentioned in this dataset, Sequencher (123 mentions) and Microsoft Access (114 mentions), a random set of five valid software mentions were chosen; all other software had five valid mentions or fewer.

This resulted in a data set containing 80 mentions of 58 pieces of software, which was manually annotated by NCH to classify the type of software mention according to the scheme presented in Howison and Bullard, 2015 and the software license. This dataset is available as CSM_sampled_mention_license.csv.

In [23]:
import pandas as pd
import numpy as np

In [24]:
input_data = "../data/license_study/CSM_sampled_mention_license.csv"


In [25]:
df = pd.read_csv (r'../data/license_study/CSM_sampled_mention_license.csv')
print (df)

     ID       Title QACode Software License Mention Type  \
0     1  Sequencher     SC           Closed          INS   
1     1  Sequencher     SC           Closed          NAM   
2     1  Sequencher     SC           Closed          NAM   
3     1  Sequencher     SC           Closed          PRO   
4     1  Sequencher     SC           Closed          INS   
..  ...         ...    ...              ...          ...   
76   94    Adequest     SC           Closed          INS   
77   96   NRSur PHM     SC              MIT          PUB   
78   97       PVSio     ST              GPL          PUB   
79  100      discmo     ST              GPL          NAM   
80  NaN         NaN    NaN              NaN          NaN   

                     Mentioning DOIs  \
0              10.1093/infdis/jiy036   
1       10.1016/j.meegid.2012.09.016   
2     10.1016/j.jviromet.2012.11.014   
3       10.1371/journal.pone.0205209   
4   10.12688/wellcomeopenres.14836.2   
..                               ...   

In [29]:
df.groupby(['Software License'])[['ID']].count()

Unnamed: 0_level_0,ID
Software License,Unnamed: 1_level_1
Academic,4
Apache,12
Artistic,1
BSD,1
Closed,34
GPL,8
LGPL,1
MIT,4
Unknown,8
Unknown (SaaS),6


In [30]:
df.groupby(['Software License', 'Mention Type'])[['ID']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,ID
Software License,Mention Type,Unnamed: 2_level_1
Academic,INS,1
Academic,"INS, URL",1
Academic,NAM,1
Academic,PUB,1
Apache,INS,1
Apache,NAM,3
Apache,PUB,3
Apache,"PUB, URL",1
Apache,URL,4
Artistic,NAM,1


In [79]:
# SUM(COUNTIFS('License breakdown (old)'!$E$2:$E$69,"*PUB*",'License breakdown (old)'!$D$2:$D$69,{"Apache","Artistic","BSD","MIT","Unlimited"}))

lictype_closed = {"Closed"}
lictype_academic = {"Academic"}
lictype_permissive = {"Apache","Artistic","BSD","MIT","Unlimited"}
lictype_copyleft = {"GPL","LGPL"}
lictype_unknown = {"Unknown","Unknown (SaaS)"}

results = {"Closed": {"lictypes": {"Closed"}},
           "Academic": {"lictypes": {"Academic"}},
           "Permissive": {"lictypes": {"Apache","Artistic","BSD","MIT","Unlimited"}},
           "Copyleft": {"lictypes": {"GPL","LGPL"}},
           "Unknown": {"lictypes": {"Unknown","Unknown (SaaS)"}}
          }

def cluster_mentions(mention, res):
    types = {"PUB","MAN","PRO","INS","URL","NAM","NOT"}    
    for type in types:
        if type in mention:
            if type in res:
                res[type] +=1
                #print("Assigning to ", type, "value is ", res[type])
            else:
                res[type] = 1
                #print("Assigning to ", type, "value is ",res[type])

for i in df.index:
    for classification in results:
        if df['Software License'][i] in results[classification]['lictypes']:
            #print ("Index: ", i, "Software Title: ", df['Title'][i], "License Type: ", classification, "Mention Type: ", df['Mention Type'][i])
            cluster_mentions(df['Mention Type'][i], results[classification])
            #print(results)
            
            
print(results)    

{'Closed': {'lictypes': {'Closed'}, 'INS': 13, 'NAM': 17, 'PRO': 1, 'URL': 4, 'PUB': 2}, 'Academic': {'lictypes': {'Academic'}, 'NAM': 1, 'INS': 2, 'URL': 1, 'PUB': 1}, 'Permissive': {'lictypes': {'BSD', 'Apache', 'Unlimited', 'Artistic', 'MIT'}, 'NAM': 5, 'PUB': 9, 'URL': 5, 'INS': 1}, 'Copyleft': {'lictypes': {'GPL', 'LGPL'}, 'PUB': 4, 'PRO': 2, 'INS': 1, 'NAM': 2}, 'Unknown': {'lictypes': {'Unknown', 'Unknown (SaaS)'}, 'PUB': 6, 'URL': 3, 'NAM': 6, 'PRO': 1}}


In [80]:
df2 = pd.DataFrame(results).fillna(0).transpose()
print(df2)


                                           lictypes INS NAM PRO URL PUB
Closed                                     {Closed}  13  17   1   4   2
Academic                                 {Academic}   2   1   0   1   1
Permissive  {BSD, Apache, Unlimited, Artistic, MIT}   1   5   0   5   9
Copyleft                                {GPL, LGPL}   1   2   2   0   4
Unknown                   {Unknown, Unknown (SaaS)}   0   6   1   3   6


In [81]:
df2 = df2.drop(['lictypes'], axis=1)


In [83]:
df2['Total'] = df2.sum(axis=1)
print(df2)

           INS NAM PRO URL PUB  Total
Closed      13  17   1   4   2   37.0
Academic     2   1   0   1   1    5.0
Permissive   1   5   0   5   9   20.0
Copyleft     1   2   2   0   4    9.0
Unknown      0   6   1   3   6   16.0
