# CORD-19 Software Mentions - Access Study

## Relation between citation characteristics and accessibility for analysis (RQ2)

In order to understand research software projects better, e.g., to
determine the qualities and characteristics that make research software projects
successful under a given definition of success, or that help make them sustainable,
or that underlie their collapse, access to their source code repositories is needed.
We wanted to find out which characteristics of software citations enable this access,
and how they relate to adherence to the software citation principles.

Some observable qualities of software mentions and citations can be linked to
adherence to the software citation principles:
a citation to software follows the principle of *Importance*, 
not mentioning a software violates that principle; 
identifying authors in references follows the *Credit and attribution* principle; 
providing access to a persistently archived version of source code in citations or references follows the *Persistence* principle; 
providing access to the source code more generally follows the *Accessibility* principle,
which is the one that this question is most concerned with;
providing version information for software that was used follows the *Specificity* principle.

The results of this part of the study would provide evidence concerning the suitability
of the software citation principles for software accessibility.

## Methodology

The dataset of 80 software mentions was manually annotated by SD to
classify for adherence to the software citation principles, and the
possibility to directly or indirectly access the source code of the
mentioned software.
The table below presents the annotations for
mention features that were actually found in the dataset. The dataset
itself is available as *CSM\_sampled\_mention\_access.csv*.

  | Code |   Description |
  | :--- | :--- |
  | VER | Version information in reference
  | CVER | Version information near mention
  | CRE | Information in reference allowing for personal credit
  | REP | Link to community repository in reference
  | CACC | Link to source code near mention

> Table: Annotations for accessibility
  and adherence to the software citation principles, for which
  respective mention features could be found in our dataset.

In [None]:
import pandas as pd
import numpy as np

In [None]:
input_data = "../data/license_study/CSM_sampled_mention_license.csv"


In [None]:
df = pd.read_csv (r'../data/license_study/CSM_sampled_mention_license.csv')
print (df)

In [None]:
df.groupby(['Software License'])[['ID']].count()

In [None]:
df.groupby(['Software License', 'Mention Type'])[['ID']].count()

In [None]:
# SUM(COUNTIFS('License breakdown (old)'!$E$2:$E$69,"*PUB*",'License breakdown (old)'!$D$2:$D$69,{"Apache","Artistic","BSD","MIT","Unlimited"}))

lictype_closed = {"Closed"}
lictype_academic = {"Academic"}
lictype_permissive = {"Apache","Artistic","BSD","MIT","Unlimited"}
lictype_copyleft = {"GPL","LGPL"}
lictype_unknown = {"Unknown","Unknown (SaaS)"}

results = {"Closed": {"lictypes": {"Closed"}},
           "Academic": {"lictypes": {"Academic"}},
           "Permissive": {"lictypes": {"Apache","Artistic","BSD","MIT","Unlimited"}},
           "Copyleft": {"lictypes": {"GPL","LGPL"}},
           "Unknown": {"lictypes": {"Unknown","Unknown (SaaS)"}}
          }

def cluster_mentions(mention, res):
    types = {"PUB","MAN","PRO","INS","URL","NAM","NOT"}    
    for type in types:
        if type in mention:
            if type in res:
                res[type] +=1
                #print("Assigning to ", type, "value is ", res[type])
            else:
                res[type] = 1
                #print("Assigning to ", type, "value is ",res[type])

for i in df.index:
    for classification in results:
        if df['Software License'][i] in results[classification]['lictypes']:
            #print ("Index: ", i, "Software Title: ", df['Title'][i], "License Type: ", classification, "Mention Type: ", df['Mention Type'][i])
            cluster_mentions(df['Mention Type'][i], results[classification])
            #print(results)
            
            
print(results)    

In [None]:
df2 = pd.DataFrame(results).fillna(0).transpose()
print(df2)


In [None]:
df2 = df2.drop(['lictypes'], axis=1)


In [None]:
df2['Total'] = df2.sum(axis=1)
print(df2)