# CORD-19 Software Mentions - Access Study

## Relation between citation characteristics and accessibility for analysis (RQ2)

In order to understand research software projects better, e.g., to
determine the qualities and characteristics that make research software projects
successful under a given definition of success, or that help make them sustainable,
or that underlie their collapse, access to their source code repositories is needed.
We wanted to find out which characteristics of software citations enable this access,
and how they relate to adherence to the software citation principles.

Some observable qualities of software mentions and citations can be linked to
adherence to the software citation principles:
a citation to software follows the principle of *Importance*, 
not mentioning a software violates that principle; 
identifying authors in references follows the *Credit and attribution* principle; 
providing access to a persistently archived version of source code in citations or references follows the *Persistence* principle; 
providing access to the source code more generally follows the *Accessibility* principle,
which is the one that this question is most concerned with;
providing version information for software that was used follows the *Specificity* principle.

The results of this part of the study would provide evidence concerning the suitability
of the software citation principles for software accessibility.

## Methodology

The dataset of 80 software mentions was manually annotated by SD to
classify for adherence to the software citation principles, and the
possibility to directly or indirectly access the source code of the
mentioned software.
The table below presents the annotations for
mention features that were actually found in the dataset. The dataset
itself is available as *CSM\_sampled\_mention\_access.csv*.

  | Code |   Description |
  | :--- | :--- |
  | VER | Version information in reference
  | CVER | Version information near mention
  | CRE | Information in reference allowing for personal credit
  | REP | Link to community repository in reference
  | CACC | Link to source code near mention

> Table: Annotations for accessibility
  and adherence to the software citation principles, for which
  respective mention features could be found in our dataset.

In [None]:
# Import dependencies
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns

# sns.set_palette('colorblind')

We read the access dataset as a pandas dataframe, and print it as a sanity check.

In [None]:
df = pd.read_csv(r'../data/access_study/CSM_sampled_mention_access.csv', encoding='unicode_escape', engine='python', index_col=False).fillna(0)
print(df)

### Accessibility

We print the totals for each access type annotation.

The annotations are in column *Access*, where
*D* means *direct access is possible from the mention*,
*I* means _**in**direct access is possible from the mention_ and
*N* means *no access is possible from the mention*.

We also convert the raw values to printable column headers.

In [None]:
access = df['Access']
access = access.str.replace('N', 'No access')
access = access.str.replace('D', 'Direct access')
access = access.str.replace('I', 'Indirect access')
access_totals = access.value_counts()
access_totals

We calculate the percentages for each of the totals.

In [None]:
access_percent = access.value_counts(normalize=True)
access_percent100 = access_percent.mul(100).round(1).astype(str)
access_df = pd.DataFrame({'No. of mentions': access_totals, '% of mentions': access_percent100})
access_df

We convert this new dataframe to latex and print it.

In [None]:
print(access_df.to_latex())

### Adherence to principles

We look at the characteristics that we expect to have in adherence to the software citation principles.

We cannot detect the ommission of mentions for software that was used in reported research, but we can detect the existence of
- author names that allow for credit;
- a link or identifier for a persistently archived version of the software that was used;
- a link that allows us to access the source code;
- a version identifier.

The dataset contains annotations for instances where we found this information:

In [None]:
def create_data_structure():
    # Define the annotations for principledness, i.e., adherence to the software citation principles, 
    # in lists of code, description, occurrence count
    credit_ref = ['CRE', 'Creditable author information in reference', 0]
    access_cit = ['CACC', 'Link for access to source code near mention', 0]
    ver_cit = ['CVER', 'Version information near mention', 0]
    ver_ref = ['VER', 'Version information in reference', 0]
    artifrepo_ref = ['REP', 'Link to community repository in reference', 0]
    no_princ = ['N', 'No information in adherence to principles', 0]
    features = [credit_ref, access_cit, ver_cit, ver_ref, artifrepo_ref, no_princ]
    return features

Now count the occurrence of annotations.

In [None]:
# Data structure
clusters = {
    'Author information': {'annos': {'CRE'}},
    'Software links': {'annos': {'CACC', 'REP'}},
    'Version information': {'annos': {'VER', 'CVER'}},
    'No information': {'annos': {'N'}}
}
                
                
for i in df.index:
    for category in clusters:
        raw_val = df['Accessibility and Principledness'][i]
        vals = raw_val.split(',')
        vals = [v.strip() for v in vals]
        for val in vals:
            if val in clusters[category]['annos']:
                if val in clusters[category]:
                    clusters[category][val] += 1
                else:
                    clusters[category][val] = 1

clusters

Create a new dataframe to hold the clusters, and sum up occurrences for single features in cluster category.

In [None]:
df2 = pd.DataFrame(clusters).fillna(0).transpose()
df2 = df2.drop(['annos'], axis=1)
df2.insert(0, 'Total', df2.sum(axis=1, skipna=True).astype(int))
df2

In [None]:
print(df2.to_latex(index=False, column_format='rccccccc'))