# CORD-19 Software Mentions - Access Study

## Relation between citation characteristics and accessibility for analysis (RQ2)

In order to understand research software projects better, e.g., to
determine the qualities and characteristics that make research software projects
successful under a given definition of success, or that help make them sustainable,
or that underlie their collapse, access to their source code repositories is needed.
We wanted to find out which characteristics of software citations enable this access,
and how they relate to adherence to the software citation principles.

Some observable qualities of software mentions and citations can be linked to
adherence to the software citation principles:
a citation to software follows the principle of *Importance*, 
not mentioning a software violates that principle; 
identifying authors in references follows the *Credit and attribution* principle; 
providing access to a persistently archived version of source code in citations or references follows the *Persistence* principle; 
providing access to the source code more generally follows the *Accessibility* principle,
which is the one that this question is most concerned with;
providing version information for software that was used follows the *Specificity* principle.

The results of this part of the study would provide evidence concerning the suitability
of the software citation principles for software accessibility.

## Methodology

The dataset of 80 software mentions was manually annotated by SD to
classify for adherence to the software citation principles, and the
possibility to directly or indirectly access the source code of the
mentioned software.
The table below presents the annotations for
mention features that were actually found in the dataset. The dataset
itself is available as *CSM\_sampled\_mention\_access.csv*.

  | Code |   Description |
  | :--- | :--- |
  | VER | Version information in reference
  | CVER | Version information near mention
  | CRE | Information in reference allowing for personal credit
  | REP | Link to community repository in reference
  | CACC | Link to source code near mention

> Table: Annotations for accessibility
  and adherence to the software citation principles, for which
  respective mention features could be found in our dataset.

In [None]:
# Import dependencies
import pandas as pd
import numpy as np

We read the access dataset as a pandas dataframe, and print it as a sanity check.

In [None]:
df = pd.read_csv(r'../data/access_study/CSM_sampled_mention_access.csv', encoding='unicode_escape', engine='python', index_col=False).fillna(0)
print(df)

### Accessibility

We print the totals for each access type annotation.

The annotations are in column *Access*, where
*D* means *direct access is possible from the mention*,
*I* means _**in**direct access is possible from the mention_ and
*N* means *no access is possible from the mention*.

We also convert the raw values to printable column headers.

In [None]:
access = df['Access']
access = access.str.replace('N', 'No access')
access = access.str.replace('D', 'Direct access')
access = access.str.replace('I', 'Indirect access')
access_totals = access.value_counts()
access_totals

We calculate the percentages for each of the totals.

In [None]:
access_percent = access.value_counts(normalize=True)
access_percent100 = access_percent.mul(100).round(1).astype(str)
access_df = pd.DataFrame({'No. of mentions': access_totals, '% of mentions': access_percent100})
access_df

We convert this new dataframe to latex and print it.

In [None]:
print(access_df.to_latex())

### Adherence to principles

We look at the characteristics that we expect to have in adherence to the software citation principles.

We cannot detect the ommission of mentions for software that was used in reported research, but we can detect the existence of
- author names that allow for credit;
- a link or identifier for a persistently archived version of the software that was used;
- a link that allows us to access the source code;
- a version identifier.

The dataset contains annotations for instances where we found this information:

In [None]:
expanded_annos = {
    'CRE': 'Creditable author information in reference',
    'CACC': 'Link for access to source code near mention',
    'CVER': 'Version information near mention',
    'VER': 'Version information in reference',
    'REP': 'Link to community repository in reference',
    'N': 'No information in adherence to principles',
}

# Render as LaTeX
print(pd.DataFrame(expanded_annos, index = ['Description']).transpose().to_latex(column_format='rl'))

Now count the occurrence of annotations.

In [None]:
# Create data structure with clustering information
clusters = {
    'Author information': {'annos': {'CRE'}},
    'Software links': {'annos': {'CACC', 'REP'}},
    'Version information': {'annos': {'VER', 'CVER'}},
    'No information': {'annos': {'N'}}
}

# Cluster the annotations
for i in df.index:
    for category in clusters:
        raw_val = df['Accessibility and Principledness'][i]
        # Split and strip potentially comma-separated annotations
        vals = raw_val.split(',')
        vals = [v.strip() for v in vals]
        for val in vals:
            if val in clusters[category]['annos']:
                # Iterate count if cuont already exists, or create inital count
                if val in clusters[category]:
                    clusters[category][val] += 1
                else:
                    clusters[category][val] = 1

clusters

Create a new dataframe to hold the clusters, and sum up occurrences for single features in cluster category.

In [None]:
# NaN values in dataframe should be filled with zeroes, and dataframe should be transposed
df2 = pd.DataFrame(clusters).fillna(0).transpose()
# Drop the unneeded column showing the annotations belonging to a category
df2 = df2.drop(['annos'], axis=1)
# Add a column showing the category total, summing up the counts of the single subcategories
df2.insert(0, 'Total', df2.sum(axis=1, skipna=True).astype(int))
df2

In [None]:
# Render the new dataframe as a LaTeX table
print(df2.to_latex(index=False, column_format='rccccccc'))

Render this more nicely as a stacked bar plot.

In [None]:
%matplotlib inline
# Render the percentage data as a nice stacked bar chart
import matplotlib
matplotlib.use("pgf")
matplotlib.rcParams.update(
    {
        # Adjust to your LaTex-Engine
        "pgf.texsystem": "pdflatex",
        "font.family": "serif",
        "text.usetex": True,
        "pgf.rcfonts": False,
        "axes.unicode_minus": False,
    }
)
import matplotlib.pyplot as plt

# Sort by category total in decending order
df2.sort_values(by=['Total'], inplace=True, ascending=True)

# Collect columns to render (exclude Total)
plot_cols = [col for col in df2.columns.tolist() if col not in ['Total']]

# Colourblind/-friendly colours adapted from https://gist.github.com/thriveth/8560036
my_colors = ['#4daf4a', '#f781bf', '#e41a1c', '#984ea3', '#999999', '#a65628']

# Create the plot
print(df2[plot_cols])
ax = df2[plot_cols].plot(kind='barh', stacked=True, figsize=(9,3), color=my_colors)
plt.tight_layout()

# Add a title and rotate the x-axis labels to be horizontal
plt.title('Mention features by annotation')
plt.xticks(rotation=0, ha='center')
plt.xlabel('No. of mentions in our sample')

for c in ax.containers:
    print(c.datavalues)
    ax.bar_label(c, labels = ['' if v == 0 else int(v) for v in c.datavalues], label_type='center')
    
# Save the plot
# Fixes cropped labels
plt.tight_layout()
# Save as pgf
plt.savefig('mentions-by-annotation.pgf')
plt.show()

Let's see how the features distribute over mention types.

In [None]:
# We need to expand a) comma-separated mention types, and then b) comma-separated features
def explode_list(val):
    '''Explodes a comma-separated string list into a list'''
    vals = val.split(',')
    vals = [v.strip() for v in vals]
    return vals

# Data structure
typ_cols = ['cat', 'subcat', 'feature', 'count']
typ_df = pd.DataFrame(columns=typ_cols)

# Re-cluster, but this time also record mention type
# We need a new map from features to mention types, and from single annotations to mention types
for i in df.index:
    acc_raw = df['Accessibility and Principledness'][i]
    acc_vals = explode_list(acc_raw)
    typ_raw = df['Mention Type'][i]
    typ_vals = explode_list(typ_raw)
    for acc in acc_vals:
        for category in clusters:
            if acc in clusters[category]['annos']:
                for typ in typ_vals:
                    # Set up filters
                    mf_1 = typ_df['cat'] == category
                    mf_2 = typ_df['subcat'] == acc
                    mf_3 = typ_df['feature'] == typ
                    # Get index of (should be single row) having this filter
                    indices = typ_df.index[mf_1 & mf_2 & mf_3].tolist()
                    assert len(indices) in [0, 1]
                    target_row = typ_df.loc[mf_1 & mf_2 & mf_3]
                    if target_row.empty:
                        typ_df.loc[len(typ_df.index)] = [category, acc, typ, 1]
#                         typ_df.loc[0]['count'] = 1
#                         typ_df = typ_df.append({'cat': category, 
#                                                 'subcat': acc, 
#                                                 'feature': typ, 
#                                                 'count': 1}, ignore_index=True,
#                                               index=['cat', subcat])
                    else:
                        assert len(indices) == 1
                        typ_df.at[indices[0], 'count'] += 1
                    
typ_df.sort_values(by=['cat', 'subcat', 'feature'], inplace=True)
multi_df = typ_df.set_index(['cat', 'subcat', 'feature', 'count'])
plot_df = typ_df.sort_values(by=['feature', 'cat', 'subcat', 'count'])
multi_df

In [None]:
# Prepare a set of mention types
types = set()
for values in df['Mention Type'].unique():
    types.update(explode_list(values))
# Prepare a set of accessibility annotations
annos = set()
for values in df['Accessibility and Principledness'].unique():
    annos.update(explode_list(values))

In [None]:
%matplotlib inline
matplotlib.rcParams['axes.prop_cycle'] = matplotlib.cycler(color=my_colors) 
plot_df = plot_df.reindex(columns=['feature', 'cat', 'subcat', 'count'])

import copy

typ_cluster = {
    'Author information': {'CRE': {}},
    'Software links': {'CACC': {}, 'REP': {}},
    'Version information': {'VER': {}, 'CVER': {}},
    'No information': {'N': {}}
}

clusters = {}

for typ in types:
    _cl = copy.deepcopy(typ_cluster)
    df = plot_df.loc[plot_df['feature'] == typ]
    for i, row in df.iterrows():
        _cl[row['cat']][row['subcat']] = row['count']
    clusters[typ]= _cl

fig, axes = plt.subplots(nrows = 5, sharex = True, figsize=(10,14))
plt.suptitle('Distribution of mention features over mention types', y=1.01, fontsize='x-large')

color_dict = {'CRE': my_colors[0], 'CACC': my_colors[1], 'REP': my_colors[2], 'VER': my_colors[3], 'CVER': my_colors[4], 'N': my_colors[5]}

for i, cl in enumerate(clusters.items()):
    typ = cl[0]
    _df = pd.DataFrame(cl[1]).transpose()
    vars()['ax' + str(i)] = _df.plot(kind='barh', 
                                     stacked=True, 
                                     color=color_dict,
                                     ax=axes[i])
    vars()['ax' + str(i)].set_title('Mention features for ' + typ, y=1.03, pad=-14, fontsize='small')
    plt.xticks(rotation=0, ha='center')
    plt.xlabel('Number of features found for mention types.', fontsize='large')

    for c in vars()['ax' + str(i)].containers:
        vars()['ax' + str(i)].bar_label(c, labels = ['' if v == 0 else int(v) for v in c.datavalues], label_type='center')

# TODO Move subplot titles, place legend and title better
for ax in fig.axes:
    ax.get_legend().remove()

# fig.legend(ncol=5, bbox_to_anchor=(0, 1), loc='lower left', fontsize='small', *zip(*unique))
from matplotlib.patches import Patch
legend_elements = []
for anno in annos:
    legend_elements.append(Patch(facecolor=color_dict[anno], label=anno))

# Create the figure
fig.legend(handles=legend_elements, ncol=6, 
           bbox_to_anchor=(.5, .97),
           loc='lower left', 
           fontsize='small')

plt.tight_layout()
plt.savefig('features-over-mentions.pgf')
plt.show()