# CORD-19 Software Mentions - Access Study

## Relation between citation characteristics and accessibility for analysis (RQ2)

In order to understand research software projects better, e.g., to
determine the qualities and characteristics that make research software projects
successful under a given definition of success, or that help make them sustainable,
or that underlie their collapse, access to their source code repositories is needed.
We wanted to find out which characteristics of software citations enable this access,
and how they relate to adherence to the software citation principles.

Some observable qualities of software mentions and citations can be linked to
adherence to the software citation principles:
a citation to software follows the principle of *Importance*, 
not mentioning a software violates that principle; 
identifying authors in references follows the *Credit and attribution* principle; 
providing access to a persistently archived version of source code in citations or references follows the *Persistence* principle; 
providing access to the source code more generally follows the *Accessibility* principle,
which is the one that this question is most concerned with;
providing version information for software that was used follows the *Specificity* principle.

The results of this part of the study would provide evidence concerning the suitability
of the software citation principles for software accessibility.

## Methodology

The dataset of 80 software mentions was manually annotated by SD to
classify for adherence to the software citation principles, and the
possibility to directly or indirectly access the source code of the
mentioned software.
The table below presents the annotations for
mention features that were actually found in the dataset. The dataset
itself is available as *CSM\_sampled\_mention\_access.csv*.

  | Code |   Description |
  | :--- | :--- |
  | VER | Version information in reference
  | CVER | Version information near mention
  | CRE | Information in reference allowing for personal credit
  | REP | Link to community repository in reference
  | CACC | Link to source code near mention

> Table: Annotations for accessibility
  and adherence to the software citation principles, for which
  respective mention features could be found in our dataset.

In [None]:
# Import dependencies
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns

# sns.set_palette('colorblind')

We read the access dataset as a pandas dataframe, and print it as a sanity check.

In [None]:
df = pd.read_csv(r'../data/access_study/CSM_sampled_mention_access.csv', encoding='unicode_escape', engine='python')
print(df)

We print the totals for each access type annotation.

The annotations are in column *Access*, where
*D* means *direct access is possible from the mention*,
*I* means _**in**direct access is possible from the mention_ and
*N* means *no access is possible from the mention*.

We also convert the raw values to printable column headers.

In [None]:
access = df['Access']
access = access.str.replace('N', 'No access')
access = access.str.replace('D', 'Direct access')
access = access.str.replace('I', 'Indirect access')
access_totals = access.value_counts()
access_totals

We calculate the percentages for each of the totals.

In [None]:
access_percent = access.value_counts(normalize=True)
access_percent100 = access_percent.mul(100).round(1).astype(str) + '% of mentions'
access_df = pd.DataFrame({'No. of mentions': access_totals, '%': access_percent100})
access_df

We convert this new dataframe to latex and print it.

In [None]:
print(access_df.to_latex())

In [None]:
# Annotations
# Principledness
credit_ref = 'CRE'
access_cit = 'CACC'
ver_cit = 'CVER'
ver_ref = 'VER'
repo = 'REP'
none = 'N'

# Accessibility
direct = 'D'
indirect = 'I'

# Counters
direct_int = 0
indirect_int = 0
none_acc_int = 0

for i in df.index:
    access = df['Access'][i]
    mention_types = df['Mention Type'][i]
    principles = df['Accessibility and Principledness'][i]
    if access is direct:
        direct_int += 1
        print('Upped direct by 1 to ' + str(direct_int))
    elif access is indirect:
        indirect_int += 1
    elif access is none:
        none_acc_int += 1
        
print(str(none_acc_int) + ' ' + str(indirect_int) + ' ' + str(direct_int))

# Access overview

raw_data = {
    'access': ['direct access', 'indirect access', 'no access'],
    'counts': [direct_int, indirect_int, none_acc_int]
}


fig, ax = plt.subplots(figsize=(5,4))

# add the plot
sns.barplot(x='access', y='counts', data=raw_data, ax=ax)

# add the annotation
ax.bar_label(ax.containers[-1], fmt='%d')

ax.set(ylabel='No. of mentions', xlabel='Accessibility')
plt.savefig('access.pdf')
plt.show()



# sns.barplot(x='access', y='counts', data=raw_data)

# Access by mention type

# Access by principle type

# Access by mention + principle

# Principle by mention type

# def cluster_access_by_mention(mention, res):
#     types = {"PUB","MAN","PRO","INS","URL","NAM","NOT"}    
#     for type in types:
#         if type in mention:
#             if type in res:
#                 res[type] +=1
#                 print("Assigning to ", type, "value is ", res[type])
#             else:
#                 res[type] = 1
#                 print("Assigning to ", type, "value is ",res[type])
# 
# for i in df.index:
#     for classification in results:
#         if df['Accessibility and Principledness'][i] in results[classification]['lictypes']:
#             #print ("Index: ", i, "Software Title: ", df['Title'][i], "License Type: ", classification, "Mention Type: ", df['Mention Type'][i])
#             cluster_mentions(df['Mention Type'][i], results[classification])
#             #print(results)
#             
#             
# print(results)    

In [None]:
df2 = pd.DataFrame(results).fillna(0).transpose()
print(df2)


In [None]:
df2 = df2.drop(['lictypes'], axis=1)


In [None]:
df2['Total'] = df2.sum(axis=1)
print(df2)