# CORD-19 Software Mentions mention type comparison

## Comparison of CSM mention types after Howison & Bullard (2015) with Howison & Bullard

We have classified a subset of 80 software mentions from a random sample of 100 software mention candidates from the CORD-19 dataset by mention type.
The mention types are those used in Howison & Bullard 2015 (doi:10.1002/asi.23538).
We did not use the following annotations:

- *Cite to users manual*
- *Not even name mentioned*

The mention types we found are listed in a table extracted from the annotated dataset used in the access study.

In [None]:
import pandas as pd

# Read the dataset
df = pd.read_csv(r'../data/access_study/CSM_sampled_mention_access.csv', encoding='unicode_escape', engine='python', index_col=False).fillna(0)
# Get the raw annotations for mention type
raw_types = df['Mention Type']

# Have a peep at the mention counts
mentions_total = len(raw_types)
print(mentions_total)

# Build a list of single types, i.e., split and strip comma-separated values
types = []
for separated in raw_types:
    vals = separated.split(',')
    vals = [v.strip() for v in vals]
    types.extend(vals)
# Create a dataframe for just the single types, sorted alphabetically
type_df = pd.DataFrame(data=types, columns=['Type']).sort_values(by='Type')
# Create a new dataframe including the counts for the single types
counts_df = pd.DataFrame(type_df['Type'].value_counts())
counts_df.rename(columns = {'Type':'our'}, inplace=True)
# Insert the actual types, which are the index right now, into its own column
counts_df.insert(0, 'Type',counts_df.index)
counts_df = counts_df.reindex(['PUB', 'PRO', 'INS', 'URL', 'NAM'])
counts_df

Add the data from Howison & Bullard 2005, Table 1.

In [None]:
hb_mentions = {
    'PUB': 105,
    'MAN': 6, # Citing user manual
    'PRO': 15,
    'INS': 53,
    'URL': 13,
    'NAM': 90,
    'NEN': 4 # Not even name mentioned
}

# We have no evidence for NEN, and no occurrences of MAN, in our sample, so drop these from the H&B data
del hb_mentions['MAN']
del hb_mentions['NEN']

hb_mentions

Add the Howison & Bullard data to the dataframe.

In [None]:
# Our no. of mentions
print('No. of mentions in our sample: ' + str(mentions_total))

# H & B number of mentions
hb_mentions_total = sum(hb_mentions.values())
print('No. of mentions in Howison & Bullard 2005 data: ' + str(hb_mentions_total))

# Add data to dataframe
counts_df['hb'] = counts_df['Type'].map(hb_mentions)
counts_df

Calculate percentages for both datasets, and add respective columns to the dataframe.

In [None]:
counts_df. insert(2, '%our', counts_df['our']/counts_df['our'].sum()*100)
counts_df. insert(4, '%hb', counts_df['hb']/counts_df['hb'].sum()*100)
counts_df['%our'] = counts_df['%our'].round(decimals = 1)
counts_df['%hb'] = counts_df['%hb'].round(decimals = 1)
counts_df

Transpose the dataframe and print it as a LaTeX table.

In [None]:
df_transposed = counts_df.transpose()
print(df_transposed.to_latex())
df_transposed

Create a horizontal stacked bar plot to compare the mention types across the two datasets.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Transpose dataframe
types_df = counts_df.transpose()

# Output table first
tab_df = types_df.drop(['Type'], axis=0)
tab_df = tab_df.rename(index={'our': 'Totals (our sample)', 'hb': 'Totals (Howison & Bullard(2015))', '%our': '% (our sample)', '%hb': '% (Howison & Bullard (2015))'})
print(tab_df.to_latex())

# Drop unneeded data
types_df = types_df.drop(['Type', 'our', 'hb'], axis=0)
types_df.round(1)

# Rename cols
types_df = types_df.rename(index={'%our': 'Our sample', '%hb': 'Howison \&Bullard (2015)'})

# Colourblind/-friendly colours adapted from https://gist.github.com/thriveth/8560036
my_colors = ['#4daf4a', '#f781bf', '#e41a1c', '#984ea3', '#999999', '#a65628', '#dede00']

# Create the plot
# fig, ax1 = plt.subplots(nrows = 1)
ax = types_df.plot(kind='barh', 
                   stacked=True,
                   figsize=(8, 3), 
                   color=my_colors)
ax.legend(ncol=5, 
          bbox_to_anchor=(0.16, 1),
          loc='lower left', 
          fontsize='small')
ax.set_xlabel('% of mentions')

# Add a title and rotate the x-axis labels to be horizontal
plt.title('Comparison of mention types', y=1.2)
plt.xticks(rotation=0, ha='center')
plt.yticks(rotation=45)

# Add value labels to bar sections
for c in ax.containers:
    ax.bar_label(c, label_type='center')
    
# Format, save, and show the plot
plt.tight_layout()
plt.savefig('mention-type-comparison.pgf')
plt.show()