# CORD-19 Software counting
This notebook is designated to count software mentions based on the CORD19 dataset from: 

Wade, Alex D.; Williams, Ivana (2021), CORD-19 Software Mentions, Dryad, Dataset, https://doi.org/10.5061/dryad.vmcvdncs0

First, relevant packages must be imported into the notebook.

In [1]:
import numpy as np
import pandas as pd
import csv
import ast
import collections
import matplotlib.pyplot as plt
import Levenshtein as lev
from fuzzywuzzy import fuzz 

Get the data and save it to a variable. 

In [2]:
CORD19_CSV = pd.read_csv('../data/cord-19/CORD19_software_mentions.csv' , converters={'software': lambda x: x[1:-1].split(',')})

Show the head of the dataset to inspect all columns and obtain a broad overview. 

In [3]:
CORD19_CSV.head(20)

Unnamed: 0,paper_id,doi,title,source_x,license,publish_time,journal,url,software
0,00006903b396d50cc0037fed39916d57d50ee801,,Urban green space and happiness in developed c...,ArXiv,arxiv,2021-01-04,,https://arxiv.org/pdf/2101.00807v1.pdf,['Google Street View']
1,0000fcce604204b1b9d876dc073eb529eb5ce305,10.1016/j.regg.2021.01.002,La Geriatría de Enlace con residencias en la é...,Elsevier; PMC,els-covid,2021-01-13,Rev Esp Geriatr Gerontol,https://api.elsevier.com/content/article/pii/S...,['SEGG']
2,000122a9a774ec76fa35ec0c0f6734e7e8d0c541,10.1016/j.rec.2020.08.002,Impact of COVID-19 on ST-segment elevation myo...,Elsevier; Medline; PMC,no-cc,2020-09-08,Rev Esp Cardiol (Engl Ed),https://api.elsevier.com/content/article/pii/S...,"['STATA', 'IAMCEST']"
3,0001418189999fea7f7cbe3e82703d71c85a6fe5,10.1016/j.vetmic.2006.11.026,Absence of surface expression of feline infect...,Elsevier; Medline; PMC,no-cc,2007-03-31,Vet Microbiol,https://www.sciencedirect.com/science/article/...,['SPSS']
4,00033d5a12240a8684cfe943954132b43434cf48,10.3390/v12080849,Detection of Severe Acute Respiratory Syndrome...,Medline; PMC,cc-by,2020-08-04,Viruses,https://www.ncbi.nlm.nih.gov/pubmed/32759673/;...,"['R', 'MassARRAY Typer Analyzer']"
5,00035ac98d8bc38fbca02a1cc957f55141af67c0,10.3389/fpsyt.2020.559701,The Psychological Pressures of Breast Cancer P...,Medline; PMC,cc-by,2020-12-15,Front Psychiatry,https://doi.org/10.3389/fpsyt.2020.559701; htt...,"['Wechat', 'SPSS Statistics']"
6,00039b94e6cb7609ecbddee1755314bcfeb77faa,10.1111/j.1365-2249.2004.02415.x,Plasma inflammatory cytokines and chemokines i...,Medline; PMC,bronze-oa,2004-04-01,Clinical & Experimental Immunology,https://onlinelibrary.wiley.com/doi/pdfdirect/...,['Statistical Package for Social Sciences (SPS...
7,0004456994f6c1d5db7327990386d33c01cff32a,10.1186/1471-2334-10-8,Seasonal influenza risk in hospital healthcare...,PMC,cc-by,2010-01-12,BMC Infect Dis,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,"['STATA', 'STATA', 'Statacorp']"
8,00073cb65dd2596249230fab8b15a71c4a135895,10.1086/605034,Risk Parameters of Fulminant Acute Respiratory...,Medline; PMC,no-cc,2009-08-01,J Infect Dis,https://doi.org/10.1086/605034; https://www.nc...,"['SPSS', 'SPSS']"
9,0007f972812bb45abbe5b0edf8db5359d49c23eb,10.1186/s42234-020-00057-1,The role of nicotinic receptors in SARS-CoV-2 ...,Medline; PMC,cc-by,2020-10-28,Bioelectron Med,https://www.ncbi.nlm.nih.gov/pubmed/33292872/;...,"['geNorm', 'GraphPad Prism', 'GraphPad', 'C..."


The dataset contains nine different columns. Thusly, the next lines of this notebook explore the column "software". Therefore, the column software will be saved to an object.

In [4]:
software = CORD19_CSV.software
software

0                    ['Google Street View']
1                                  ['SEGG']
2                     ['STATA',  'IAMCEST']
3                                  ['SPSS']
4        ['R',  'MassARRAY Typer Analyzer']
                        ...                
77443                          ['UpToDate']
77444                   ['SALib',  'Panda']
77445                ['Prism',  'GraphPad']
77446    ['R package circular',  'R',  'R']
77447       ['GRAM',  'R studio',  'Stata']
Name: software, Length: 77448, dtype: object

In [5]:
len_software_multiple_entries = len(software)
len_software_multiple_entries

77448

The software object contains 77448 rows. In each row, there are software mentions. Some rows contain more than one mention. For instance, row two and four have two mentions. Consequently, the object needs to be transformed into an object which contains solely one software entry per row.

In [6]:
software = software.explode(ignore_index = True)

Remove the brackets around the software mentions. 

In [7]:
software = software.str.replace('\'', '')

Control the software object and inspect if each row contains only one entry. 

In [8]:
software

0         Google Street View
1                       SEGG
2                      STATA
3                    IAMCEST
4                       SPSS
                 ...        
558787                     R
558788                     R
558789                  GRAM
558790              R studio
558791                 Stata
Name: software, Length: 558792, dtype: object

In [9]:
len_software_single_entries = len(software)
len_software_single_entries

558792

Now, the object contains solely one entry per row and has a length of 558792.

In [10]:
average_entries_per_row = len_software_single_entries/len_software_multiple_entries
average_entries_per_row

7.215060427641773

Due to the alignment of the software object, it can be obtained that the dataset contains on average 7.2 software mentions per row. Furthermore, the value_counts function will be used to minimise the number of rows by checking identical duplicates.

In [11]:
software.value_counts(dropna=False)

 R                8389
 SPSS             4738
SPSS              4472
 BLAST            3166
 Excel            2666
                  ... 
 EFS Questback       1
AzureSpot            1
 Antigenic           1
AVL BOOST            1
Mystery Miner        1
Name: software, Length: 120264, dtype: int64

The function value_counts reduced the number of rows to 120279. Nevertheless, the dtype counts same software mentions as distinct. For instance, "SPSS" is listed twice with two varied numbers. For this, the datatype will be converted to a dictionary to check for possible empty spaces.

In [12]:
software_dict = software.to_dict()
software_dict

{0: 'Google Street View',
 1: 'SEGG',
 2: 'STATA',
 3: ' IAMCEST',
 4: 'SPSS',
 5: 'R',
 6: ' MassARRAY Typer Analyzer',
 7: 'Wechat',
 8: ' SPSS Statistics',
 9: 'Statistical Package for Social Sciences (SPSS)',
 10: ' BD CBA',
 11: 'STATA',
 12: ' STATA',
 13: ' Statacorp',
 14: 'SPSS',
 15: ' SPSS',
 16: 'geNorm',
 17: ' GraphPad Prism',
 18: ' GraphPad',
 19: ' Cellranger',
 20: ' R',
 21: ' Seurat',
 22: ' ggplot2',
 23: ' LinRegPCR',
 24: 'GramA',
 25: 'R package edgeR',
 26: ' R package edgeR',
 27: ' R package edgeR',
 28: ' STAR',
 29: ' FastQC',
 30: ' R package ALDEx2',
 31: ' ImageJ',
 32: ' PfAlbas',
 33: ' SAM',
 34: 'MORO Praxis',
 35: ' MORO',
 36: 'R2HC',
 37: 'Singapour',
 38: 'PRESET',
 39: 'Google Trends (GT',
 40: ' GT',
 41: 'SINUS',
 42: ' VICTIMES',
 43: 'Spirocall',
 44: ' LIBSVM',
 45: ' MATLAB voicebox',
 46: ' openS',
 47: ' MILE',
 48: 'DP',
 49: ' DP',
 50: ' DSGVO',
 51: ' iOS',
 52: ' DSGVO',
 53: ' PEPP',
 54: ' PT',
 55: ' DP',
 56: ' 3T',
 57: 'REDCap

The dictionary contains mentions with space at the first position. 
This means that the data is not cleaned properly for the value_counts() function. Therefore, the function remove_empty_spaces(d) takes a dictionary and removes spaces at the first position of a string.

In [13]:
def remove_empty_spaces(dic):
    """ Function removing an empty space at the first position of a string. 
    """
    for i in dic:
        if dic[i][:1] == " ":
            dic[i] = dic[i].strip()
    return dic

In [14]:
software_dict = remove_empty_spaces(software_dict)
software_dict

{0: 'Google Street View',
 1: 'SEGG',
 2: 'STATA',
 3: 'IAMCEST',
 4: 'SPSS',
 5: 'R',
 6: 'MassARRAY Typer Analyzer',
 7: 'Wechat',
 8: 'SPSS Statistics',
 9: 'Statistical Package for Social Sciences (SPSS)',
 10: 'BD CBA',
 11: 'STATA',
 12: 'STATA',
 13: 'Statacorp',
 14: 'SPSS',
 15: 'SPSS',
 16: 'geNorm',
 17: 'GraphPad Prism',
 18: 'GraphPad',
 19: 'Cellranger',
 20: 'R',
 21: 'Seurat',
 22: 'ggplot2',
 23: 'LinRegPCR',
 24: 'GramA',
 25: 'R package edgeR',
 26: 'R package edgeR',
 27: 'R package edgeR',
 28: 'STAR',
 29: 'FastQC',
 30: 'R package ALDEx2',
 31: 'ImageJ',
 32: 'PfAlbas',
 33: 'SAM',
 34: 'MORO Praxis',
 35: 'MORO',
 36: 'R2HC',
 37: 'Singapour',
 38: 'PRESET',
 39: 'Google Trends (GT',
 40: 'GT',
 41: 'SINUS',
 42: 'VICTIMES',
 43: 'Spirocall',
 44: 'LIBSVM',
 45: 'MATLAB voicebox',
 46: 'openS',
 47: 'MILE',
 48: 'DP',
 49: 'DP',
 50: 'DSGVO',
 51: 'iOS',
 52: 'DSGVO',
 53: 'PEPP',
 54: 'PT',
 55: 'DP',
 56: '3T',
 57: 'REDCapbased',
 58: 'REDCap',
 59: 'Research

Now, the software mentions within the dictionary do not contain empty spaces at the first position of the string. For the use of value_counts(), the dictionary is converted to a pandas series.

In [15]:
software_series = pd.Series(software_dict)
software_series.value_counts()

R                      10805
SPSS                    9210
GraphPad Prism          3986
Excel                   3856
BLAST                   3674
                       ...  
Clarity integration        1
pathJC                     1
ANCOVA Global Test         1
sound                      1
Ophthalmology Match        1
Length: 102709, dtype: int64

Due to the removal of empty spaces, the length of the object decreased. To further minimise the length of the object, all strings will be capitalized.

In [16]:
software_dict = software_series.to_dict()

In [17]:
def capitalize_mentions(dic):
    """ Function iterating a dictionary and capitalizing all strings.
    """
    for i in dic:
        dic[i] = dic[i].upper()
    return dic

In [18]:
software_dict = capitalize_mentions(software_dict)
software_dict

{0: 'GOOGLE STREET VIEW',
 1: 'SEGG',
 2: 'STATA',
 3: 'IAMCEST',
 4: 'SPSS',
 5: 'R',
 6: 'MASSARRAY TYPER ANALYZER',
 7: 'WECHAT',
 8: 'SPSS STATISTICS',
 9: 'STATISTICAL PACKAGE FOR SOCIAL SCIENCES (SPSS)',
 10: 'BD CBA',
 11: 'STATA',
 12: 'STATA',
 13: 'STATACORP',
 14: 'SPSS',
 15: 'SPSS',
 16: 'GENORM',
 17: 'GRAPHPAD PRISM',
 18: 'GRAPHPAD',
 19: 'CELLRANGER',
 20: 'R',
 21: 'SEURAT',
 22: 'GGPLOT2',
 23: 'LINREGPCR',
 24: 'GRAMA',
 25: 'R PACKAGE EDGER',
 26: 'R PACKAGE EDGER',
 27: 'R PACKAGE EDGER',
 28: 'STAR',
 29: 'FASTQC',
 30: 'R PACKAGE ALDEX2',
 31: 'IMAGEJ',
 32: 'PFALBAS',
 33: 'SAM',
 34: 'MORO PRAXIS',
 35: 'MORO',
 36: 'R2HC',
 37: 'SINGAPOUR',
 38: 'PRESET',
 39: 'GOOGLE TRENDS (GT',
 40: 'GT',
 41: 'SINUS',
 42: 'VICTIMES',
 43: 'SPIROCALL',
 44: 'LIBSVM',
 45: 'MATLAB VOICEBOX',
 46: 'OPENS',
 47: 'MILE',
 48: 'DP',
 49: 'DP',
 50: 'DSGVO',
 51: 'IOS',
 52: 'DSGVO',
 53: 'PEPP',
 54: 'PT',
 55: 'DP',
 56: '3T',
 57: 'REDCAPBASED',
 58: 'REDCAP',
 59: 'RESEARCH

In [19]:
software_series = pd.Series(software_dict)
software_series = software_series.value_counts()
software_series

R                            10805
SPSS                          9229
GRAPHPAD PRISM                4461
EXCEL                         4054
BLAST                         3943
                             ...  
FPC R / CRAN                     1
HCA BROWSER                      1
NETSTAP                          1
POCOVIDSCREEN                    1
NEUROINTERVENTIONAL SUITE        1
Length: 89462, dtype: int64

Due to the capitalization of software mentions, the length of the object could be decreased. Subsequently, the fuzzy-wuzzy compare algorithm will be introduced. This algorithm is based on Levenshtein which checks the similarity of strings by various aspects.

In [20]:
def fuzzy_ratio_compare(str1, str2, th):
    """ Function to compare to strings based on a given threeshold. 
    """
    ratio = fuzz.ratio(str1, str2)
    if(ratio > th):
        return True
    else:
        return False
    
def fuzzy_partial_ratio_compare(str1, str2, th):
    """ Function to compare to strings based on a given threeshold. 
    """
    ratio = fuzz.partial_ratio(str1, str2)
    if(ratio > th):
        return True
    else:
        return False
    
def fuzzy_token_sort_ratio_compare(str1, str2, th):
    """ Function to compare to strings based on a given threeshold. 
    """
    ratio = fuzz.token_sort_ratio(str1, str2)
    if(ratio > th):
        return True
    else:
        return False

def fuzzy_token_set_ratio_compare(str1, str2, th):
    """ Function to compare to strings based on a given threeshold. 
    """
    ratio = fuzz.token_set_ratio(str1, str2)
    if(ratio > th):
        return True
    else:
        return False

Due to performance reasons, further investigation will be conducted with a subset of the initial series based on a limit. For this, the successive analysis will focus on the 10% of the data that have the most mentions. Therefore, the corresponding amount of 1/10 of the dataset is calculated. 

In [21]:
ten_percent_position = int(round(len(software_series)*0.1, 0))
ten_percent_position

8946

The lowest mention within the 10% of most common software is considered for the limit. 

In [22]:
lowest_mention_of_ten_percent = software_series[ten_percent_position]
lowest_mention_of_ten_percent

9

As nine is the lowest mention within the 10% of most common software, all mentions with nine or more matches are picked for further analysis. 

In [23]:
software_series_shaped = software_series[software_series >= lowest_mention_of_ten_percent]
selection_limit = len(software_series_shaped)
software_series_shaped

R                 10805
SPSS               9229
GRAPHPAD PRISM     4461
EXCEL              4054
BLAST              3943
                  ...  
TESSERACT             9
3DRNA                 9
PEPSEQ                9
TBC2TARGET            9
YELP                  9
Length: 9351, dtype: int64

Converting the series to a DataFrame for comparison purposes. The index of the DataFrame is required for selecting rows. 

In [24]:
ts = software_series_shaped.to_frame()
list_soft = []
list_matches = [0]
for i in range(len(ts)):
    list_soft.append(software_series_shaped.index[i])
    list_matches.append(software_series_shaped[i])
df_shaped = pd.DataFrame()
#Insert the column software and matches
df_shaped['Software'] = list_soft
df_shaped['Matches'] = list_matches[1:]
#Sort the DataFrame by numeric by matches and then alphabetical by software
df_shaped = df_shaped.sort_values(by=['Matches', "Software"])
#Reverse the Dataframe to present the most common software at index position 0 
df_shaped = df_shaped[::-1]  
df_shaped.reset_index(drop = True) 

Unnamed: 0,Software,Matches
0,R,10805
1,SPSS,9229
2,GRAPHPAD PRISM,4461
3,EXCEL,4054
4,BLAST,3943
...,...,...
9346,3DRNA,9
9347,2DST,9
9348,- MASK,9
9349,- CNN,9


Replacing special characters to prevent the error "unterminated subpattern" at a later stage of this notebook.

In [25]:
df_shaped['Software'] = df_shaped.Software.str.replace('(','')
df_shaped['Software'] = df_shaped.Software.str.replace(')','')

The following function compares software mentions based on the fuzzy-wuzzy method. As a result, a blacklist with identified duplicates and a modified DataFrame is returned.

In [26]:
def unify_dataframe(df, th):
    """Match software mentions based on fuzzywuzzy algorithm
    """
    df_holder = df
    blacklist = set()
    dict_matches = df.Matches
    for i in range(len(df)):
        for j in range(i + 1, len(df)):
            if df['Software'][i] not in blacklist:
                if(fuzzy_token_set_ratio_compare(df['Software'][i], df['Software'][j], th)):
                    print("Position: "+str(i)+"/"+str(len(df))+" -> '"+df['Software'][i]+"' with " + str(df['Matches'][i]) +
                          " mentions matched with '" + df['Software'][j] + "' mentioned " + str(df['Matches'][j])+ " times")
                    df.iloc[i, df.columns.get_loc('Matches')] = int(df['Matches'].loc[i] + df['Matches'].loc[j])
                    blacklist.add(df['Software'][j])
    df_holder = df
    return df_holder, blacklist

In [27]:
%%time
threeshold = 84
df_returned = unify_dataframe(df_shaped, threeshold)
df_unified = df_returned[0]
blacklist = df_returned[1]

Position: 0/9351 -> 'R' with 10805 mentions matched with 'R PACKAGE' mentioned 313 times
Position: 0/9351 -> 'R' with 11118 mentions matched with 'R FOUNDATION FOR STATISTICAL COMPUTING' mentioned 280 times
Position: 0/9351 -> 'R' with 11398 mentions matched with 'R CORE TEAM' mentioned 207 times
Position: 0/9351 -> 'R' with 11605 mentions matched with 'R STUDIO' mentioned 148 times
Position: 0/9351 -> 'R' with 11753 mentions matched with 'R DEVELOPMENT CORE TEAM' mentioned 86 times
Position: 0/9351 -> 'R' with 11839 mentions matched with 'R SCRIPT' mentioned 84 times
Position: 0/9351 -> 'R' with 11923 mentions matched with 'MASK R - CNN' mentioned 71 times
Position: 0/9351 -> 'R' with 11994 mentions matched with 'R FOUNDATION FOR STATISTICAL' mentioned 63 times
Position: 0/9351 -> 'R' with 12057 mentions matched with 'R CORE' mentioned 62 times
Position: 0/9351 -> 'R' with 12119 mentions matched with 'R PROJECT FOR STATISTICAL COMPUTING' mentioned 57 times
Position: 0/9351 -> 'R' with

Position: 2/9351 -> 'GRAPHPAD PRISM' with 8335 mentions matched with 'PRISM GRAPHPAD' mentioned 48 times
Position: 2/9351 -> 'GRAPHPAD PRISM' with 8383 mentions matched with 'GRAPHPAD PRISM5' mentioned 27 times
Position: 2/9351 -> 'GRAPHPAD PRISM' with 8410 mentions matched with 'GRAPHPAD PRIZM' mentioned 19 times
Position: 2/9351 -> 'GRAPHPAD PRISM' with 8429 mentions matched with 'GRAPHPAD PRISM®' mentioned 19 times
Position: 2/9351 -> 'GRAPHPAD PRISM' with 8448 mentions matched with 'GRAPHPAD PRISM6' mentioned 17 times
Position: 2/9351 -> 'GRAPHPAD PRISM' with 8465 mentions matched with 'GRAPHPADPRISM' mentioned 14 times
Position: 2/9351 -> 'GRAPHPAD PRISM' with 8479 mentions matched with 'GRAPHPAD PRISM7' mentioned 11 times
Position: 2/9351 -> 'GRAPHPAD PRISM' with 8490 mentions matched with 'GRAPHPAD PRISM8' mentioned 9 times
Position: 3/9351 -> 'EXCEL' with 4054 mentions matched with 'MS EXCEL' mentioned 155 times
Position: 3/9351 -> 'EXCEL' with 4209 mentions matched with 'EXCEL

Position: 14/9351 -> 'PYTHON' with 1599 mentions matched with 'PYTHON PACKAGE' mentioned 11 times
Position: 14/9351 -> 'PYTHON' with 1610 mentions matched with 'LEARN PYTHON' mentioned 10 times
Position: 16/9351 -> 'GISAID' with 1429 mentions matched with 'GISAID EPICOV' mentioned 13 times
Position: 17/9351 -> 'PYMOL' with 1256 mentions matched with 'PYMOL MOLECULAR GRAPHICS SYSTEM' mentioned 105 times
Position: 18/9351 -> 'REDCAP' with 1242 mentions matched with 'RESEARCH ELECTRONIC DATA CAPTURE REDCAP' mentioned 106 times
Position: 18/9351 -> 'REDCAP' with 1348 mentions matched with 'REDCAP RESEARCH ELECTRONIC DATA CAPTURE' mentioned 69 times
Position: 18/9351 -> 'REDCAP' with 1417 mentions matched with 'REDCAP®' mentioned 26 times
Position: 18/9351 -> 'REDCAP' with 1443 mentions matched with 'RECAP' mentioned 11 times
Position: 18/9351 -> 'REDCAP' with 1454 mentions matched with 'HREDCAP' mentioned 10 times
Position: 20/9351 -> 'GOOGLE TRENDS' with 1194 mentions matched with 'GOOGLE

Position: 58/9351 -> 'CHIMERA' with 1127 mentions matched with 'CHIMERAX' mentioned 34 times
Position: 58/9351 -> 'CHIMERA' with 1161 mentions matched with 'UCSF CHIMERA PACKAGE' mentioned 11 times
Position: 58/9351 -> 'CHIMERA' with 1172 mentions matched with 'USCF CHIMERA' mentioned 10 times
Position: 60/9351 -> 'CLUSTAL OMEGA' with 496 mentions matched with 'CLUSTAL W' mentioned 385 times
Position: 60/9351 -> 'CLUSTAL OMEGA' with 881 mentions matched with 'CLUSTAL' mentioned 231 times
Position: 60/9351 -> 'CLUSTAL OMEGA' with 1112 mentions matched with 'CLUSTAL X' mentioned 162 times
Position: 60/9351 -> 'CLUSTAL OMEGA' with 1274 mentions matched with 'OMEGA' mentioned 73 times
Position: 60/9351 -> 'CLUSTAL OMEGA' with 1347 mentions matched with 'CLUSTAL V' mentioned 19 times
Position: 60/9351 -> 'CLUSTAL OMEGA' with 1366 mentions matched with 'CLUSTALOMEGA' mentioned 15 times
Position: 61/9351 -> 'MEDCALC' with 495 mentions matched with 'MEDCALC®' mentioned 11 times
Position: 63/93

Position: 99/9351 -> 'STAR' with 637 mentions matched with 'FLOWJO STAR' mentioned 9 times
Position: 99/9351 -> 'STAR' with 646 mentions matched with 'STARS' mentioned 9 times
Position: 101/9351 -> 'HADDOCK' with 358 mentions matched with 'HADDOCK2' mentioned 10 times
Position: 103/9351 -> 'VMD' with 352 mentions matched with 'VISUAL MOLECULAR DYNAMICS VMD' mentioned 47 times
Position: 104/9351 -> 'MODELLER' with 348 mentions matched with 'MODELER' mentioned 50 times
Position: 106/9351 -> 'ROBERTA' with 347 mentions matched with 'ROBETTA' mentioned 60 times
Position: 106/9351 -> 'ROBERTA' with 407 mentions matched with 'ROBERT' mentioned 37 times
Position: 106/9351 -> 'ROBERTA' with 444 mentions matched with 'PROBERTA' mentioned 28 times
Position: 107/9351 -> 'DOCK' with 338 mentions matched with 'ZDOCK' mentioned 129 times
Position: 107/9351 -> 'DOCK' with 467 mentions matched with 'HDOCK' mentioned 69 times
Position: 107/9351 -> 'DOCK' with 536 mentions matched with 'DOC' mentioned 6

Position: 146/9351 -> 'TMHMM' with 297 mentions matched with 'TMHMM SERVER' mentioned 20 times
Position: 146/9351 -> 'TMHMM' with 297 mentions matched with 'TMHMM2' mentioned 14 times
Position: 147/9351 -> 'REVIEW MANAGER' with 284 mentions matched with 'REVIEW MANAGER REVMAN' mentioned 31 times
Position: 147/9351 -> 'REVIEW MANAGER' with 315 mentions matched with 'REVIEW' mentioned 12 times
Position: 147/9351 -> 'REVIEW MANAGER' with 327 mentions matched with 'COCHRANE REVIEW MANAGER' mentioned 10 times
Position: 147/9351 -> 'REVIEW MANAGER' with 337 mentions matched with 'REVIEW MANAGER REVMAN' mentioned 10 times
Position: 149/9351 -> 'TRACETOGETHER' with 281 mentions matched with 'ABTRACETOGETHER' mentioned 13 times
Position: 152/9351 -> 'UMAP' with 280 mentions matched with 'MAP' mentioned 90 times
Position: 153/9351 -> 'STATISTICA' with 280 mentions matched with 'STATISTA' mentioned 108 times
Position: 153/9351 -> 'STATISTICA' with 280 mentions matched with 'STATISTICS' mentioned 

Position: 200/9351 -> 'CNN' with 229 mentions matched with 'MASK R - CNN' mentioned 71 times
Position: 200/9351 -> 'CNN' with 300 mentions matched with 'RCNN' mentioned 25 times
Position: 200/9351 -> 'CNN' with 325 mentions matched with 'CONN' mentioned 23 times
Position: 200/9351 -> 'CNN' with 348 mentions matched with 'FASTER R - CNN' mentioned 22 times
Position: 200/9351 -> 'CNN' with 370 mentions matched with 'CANN' mentioned 14 times
Position: 200/9351 -> 'CNN' with 384 mentions matched with 'MASK R CNN' mentioned 11 times
Position: 200/9351 -> 'CNN' with 395 mentions matched with '- CNN' mentioned 9 times
Position: 205/9351 -> 'SPARK' with 227 mentions matched with 'APACHE SPARK' mentioned 82 times
Position: 205/9351 -> 'SPARK' with 227 mentions matched with 'SPARKY' mentioned 28 times
Position: 205/9351 -> 'SPARK' with 227 mentions matched with 'SPARKS' mentioned 21 times
Position: 205/9351 -> 'SPARK' with 227 mentions matched with 'SPARK SQL' mentioned 18 times
Position: 205/93

Position: 229/9351 -> 'GOOGLE SEARCH' with 214 mentions matched with 'GOOGLE®' mentioned 26 times
Position: 229/9351 -> 'GOOGLE SEARCH' with 214 mentions matched with 'GOOGLE DATASET SEARCH' mentioned 14 times
Position: 229/9351 -> 'GOOGLE SEARCH' with 214 mentions matched with '- SEARCH' mentioned 10 times
Position: 231/9351 -> 'ARIMA' with 211 mentions matched with 'SARIMA' mentioned 137 times
Position: 231/9351 -> 'ARIMA' with 211 mentions matched with 'ARIA' mentioned 25 times
Position: 231/9351 -> 'ARIMA' with 211 mentions matched with 'WARIMA' mentioned 12 times
Position: 232/9351 -> 'JUPYTER' with 223 mentions matched with 'JUPYTER NOTEBOOK' mentioned 48 times
Position: 232/9351 -> 'JUPYTER' with 223 mentions matched with 'JUPYTER NOTEBOOKS' mentioned 12 times
Position: 234/9351 -> 'MCDA' with 210 mentions matched with 'MDA' mentioned 29 times
Position: 234/9351 -> 'MCDA' with 239 mentions matched with 'MCDA INDEX TOOL' mentioned 18 times
Position: 234/9351 -> 'MCDA' with 257 me

Position: 281/9351 -> 'LAMP' with 182 mentions matched with 'LAMPA' mentioned 88 times
Position: 281/9351 -> 'LAMP' with 182 mentions matched with 'LAP' mentioned 22 times
Position: 281/9351 -> 'LAMP' with 182 mentions matched with 'CLAMP' mentioned 20 times
Position: 281/9351 -> 'LAMP' with 182 mentions matched with 'AMP' mentioned 16 times
Position: 281/9351 -> 'LAMP' with 182 mentions matched with 'QLAMP' mentioned 12 times
Position: 282/9351 -> 'CELLQUEST' with 182 mentions matched with 'CELL QUEST' mentioned 54 times
Position: 282/9351 -> 'CELLQUEST' with 182 mentions matched with 'CELLQUEST PRO' mentioned 47 times
Position: 283/9351 -> 'SEQ' with 194 mentions matched with '3SEQ' mentioned 145 times
Position: 283/9351 -> 'SEQ' with 194 mentions matched with 'GSEQ' mentioned 36 times
Position: 283/9351 -> 'SEQ' with 194 mentions matched with 'SEQ2' mentioned 10 times
Position: 284/9351 -> 'PLINK' with 229 mentions matched with 'JPLINK' mentioned 34 times
Position: 284/9351 -> 'PLIN

Position: 351/9351 -> 'MAPPER' with 157 mentions matched with 'QGRS MAPPER' mentioned 18 times
Position: 351/9351 -> 'MAPPER' with 157 mentions matched with 'KEGG MAPPER' mentioned 11 times
Position: 352/9351 -> 'ATLAS' with 157 mentions matched with 'RIPE ATLAS' mentioned 14 times
Position: 355/9351 -> 'SIFT' with 157 mentions matched with 'SWIFT' mentioned 48 times
Position: 355/9351 -> 'SIFT' with 157 mentions matched with 'SIT' mentioned 16 times
Position: 356/9351 -> 'SLACK' with 157 mentions matched with 'SLAC' mentioned 11 times
Position: 357/9351 -> 'GEN' with 157 mentions matched with 'GENE' mentioned 99 times
Position: 357/9351 -> 'GEN' with 256 mentions matched with 'GENV' mentioned 70 times
Position: 357/9351 -> 'GEN' with 326 mentions matched with 'GEN5' mentioned 40 times
Position: 357/9351 -> 'GEN' with 366 mentions matched with 'PGEN' mentioned 10 times
Position: 357/9351 -> 'GEN' with 376 mentions matched with 'GENO' mentioned 10 times
Position: 358/9351 -> 'YASARA' wi

Position: 425/9351 -> 'REMAP' with 138 mentions matched with 'RESMAP' mentioned 19 times
Position: 426/9351 -> 'SIMEX' with 138 mentions matched with 'IMEX' mentioned 32 times
Position: 432/9351 -> 'SEQUENCHER' with 136 mentions matched with 'SEQUENCERR' mentioned 13 times
Position: 432/9351 -> 'SEQUENCHER' with 136 mentions matched with 'SEQUENCER' mentioned 13 times
Position: 435/9351 -> 'PROTEOME DISCOVERER' with 136 mentions matched with 'PROTEOME' mentioned 18 times
Position: 435/9351 -> 'PROTEOME DISCOVERER' with 136 mentions matched with 'PROTEOME DISCOVER' mentioned 9 times
Position: 436/9351 -> 'VIPR' with 135 mentions matched with 'VIPER' mentioned 68 times
Position: 436/9351 -> 'VIPR' with 203 mentions matched with 'VIR' mentioned 52 times
Position: 436/9351 -> 'VIPR' with 255 mentions matched with 'VIP' mentioned 32 times
Position: 437/9351 -> 'HMMER' with 135 mentions matched with 'HMMER3' mentioned 46 times
Position: 439/9351 -> 'SEQMAN' with 134 mentions matched with 'SE

Position: 515/9351 -> 'PHYRE2' with 118 mentions matched with 'PHYRE' mentioned 52 times
Position: 517/9351 -> 'GOOGLE MEET' with 117 mentions matched with 'GOOGLE' mentioned 42 times
Position: 517/9351 -> 'GOOGLE MEET' with 117 mentions matched with 'GOOGLE®' mentioned 26 times
Position: 517/9351 -> 'GOOGLE MEET' with 117 mentions matched with 'GOOGLE SHEET' mentioned 11 times
Position: 518/9351 -> 'SIDARTHE' with 117 mentions matched with 'SIDARE' mentioned 10 times
Position: 527/9351 -> 'FOLDX' with 116 mentions matched with 'FOLD' mentioned 113 times
Position: 528/9351 -> 'AIM' with 116 mentions matched with 'AIML' mentioned 20 times
Position: 528/9351 -> 'AIM' with 136 mentions matched with 'AIMS' mentioned 14 times
Position: 528/9351 -> 'AIM' with 150 mentions matched with 'AIM QUEST' mentioned 10 times
Position: 532/9351 -> 'MAIL' with 115 mentions matched with 'EMAIL' mentioned 15 times
Position: 532/9351 -> 'MAIL' with 115 mentions matched with 'GMAIL' mentioned 11 times
Posit

Position: 622/9351 -> 'NODE' with 101 mentions matched with 'ODE' mentioned 9 times
Position: 629/9351 -> 'ROS' with 110 mentions matched with 'RODS' mentioned 83 times
Position: 629/9351 -> 'ROS' with 110 mentions matched with 'ROCS' mentioned 20 times
Position: 629/9351 -> 'ROS' with 110 mentions matched with 'ROBOT OPERATING SYSTEM ROS' mentioned 16 times
Position: 629/9351 -> 'ROS' with 110 mentions matched with 'ROSE' mentioned 15 times
Position: 629/9351 -> 'ROS' with 110 mentions matched with 'ROST' mentioned 15 times
Position: 630/9351 -> 'HEALTH' with 101 mentions matched with 'MHEALTH' mentioned 54 times
Position: 630/9351 -> 'HEALTH' with 155 mentions matched with 'ONE HEALTH' mentioned 50 times
Position: 630/9351 -> 'HEALTH' with 205 mentions matched with 'HEALTH ARCADE' mentioned 9 times
Position: 631/9351 -> 'DEAL' with 101 mentions matched with 'EAL' mentioned 62 times
Position: 631/9351 -> 'DEAL' with 101 mentions matched with 'DEA' mentioned 17 times
Position: 631/9351

Position: 693/9351 -> 'SCAN' with 125 mentions matched with 'SCA' mentioned 35 times
Position: 693/9351 -> 'SCAN' with 125 mentions matched with 'SCRAN' mentioned 12 times
Position: 695/9351 -> 'RAST' with 137 mentions matched with 'RAT' mentioned 26 times
Position: 695/9351 -> 'RAST' with 137 mentions matched with 'RAS' mentioned 12 times
Position: 696/9351 -> 'RSEM' with 139 mentions matched with 'SEM' mentioned 46 times
Position: 696/9351 -> 'RSEM' with 139 mentions matched with 'REM' mentioned 11 times
Position: 699/9351 -> 'DOCTOR' with 93 mentions matched with 'DOCKTHOR' mentioned 13 times
Position: 700/9351 -> 'WEBLOGO' with 93 mentions matched with 'WEBLOGO3' mentioned 13 times
Position: 701/9351 -> 'MASK' with 93 mentions matched with 'MASK R - CNN' mentioned 71 times
Position: 701/9351 -> 'MASK' with 93 mentions matched with 'MASK R CNN' mentioned 11 times
Position: 701/9351 -> 'MASK' with 93 mentions matched with 'ASK' mentioned 10 times
Position: 701/9351 -> 'MASK' with 93 

Position: 813/9351 -> 'DATA' with 82 mentions matched with 'OPEN DATA KIT' mentioned 11 times
Position: 813/9351 -> 'DATA' with 82 mentions matched with 'OCTET DATA ANALYSIS' mentioned 11 times
Position: 813/9351 -> 'DATA' with 82 mentions matched with 'GDATA' mentioned 10 times
Position: 814/9351 -> 'MAN' with 81 mentions matched with 'EMAN' mentioned 51 times
Position: 814/9351 -> 'MAN' with 81 mentions matched with 'MAIN' mentioned 11 times
Position: 814/9351 -> 'MAN' with 81 mentions matched with 'SMAN' mentioned 11 times
Position: 816/9351 -> 'LOESS' with 81 mentions matched with 'LOWESS' mentioned 41 times
Position: 818/9351 -> 'LISA' with 81 mentions matched with 'ELISA' mentioned 44 times
Position: 818/9351 -> 'LISA' with 81 mentions matched with 'LIA' mentioned 15 times
Position: 818/9351 -> 'LISA' with 81 mentions matched with 'LSA' mentioned 15 times
Position: 818/9351 -> 'LISA' with 81 mentions matched with 'QLISA' mentioned 9 times
Position: 819/9351 -> 'CLARIVATE ANALYTIC

Position: 925/9351 -> 'DIVA' with 74 mentions matched with 'DIA' mentioned 56 times
Position: 925/9351 -> 'DIVA' with 74 mentions matched with 'FACS DIVA' mentioned 45 times
Position: 925/9351 -> 'DIVA' with 74 mentions matched with 'DIVAS' mentioned 17 times
Position: 925/9351 -> 'DIVA' with 74 mentions matched with 'BD FACS DIVA' mentioned 15 times
Position: 925/9351 -> 'DIVA' with 74 mentions matched with 'IVA' mentioned 11 times
Position: 925/9351 -> 'DIVA' with 74 mentions matched with 'DIV' mentioned 9 times
Position: 927/9351 -> 'NANOSTRING' with 89 mentions matched with 'NANOSTRING NCOUNTER' mentioned 22 times
Position: 929/9351 -> 'GENSIM' with 84 mentions matched with 'AGENTSIM' mentioned 18 times
Position: 930/9351 -> 'INF' with 73 mentions matched with 'TAG RNA INF' mentioned 23 times
Position: 939/9351 -> 'LINE' with 73 mentions matched with 'LIN' mentioned 25 times
Position: 943/9351 -> 'ESMO' with 73 mentions matched with 'ESMOS' mentioned 12 times
Position: 943/9351 -> 

Position: 1055/9351 -> 'NETWORKX' with 67 mentions matched with 'NETWORK' mentioned 36 times
Position: 1059/9351 -> 'UCSC GENOME BROWSER' with 66 mentions matched with 'BROWSER' mentioned 37 times
Position: 1059/9351 -> 'UCSC GENOME BROWSER' with 103 mentions matched with 'ENSEMBL GENOME BROWSER' mentioned 27 times
Position: 1059/9351 -> 'UCSC GENOME BROWSER' with 130 mentions matched with 'GENOME BROWSER' mentioned 16 times
Position: 1059/9351 -> 'UCSC GENOME BROWSER' with 146 mentions matched with 'GENOME' mentioned 14 times
Position: 1060/9351 -> 'GLMNET' with 66 mentions matched with 'R PACKAGE GLMNET' mentioned 10 times
Position: 1061/9351 -> 'MICROSOFT ACADEMIC' with 66 mentions matched with 'MICROSOFT' mentioned 16 times
Position: 1062/9351 -> 'SIGMASTAT' with 82 mentions matched with 'SIGMA STAT' mentioned 24 times
Position: 1066/9351 -> 'CODONW' with 66 mentions matched with 'CODON W' mentioned 13 times
Position: 1069/9351 -> 'GENEMARK' with 66 mentions matched with 'GENEMARKS

Position: 1209/9351 -> 'AF' with 60 mentions matched with 'LAS AF' mentioned 20 times
Position: 1209/9351 -> 'AF' with 60 mentions matched with 'LEICA LAS AF' mentioned 14 times
Position: 1213/9351 -> 'UPTODATE' with 60 mentions matched with 'UPTODATE®' mentioned 16 times
Position: 1214/9351 -> 'LIB' with 60 mentions matched with 'DLIB' mentioned 20 times
Position: 1214/9351 -> 'LIB' with 60 mentions matched with 'ZLIB' mentioned 10 times
Position: 1216/9351 -> 'LEAST' with 60 mentions matched with 'LAST' mentioned 28 times
Position: 1216/9351 -> 'LEAST' with 60 mentions matched with 'EAST' mentioned 12 times
Position: 1220/9351 -> 'BIOGRID' with 60 mentions matched with 'ISOGRID' mentioned 17 times
Position: 1222/9351 -> 'GENESPRING' with 77 mentions matched with 'GENESPRING GX' mentioned 29 times
Position: 1222/9351 -> 'GENESPRING' with 77 mentions matched with 'GENESPRING GX11' mentioned 16 times
Position: 1223/9351 -> 'ANALYSE' with 60 mentions matched with 'ANALYST' mentioned 35 t

Position: 1372/9351 -> 'CAPRI' with 54 mentions matched with 'CAPRI APP' mentioned 13 times
Position: 1374/9351 -> 'AGREE' with 54 mentions matched with 'AGREED' mentioned 11 times
Position: 1377/9351 -> 'SCI' with 54 mentions matched with 'SCIP' mentioned 19 times
Position: 1383/9351 -> 'COVEX' with 54 mentions matched with 'COVE' mentioned 16 times
Position: 1385/9351 -> 'SCENIC' with 66 mentions matched with 'PYSCENIC' mentioned 19 times
Position: 1393/9351 -> 'SEMMT' with 54 mentions matched with 'SMMT' mentioned 15 times
Position: 1396/9351 -> 'MOBILE' with 54 mentions matched with 'MOBILE NET' mentioned 14 times
Position: 1398/9351 -> 'DEMO' with 53 mentions matched with 'DEM' mentioned 35 times
Position: 1400/9351 -> 'CADD' with 53 mentions matched with 'CAD' mentioned 20 times
Position: 1400/9351 -> 'CADD' with 53 mentions matched with 'CADDD' mentioned 12 times
Position: 1401/9351 -> 'PROTEIN PREPARATION WIZARD' with 53 mentions matched with 'PROTEIN' mentioned 35 times
Positi

Position: 1625/9351 -> 'SIMPLE' with 47 mentions matched with 'SIMPLEXA' mentioned 32 times
Position: 1631/9351 -> 'ELECTRE' with 47 mentions matched with 'ELECTRA' mentioned 39 times
Position: 1636/9351 -> 'PEDS' with 47 mentions matched with 'PED' mentioned 16 times
Position: 1636/9351 -> 'PEDS' with 47 mentions matched with '- PEDS' mentioned 11 times
Position: 1637/9351 -> 'MOMO' with 47 mentions matched with 'MOO' mentioned 9 times
Position: 1638/9351 -> 'LIGHTCYCLER' with 47 mentions matched with 'LIGHTCYCLER 480' mentioned 16 times
Position: 1645/9351 -> 'EER' with 47 mentions matched with 'ESER' mentioned 34 times
Position: 1645/9351 -> 'EER' with 47 mentions matched with 'ENER' mentioned 17 times
Position: 1646/9351 -> 'VIRUS' with 46 mentions matched with 'VIRTUS' mentioned 14 times
Position: 1647/9351 -> 'PATHOSCOPE' with 46 mentions matched with 'CLINICAL PATHOSCOPE' mentioned 11 times
Position: 1651/9351 -> 'EMPOWER' with 46 mentions matched with 'MPOWER' mentioned 16 time

Position: 1820/9351 -> 'ION TORRENT' with 42 mentions matched with 'ION' mentioned 32 times
Position: 1826/9351 -> 'FMEA' with 42 mentions matched with 'MEA' mentioned 15 times
Position: 1827/9351 -> 'CAFE' with 42 mentions matched with 'CAFFE' mentioned 29 times
Position: 1827/9351 -> 'CAFE' with 42 mentions matched with 'CAFEM' mentioned 15 times
Position: 1830/9351 -> 'EXPLORER' with 42 mentions matched with 'OSIRIS PROPERTY EXPLORER' mentioned 20 times
Position: 1830/9351 -> 'EXPLORER' with 42 mentions matched with 'INTERNET EXPLORER' mentioned 16 times
Position: 1830/9351 -> 'EXPLORER' with 42 mentions matched with 'GPS EXPLORER' mentioned 13 times
Position: 1830/9351 -> 'EXPLORER' with 42 mentions matched with '- EXPLORER' mentioned 12 times
Position: 1830/9351 -> 'EXPLORER' with 42 mentions matched with 'EXPORTER' mentioned 12 times
Position: 1830/9351 -> 'EXPLORER' with 42 mentions matched with 'EXPLORE' mentioned 11 times
Position: 1830/9351 -> 'EXPLORER' with 42 mentions matc

Position: 2020/9351 -> 'GRFT' with 39 mentions matched with 'GFT' mentioned 36 times
Position: 2022/9351 -> 'SOLVE' with 39 mentions matched with 'SOLVER' mentioned 22 times
Position: 2023/9351 -> 'IGV' with 39 mentions matched with 'INTEGRATIVE GENOMICS VIEWER IGV' mentioned 19 times
Position: 2026/9351 -> 'DEVOPS' with 49 mentions matched with 'AZURE DEVOPS' mentioned 10 times
Position: 2027/9351 -> 'NVIDIA' with 39 mentions matched with 'INVIDIA' mentioned 11 times
Position: 2030/9351 -> 'BOOST' with 59 mentions matched with 'BOOT' mentioned 9 times
Position: 2031/9351 -> 'NEOTREE' with 39 mentions matched with 'NETTREE' mentioned 14 times
Position: 2035/9351 -> 'SEABORN' with 38 mentions matched with 'SEAHORN' mentioned 20 times
Position: 2036/9351 -> 'IDENTIF' with 38 mentions matched with 'IDENTIFY' mentioned 24 times
Position: 2038/9351 -> 'SCALEPACK' with 38 mentions matched with 'SCALAPACK' mentioned 19 times
Position: 2040/9351 -> 'COVIDCARE' with 38 mentions matched with 'CO

Position: 2211/9351 -> 'FCS EXPRESS' with 36 mentions matched with 'EXPRESS' mentioned 26 times
Position: 2211/9351 -> 'FCS EXPRESS' with 36 mentions matched with 'FCS' mentioned 15 times
Position: 2219/9351 -> 'SWOT' with 36 mentions matched with 'WOT' mentioned 9 times
Position: 2221/9351 -> 'VITAL' with 36 mentions matched with 'VITA' mentioned 11 times
Position: 2227/9351 -> 'ALT' with 35 mentions matched with 'ALRT' mentioned 9 times
Position: 2230/9351 -> 'SLICER' with 35 mentions matched with '3D SLICER' mentioned 16 times
Position: 2230/9351 -> 'SLICER' with 35 mentions matched with 'SLICE' mentioned 13 times
Position: 2231/9351 -> 'SEQSCAPE' with 35 mentions matched with 'ESCAPE' mentioned 11 times
Position: 2232/9351 -> 'GEOS' with 35 mentions matched with 'GEO' mentioned 27 times
Position: 2236/9351 -> 'APPROXMC3' with 35 mentions matched with 'APPROXMC' mentioned 22 times
Position: 2237/9351 -> 'ROB' with 35 mentions matched with 'COCHRANE ROB' mentioned 21 times
Position: 

Position: 2427/9351 -> 'VIRGO' with 33 mentions matched with 'VIRO' mentioned 11 times
Position: 2429/9351 -> 'FREESTYLE LIBRE' with 33 mentions matched with 'FREESTYLE' mentioned 12 times
Position: 2431/9351 -> 'UCSF' with 33 mentions matched with 'UCSF CHIMERAX' mentioned 22 times
Position: 2431/9351 -> 'UCSF' with 33 mentions matched with 'UCSF CHIMERA PACKAGE' mentioned 11 times
Position: 2441/9351 -> 'ALL' with 33 mentions matched with 'MALL' mentioned 12 times
Position: 2452/9351 -> 'UNACAST' with 33 mentions matched with 'RNACAST' mentioned 9 times
Position: 2457/9351 -> 'EMRINGER' with 33 mentions matched with 'RINGER' mentioned 18 times
Position: 2459/9351 -> 'SPREAD3' with 44 mentions matched with 'SPREAD' mentioned 16 times
Position: 2460/9351 -> 'BASIC' with 33 mentions matched with 'BASIC LOCAL ALIGNMENT SEARCH TOOL' mentioned 31 times
Position: 2460/9351 -> 'BASIC' with 33 mentions matched with 'VISUAL BASIC' mentioned 24 times
Position: 2460/9351 -> 'BASIC' with 33 menti

Position: 2728/9351 -> 'PARTEK GENOMICS SUITE' with 30 mentions matched with 'SUITE' mentioned 29 times
Position: 2728/9351 -> 'PARTEK GENOMICS SUITE' with 30 mentions matched with 'PARTEK' mentioned 18 times
Position: 2729/9351 -> 'EQA' with 30 mentions matched with 'EQAO' mentioned 17 times
Position: 2732/9351 -> 'CONQUEST' with 30 mentions matched with 'CQUEST' mentioned 27 times
Position: 2734/9351 -> 'DYNAMIC' with 30 mentions matched with 'DYNAMICS' mentioned 13 times
Position: 2739/9351 -> 'SPLASHGUARD' with 47 mentions matched with 'SPLASHGUARD CG' mentioned 12 times
Position: 2740/9351 -> 'FACE' with 57 mentions matched with 'FACEX' mentioned 21 times
Position: 2740/9351 -> 'FACE' with 57 mentions matched with 'FACES' mentioned 15 times
Position: 2740/9351 -> 'FACE' with 57 mentions matched with 'FAC' mentioned 11 times
Position: 2745/9351 -> 'SGA' with 30 mentions matched with 'NSGA' mentioned 18 times
Position: 2745/9351 -> 'SGA' with 30 mentions matched with 'SAGA' mentione

Position: 3026/9351 -> 'ITERATIVE' with 27 mentions matched with 'INTERACTIVE' mentioned 11 times
Position: 3035/9351 -> 'GINA' with 27 mentions matched with 'INA' mentioned 26 times
Position: 3035/9351 -> 'GINA' with 27 mentions matched with 'GIN' mentioned 9 times
Position: 3047/9351 -> 'SEPAR' with 27 mentions matched with 'SEPCAR' mentioned 15 times
Position: 3047/9351 -> 'SEPAR' with 27 mentions matched with 'SPAR' mentioned 12 times
Position: 3058/9351 -> 'CIRCLE' with 27 mentions matched with 'CIRCLIZE' mentioned 15 times
Position: 3059/9351 -> 'SOFIA' with 42 mentions matched with 'SOFA' mentioned 12 times
Position: 3063/9351 -> 'ALRC' with 27 mentions matched with 'ARC' mentioned 26 times
Position: 3064/9351 -> 'EMBL' with 27 mentions matched with 'EMB' mentioned 19 times
Position: 3064/9351 -> 'EMBL' with 27 mentions matched with 'XEMBL' mentioned 11 times
Position: 3070/9351 -> 'MISMIS' with 54 mentions matched with 'MISMS' mentioned 16 times
Position: 3079/9351 -> 'LINKER' 

Position: 3342/9351 -> 'LMFIT' with 25 mentions matched with 'ULMFIT' mentioned 23 times
Position: 3348/9351 -> 'UGENE' with 25 mentions matched with 'UNIPRO UGENE' mentioned 15 times
Position: 3358/9351 -> 'HTCONDOR' with 25 mentions matched with 'CONDOR' mentioned 16 times
Position: 3359/9351 -> 'AUPRRE' with 25 mentions matched with 'MAUPRRE' mentioned 19 times
Position: 3362/9351 -> 'ECE' with 25 mentions matched with 'ECEA' mentioned 15 times
Position: 3369/9351 -> 'TOP' with 25 mentions matched with 'STOP' mentioned 22 times
Position: 3391/9351 -> 'ITSA' with 24 mentions matched with 'TSA' mentioned 22 times
Position: 3391/9351 -> 'ITSA' with 24 mentions matched with 'ITS' mentioned 14 times
Position: 3392/9351 -> 'UNIGEN2' with 24 mentions matched with 'UNIGEN3' mentioned 11 times
Position: 3408/9351 -> 'SKMEANS' with 24 mentions matched with 'KMEANS' mentioned 19 times
Position: 3408/9351 -> 'SKMEANS' with 24 mentions matched with 'CKMEANS' mentioned 9 times
Position: 3422/9351

Position: 3712/9351 -> 'MPILEUP' with 23 mentions matched with 'SAMTOOLS MPILEUP' mentioned 20 times
Position: 3712/9351 -> 'MPILEUP' with 23 mentions matched with 'PILEUP' mentioned 16 times
Position: 3727/9351 -> 'EFSA' with 22 mentions matched with 'ESA' mentioned 14 times
Position: 3744/9351 -> 'PHYLOGENY' with 22 mentions matched with 'NGPHYLOGENY' mentioned 11 times
Position: 3754/9351 -> 'TISM' with 22 mentions matched with 'TIS' mentioned 15 times
Position: 3760/9351 -> 'EQIP' with 22 mentions matched with 'EQUIP' mentioned 11 times
Position: 3763/9351 -> 'AMIRO' with 22 mentions matched with 'MIRO' mentioned 11 times
Position: 3764/9351 -> 'EBSO' with 22 mentions matched with 'BSO' mentioned 20 times
Position: 3764/9351 -> 'EBSO' with 22 mentions matched with 'EBSCO' mentioned 18 times
Position: 3767/9351 -> 'GOOGLE HOME' with 22 mentions matched with 'HOME' mentioned 14 times
Position: 3772/9351 -> 'DECT' with 22 mentions matched with 'DEC' mentioned 21 times
Position: 3774/9

Position: 4263/9351 -> 'GRAAL' with 20 mentions matched with 'GRAL' mentioned 10 times
Position: 4265/9351 -> 'UMI' with 20 mentions matched with 'LUMI' mentioned 9 times
Position: 4266/9351 -> 'PSYCHINFO' with 20 mentions matched with 'PSYCINFO' mentioned 10 times
Position: 4268/9351 -> 'DTW' with 31 mentions matched with 'GDTW' mentioned 9 times
Position: 4274/9351 -> 'OCULUS' with 20 mentions matched with 'OCULUS QUEST' mentioned 13 times
Position: 4274/9351 -> 'OCULUS' with 20 mentions matched with 'OCULUS GO' mentioned 11 times
Position: 4284/9351 -> 'KERNEL' with 37 mentions matched with 'LINUX KERNEL' mentioned 10 times
Position: 4288/9351 -> 'PI' with 20 mentions matched with 'RASPBERRY PI' mentioned 18 times
Position: 4295/9351 -> 'CHROMASPRO' with 20 mentions matched with 'CROMASPRO' mentioned 15 times
Position: 4296/9351 -> 'SYNAPPS' with 20 mentions matched with 'SYNAPSE' mentioned 13 times
Position: 4298/9351 -> 'SPA' with 20 mentions matched with 'STPA' mentioned 18 times

Position: 4686/9351 -> 'GOOGLE GLASS' with 18 mentions matched with 'GLASS' mentioned 18 times
Position: 4726/9351 -> 'GRAPHQL' with 18 mentions matched with 'GRAPHML' mentioned 12 times
Position: 4748/9351 -> 'GLIM' with 18 mentions matched with 'GLM' mentioned 9 times
Position: 4764/9351 -> 'FCAP ARRAY' with 18 mentions matched with 'ARRAY' mentioned 11 times
Position: 4772/9351 -> 'ESENSE' with 18 mentions matched with 'SENSE' mentioned 14 times
Position: 4777/9351 -> 'CARD' with 18 mentions matched with 'CARDS' mentioned 12 times
Position: 4777/9351 -> 'CARD' with 18 mentions matched with 'CHLA CARD' mentioned 10 times
Position: 4793/9351 -> 'SOFTWAREKG' with 18 mentions matched with 'SOFTWARE' mentioned 12 times
Position: 4811/9351 -> 'ESEM' with 18 mentions matched with 'ESM' mentioned 12 times
Position: 4812/9351 -> 'SPLIT' with 18 mentions matched with 'SPLINT' mentioned 11 times
Position: 4835/9351 -> 'PLATO' with 17 mentions matched with 'PLATON' mentioned 10 times
Position: 

Position: 5541/9351 -> 'PARTER' with 15 mentions matched with 'PARTNER' mentioned 12 times
Position: 5545/9351 -> 'SEIARD' with 15 mentions matched with 'SEIRD' mentioned 15 times
Position: 5548/9351 -> 'CHIO' with 15 mentions matched with 'CHI' mentioned 9 times
Position: 5553/9351 -> 'PROTIDE' with 15 mentions matched with 'PROTIDENT' mentioned 10 times
Position: 5559/9351 -> 'FACTEST' with 15 mentions matched with 'FASTEST' mentioned 10 times
Position: 5561/9351 -> 'ADT' with 15 mentions matched with 'AUTODOCK TOOLS ADT' mentioned 15 times
Position: 5565/9351 -> 'MAMI' with 15 mentions matched with 'AMI' mentioned 10 times
Position: 5593/9351 -> 'OFANGBM' with 15 mentions matched with 'FANGBM' mentioned 12 times
Position: 5599/9351 -> 'CYC' with 15 mentions matched with 'CYCL' mentioned 14 times
Position: 5610/9351 -> 'SPI' with 15 mentions matched with 'SPRI' mentioned 13 times
Position: 5630/9351 -> 'GLFORTHEL' with 15 mentions matched with 'FORTHEL' mentioned 9 times
Position: 56

Position: 6476/9351 -> '2D' with 13 mentions matched with 'IMAGE MASTER 2D PLATINUM' mentioned 9 times
Position: 6496/9351 -> 'RTWEET' with 13 mentions matched with 'TWEET' mentioned 10 times
Position: 6514/9351 -> 'HFS' with 13 mentions matched with 'HDFS' mentioned 10 times
Position: 6516/9351 -> 'AEGIS' with 13 mentions matched with 'AEIS' mentioned 9 times
Position: 6543/9351 -> 'FACTT' with 13 mentions matched with '- FACT' mentioned 11 times
Position: 6548/9351 -> 'MPLOC' with 13 mentions matched with 'PLOC' mentioned 9 times
Position: 6596/9351 -> 'ESPEN' with 13 mentions matched with 'ESPN' mentioned 9 times
Position: 6600/9351 -> 'GENOMESTUDIO' with 13 mentions matched with 'ILLUMINA GENOMESTUDIO' mentioned 10 times
Position: 6606/9351 -> 'INS' with 13 mentions matched with 'RINS' mentioned 11 times
Position: 6655/9351 -> 'GENET' with 13 mentions matched with 'GELNET' mentioned 12 times
Position: 6656/9351 -> 'BOSSL' with 13 mentions matched with 'BOSS' mentioned 13 times
Posi

Position: 8495/9351 -> 'PLUMX' with 10 mentions matched with 'PLUM' mentioned 9 times
Position: 8519/9351 -> 'REGRNA' with 9 mentions matched with 'REGRNA2' mentioned 9 times
Position: 8547/9351 -> 'ETE3' with 9 mentions matched with 'ETE' mentioned 9 times
Position: 8564/9351 -> 'MMLA' with 9 mentions matched with 'MLA' mentioned 9 times
Position: 8694/9351 -> 'RJIP' with 9 mentions matched with 'RIP' mentioned 9 times
Position: 8763/9351 -> 'CHEMMINE' with 9 mentions matched with 'CHEMMINER' mentioned 9 times
Position: 8805/9351 -> 'RADX' with 9 mentions matched with 'ADX' mentioned 9 times
Position: 8935/9351 -> 'MATRIXDB' with 9 mentions matched with 'MATRIX' mentioned 9 times
Position: 8966/9351 -> 'ECOSCREEN' with 9 mentions matched with 'VECSCREEN' mentioned 9 times
Position: 8973/9351 -> 'AMBER20' with 9 mentions matched with 'AMBER10' mentioned 9 times
Position: 9220/9351 -> 'HEG' with 9 mentions matched with 'PHEG' mentioned 9 times
Wall time: 1h 23min 8s


Next, the DataFrames presents the new aggregation numbers. 

In [28]:
df_unified

Unnamed: 0,Software,Matches
0,R,13163
1,SPSS,11290
2,GRAPHPAD PRISM,8499
3,EXCEL,4319
4,BLAST,6711
...,...,...
9347,3DRNA,9
8894,2DST,9
9108,- MASK,9
8868,- CNN,9


The blacklist contains the matched duplicates which means that they need to be removed from the DataFrame.

In [29]:
for i in blacklist:
    name_of_index = df_unified[ df_unified['Software'] == i ].index
    df_unified.drop(name_of_index, inplace = True)

For comparison purposes, the DataFrame is sorted in descending order by matches and then alphabetical by software.

In [30]:
df_unified = df_unified.sort_values(by=['Matches', "Software"])
#Reverse the Dataframe to present the most common software at index position 0 
df_unified = df_unified[::-1]  
df_unified

Unnamed: 0,Software,Matches
0,R,13163
1,SPSS,11290
2,GRAPHPAD PRISM,8499
4,BLAST,6711
3,EXCEL,4319
...,...,...
8611,7VINCUT,9
8856,6GCVAE,9
9085,4D,9
9347,3DRNA,9


To verify the removal of duplicates, the length of the DataFrame is outputted.

In [31]:
len_df_unified = len(df_unified)
len_df_unified

7168

For the highest percentile of software mentiones, the implemented algorithm leads to a reduction of approximately 23.4%.

In [32]:
reduction = round((1-len_df_unified/selection_limit) * 100, 2)
reduction

23.35

To investigate the position change of software mentions, the following algorithm compares its index position to the sorted position by matches.

In [33]:
list_change = []
for i in range(len(df_unified)):
    dif = df_unified.index[i]-i
    if(dif > 0):
        list_change.append("+"+str(dif))
    else:
        list_change.append(df_unified.index[i]-i)
df_unified['Change'] = list_change
df_unified.head(20)

Unnamed: 0,Software,Matches,Change
0,R,13163,0
1,SPSS,11290,0
2,GRAPHPAD PRISM,8499,0
4,BLAST,6711,1
3,EXCEL,4319,-1
5,STATA,4048,0
10,MEGA,3428,4
6,SAS,3399,-1
12,IMAGEJ,2779,4
7,MATLAB,2710,-2


Assigning the outcome of this notebook to a new DataFrame for the classification notebook. The outcome is stored on an external file.

In [34]:
df_software_mentions = df_unified
df_software_mentions.to_pickle('software_mentions_CS5099.pkl')