# 7. Evaluation of results
In this final notebook we will propose an evaluation metric to check the performance and our final systems on the git track. Since none of the track repositories had a category to use as ground truth to compare our results against, a manual labelling of the repositories was made by an external group of people. This labelled data will be used as ground truth.

## Setup

In [1]:
%run __init__.py

In [2]:
import pandas as pd

GIT_FILE_PATH = os.path.join(NOTEBOOK_1_RESULTS_DIR, 'git_dataframe.pkl')

git_df = pd.read_pickle(GIT_FILE_PATH)

In [3]:
import pandas as pd

LABELLED_GIT_FILE_PATH = os.path.join(NOTEBOOK_7_RESULTS_DIR, 'labelled_git_repos.csv')

labelled_git_df = pd.read_csv(LABELLED_GIT_FILE_PATH, sep=';')
labelled_git_df.head()

Unnamed: 0,repo_url,topics
0,https://github.com/cmungall/LIRICAL/,"Java, diagnosis, human phenotype"
1,https://github.com/cmungall/wikidata_ontomatcher,"Prolog, Wikidata, ontology matching"
2,https://github.com/markwilkinson/ro-crate-ruby,"Ruby, RO Crate, Research Objects"
3,https://github.com/markwilkinson/Misc_Training...,database
4,https://github.com/mikel-egana-aranguren/FAIRi...,"FAIR data, OpenRefine"


## Analyzing the categories
We will begin by selecting a subset of the protocols dataframe with just the id of the protocol and its categories:

In [4]:
import numpy as np

categories_df = pd.concat([git_df, labelled_git_df], axis=1)
categories_df['topics'].replace('', np.nan, inplace=True)
categories_df.dropna(subset=['topics'], inplace=True)
categories_df.head(n=4)

Unnamed: 0,gh_id,name,description,owner_name,languages,readme_text,issues_text,commits_text,filenames,comments_text,full_text,full_text_cleaned,num_chars_text,repo_url,topics
0,216602979,LIRICAL,LIkelihood Ratio Interpretation of Clinical Ab...,cmungall,"{'Java': 492423, 'FreeMarker': 13149, 'Python'...",LIRICAL. LIkelihood Ratio Interpretation of C...,,Merge pull request #442 from TheJacksonLaborat...,\nCHANGELOG\nREADME\nhoxc13 output\nlirical to...,note that the Jannovar dependency does not nee...,LIkelihood Ratio Interpretation of Clinical Ab...,LIkelihood Ratio Interpretation of Clinical Ab...,3770,https://github.com/cmungall/LIRICAL/,"Java, diagnosis, human phenotype"
1,199330464,wikidata_ontomatcher,Matches ontology classes against wikidata,cmungall,"{'Prolog': 14691, 'Makefile': 1472, 'Dockerfil...",Match an ontology to Wikidata. This applicatio...,Will help with #1 and with https://github.com/...,Adding skos:altLabel\n\nhttps://github.com/cmu...,\nREADME\ninstall\npack\nwikidata ontomatcher\...,,Matches ontology classes against wikidata. Mat...,Matches ontology classes against wikidata. Mat...,519,https://github.com/cmungall/wikidata_ontomatcher,"Prolog, Wikidata, ontology matching"
2,253207181,ro-crate-ruby,"A Ruby gem for creating, manipulating and read...",markwilkinson,"{'Ruby': 52724, 'HTML': 1319}","ro-crate-ruby. This is a WIP gem for creating,...",,Update LICENSE\nBump version\nTidy up and chec...,\n travis\nGemfile\nREADME\nROCrate\nContact P...,*\n * Expands the tree to the target element a...,"A Ruby gem for creating, manipulating and read...","A Ruby gem for creating, manipulating and read...",2559,https://github.com/markwilkinson/ro-crate-ruby,"Ruby, RO Crate, Research Objects"
3,212556220,Misc_Training_scripts,A place for me to keep various miscellanelous ...,markwilkinson,"{'Shell': 15815, 'Ruby': 9445}",Misc_Training_scripts. A place for me to keep ...,,added new cool 3-federated query\nfinished edi...,README\nSpecies Abundance Pub2015\nSpecies Inf...,,A place for me to keep various miscellanelous ...,A place for me to keep various miscellanelous ...,545,https://github.com/markwilkinson/Misc_Training...,database


In [5]:
repos_categories = {str(aid): [t.strip() for t in categories.split(',')]
                    for aid, categories in zip(categories_df['gh_id'].values,
                                               categories_df['topics'].values)}
repos_categories['253207181']

['Ruby', 'RO Crate', 'Research Objects']

As we can see above, each protocol is composed of a variable sized list of category terms.

In the following cell we are going to perform a cleaning of the categories to remove those that will not be useful for the evaluation of our models. We will remove those words that do not have a match to WordNet, which will be used later on to perform the evaluation of the models. Finally, those rows that do not have any caetgory will be removed from the final sample:

In [6]:
from herc_common.evaluation import _get_synset


filtered_repos_categories = {
    k: set([el for el in v if not el.isnumeric()
            and _get_synset(el) is not None])
    for k, v in repos_categories.items()
}

final_repos_categories = {
    k: v
    for k, v in filtered_repos_categories.items()
    if len(v) != 0
}

len(final_repos_categories)

48

In [7]:
final_repos_categories

{'216602979': {'Java', 'diagnosis'},
 '199330464': {'Prolog'},
 '253207181': {'Ruby'},
 '212556220': {'database'},
 '90349931': {'Java'},
 '126633812': {'classification', 'music'},
 '173520377': {'Python'},
 '103798851': {'Algorithms', 'C'},
 '153249816': {'Python'},
 '170129937': {'R', 'evaluation'},
 '57412597': {'R'},
 '46532803': {'C', 'DNA'},
 '95547920': {'Python'},
 '94098106': {'Java'},
 '83849237': {'vim'},
 '83414669': {'teaching'},
 '56680034': {'Python', 'genetics'},
 '231038084': {'R', 'correlation'},
 '14387064': {'Ruby'},
 '212904362': {'mathematics'},
 '113264497': {'reasoner'},
 '260889871': {'Rust', 'algorithms'},
 '161862375': {'Groovy', 'plants', 'plots', 'proteins'},
 '171842501': {'Ruby', 'docker'},
 '257154635': {'HTML', 'Python'},
 '260966843': {'genome', 'graphs'},
 '257545116': {'Docker', 'biomedicine'},
 '150747903': {'Python', 'TERMite'},
 '151696606': {'Java', 'Termite'},
 '126086017': {'Python'},
 '150911020': {'Python'},
 '254755549': {'Python', 'annotati

As we can see above, from the initial 100 protocols in the dataframe 96 have at least a category to be compared against.

## Evaluation
For the evaluation of our system we will use WordNet to obtain a semantic similarity score between the topics predicted by our system and those used as ground truth.

Before we can start calculating these similarity scores, we will obtain the topics predicted by our system. First, we will be loading the final pipeline that has been saved in our previous notebook:

In [8]:
import string

import en_core_sci_lg
import en_core_web_md

from collections import Counter

from tqdm import tqdm

en_core_web_md.load()
en_core_sci_lg.load()

<spacy.lang.en.English at 0x1fa6b9ef708>

In [9]:
from herc_common.utils import load_object

final_pipe = load_object(os.path.join(NOTEBOOK_6_RESULTS_DIR, 'final_pipe.pkl'))

Now, we will select the sample of publications with at least one ground truth subject, and obtain the output of our system for those articles:

In [10]:
repos_keys = [int(k) for k in final_repos_categories.keys()]
X = categories_df.set_index('gh_id', inplace=False).loc[repos_keys]['full_text_cleaned'].values

In [11]:
y_base = final_repos_categories.values()
y_pred = final_pipe.transform(X)

HBox(children=(FloatProgress(value=0.0, max=48.0), HTML(value='')))




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [12]:
y_pred = [[str(topic[0]) for topic in doc] for doc in y_pred]
y_pred[:5]

[['computer science',
  'software',
  'artificial intelligence',
  'interaction science',
  'engineering',
  'automation',
  'statistics'],
 ['Wikidata',
  'online database',
  'knowledge base',
  'semantic wiki',
  'knowledge graph',
  'Linked Open Data cloud diagram',
  'Semantic Web'],
 ['information',
  'abstract object',
  'advertising',
  'data',
  'creative work',
  'level',
  'message'],
 ['species',
  'taxonomic rank',
  'subgenus',
  'rank',
  'subseries',
  'rank',
  'population'],
 ['minute',
  'branch of science',
  'construction',
  'specialty',
  'occupation',
  'education',
  'statistics']]

## Similarity
In this section we will be calculating the similarity scores between the topics inferred by the model and the ones used as ground truth:

In [13]:
import numpy as np

from nltk.corpus import wordnet as wn
from nltk.corpus.reader.wordnet import WordNetError
import pdb


def compute_similarity_scores(topics_base, topics_pred, similarity_func):
    scores_matrix = get_scores_matrix(topics_base, topics_pred, similarity_func)
    return obtain_associations_scores(scores_matrix)
        

def get_scores_matrix(topics_base, topics_pred, similarity_func):
    sim_measures = []
    for topic_p in topics_pred:
        p_synset = _get_synset(topic_p)
        if p_synset is None:
            array_len = len(topics_base)
            a = np.empty(array_len)
            a[:] = np.nan
            sim_measures.append(a)
            continue

        topic_sim_measures = []
        for topic_b in topics_base:
            b_synset = _get_synset(topic_b)
            if b_synset is None:
                topic_sim_measures.append(np.nan)
                continue
            try:
                similarity = getattr(p_synset, similarity_func)(b_synset)
                topic_sim_measures.append(similarity)
            except WordNetError:
                # comparing synsets with different POS
                topic_sim_measures.append(np.nan)
                continue
        sim_measures.append(topic_sim_measures)
    return np.array(sim_measures)

def obtain_associations_scores(scores_matrix):
    scores_matrix = _remove_nan_rows(scores_matrix)
    scores_matrix = _remove_nan_cols(scores_matrix)
    n = scores_matrix.shape[0]
    m = scores_matrix.shape[1]
    if n < m:
        sim_measures = np.nanmax(scores_matrix, axis=1)
    else:
        sim_measures = np.nanmax(scores_matrix, axis=0)
    return {
        'max similarity': np.max(sim_measures),
        'min similarity': np.min(sim_measures),
        'mean similarity': np.mean(sim_measures),
        'median similarity': np.median(sim_measures)
    }

def _get_synset(word):
    try:
        word = '_'.join(word.split(' '))
        return wn.synsets(word)[0]
    except IndexError:
        return None

def _remove_nan_rows(m):
    return m[~np.all(np.isnan(m), axis=1), :]

def _remove_nan_cols(m):
    return m[:, ~np.all(np.isnan(m), axis=0)]


scores = [compute_similarity_scores(y_b, y_p, 'lch_similarity')
          for y_b, y_p in zip(y_base, y_pred)]
scores[:5]

[{'max similarity': 1.4403615823901665,
  'min similarity': 1.072636802264849,
  'mean similarity': 1.2564991923275077,
  'median similarity': 1.2564991923275077},
 {'max similarity': 1.3350010667323402,
  'min similarity': 1.3350010667323402,
  'mean similarity': 1.3350010667323402,
  'median similarity': 1.3350010667323402},
 {'max similarity': 1.072636802264849,
  'min similarity': 1.072636802264849,
  'mean similarity': 1.072636802264849,
  'median similarity': 1.072636802264849},
 {'max similarity': 1.55814461804655,
  'min similarity': 1.55814461804655,
  'mean similarity': 1.55814461804655,
  'median similarity': 1.55814461804655},
 {'max similarity': 1.3350010667323402,
  'min similarity': 1.3350010667323402,
  'mean similarity': 1.3350010667323402,
  'median similarity': 1.3350010667323402}]

In [14]:
final_similarity = np.mean([score['mean similarity'] for score in scores])
final_similarity

1.31958798808627

## Saving the results
Finally, we are going to save the results. First of all, the predictions will be saved to a new dataframe:

In [15]:
cols_subset = ['name', 'topics']

results_df = categories_df.set_index('gh_id', inplace=False).loc[repos_keys][cols_subset]
results_df['Topics Predicted'] = ['\n'.join(topics) for topics in y_pred]
results_df['topics'] = ['\n'.join(topics) for topics in y_base]
results_df.head()

Unnamed: 0_level_0,name,topics,Topics Predicted
gh_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
216602979,LIRICAL,Java\ndiagnosis,computer science\nsoftware\nartificial intelli...
199330464,wikidata_ontomatcher,Prolog,Wikidata\nonline database\nknowledge base\nsem...
253207181,ro-crate-ruby,Ruby,information\nabstract object\nadvertising\ndat...
212556220,Misc_Training_scripts,database,species\ntaxonomic rank\nsubgenus\nrank\nsubse...
90349931,elda,Java,minute\nbranch of science\nconstruction\nspeci...


In [16]:
scores_df = pd.DataFrame.from_records(scores)
scores_df.set_index(results_df.index, inplace=True)
scores_df.head()

Unnamed: 0_level_0,max similarity,min similarity,mean similarity,median similarity
gh_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
216602979,1.440362,1.072637,1.256499,1.256499
199330464,1.335001,1.335001,1.335001,1.335001
253207181,1.072637,1.072637,1.072637,1.072637
212556220,1.558145,1.558145,1.558145,1.558145
90349931,1.335001,1.335001,1.335001,1.335001


And now the scores obtained for each protocol will be saved too:

In [17]:
final_df = results_df.join(scores_df)
final_df.head()

Unnamed: 0_level_0,name,topics,Topics Predicted,max similarity,min similarity,mean similarity,median similarity
gh_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
216602979,LIRICAL,Java\ndiagnosis,computer science\nsoftware\nartificial intelli...,1.440362,1.072637,1.256499,1.256499
199330464,wikidata_ontomatcher,Prolog,Wikidata\nonline database\nknowledge base\nsem...,1.335001,1.335001,1.335001,1.335001
253207181,ro-crate-ruby,Ruby,information\nabstract object\nadvertising\ndat...,1.072637,1.072637,1.072637,1.072637
212556220,Misc_Training_scripts,database,species\ntaxonomic rank\nsubgenus\nrank\nsubse...,1.558145,1.558145,1.558145,1.558145
90349931,elda,Java,minute\nbranch of science\nconstruction\nspeci...,1.335001,1.335001,1.335001,1.335001


In [18]:
final_df.to_csv(os.path.join(NOTEBOOK_7_RESULTS_DIR, 'repos_scores.csv'))