# 7. Evaluation of results
In this final notebook we will propose an evaluation metric to check the performance and our final systems on the protocols track. For our approach we will rely on the default categories asigned to each protocols. For an additional level of accuracy in the evaluation of our models, the list of subjects used as ground truth could be generated by a human committee. For the scope of this challenge, we will however rely on the default subjects present in each protocol and leave the human annotation as future work.

## Setup

In [1]:
%run __init__.py

INFO:root:Starting logger


In [2]:
import pandas as pd

PROTOCOLS_FILE_PATH = os.path.join(NOTEBOOK_2_RESULTS_DIR, 'protocols_dataframe.pkl')

protocols_df = pd.read_pickle(PROTOCOLS_FILE_PATH)

## Analyzing the categories
We will begin by selecting a subset of the protocols dataframe with just the id of the protocol and its categories:

In [3]:
import numpy as np

categories_df = protocols_df.copy()
categories_df['categories'].replace('', np.nan, inplace=True)
categories_df.dropna(subset=['categories'], inplace=True)
categories_df.head(n=4)

Unnamed: 0,pr_id,title,abstract,materials,procedure,equipment,background,categories,authors,full_text,full_text_no_abstract,full_text_cleaned,full_text_no_abstract_cleaned,num_chars_text
0,e100,Scratch Wound Healing Assay,The scratch wound healing assay has been widel...,"Human MDA-MB-231 cell line (ATCC, catalog numb...",Grow cells in DMEM supplemented with 10% FBS.|...,BD Falcon 24-well tissue culture plate (Fisher...,,Cancer Biology|Invasion & metastasis|Cell biol...,Yanling Chen,Scratch Wound Healing Assay. The scratch wound...,Scratch Wound Healing Assay. Grow cells in DME...,Scratch Wound Healing Assay. The scratch wound...,Scratch Wound Healing Assay. Grow cells in DME...,2583
1,e1029,ADCC Assay Protocol,Antibody-dependent cell-mediated cytotoxicity ...,Raji cells (ATCC)|A/California/04/2009 (H1N1) ...,Preperation of Target Cells\n\n\t\t\n\n\n\t\t\...,Round bottomed 96-well plate|Temperature contr...,,Immunology|Immune cell function|Cytotoxicity|C...,Vikram Srivastava|Zheng Yang|Ivan Fan Ngai...,ADCC Assay Protocol. Antibody-dependent cell-m...,ADCC Assay Protocol. Preperation of Target Cel...,ADCC Assay Protocol. Antibody-dependent cell-m...,ADCC Assay Protocol. Preperation of Target Cel...,3824
2,e1072,Catalase Activity Assay in Candida glabrata,Commensal and pathogenic fungi are exposed to ...,Yeast strains \nNote: BG14 was used as the C. ...,Preparation of total soluble extracts\n\t\t\n\...,Orbital incubator shaker|Microfuge tubes|50 ml...,,Microbiology|Microbial biochemistry|Protein|Ac...,Emmanuel Orta-Zavalza|Marcela Briones-Martin...,Catalase Activity Assay in Candida glabrata. C...,Catalase Activity Assay in Candida glabrata. P...,Catalase Activity Assay in Candida glabrata. C...,Catalase Activity Assay in Candida glabrata. P...,4207
3,e1077,RNA Isolation and Northern Blot Analysis,The northern blot is a technique used in molec...,Vero cells (kidney epithelial cells extracted ...,RNA extraction\n\t\t\n\n\t\t\t\tCells were see...,"100 mm cell culture dishes (Corning, catalog n...",,Microbiology|Microbial genetics|RNA|RNA extrac...,Ying Liao|To Sing Fung|Mei Huang|Shouguo Fang|...,RNA Isolation and Northern Blot Analysis. The ...,RNA Isolation and Northern Blot Analysis. RNA ...,RNA Isolation and Northern Blot Analysis. The ...,RNA Isolation and Northern Blot Analysis. RNA ...,6890


In [4]:
protocols_categories = {aid: categories.split('|') 
                       for aid, categories in zip(categories_df['pr_id'].values,
                                                  categories_df['categories'].values)}
protocols_categories['e1029']

['Immunology',
 'Immune cell function',
 'Cytotoxicity',
 'Cell Biology',
 'Cell-based analysis',
 'Flow cytometry']

As we can see above, each protocol is composed of a variable sized list of category terms.

In the following cell we are going to perform a cleaning of the categories to remove those that will not be useful for the evaluation of our models. We will remove those words that do not have a match to WordNet, which will be used later on to perform the evaluation of the models. Finally, those rows that do not have any caetgory will be removed from the final sample:

In [5]:
from herc_common.evaluation import _get_synset

stop_categories = set([])

filtered_protocols_categories = {
    k: set([el for el in v if el.lower() not in stop_categories
            and not el.isnumeric() and _get_synset(el) is not None])
    for k, v in protocols_categories.items()
}

final_protocols_categories = {
    k: v
    for k, v in filtered_protocols_categories.items()
    if len(v) != 0
}

len(final_protocols_categories)

94

In [6]:
final_protocols_categories

{'e1029': {'Cytotoxicity', 'Immunology'},
 'e1072': {'Activity', 'Biochemistry', 'Microbiology', 'Protein'},
 'e1077': {'Microbiology', 'Molecular Biology', 'RNA'},
 'e1090': {'Microbiology'},
 'e1136': {'Biochemistry', 'Microbiology', 'Protein'},
 'e1180': {'Cytokine', 'Immunology', 'Macrophage'},
 'e1183': {'Development', 'Ligand', 'Stem Cell'},
 'e122': {'Biochemistry', 'Immunology', 'Protein'},
 'e1235': {'General', 'Immunology', 'Microbiology'},
 'e1236': {'Biochemistry', 'Carbohydrate', 'Glycoprotein', 'Polysaccharide'},
 'e1287': {'Biochemistry', 'Microbiology', 'Protein'},
 'e1295': {'Microbiology', 'Virus'},
 'e1308': {'Microbiology'},
 'e1374': {'Biochemistry',
  'Detection',
  'Molecular Biology',
  'Protein',
  'Synthesis'},
 'e1428': {'Microbiology', 'Molecular Biology', 'RNA'},
 'e144': {'Stem Cell'},
 'e1467': {'Biochemistry', 'Carotenoid', 'Chlorophyll', 'Microbiology'},
 'e1471': {'Immunology'},
 'e1569': {'Bacterium', 'Microbiology'},
 'e16': {'Microbiology'},
 'e162'

As we can see above, from the initial 100 protocols in the dataframe 96 have at least a category to be compared against.

## Evaluation
For the evaluation of our system we will use WordNet to obtain a semantic similarity score between the topics predicted by our system and those used as ground truth.

Before we can start calculating these similarity scores, we will obtain the topics predicted by our system. First, we will be loading the final pipeline that has been saved in our previous notebook:

In [7]:
import string

import en_core_sci_lg
import en_core_web_md

from collections import Counter

from tqdm import tqdm

en_core_web_md.load()
en_core_sci_lg.load()

<spacy.lang.en.English at 0x1be0c62e188>

In [8]:
from herc_common.utils import load_object

final_pipe = load_object(os.path.join(NOTEBOOK_6_RESULTS_DIR, 'final_pipe.pkl'))

Now, we will select the sample of publications with at least one ground truth subject, and obtain the output of our system for those articles:

In [9]:
protocols_keys = final_protocols_categories.keys()
X = categories_df.set_index('pr_id', inplace=False).loc[protocols_keys]['full_text_cleaned'].values

In [10]:
y_base = final_protocols_categories.values()
y_pred = final_pipe.transform(X)

HBox(children=(FloatProgress(value=0.0, max=94.0), HTML(value='')))

INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.





INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
I

INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [11]:
y_pred = [[str(topic[0]) for topic in doc] for doc in y_pred]
y_pred[:5]

[['protein',
  'biological process',
  'process',
  'carbon dioxide transmembrane transport',
  'death',
  'response to carbon dioxide',
  'cellular response to carbon dioxide'],
 ['botany',
  'protein',
  'chemistry',
  'chemical compound',
  'chemical substance',
  'group or class of proteins',
  'cell type'],
 ['protein',
  'biological process',
  'process',
  'chemical compound',
  'chemistry',
  'chemical substance',
  'science'],
 ['process',
  'chemical compound',
  'biological process',
  'glucose',
  'concept',
  'interaction science',
  'intentional human action'],
 ['interaction science',
  'protein',
  'chemical compound',
  'materials science',
  'technology',
  'cell biology',
  'biopolymer']]

## Similarity
In this section we will be calculating the similarity scores between the topics inferred by the model and the ones used as ground truth:

In [12]:
from herc_common.evaluation import compute_similarity_scores

scores = [compute_similarity_scores(y_b, y_p, 'lch_similarity')
          for y_b, y_p in zip(y_base, y_pred)]
scores[:5]

[{'max similarity': 1.4403615823901665,
  'min similarity': 0.9295359586241757,
  'mean similarity': 1.184948770507171,
  'median similarity': 1.184948770507171},
 {'max similarity': 3.6375861597263857,
  'min similarity': 1.55814461804655,
  'mean similarity': 2.4407132240308744,
  'median similarity': 2.283561059175281},
 {'max similarity': 2.538973871058276,
  'min similarity': 2.0281482472922856,
  'mean similarity': 2.1984234552142823,
  'median similarity': 2.0281482472922856},
 {'max similarity': 1.3350010667323402,
  'min similarity': 1.3350010667323402,
  'mean similarity': 1.3350010667323402,
  'median similarity': 1.3350010667323402},
 {'max similarity': 3.6375861597263857,
  'min similarity': 0.8649974374866046,
  'mean similarity': 1.7891936782331985,
  'median similarity': 0.8649974374866046}]

In [13]:
final_similarity = np.mean([score['mean similarity'] for score in scores])
final_similarity

1.7950689264796515

## Saving the results
Finally, we are going to save the results. First of all, the predictions will be saved to a new dataframe:

In [14]:
cols_subset = ['title', 'categories']

results_df = categories_df.set_index('pr_id', inplace=False).loc[protocols_keys][cols_subset]
results_df['Topics Predicted'] = ['\n'.join(topics) for topics in y_pred]
results_df['categories'] = ['\n'.join(topics) for topics in y_base]
results_df.head()

Unnamed: 0_level_0,title,categories,Topics Predicted
pr_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
e1029,ADCC Assay Protocol,Cytotoxicity\nImmunology,protein\nbiological process\nprocess\ncarbon d...
e1072,Catalase Activity Assay in Candida glabrata,Microbiology\nActivity\nProtein\nBiochemistry,botany\nprotein\nchemistry\nchemical compound\...
e1077,RNA Isolation and Northern Blot Analysis,Microbiology\nMolecular Biology\nRNA,protein\nbiological process\nprocess\nchemical...
e1090,Flow Cytometric Analysis of Autophagic Activit...,Microbiology,process\nchemical compound\nbiological process...
e1136,Preparation of Parasite Protein Extracts and W...,Microbiology\nProtein\nBiochemistry,interaction science\nprotein\nchemical compoun...


In [15]:
scores_df = pd.DataFrame.from_records(scores)
scores_df.set_index(results_df.index, inplace=True)
scores_df.head()

Unnamed: 0_level_0,max similarity,min similarity,mean similarity,median similarity
pr_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e1029,1.440362,0.929536,1.184949,1.184949
e1072,3.637586,1.558145,2.440713,2.283561
e1077,2.538974,2.028148,2.198423,2.028148
e1090,1.335001,1.335001,1.335001,1.335001
e1136,3.637586,0.864997,1.789194,0.864997


And now the scores obtained for each protocol will be saved too:

In [16]:
final_df = results_df.join(scores_df)
final_df.head()

Unnamed: 0_level_0,title,categories,Topics Predicted,max similarity,min similarity,mean similarity,median similarity
pr_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
e1029,ADCC Assay Protocol,Cytotoxicity\nImmunology,protein\nbiological process\nprocess\ncarbon d...,1.440362,0.929536,1.184949,1.184949
e1072,Catalase Activity Assay in Candida glabrata,Microbiology\nActivity\nProtein\nBiochemistry,botany\nprotein\nchemistry\nchemical compound\...,3.637586,1.558145,2.440713,2.283561
e1077,RNA Isolation and Northern Blot Analysis,Microbiology\nMolecular Biology\nRNA,protein\nbiological process\nprocess\nchemical...,2.538974,2.028148,2.198423,2.028148
e1090,Flow Cytometric Analysis of Autophagic Activit...,Microbiology,process\nchemical compound\nbiological process...,1.335001,1.335001,1.335001,1.335001
e1136,Preparation of Parasite Protein Extracts and W...,Microbiology\nProtein\nBiochemistry,interaction science\nprotein\nchemical compoun...,3.637586,0.864997,1.789194,0.864997


In [17]:
final_df.to_csv(os.path.join(NOTEBOOK_7_RESULTS_DIR, 'protocols_scores.csv'))