# Bibliography Categorization: 'BibCat'
## Tutorial: Estimating performance of classifiers in bibcat.



---


## Introduction.

In this tutorial, we will use bibcat to estimate the performance of classifiers on sets of texts.


---

## User Workflow: Training a machine learning (ML) model.


The `Performance` class contains user-friendly methods for estimating the performance of given classifiers and outputting that performance as, e.g., confusion matrices.  We overview how this method can be run in the code blocks below.

For this tutorial, we assume that the user has already run the trainML tutorial, and so has generated and saved a machine learning model.

In [1]:
#Import external packages
import re
import os
import sys
import json
import numpy as np


In [2]:
# Set up for fetching necessary bibcat modules for the tutorial
# Check work directories: src/ is where all source python scripts are available. 
current_dir= os.path.dirname(os.path.abspath('__file__'))
_parent = os.path.dirname(current_dir)
src_dir = os.path.join(_parent, "src")

print(f'Current Directory: {current_dir}')
print(f'Source directory: {src_dir}')

# move to the ../src/ directory to import necessary modules. 
os.chdir(src_dir)

Current Directory: /Users/jamila.pegues/Documents/STScI_Fellowship/Functional/Library/BibTracking/repo_stsci/bibcat/docs
Source directory: /Users/jamila.pegues/Documents/STScI_Fellowship/Functional/Library/BibTracking/repo_stsci/bibcat/src


In [3]:
#Import bibcat packages
import bibcat_classes as bibcat
import bibcat_config as config
import bibcat_constants as preset

Root directory =/Users/jamila.pegues/Documents/STScI_Fellowship/Functional/Library/BibTracking/repo_stsci/bibcat/src, parent directory=/Users/jamila.pegues/Documents/STScI_Fellowship/Functional/Library/BibTracking/repo_stsci/bibcat
/Users/jamila.pegues/Documents/STScI_Fellowship/Functional/Library/BibTracking/repo_stsci/bibcat/src/models folder already exists.
/Users/jamila.pegues/Documents/STScI_Fellowship/Functional/Library/BibTracking/repo_stsci/bibcat/output folder already exists.


In [4]:
#Fetch filepaths for model and data
name_model = config.name_model
filepath_json = config.path_json
output_path = config.PATH_OUTPUT
dir_model = os.path.join(config.dir_allmodels, name_model)
#
#Set directories for storing performance output
filepath_output = output_path #Where to store the performance output, such as the confusion matrices
#
#Set directories for fetching text
dir_info = dir_model
folder_test = preset.folders_TVT["test"]
dir_test = os.path.join(dir_model, folder_test)

In [5]:
#Set some global variables
seed_test = 10 #Random seed for shuffling text dataset
np.random.seed(seed_test)
do_shuffle = True #Whether or not to shuffle the text dataset
do_real_testdata = True #If True, will use real papers to test performance; if False, will use fake texts below
#
max_tests = 100 #Number of text entries to test the performance for; None for all tests available
#is_text_processed = False #We are using preprocessed text for this tutorial (previously generated by trainML in a test set directory)
mode_modif = "skim_trim_anon" #None #We are using preprocessed data in this tutorial, so we do not need a processing mode at all
buffer = 0
#
#Prepare some Keyword objects
kobj_hubble = bibcat.Keyword(
                keywords=["Hubble", "Hubble Telescope",
                          "Hubble Space Telescope"],
                acronyms=["hst", "ht"])
all_kobjs = [kobj_hubble]

Let's build a set of classifiers for which we'd like to test the performance.  We'll then feed each of those classifiers into an instance of the Operator class to handle them.

In [6]:
#Create a list of classifiers
#This can be modified to use whatever classifiers you'd like.
#Load a previously trained ML model
filepath_model = os.path.join(dir_model, (name_model+".npy"))
fileloc_ML = os.path.join(dir_model, (preset.tfoutput_prefix+name_model))
classifier_ML = bibcat.Classifier_ML(filepath_model=filepath_model, fileloc_ML=fileloc_ML, do_verbose=True)
#

#Load a rule-based classifier
classifier_rules = bibcat.Classifier_Rules()
#

INFO:absl:using Lamb optimizer






In [7]:
#Load models into instances of the Operator class
operator_1 = bibcat.Operator(classifier=classifier_ML, mode=mode_modif, keyword_objs=all_kobjs,
                            name="Operator_1", do_verbose=True, load_check_truematch=True, do_verbose_deep=False)
operator_2 = bibcat.Operator(classifier=classifier_rules,
                            name="Operator_2", mode=mode_modif, keyword_objs=all_kobjs, do_verbose=True, do_verbose_deep=False)
list_operators = [operator_1, operator_2] #Feel free to add more/less operators here.
#

Instance of Operator successfully initialized!
Keyword objects:
0: Keyword Object:
Name: Hubble
Keywords: ['Hubble Space Telescope', 'Hubble Telescope', 'Hubble']
Acronyms: ['hst', 'ht']

Instance of Operator successfully initialized!
Keyword objects:
0: Keyword Object:
Name: Hubble
Keywords: ['Hubble Space Telescope', 'Hubble Telescope', 'Hubble']
Acronyms: ['hst', 'ht']



Now, let's fetch some text for our classifiers to classify. For this tutorial, we'll load previously processed texts from the directory containing the test set for the ML classifier.

In [18]:
#For use of real papers from test dataset to test on
if (do_real_testdata):
    #Load information for processed bibcodes reserved for testing
    dict_TVTinfo = np.load(os.path.join(dir_info, "dict_TVTinfo.npy"), allow_pickle=True).item()
    list_test_bibcodes = [key for key in dict_TVTinfo if (dict_TVTinfo[key]["folder_TVT"] == folder_test)]
    
    #Load the original data
    with open(filepath_json, 'r') as openfile:
        dataset = json.load(openfile)
    #
    
    #Extract text information for the bibcodes reserved for testing
    list_test_indanddata_raw = [(ii, dataset[ii]) for ii in range(0, len(dataset))
                                if (dataset[ii]["bibcode"] in list_test_bibcodes)] #Data for test set
    #
    
    #Shuffle, if requested
    if do_shuffle:
        np.random.shuffle(list_test_indanddata_raw)
    #
    
    #Extract target number of test papers from the test bibcodes
    if (max_tests is not None): #Fetch subset of tests
        list_test_indanddata = list_test_indanddata_raw[0:max_tests]
    else: #Use all tests
        list_test_indanddata = list_test_indanddata_raw
    #
    
    #Process the text input into dictionary format for inputting into the codebase
    dict_texts = {} #To hold formatted text entries
    for ii in range(0, len(list_test_indanddata)):
        curr_ind = list_test_indanddata[ii][0]
        curr_data = list_test_indanddata[ii][1]
        #Convert this data entry into dictionary with: key:text,id,bibcode,mission structure
        curr_info = {"text":curr_data["body"], "id":str(curr_ind), "bibcode":curr_data["bibcode"],
                    "missions":{key:{"mission":key, "class":curr_data["class_missions"][key]["papertype"]}
                                for key in curr_data["class_missions"]}}
        #
        #Store this data entry
        dict_texts[str(curr_ind)] = curr_info
    #
    
    #Print some notes about the testing data
    print("Number of texts in text set: {0}".format(len(dict_texts)))
    print("")
    for key in dict_texts:
        print("Entry {0}:".format(key))
        print("ID: {0}".format(dict_texts[key]["id"]))
        print("Bibcode: {0}".format(dict_texts[key]["bibcode"]))
        print("Missions: {0}".format(dict_texts[key]["missions"]))
        print("Start of text:\n{0}".format(dict_texts[key]["text"][0:500]))
        print("-\n")
    #
#

Number of texts in text set: 100

Entry 3039:
ID: 3039
Bibcode: 2020MNRAS.491.3042P
Missions: {'HST': {'mission': 'HST', 'class': 'SCIENCE'}}
Start of text:
1 INTRODUCTION Two independent studies, Lynden-Bell ( 1976 ) and Kunkel Demers ( 1976 ), first described a notable feature in the spatial distribution of the then-known satellite galaxies and distant globular clusters around the Milky Way: they align close to a common great circle on the sky, which is also traced by the Magellanic Stream. In three dimensions, these Milky Way satellites are found close to a common plane. This plane of satellites has since been termed the Vast Polar Structure (VPO
-

Entry 2779:
ID: 2779
Bibcode: 2020MNRAS.498.5183S
Missions: {'HST': {'mission': 'HST', 'class': 'MENTION'}, 'GALEX': {'mission': 'GALEX', 'class': 'SCIENCE'}}
Start of text:
1 INTRODUCTION Large multiband galaxy surveys (SDSS, York et al. 2000 ; GALEX, Martin et al. 2005 ; 2MASS, Skrutskie et al. 2006 ; WISE, Wright et al. 2010 ; DES, Ab

In [None]:
#For use of fake, made-up data entries to test on
if (not do_real_testdata):
    print("Using fake test data for testing.")
    #Make some fake data
    dict_texts_raw = {"science":["We present HST observations in Figure 4.",
                        "The HST stars are listed in Table 3b.",
                        "Despite our efforts to smooth the data, there are still rings in the HST images.",
                        "See Section 8c for more discussion of the Hubble images.",
                        "The supernovae detected with HST tend to be brighter than initially predicted.",
                        "Our spectra from HST fit well to the standard trend first published in Someone et al. 1990.",
                        "We use the Hubble Space Telescope to build an ultraviolet database of the target stars.",
                        "The blue points (HST) exhibit more scatter than the red points (JWST).",
                        "The benefit, then, is the far higher S/N we achieved in our HST observations.",
                        "Here we employ the Hubble Telescope to observe the edge of the photon-dominated region.",
                        "The black line shows that the region targeted with Hubble has an extreme UV signature."],
                 "datainfluenced":["The simulated Hubble data is plotted in Figure 4.",
                       "Compared to the HST observations in Someone et al., our JWST follow-up reached higher S/N.",
                       "We were able to reproduce the luminosities from Hubble using our latest models.",
                       "We overplot Hubble-observed stars from Someone et al. in Figure 3b.",
                       "We built the spectral templates using UV data in the Hubble archive.",
                       "We simulate what our future HST observations will look like to predict the S/N.",
                       "Our work here with JWST is inspired by our earlier HST study published in 2010.",
                       "We therefore use the Hubble statistics from Author et al. to guide our stellar predictions.",
                       "The stars in Figure 3 were plotted based on the HST-fitted trend line in Person et al.",
                       "The final step is to use the HST exposure tool to put our modeled images in context."],
                 "mention":["Person et al. used HST to measure the Hubble constant.",
                        "We will present new HST observations in a future work.",
                        "HST is a fantastic instrument that has revolutionized our view of space.",
                        "The Hubble Space Telescope (HST) has its mission center at the STScI.",
                        "We can use HST to power a variety of science in the ultraviolet regime.",
                        "It is not clear when the star will be observable with HST.",
                        "More data can be found and downloaded from the Hubble archive.",
                        "We note that HST can be used to observe the stars as well, at higher S/N.",
                        "However, we ended up using the JWST rather than HST observations in this work.",
                        "We push the analysis of the Hubble component of the dataset to a future study.",
                        "We expect the HST observations to be released in the fall.",
                        "We look forward to any follow-up studies with, e.g., the Hubble Telescope."]}
    #
    #Convert into dictionary with: key:text,class,id,mission structure
    i_track = 0
    dict_texts = {}
    #Store subheadings by mission, to avoid duplicating and processing the same text across different missions
    mission = operator_1._fetch_keyword_object(lookup="HST")._get_info("name")
    for key in dict_texts_raw:
        curr_set = dict_texts_raw[key]
        for ii in range(0, len(curr_set)):
            dict_texts[str(i_track)] = {"text":curr_set[ii], "id":"{0}_{1}".format(key, ii), "bibcode":str(i_track),
                                        "missions":{mission:{"mission":mission, "class":key}}}
            i_track += 1
    #
    print("Mission: {0}".format(mission))
    print("Number of texts in text set: {0}".format(len(dict_texts)))
    print("")
    for key in dict_texts:
        print(dict_texts[key])
        print("-")
    #
#

Next, let's prepare some additional information for each of these classifiers.  We'll need to set, for example, the uncertainty thresholds for accepting or rejecting each classifier's output.

In [None]:
#Set parameters for each operator and its internal classifier
#Global parameters
do_verify_truematch = False #Off for now, turn on later
do_raise_innererror = False

#For operator 1
mapper_1 = None #Mapper to mask classifications; None if no masking
dict_texts_1 = dict_texts #Dictionary of texts to classify
threshold_1 = 0.70 #Uncertainty threshold for this classifier
buffer_1 = 0 #None since text already preprocessed

#For operator 2
mapper_2 = None #Mapper to mask classifications; None if no masking
dict_texts_2 = dict_texts #Dictionary of texts to classify
threshold_2 = 0.70 #Uncertainty threshold for this classifier
buffer_2 = 0 #None since text already preprocessed

#Gather parameters into lists
list_mappers = [mapper_1, mapper_2]
list_thresholds = [threshold_1, threshold_2]
list_dict_texts = [dict_texts_1, dict_texts_2]
list_buffers = [buffer_1, buffer_2]
list_threshold_arrays = [np.linspace(0.5, 0.95, 20)]*2 #For uncertainty test

Now, let's evaluate the performance of these classifiers in different ways.  We will consider these performance tests:
* Basic: We generate confusion matrices for the set of Operators (containing the different classifiers).
* Uncertainty: We plot performance as a function of uncertainty level for the set of Operators.

In [None]:
#Create an instance of the Performance class
performer = bibcat.Performance()

The Basic evaluation:

In [None]:
#Parameters for this evaluation
fileroot_evaluation = "test_eval_basic" #Root name of the file within which to store the performance evaluation output
fileroot_misclassif = "test_misclassif_basic" #Root name of the file within which to store misclassified text information
figsize = (20, 12)

#Run the pipeline for a basic evaluation of model performance
performer.evaluate_performance_basic(operators=list_operators, dicts_texts=list_dict_texts, mappers=list_mappers,
                                     thresholds=list_thresholds, buffers=list_buffers, is_text_processed=is_text_processed,
                                     do_verify_truematch=do_verify_truematch, do_raise_innererror=do_raise_innererror,
                                     do_save_evaluation=True, do_save_misclassif=True, filepath_output=filepath_output,
                                     fileroot_evaluation=fileroot_evaluation, fileroot_misclassif=fileroot_misclassif,
                                     print_freq=25, do_verbose=True, do_verbose_deep=False, figsize=figsize)

The Uncertainty evaluation:

In [None]:
#Parameters for this evaluation
fileroot_evaluation = "test_eval_uncertainty" #Root name of the file within which to store the performance evaluation output
fileroot_misclassif = "test_misclassif_uncertainty" #Root name of the file within which to store misclassified text information
figsize = (20, 12)

#Run the pipeline for an evaluation of model performance as a function of uncertainty
performer.evaluate_performance_uncertainty(operators=list_operators, dicts_texts=list_dict_texts, mappers=list_mappers,
                                     threshold_arrays=list_threshold_arrays, buffers=list_buffers,
                                     is_text_processed=is_text_processed,
                                     do_verify_truematch=do_verify_truematch, do_raise_innererror=do_raise_innererror,
                                     do_save_evaluation=True, do_save_misclassif=True, filepath_output=filepath_output,
                                     fileroot_evaluation=fileroot_evaluation, fileroot_misclassif=fileroot_misclassif,
                                     print_freq=25, do_verbose=True, do_verbose_deep=False, figsize=figsize)

And with that, you should have new confusion matrices summarizing the basic performance for these classifiers saved in your requested directory!

---

In [None]:
#Set end marker for this tutorial.
print("This tutorial completed successfully.")