# Bibliography Categorization: 'BibCat'
## Tutorial: Estimating performance of classifiers in bibcat.



---


## Introduction.

In this tutorial, we will use bibcat to estimate the performance of classifiers on sets of texts.


---

## User Workflow: Training a machine learning (ML) model.


The `Performance` class contains user-friendly methods for estimating the performance of given classifiers and outputting that performance as, e.g., confusion matrices.  We overview how this method can be run in the code blocks below.

For this tutorial, we assume that the user has already run the trainML tutorial, and so has generated and saved a machine learning model.

In [1]:
#Import external packages
import re
import os
import sys
import json
import numpy as np
sys.path.append("./../main/")
#
#Import bibcat packages
import bibcat_classes as bibcat
import bibcat_config as config
import bibcat_constants as preset

2023-08-26 04:59:31.623788: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
#Fetch filepaths for model and data
name_model = config.name_model
filepath_json = config.path_json
dir_model = os.path.join(config.dir_allmodels, name_model)
#
#Set directories for storing performance output
filepath_output = dir_model #Where to store the performance output, such as the confusion matrices
fileroot_evaluation = "test_eval" #Root name of the file within which to store the performance evaluation output
fileroot_misclassif = "test_misclassif" #Root name of the file within which to store misclassified text information
#
#Set directories for fetching text
dir_info = dir_model
dir_test = os.path.join(dir_model, "dir_test")

In [3]:
#Set some global variables
seed_test = 10 #Random seed for shuffling text dataset
do_shuffle = True #Whether or not to shuffle the text dataset
max_tests = 100 #Number of text entries to test the performance for; None for all tests available
is_text_processed = False #We are using preprocessed text for this tutorial (previously generated by trainML in a test set directory)
mode_modif = "skim_trim_anon" #None #We are using preprocessed data in this tutorial, so we do not need a processing mode at all
buffer = 0
#
#Prepare some Keyword objects
kobj_hubble = bibcat.Keyword(
                keywords=["Hubble", "Hubble Telescope",
                          "Hubble Space Telescope"],
                acronyms=["hst", "ht"])
all_kobjs = [kobj_hubble]

Let's build a set of classifiers for which we'd like to test the performance.  We'll then feed each of those classifiers into an instance of the Operator class to handle them.

In [4]:
#Create a list of classifiers
#This can be modified to use whatever classifiers you'd like.
#Load a previously trained ML model
filepath_model = os.path.join(dir_model, (name_model+".npy"))
fileloc_ML = os.path.join(dir_model, (preset.tfoutput_prefix+name_model))
classifier_ML = bibcat.Classifier_ML(filepath_model=filepath_model, fileloc_ML=fileloc_ML, do_verbose=True)
#

#Load a rule-based classifier
classifier_rules = bibcat.Classifier_Rules()
#





In [5]:
#Load models into instances of the Operator class
operator_1 = bibcat.Operator(classifier=classifier_ML, mode=mode_modif, keyword_objs=all_kobjs,
                            name="Operator_1", do_verbose=True, load_check_truematch=True, do_verbose_deep=False)
operator_2 = bibcat.Operator(classifier=classifier_rules,
                            name="Operator_2", mode=mode_modif, keyword_objs=all_kobjs, do_verbose=True, do_verbose_deep=False)
list_operators = [operator_1, operator_2] #Feel free to add more/less operators here.
#

Instance of Operator successfully initialized!
Keyword objects:
0: Keyword Object:
Name: Hubble
Keywords: ['Hubble Space Telescope', 'Hubble Telescope', 'Hubble']
Acronyms: ['hst', 'ht']

Instance of Operator successfully initialized!
Keyword objects:
0: Keyword Object:
Name: Hubble
Keywords: ['Hubble Space Telescope', 'Hubble Telescope', 'Hubble']
Acronyms: ['hst', 'ht']



Now, let's fetch some text for our classifiers to classify. For this tutorial, we'll load previously processed texts from the directory containing the test set for the ML classifier.

In [6]:
#Below code blocked - need to instead split papers themselves into TVT, rather than paragraphs into TVT...
#...otherwise, cannot reprocess each paper (because single paper could contain both training and test data, for example)
"""
#Load information for the processed text
dict_allinfo = np.load(os.path.join(dir_info, "dict_textinfo.npy"), allow_pickle=True).item()

#Prepare filepaths for each class directory of text
list_subdirs = os.listdir(dir_test)
list_joinedfilenames = [os.path.join(item1, item2) for item1 in list_subdirs
                  for item2 in os.listdir(os.path.join(dir_test, item1)) if item2.endswith(".txt")]

#Shuffle the tests, if so requested
if do_shuffle:
    np.random.seed(seed_test)
    np.random.shuffle(list_joinedfilenames)
#

#Truncate the number of tests, if so requested
if (max_tests is not None):
    list_joinedfilenames = list_joinedfilenames[0:max_tests]
#

dict_texts = {}
#Process the tests into a dictionary of texts
for ii in range(0, len(list_joinedfilenames)):
    curr_filename = list_joinedfilenames[ii]
    curr_fileroot = re.sub("\.txt$", "", curr_filename.split("/")[1]) #Remove extension
    curr_info = dict_allinfo[curr_fileroot]

    #Load the text from this file
    with open(os.path.join(dir_test, curr_filename), 'r') as openfile:
        curr_text = openfile.read()
    #
    
    #Store info for this current text entry
    curr_dict = {"text":curr_text, "mission":curr_info["mission"], "forest":curr_info["forest"],
                "class":curr_info["class"], "id":curr_info["id"]}
    dict_texts[str(ii)] = curr_dict
#
#Print some information about the sets of texts
print("Texts were pulled from: {0}".format(dir_test))
print("Number of valid .txt files in this directory: {0}".format(len(list_joinedfilenames)))
print("Number of texts in text set: {0}".format(len(dict_texts)))
#"""

'\n#Load information for the processed text\ndict_allinfo = np.load(os.path.join(dir_info, "dict_textinfo.npy"), allow_pickle=True).item()\n\n#Prepare filepaths for each class directory of text\nlist_subdirs = os.listdir(dir_test)\nlist_joinedfilenames = [os.path.join(item1, item2) for item1 in list_subdirs\n                  for item2 in os.listdir(os.path.join(dir_test, item1)) if item2.endswith(".txt")]\n\n#Shuffle the tests, if so requested\nif do_shuffle:\n    np.random.seed(seed_test)\n    np.random.shuffle(list_joinedfilenames)\n#\n\n#Truncate the number of tests, if so requested\nif (max_tests is not None):\n    list_joinedfilenames = list_joinedfilenames[0:max_tests]\n#\n\ndict_texts = {}\n#Process the tests into a dictionary of texts\nfor ii in range(0, len(list_joinedfilenames)):\n    curr_filename = list_joinedfilenames[ii]\n    curr_fileroot = re.sub("\\.txt$", "", curr_filename.split("/")[1]) #Remove extension\n    curr_info = dict_allinfo[curr_fileroot]\n\n    #Load the 

In [11]:
#!!!!!
#Borrowing fake data from trainML tutorial, just for now, to get everything running
#Make some fake data
dict_texts_raw = {"science":["We present HST observations in Figure 4.",
                        "The HST stars are listed in Table 3b.",
                        "Despite our efforts to smooth the data, there are still rings in the HST images.",
                        "See Section 8c for more discussion of the Hubble images.",
                        "The supernovae detected with HST tend to be brighter than initially predicted.",
                        "Our spectra from HST fit well to the standard trend first published in Someone et al. 1990.",
                        "We use the Hubble Space Telescope to build an ultraviolet database of the target stars.",
                        "The blue points (HST) exhibit more scatter than the red points (JWST).",
                        "The benefit, then, is the far higher S/N we achieved in our HST observations.",
                        "Here we employ the Hubble Telescope to observe the edge of the photon-dominated region.",
                        "The black line shows that the region targeted with Hubble has an extreme UV signature."],
                 "datainfluenced":["The simulated Hubble data is plotted in Figure 4.",
                       "Compared to the HST observations in Someone et al., our JWST follow-up reached higher S/N.",
                       "We were able to reproduce the luminosities from Hubble using our latest models.",
                       "We overplot Hubble-observed stars from Someone et al. in Figure 3b.",
                       "We built the spectral templates using UV data in the Hubble archive.",
                       "We simulate what our future HST observations will look like to predict the S/N.",
                       "Our work here with JWST is inspired by our earlier HST study published in 2010.",
                       "We therefore use the Hubble statistics from Author et al. to guide our stellar predictions.",
                       "The stars in Figure 3 were plotted based on the HST-fitted trend line in Person et al.",
                       "The final step is to use the HST exposure tool to put our modeled images in context."],
                 "mention":["Person et al. used HST to measure the Hubble constant.",
                        "We will present new HST observations in a future work.",
                        "HST is a fantastic instrument that has revolutionized our view of space.",
                        "The Hubble Space Telescope (HST) has its mission center at the STScI.",
                        "We can use HST to power a variety of science in the ultraviolet regime.",
                        "It is not clear when the star will be observable with HST.",
                        "More data can be found and downloaded from the Hubble archive.",
                        "We note that HST can be used to observe the stars as well, at higher S/N.",
                        "However, we ended up using the JWST rather than HST observations in this work.",
                        "We push the analysis of the Hubble component of the dataset to a future study.",
                        "We expect the HST observations to be released in the fall.",
                        "We look forward to any follow-up studies with, e.g., the Hubble Telescope."]}
#
#Convert into dictionary with: key:text,class,id,mission structure
i_track = 0
dict_texts = {}
mission = "HST" #Store subheadings by mission, to avoid duplicating and processing the same text across different missions
for key in dict_texts_raw:
    curr_set = dict_texts_raw[key]
    for ii in range(0, len(curr_set)):
        dict_texts[str(i_track)] = {"text":curr_set[ii], "id":"{0}_{1}".format(key, ii),
                                    "missions":{"mission":mission, "class":key}}
        i_track += 1
#
print("Number of texts in text set: {0}".format(len(dict_texts)))
print("")
for key in dict_texts:
    print(dict_texts[key])
    print("-")
#

Number of texts in text set: 33

{'text': 'We present HST observations in Figure 4.', 'id': 'science_0', 'missions': {'mission': 'HST', 'class': 'science'}}
-
{'text': 'The HST stars are listed in Table 3b.', 'id': 'science_1', 'missions': {'mission': 'HST', 'class': 'science'}}
-
{'text': 'Despite our efforts to smooth the data, there are still rings in the HST images.', 'id': 'science_2', 'missions': {'mission': 'HST', 'class': 'science'}}
-
{'text': 'See Section 8c for more discussion of the Hubble images.', 'id': 'science_3', 'missions': {'mission': 'HST', 'class': 'science'}}
-
{'text': 'The supernovae detected with HST tend to be brighter than initially predicted.', 'id': 'science_4', 'missions': {'mission': 'HST', 'class': 'science'}}
-
{'text': 'Our spectra from HST fit well to the standard trend first published in Someone et al. 1990.', 'id': 'science_5', 'missions': {'mission': 'HST', 'class': 'science'}}
-
{'text': 'We use the Hubble Space Telescope to build an ultraviolet d

Next, let's prepare some additional information for each of these classifiers.  We'll need to set, for example, the uncertainty thresholds for accepting or rejecting each classifier's output.

In [8]:
#Set parameters for each operator and its internal classifier
#Global parameters
do_verify_truematch = False #Off for now, turn on later
do_raise_innererror = False

#For operator 1
mapper_1 = None #Mapper to mask classifications; None if no masking
dict_texts_1 = dict_texts #Dictionary of texts to classify
threshold_1 = 0.70 #Uncertainty threshold for this classifier
buffer_1 = 0 #None since text already preprocessed

#For operator 2
mapper_2 = None #Mapper to mask classifications; None if no masking
dict_texts_2 = dict_texts #Dictionary of texts to classify
threshold_2 = 0.70 #Uncertainty threshold for this classifier
buffer_2 = 0 #None since text already preprocessed

#Gather parameters into lists
list_mappers = [mapper_1, mapper_2]
list_thresholds = [threshold_1, threshold_2]
list_dict_texts = [dict_texts_1, dict_texts_2]
list_buffers = [buffer_1, buffer_2]

Now, let's evaluate the performance of these classifiers.

In [9]:
#Create an instance of the Performance class
performer = bibcat.Performance()

In [10]:
#Run the pipeline for a basic evaluation of model performance
performer.evaluate_performance_basic(operators=list_operators, dicts_texts=list_dict_texts, mappers=list_mappers,
                                     thresholds=list_thresholds, buffers=list_buffers, is_text_processed=is_text_processed,
                                     do_verify_truematch=do_verify_truematch, do_raise_innererror=do_raise_innererror,
                                     do_save_evaluation=True, do_save_misclassif=True, filepath_output=filepath_output,
                                     fileroot_evaluation=fileroot_evaluation, fileroot_misclassif=fileroot_misclassif,
                                     print_freq=25, do_verbose=True, do_verbose_deep=False)


> Running evaluate_performance_basic()!
Generating evaluation for the given operators...

> Running _generate_evaluation()!
Iterating through Operators to classify each set of text...
Classifying with Operator #0...

> Running classify_set()!
> Running _fetch_keyword_object() for lookup term Hubble.
Classification for text #1 of 33 complete...
> Running _fetch_keyword_object() for lookup term Hubble.
> Running _fetch_keyword_object() for lookup term Hubble.
> Running _fetch_keyword_object() for lookup term Hubble.
> Running _fetch_keyword_object() for lookup term Hubble.
> Running _fetch_keyword_object() for lookup term Hubble.
> Running _fetch_keyword_object() for lookup term Hubble.
> Running _fetch_keyword_object() for lookup term Hubble.
> Running _fetch_keyword_object() for lookup term Hubble.
> Running _fetch_keyword_object() for lookup term Hubble.
> Running _fetch_keyword_object() for lookup term Hubble.
> Running _fetch_keyword_object() for lookup term Hubble.
> Running _fetc

KeyError: 'class'

And with that, you should have new confusion matrices summarizing the basic performance for these classifiers saved in your requested directory!

---