# Bibliography Categorization: 'BibCat'
## Tutorial: Machine learning (ML) models in bibcat.



---


## Introduction.

In this tutorial, we will use bibcat to train a machine learning (ML) model on some raw input text.


---

## User Workflow: Training a machine learning (ML) model.


The `Operator` class contains a user-friendly method `train_model_ML` that runs the full workflow for training an ML model, from the input raw text all the way to saving the output ML model.  We overview how this method can be run in the code blocks below.

For this tutorial, we have two sets of data: either 1) some short, made-up text for a quick run of the code, or 2) an imported database of text from an external file of the user's choosing. The former case is useful for getting a quick sense of how the code works. The latter case is useful for building an actual model, but of course will take much longer on larger databases of text!

In [1]:
#Import external packages
import os
import sys
import json
sys.path.append("./../main/")
#
#Import bibcat packages
import bibcat_classes as bibcat
import bibcat_config as config
import bibcat_constants as preset
#

2023-08-25 16:26:17.206858: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
#Which data would you like to run the ML model on?  Choose from the booleans below.
do_quick_run = True #This will train the ML model on short bits of text. Runs pretty quickly.
do_real_run = False #This will train the ML model on external text. Will take longer for larger databases.
#

In [3]:
#The rest of these parameters can be left as-is for a first run-through.
#
num_papers = 500 #500 #None, or an integer; if an integer, will truncate external .json text dataset to this size
#Set num_papers=None to use all available papers in external dataset
#Note: If set to integer, final paper count might be a little more than target num_papers given
#
allowed_classifications = config.allowed_classifications #For external data; classifications to include
#

#Fetch filepaths for model and data
name_model = config.name_model
filepath_json = config.path_json
dir_model = os.path.join(config.dir_allmodels, name_model)
#
#Set values for generating ML model
do_reuse_run = True #Whether or not to reuse any existing output from previous training+validation+testing runs
do_shuffle = True #Whether or not to shuffle contents of training vs validation vs testing datasets
fraction_TVT = [0.8, 0.1, 0.1] #Fractional breakdown of training vs validation vs testing datasets
#
mode_TVT = "uniform" # "uniform" #"available"
#"uniform" = all training datasets will have the same number of entries
#"available" = all training datasets will use full fraction (from fraction_TVT) of data available
#
seed_TVT = 10 #Random seed for generating training vs validation vs testing datasets
seed_ML = 8 #Random seed for ML model
mode_modif = "skim_trim_anon" #Mode to use for processing and generating modifs from input raw text
#NOTE: See other modif modes in workflow tutorial
buffer = 0
#
#Prepare some Keyword objects
kobj_hubble = bibcat.Keyword(
                keywords=["Hubble", "Hubble Telescope",
                          "Hubble Space Telescope"],
                acronyms=["hst", "ht"])
all_kobjs = [kobj_hubble]
#

In [4]:
#Initialize an empty ML classifier
classifier_ML = bibcat.Classifier_ML(filepath_model=None, fileloc_ML=None,
                                    class_names=None, do_verbose=True)
#

In [5]:
#Initialize an Operator
tabby_ML = bibcat.Operator(classifier=classifier_ML, mode=mode_modif, keyword_objs=all_kobjs,
                           do_verbose=True, load_check_truematch=False, do_verbose_deep=False)
#

Instance of Operator successfully initialized!
Keyword objects:
0: Keyword Object:
Name: Hubble
Keywords: ['Hubble Space Telescope', 'Hubble Telescope', 'Hubble']
Acronyms: ['hst', 'ht']



In [6]:
#Set up data for the quick example case. But in reality, ML models should be trained on MUCH more data than this!!!
if do_quick_run:
    #Make some fake data
    dict_texts_raw = {"science":["We present HST observations in Figure 4.",
                            "The HST stars are listed in Table 3b.",
                            "Despite our efforts to smooth the data, there are still rings in the HST images.",
                            "See Section 8c for more discussion of the Hubble images.",
                            "The supernovae detected with HST tend to be brighter than initially predicted.",
                            "Our spectra from HST fit well to the standard trend first published in Someone et al. 1990.",
                            "We use the Hubble Space Telescope to build an ultraviolet database of the target stars.",
                            "The blue points (HST) exhibit more scatter than the red points (JWST).",
                            "The benefit, then, is the far higher S/N we achieved in our HST observations.",
                            "Here we employ the Hubble Telescope to observe the edge of the photon-dominated region.",
                            "The black line shows that the region targeted with Hubble has an extreme UV signature."],
                     "datainfluenced":["The simulated Hubble data is plotted in Figure 4.",
                           "Compared to the HST observations in Someone et al., our JWST follow-up reached higher S/N.",
                           "We were able to reproduce the luminosities from Hubble using our latest models.",
                           "We overplot Hubble-observed stars from Someone et al. in Figure 3b.",
                           "We built the spectral templates using UV data in the Hubble archive.",
                           "We simulate what our future HST observations will look like to predict the S/N.",
                           "Our work here with JWST is inspired by our earlier HST study published in 2010.",
                           "We therefore use the Hubble statistics from Author et al. to guide our stellar predictions.",
                           "The stars in Figure 3 were plotted based on the HST-fitted trend line in Person et al.",
                           "The final step is to use the HST exposure tool to put our modeled images in context."],
                     "mention":["Person et al. used HST to measure the Hubble constant.",
                            "We will present new HST observations in a future work.",
                            "HST is a fantastic instrument that has revolutionized our view of space.",
                            "The Hubble Space Telescope (HST) has its mission center at the STScI.",
                            "We can use HST to power a variety of science in the ultraviolet regime.",
                            "It is not clear when the star will be observable with HST.",
                            "More data can be found and downloaded from the Hubble archive.",
                            "We note that HST can be used to observe the stars as well, at higher S/N.",
                            "However, we ended up using the JWST rather than HST observations in this work.",
                            "We push the analysis of the Hubble component of the dataset to a future study.",
                            "We expect the HST observations to be released in the fall.",
                            "We look forward to any follow-up studies with, e.g., the Hubble Telescope."]}
    #
    #Convert into dictionary with: key:text,class,id,mission structure
    i_track = 0
    dict_texts = {}
    for key in dict_texts_raw:
        curr_set = dict_texts_raw[key]
        for ii in range(0, len(curr_set)):
            dict_texts[str(i_track)] = {"text":curr_set[ii], "class":key, "id":"{0}_{1}".format(key, ii),
                                       "mission":"HST"}
            i_track += 1
#

In [7]:
#Set up data for the external data case.
if do_real_run:
    #Load the original data
    with open(filepath_json, 'r') as openfile:
        dataset = json.load(openfile)
    #
    #Organize a new version of the data with: key:text,class,id,mission structure
    i_track = 0 #Track number of papers kept from original dataset
    dict_texts = {}
    for ii in range(0, len(dataset)):
        #Extract mission classifications for current text
        curr_data = dataset[ii]
        #
        #Skip if no valid text at all for this text
        if ("body" not in curr_data):
            continue
        #
        #Skip if no valid missions at all for this text
        if ("class_missions" not in curr_data):
            continue
        #
        #Otherwise, extract the missions
        curr_missions = curr_data["class_missions"]
        #
        
        #Iterate through missions for this text
        i_mission = 0
        for curr_key in curr_missions:
            #If this is not an allowed mission, skip
            if (curr_missions[curr_key]["papertype"] not in allowed_classifications):
                continue
            #
            #Otherwise, check if this mission is a target mission
            fetched_kobj = tabby_ML._fetch_keyword_object(lookup=curr_key,
                                                          do_verbose=False, do_raise_emptyerror=False)
            #Skip if not a target
            if (fetched_kobj is None):
                continue
            #
            #Otherwise, store classification info for this entry
            curr_class = curr_missions[curr_key]["papertype"]
            new_dict = {"text":curr_data["body"], #Text for this paper
                        "class":curr_class, #Classification for this mission
                        "mission":curr_key, #The mission itself
                        "id":("paper{0}_mission{1}_{2}_{3}".format(ii, i_mission,
                                                                   curr_key, curr_class)) #ID for this entry
                       }
            dict_texts[str(i_track)] = new_dict
            #
            #Increment counters
            i_mission += 1 #Count of kept missions for this paper
            i_track += 1 #Count of kept classifications overall
        #

        #Terminate early if requested number of papers reached
        if ((num_papers is not None) and (i_track >= num_papers)):
            break
    #
#

In [8]:
#Throw error if not enough text entries collected
if do_real_run:
    if ((num_papers is not None) and (len(dict_texts) < num_papers)):
        raise ValueError("Err: Something went wrong during initial processing. Insufficient number of texts extracted."
                        +"\nRequested number of texts: {0}\nActual number of texts: {1}"
                        .format(num_papers, len(dict_texts)))
#

In [9]:
#Uncomment the code below to print a snippet of each of the entries in the dataset.
#"""
for curr_key in dict_texts:
    print("Text #{0}:".format(curr_key))
    print("Classification: {0}".format(dict_texts[curr_key]["class"]))
    print("Mission: {0}".format(dict_texts[curr_key]["mission"]))
    print("ID: {0}".format(dict_texts[curr_key]["id"]))
    print("Text snippet:")
    print(dict_texts[curr_key]["text"][0:500])
    print("---\n\n")
#"""

Text #0:
Classification: science
Mission: HST
ID: science_0
Text snippet:
We present HST observations in Figure 4.
---


Text #1:
Classification: science
Mission: HST
ID: science_1
Text snippet:
The HST stars are listed in Table 3b.
---


Text #2:
Classification: science
Mission: HST
ID: science_2
Text snippet:
Despite our efforts to smooth the data, there are still rings in the HST images.
---


Text #3:
Classification: science
Mission: HST
ID: science_3
Text snippet:
See Section 8c for more discussion of the Hubble images.
---


Text #4:
Classification: science
Mission: HST
ID: science_4
Text snippet:
The supernovae detected with HST tend to be brighter than initially predicted.
---


Text #5:
Classification: science
Mission: HST
ID: science_5
Text snippet:
Our spectra from HST fit well to the standard trend first published in Someone et al. 1990.
---


Text #6:
Classification: science
Mission: HST
ID: science_6
Text snippet:
We use the Hubble Space Telescope to build an ultraviolet 

In [10]:
#Print number of texts that fell under given parameters
print("Target missions:")
for curr_kobj in all_kobjs:
    print(curr_kobj)
    print("")
print("")
print("Number of valid text entries:")
print(len(dict_texts))

Target missions:
Keyword Object:
Name: Hubble
Keywords: ['Hubble Space Telescope', 'Hubble Telescope', 'Hubble']
Acronyms: ['hst', 'ht']



Number of valid text entries:
33


In [11]:
#Use the Operator instance to train an ML model
tabby_ML.train_model_ML(dir_model=dir_model, name_model=name_model, do_reuse_run=do_reuse_run,
                        seed_ML=seed_ML, seed_TVT=seed_TVT, filename_json=None, dict_texts=dict_texts,
                        buffer=buffer, fraction_TVT=fraction_TVT, mode_TVT=mode_TVT, do_shuffle=do_shuffle,
                        do_verbose=True, do_verbose_deep=None)
#


> Running train_model_ML()!
Loading data from given .json file or dict. of texts...
Text data has been loaded.
Processing text data into modifs...
25 of 33 total texts have been processed...
Text data has been processed into modifs.
Storing the data in train+validate+test directories...

> Running generate_directory_TVT().
Random seed set to: 10

Class breakdown of given dataset:
Counter({'mention': 12, 'science': 11, 'datainfluenced': 10})

Fractions given for TVT split: [0.8 0.1 0.1]
Mode requested: uniform
TVT partition per class:
science: [8 1 2]
datainfluenced: [8 1 1]
mention: [8 1 3]

Indices split per class, per TVT. Shuffling=True.
Created new directories for TVT files.
Stored in: /Users/jamila.pegues/Documents/STScI_Fellowship/Functional/Library/BibTracking/scratchwork/test_model_2023_08_25a
Files saved to new TVT directories.
dict_keys(['i_clausechain', 'i_clausetrail', 'word', 'wordchunk', 'is_important', 'dict_importance', 'is_useless', 'pos_main'])


ValueError: Object arrays cannot be saved when allow_pickle=False

And with that, we're done training a new ML model!  If run successfully, the model will be saved in the `dir_model` directory.

We can then use the brand new model to classify some new text, like so:

In [None]:
#Set path to the new model output
filepath_model = os.path.join(dir_model, (name_model+".npy"))
fileloc_ML = os.path.join(dir_model, (preset.tfoutput_prefix+name_model))
#Load the new ML model into a new Classifier_ML instance
classifier_ML = bibcat.Classifier_ML(filepath_model=filepath_model, fileloc_ML=fileloc_ML,
                                    do_verbose=True)
#
#Load the instance into a new Operator
tabby_ML = bibcat.Operator(classifier=classifier_ML, mode=mode_modif, keyword_objs=all_kobjs,
                           do_verbose=True, load_check_truematch=True, do_verbose_deep=False)
#

In [None]:
#Run the classifier for some sample text below
lookup = "HST"
text = "In this study, we present our HST observations of stars in the star-forming region Taurus."
threshold = 0.8
#
#Run the classifier
result = tabby_ML.classify(text=text, lookup=lookup, buffer=0, threshold=threshold,
                            do_raise_innererror=False, do_check_truematch=True)
#

In [None]:
#Print the classifier results
print("Modif: {2}\n\nClassification: {0}\n\nUncertainties per class: {1}\n"
      .format(result["verdict"], result["uncertainty"], result["modif"]))
print("Full classification output:\n{0}".format(result))

---