# Bibliography Categorization: 'BibCat'
## Tutorial: Machine learning (ML) models in bibcat.



---


## Introduction.

In this tutorial, we will use bibcat to train a machine learning (ML) model on some raw input text.


---

## User Workflow: Training a machine learning (ML) model.


The `Operator` class contains a user-friendly method `train_model_ML` that runs the full workflow for training an ML model, from the input raw text all the way to saving the output ML model.  We overview how this method can be run in the code blocks below.

For this tutorial, we have two sets of data: either 1) some short, made-up text for a quick run of the code, or 2) an imported database of text from an external file of the user's choosing. The former case is useful for getting a quick sense of how the code works. The latter case is useful for building an actual model, but of course will take much longer on larger databases of text!

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
#Import external packages
import os
import time
import json
import numpy as np

In [3]:
# Set up for fetching necessary bibcat modules for the tutorial
# Check work directories: src/ is where all source python scripts are available. 
current_dir= os.path.dirname(os.path.abspath('__file__'))
_parent = os.path.dirname(current_dir)
src_dir = os.path.join(_parent, "src")

print(f'Current Directory: {current_dir}')
print(f'Source directory: {src_dir}')

# move to the ../src/ directory to import necessary modules. 
os.chdir(src_dir)

Current Directory: /Users/jamila.pegues/Documents/STScI_Fellowship/Functional/Library/BibTracking/repo_stsci/bibcat/docs
Source directory: /Users/jamila.pegues/Documents/STScI_Fellowship/Functional/Library/BibTracking/repo_stsci/bibcat/src


In [4]:
#Import bibcat packages
import bibcat_classes as bibcat
import bibcat_config as config
import bibcat_parameters as params #Temporary file until contents moved elsewhere

Root directory =/Users/jamila.pegues/Documents/STScI_Fellowship/Functional/Library/BibTracking/repo_stsci/bibcat/src, parent directory=/Users/jamila.pegues/Documents/STScI_Fellowship/Functional/Library/BibTracking/repo_stsci/bibcat
/Users/jamila.pegues/Documents/STScI_Fellowship/Functional/Library/BibTracking/repo_stsci/bibcat/src/models folder already exists.
/Users/jamila.pegues/Documents/STScI_Fellowship/Functional/Library/BibTracking/repo_stsci/bibcat/output folder already exists.


In [5]:
#Which data would you like to run the ML model on?  Choose from the booleans below.
do_quick_run = False #This will train the ML model on short bits of text. Runs pretty quickly.
do_real_run = True #This will train the ML model on external text. Will take longer for larger databases.
#

In [6]:
#The rest of these parameters can be left as-is for a first run-through.
#
do_check_truematch = True #A very important parameter - discuss with J.P. first!!!  Set it to either True or False.
#If any papers in dataset encountered within the codebase that have unknown ambiguous phrases...
#...then a note will be printed and those papers will not be used for training-validation-testing.
#...Add the identified ambiguous phrase to the external ambiguous phrase database and rerun to include those papers.
#
num_papers = None #500 #None, or an integer; if an integer, will truncate external .json text dataset to this size
#Set num_papers=None to use all available papers in external dataset
#Note: If set to integer, final paper count might be a little more than target num_papers given
#
allowed_classifications = params.allowed_classifications #For external data; classifications to include
mapper = params.map_papertypes #For masking of classes (e.g., masking 'supermention' as 'mention')
#

#Fetch filepaths for model and data
name_model = config.name_model
filepath_json = config.path_json
dir_model = os.path.join(config.dir_allmodels, name_model)
filesave_error = os.path.join(dir_model,
                              "{0}_processing_errors.txt".format(name_model)) #Where to save processing errors
filesave_unused_bibcodes = os.path.join(dir_model,
                              "{0}_bibcodes_unused_during_trainML.npy".format(name_model)) #Where to save processing errors
#
#Set values for generating ML model
do_reuse_run = True #Whether or not to reuse any existing output from previous training+validation+testing runs
do_shuffle = True #Whether or not to shuffle contents of training vs validation vs testing datasets
fraction_TVT = [0.8, 0.1, 0.1] #Fractional breakdown of training vs validation vs testing datasets
#
mode_TVT = "uniform" # "uniform" #"available"
#"uniform" = all training datasets will have the same number of entries
#"available" = all training datasets will use full fraction (from fraction_TVT) of data available
#
seed_TVT = 10 #Random seed for generating training vs validation vs testing datasets
seed_ML = 8 #Random seed for ML model
mode_modif = "none" #"skim_anon" #"skim_trim_anon" #Mode to use for processing and generating modifs from input raw text
#NOTE: See other modif modes in workflow tutorial
buffer = 0
#
all_kobjs = params.all_kobjs
#

In [7]:
#Initialize an empty ML classifier
classifier_ML = bibcat.Classifier_ML(filepath_model=None, fileloc_ML=None, do_verbose=True)
#

In [8]:
#Initialize an Operator
tabby_ML = bibcat.Operator(classifier=classifier_ML, mode=mode_modif, keyword_objs=all_kobjs,
                           do_verbose=True, load_check_truematch=do_check_truematch, do_verbose_deep=False)
#

Instance of Operator successfully initialized!
Keyword objects:
0: Keyword Object:
Name: Hubble
Keywords: ['Hubble Space Telescope', 'Hubble Telescope', 'Hubble']
Acronyms: ['HST', 'HT']
Banned Overlap: ['Hubble Legacy Archive']

1: Keyword Object:
Name: Webb Telescope
Keywords: ['James Webb Space Telescope', 'Webb Space Telescope', 'James Webb Telescope', 'Webb Telescope']
Acronyms: ['JWST', 'JST', 'JT']
Banned Overlap: []

2: Keyword Object:
Name: Transiting Exoplanet Survey Satellite
Keywords: ['Transiting Exoplanet Survey Satellite']
Acronyms: ['TESS']
Banned Overlap: []

3: Keyword Object:
Name: Kepler
Keywords: ['Kepler']
Acronyms: []
Banned Overlap: []

4: Keyword Object:
Name: Pan-STARRS
Keywords: ['Panoramic Survey Telescope and Rapid Response System', 'Pan-STARRS1', 'Pan-STARRS']
Acronyms: ['PanSTARRS1', 'PanSTARRS', 'PS1']
Banned Overlap: []

5: Keyword Object:
Name: Galaxy Evolution Explorer
Keywords: ['Galaxy Evolution Explorer']
Acronyms: ['GALEX']
Banned Overlap: []

6: 

In [9]:
#Set up data for the quick example case. But in reality, ML models should be trained on MUCH more data than this!!!
if do_quick_run:
    #Make some fake data
    dict_texts_raw = {"science":["We present HST observations in Figure 4.",
                            "The HST stars are listed in Table 3b.",
                            "Despite our efforts to smooth the data, there are still rings in the HST images.",
                            "See Section 8c for more discussion of the Hubble images.",
                            "The supernovae detected with HST tend to be brighter than initially predicted.",
                            "Our spectra from HST fit well to the standard trend first published in Someone et al. 1990.",
                            "We use the Hubble Space Telescope to build an ultraviolet database of the target stars.",
                            "The blue points (HST) exhibit more scatter than the red points (JWST).",
                            "The benefit, then, is the far higher S/N we achieved in our HST observations.",
                            "Here we employ the Hubble Telescope to observe the edge of the photon-dominated region.",
                            "The black line shows that the region targeted with Hubble has an extreme UV signature."],
                     "datainfluenced":["The simulated Hubble data is plotted in Figure 4.",
                           "Compared to the HST observations in Someone et al., our JWST follow-up reached higher S/N.",
                           "We were able to reproduce the luminosities from Hubble using our latest models.",
                           "We overplot Hubble-observed stars from Someone et al. in Figure 3b.",
                           "We built the spectral templates using UV data in the Hubble archive.",
                           "We simulate what our future HST observations will look like to predict the S/N.",
                           "Our work here with JWST is inspired by our earlier HST study published in 2010.",
                           "We therefore use the Hubble statistics from Author et al. to guide our stellar predictions.",
                           "The stars in Figure 3 were plotted based on the HST-fitted trend line in Person et al.",
                           "The final step is to use the HST exposure tool to put our modeled images in context."],
                     "mention":["Person et al. used HST to measure the Hubble constant.",
                            "We will present new HST observations in a future work.",
                            "HST is do_a fantastic instrument that has revolutionized our view of space.",
                            "The Hubble Space Telescope (HST) has its mission center at the STScI.",
                            "We can use HST to power a variety of science in the ultraviolet regime.",
                            "It is not clear when the star will be observable with HST.",
                            "More data can be found and downloaded from the Hubble archive.",
                            "We note that HST can be used to observe the stars as well, at higher S/N.",
                            "However, we ended up using the JWST rather than HST observations in this work.",
                            "We push the analysis of the Hubble component of the dataset to a future study.",
                            "We expect the HST observations to be released in the fall.",
                            "We look forward to any follow-up studies with, e.g., the Hubble Telescope."]}
    #
    #Convert into dictionary with: key:text,class,id,mission structure
    i_track = 0
    dict_texts = {}
    for key in dict_texts_raw:
        curr_set = dict_texts_raw[key]
        for ii in range(0, len(curr_set)):
            dict_texts[str(i_track)] = {"text":curr_set[ii], "class":key, "id":"{0}_{1}".format(key, ii),
                                       "mission":"HST", "bibcode":"{0}_{1}".format(key, ii)}
            i_track += 1
#

In [10]:
#Set up data for the external data case.
if do_real_run:
    #Load the original data
    with open(filepath_json, 'r') as openfile:
        dataset = json.load(openfile)
        len(dataset)
    #
    #Initialize holder to keep track of bibcodes used (avoids duplicate dataset entries)
    list_bibcodes = []
    dict_unused_indsandbibcodes = {} #Dictionary preserves uniqueness of unused bibcodes
    #
    #Organize a new version of the data with: key:text,class,id,mission structure
    i_track = 0 #Track number of papers kept from original dataset
    dict_texts = {}
    for ii in range(0, len(dataset)):
        #Extract mission classifications for current text
        curr_data = dataset[ii]
        curr_bibcode = curr_data["bibcode"]
        #
        #Skip if no valid text at all for this text
        if ("body" not in curr_data):
            continue
        #
        #Skip if no valid missions at all for this text
        if ("class_missions" not in curr_data):
            dict_unused_indsandbibcodes[curr_bibcode] = ii
            continue
        #
        #Otherwise, extract the missions
        curr_missions = curr_data["class_missions"]
        #print(curr_bibcode)
        #print(curr_missions)
        #
        #Skip if bibcode already encountered (and so duplicate entry)
        if (curr_bibcode in list_bibcodes):
            print("Duplicate bibcode encountered: {0}. Skipping.".format(curr_bibcode))
            continue
        #
        #Iterate through missions for this text
        i_mission = 0
        for curr_key in curr_missions:
            #If this is not an allowed classification, skip
            if (curr_missions[curr_key]["papertype"] not in allowed_classifications):
                dict_unused_indsandbibcodes[curr_bibcode] = ii
                continue
            #
            #Otherwise, check if this mission is a target mission
            fetched_kobj = tabby_ML._fetch_keyword_object(lookup=curr_key,
                                                          do_verbose=False, do_raise_emptyerror=False)
            #Skip if not a target mission
            if (fetched_kobj is None):
                dict_unused_indsandbibcodes[curr_bibcode] = ii
                continue
            #
            #Otherwise, store classification info for this entry
            curr_class = curr_missions[curr_key]["papertype"]
            new_dict = {"text":curr_data["body"], #Text for this paper
                        "bibcode":curr_data["bibcode"], #Bibcode for this paper
                        "class":curr_class, #Classification for this mission
                        "mission":curr_key, #The mission itself
                        "id":("paper{0}_mission{1}_{2}_{3}".format(ii, i_mission,
                                                                   curr_key, curr_class)) #ID for this entry
                       }
            dict_texts[str(i_track)] = new_dict
            #
            #Increment counters
            i_mission += 1 #Count of kept missions for this paper
            i_track += 1 #Count of kept classifications overall
        #
        
        #Record this bibcode as stored
        list_bibcodes.append(curr_bibcode)

        #Terminate early if requested number of papers reached
        if ((num_papers is not None) and (i_track >= num_papers)):
            #Store the remaining bibcodes as unused
            dict_unused_indsandbibcodes.update({dataset[jj]["bibcode"]:jj
                                             for jj in range((ii+1), len(dataset))})
            break
    #
#

60157

Duplicate bibcode encountered: 2022MNRAS.516.5618P. Skipping.
Duplicate bibcode encountered: 2022NatAs...6..141B. Skipping.
Duplicate bibcode encountered: 2022Natur.603..815W. Skipping.
Duplicate bibcode encountered: 2022MNRAS.515.2386R. Skipping.
Duplicate bibcode encountered: 2022MNRAS.515.3336F. Skipping.
Duplicate bibcode encountered: 2022Galax..10...76S. Skipping.
Duplicate bibcode encountered: 2022MNRAS.515.2951S. Skipping.
Duplicate bibcode encountered: 2022PSJ.....3..117G. Skipping.
Duplicate bibcode encountered: 2022MNRAS.516.1573Z. Skipping.
Duplicate bibcode encountered: 2022MNRAS.515.3319R. Skipping.
Duplicate bibcode encountered: 2022MNRAS.515.4201G. Skipping.
Duplicate bibcode encountered: 2022MNRAS.516.1977G. Skipping.
Duplicate bibcode encountered: 2022MNRAS.513.3663K. Skipping.
Duplicate bibcode encountered: 2022JGRE..12706853F. Skipping.
Duplicate bibcode encountered: 2022MNRAS.515.2698A. Skipping.
Duplicate bibcode encountered: 2022ApJ...930....2M. Skipping.
Duplicat

In [11]:
#Throw error if not enough text entries collected
if do_real_run:
    if ((num_papers is not None) and (len(dict_texts) < num_papers)):
        raise ValueError("Err: Something went wrong during initial processing. Insufficient number of texts extracted."
                        +"\nRequested number of texts: {0}\nActual number of texts: {1}"
                        .format(num_papers, len(dict_texts)))
#

In [12]:
#Uncomment the code below to print a snippet of each of the entries in the dataset.
"""
print("Number of processed texts: {0}={1}\n".format(i_track, len(dict_texts)))
for curr_key in dict_texts:
    print("Text #{0}:".format(curr_key))
    print("Classification: {0}".format(dict_texts[curr_key]["class"]))
    print("Mission: {0}".format(dict_texts[curr_key]["mission"]))
    print("ID: {0}".format(dict_texts[curr_key]["id"]))
    print("Bibcode: {0}".format(dict_texts[curr_key]["bibcode"]))
    print("Text snippet:")
    print(dict_texts[curr_key]["text"][0:500])
    print("---\n\n")
#"""

'\nprint("Number of processed texts: {0}={1}\n".format(i_track, len(dict_texts)))\nfor curr_key in dict_texts:\n    print("Text #{0}:".format(curr_key))\n    print("Classification: {0}".format(dict_texts[curr_key]["class"]))\n    print("Mission: {0}".format(dict_texts[curr_key]["mission"]))\n    print("ID: {0}".format(dict_texts[curr_key]["id"]))\n    print("Bibcode: {0}".format(dict_texts[curr_key]["bibcode"]))\n    print("Text snippet:")\n    print(dict_texts[curr_key]["text"][0:500])\n    print("---\n\n")\n#'

In [13]:
#Print number of texts that fell under given parameters
print("Target missions:")
for curr_kobj in all_kobjs:
    print(curr_kobj)
    print("")
print("")
print("Number of valid text entries:")
print(len(dict_texts))

Target missions:
Keyword Object:
Name: Hubble
Keywords: ['Hubble Space Telescope', 'Hubble Telescope', 'Hubble']
Acronyms: ['HST', 'HT']
Banned Overlap: ['Hubble Legacy Archive']


Keyword Object:
Name: Webb Telescope
Keywords: ['James Webb Space Telescope', 'Webb Space Telescope', 'James Webb Telescope', 'Webb Telescope']
Acronyms: ['JWST', 'JST', 'JT']
Banned Overlap: []


Keyword Object:
Name: Transiting Exoplanet Survey Satellite
Keywords: ['Transiting Exoplanet Survey Satellite']
Acronyms: ['TESS']
Banned Overlap: []


Keyword Object:
Name: Kepler
Keywords: ['Kepler']
Acronyms: []
Banned Overlap: []


Keyword Object:
Name: Pan-STARRS
Keywords: ['Panoramic Survey Telescope and Rapid Response System', 'Pan-STARRS1', 'Pan-STARRS']
Acronyms: ['PanSTARRS1', 'PanSTARRS', 'PS1']
Banned Overlap: []


Keyword Object:
Name: Galaxy Evolution Explorer
Keywords: ['Galaxy Evolution Explorer']
Acronyms: ['GALEX']
Banned Overlap: []


Keyword Object:
Name: K2
Keywords: ['K2']
Acronyms: []
Banned 

In [None]:
#Use the Operator instance to train an ML model
start=time.time()
str_err = tabby_ML.train_model_ML(dir_model=dir_model, name_model=name_model, do_reuse_run=do_reuse_run,
                        do_check_truematch=do_check_truematch,
                        seed_ML=seed_ML, seed_TVT=seed_TVT, dict_texts=dict_texts, mapper=mapper,
                        buffer=buffer, fraction_TVT=fraction_TVT, mode_TVT=mode_TVT, do_shuffle=do_shuffle,
                        do_verbose=True, do_verbose_deep=False)

print(f'Time to train the model with run = {time.time()-start} seconds.')

#Save the output error string to a file
if (str_err is not None):
    with open(filesave_error, 'x') as openfile:
        openfile.write(str_err)
#

#Save the unused bibcodes to a file
if ((not do_reuse_run) or (not os.path.exists(filesave_unused_bibcodes))):
    np.save(filesave_unused_bibcodes, dict_unused_indsandbibcodes)
#


> Running train_model_ML()!
Processing text data into modifs...

-
Printing Error:
ID: paper2_mission0_HST_SCIENCE
Bibcode: 2023ApJS..265....5H
Mission: HST
Masked class: science
The following err. was encountered in train_model_ML:
NotImplementedError('Err: Unrecognized ambig. phrase:\nHST Advanced Camera\nTaken from this text snippet:\nTo date, deep-field imaging observations have reached detection limits of ≃000 mag in the wavelength range of 000.000–000.000 μ m with the HST Advanced Camera for Surveys (ACS) and the Wide Field Camera 000 (WFC3) instruments in the Hubble Ultra Deep Field (Beckwithetal 000; see Bouwensetal 000 and references therein) with the moderately deep ultraviolet (UV) extension, UVUDF 000.000–000.000 μ m; Windhorstetal 000; Teplitzetal 000.')
Error was noted. Skipping this paper.
-

-
Printing Error:
ID: paper16_mission0_HST_SCIENCE
Bibcode: 2023MNRAS.518.4755A
Mission: HST
Masked class: science
The following err. was encountered in train_model_ML:
NotImplemen

200 of 28950 total texts have been processed...

-
Printing Error:
ID: paper179_mission0_HST_SCIENCE
Bibcode: 2023MNRAS.518.2123Z
Mission: HST
Masked class: science
The following err. was encountered in train_model_ML:
NotImplementedError('Err: Unrecognized ambig. phrase:\nprogram number HST - HFnumeric - A\nTaken from this text snippet:\nFinally, PB was also partially supported through program number HST-HFnumeric-A, provided by NASA through a Hubble Fellowship grant from the Space Telescope Science Institute, under NASA contract NASnumeric.')
Error was noted. Skipping this paper.
-
225 of 28950 total texts have been processed...

-
Printing Error:
ID: paper200_mission0_HST_DATA_INFLUENCED
Bibcode: 2023MNRAS.518..456D
Mission: HST
Masked class: data_influenced
The following err. was encountered in train_model_ML:
NotImplementedError('Err: Unrecognized ambig. phrase:\nHubble ’s constant\nTaken from this text snippet:\nIn total, 000 snapshots spaced in logarithmic intervals of growth fa

And with that, we're done training a new ML model!  If run successfully, the model will be saved in the `dir_model` directory.

We can then use the brand new model to classify some new text, like so:

In [None]:
#Set path to the new model output
filepath_model = os.path.join(dir_model, (name_model+".npy"))
fileloc_ML = os.path.join(dir_model, (config.tfoutput_prefix+name_model))
#Load the new ML model into a new Classifier_ML instance
classifier_ML = bibcat.Classifier_ML(filepath_model=filepath_model, fileloc_ML=fileloc_ML,
                                    do_verbose=True)
#
#Load the instance into a new Operator
tabby_ML = bibcat.Operator(classifier=classifier_ML, mode=mode_modif, keyword_objs=all_kobjs,
                           do_verbose=True, load_check_truematch=True, do_verbose_deep=False)
#

In [None]:
#Run the classifier for some sample text below
lookup = "HST"
text = "In this study, we present our lovely HST observations of bright stars in the nearby star-forming region Taurus."
threshold = 0.8
#
#Run the classifier
result = tabby_ML.classify(text=text, lookup=lookup, buffer=0, #threshold=threshold,
                            do_raise_innererror=False, do_check_truematch=True)
#

In [None]:
#Print the classifier results
print("Modif: {2}\n\nClassification: {0}\n\nUncertainties per class: {1}\n"
      .format(result["verdict"], result["uncertainty"], result["modif"]))
print("Full classification output:\n{0}".format(result))

---

In [None]:
#Set end marker for this tutorial.
print("This tutorial completed successfully.")