# Bibliography Categorization: 'BibCat'
## Tutorial: High-Level Workflow.



---


## Introduction.

### What is bibcat?

In a nutshell, bibcat is a codebase meant to automate the classification of the way a 'mission', e.g. an observatory or telescope, is used.  It is useful for institutions to keep track of missions within the community, i.e. track the way their data is being used by the community. It is useful for the Space Telescope Science Insttute (STScI), for example, to track the number of published papers that use the Hubble Space Telescope (HST) for science, as opposed to, say, just mentioning HST in passing. However, many, many papers have been published in astronomy, with new papers published each day. Doing this classification by hand can therefore be quite time-consuming.

This codebase, known as "better title pending" and affectionately nicknamed `bibcat`, was designed to automate this process: to accept text, and ultimately return the classification of how the target mission was used in that text, based on the way the text talks about that mission.

### So how does bibcat work?

In short, bibcat performs its classification in three stages:

* Extract the part of the text that refers to the target mission.
* Based on the user's specifications, streamline, simplify, and/or anonymize that subset of the text to remove as much extraneous information from that subset as able.  This stage was implemented to reduce the amount of "noise" or "bias" contained within the text subset.
* Feed that streamlined text subset into one of the available classifier types and return a classification.

There is quite a lot of technical detail contained within these (extremely!) high-level stages, but for now: let's visualize this process with a demonstration.

---

## User Workflow: Minimum Use-Case.


For our minimum workflow, our goals are to:
* Read in some text.
* Compile a "paragraph" from that text, containing all sentences that refer to our target mission (e.g., to HST).
* Classify that text (e.g., as "science", "mention", etc).

We'll start by importing the module and setting some global variables.

In [None]:
import os 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# Set up for fetching necessary bibcat modules for the tutorial
# Check work directories: src/ is where all source python scripts are available. 
current_dir= os.path.dirname(os.path.abspath('__file__'))
_parent = os.path.dirname(current_dir)
src_dir = os.path.join(_parent, "src")

print(f'Current Directory: {current_dir}')
print(f'Source directory: {src_dir}')

# move to the ../src/ directory to import necessary modules. 
os.chdir(src_dir)

In [None]:
#Import bibcat modules
import bibcat_classes as bibcat #The collection of bibcat classes
import bibcat_config as config  

In [None]:
#Load file locations
filepath_allmodels = config.dir_allmodels
filepath_allmodels
name_model = config.name_model
filepath_model = os.path.join(filepath_allmodels, name_model, (name_model+".npy"))
filepath_model
fileloc_ML = os.path.join(filepath_allmodels, name_model, ("tfoutput_"+name_model))
fileloc_ML

In [None]:
#Set some global variables
lookup = "HST" #"Kepler"
mode = "none" #"none" = Does not simplify or modify the text, just for first tutorial

#Set some probability thresholds for classification
thres_ML = 0.80
thres_rules = 0.55

In [None]:
#Build some fake text to classify
text = (" Modern telescopes allow us to observe the night sky at higher sensitivity than ever."
        +" This has allowed us to push our observations toward fainter and fainter stars."
        +" In this paper, we present a statistical analysis of new Hubble"
        +" ultraviolet spectra and their stark dependence on stellar mass,"
        +" and show that the latest data illustrate strong trends in the so far unexplored"
        +" parameter space."
        +" We ultimately characterize stellar radiation for both faint and bright stars."
        +" The target stars are discussed in Section 1."
        +" The new HST observations are fully presented in Section 2."
        +" The spectra are plotted, and the UV models fitted, in Section 3."
        +" The statistical trends in the data are presented in Section 4 and MCMC fits in Appendix C."
        +" Finally, results are summarized and discussed in Sections 5 and 6, respectively.")
print(text)

Next, we define what our target missions are using the Keyword class.  As an example, let's define the missions HST, Kepler, and K2.

Note: There is no internal hardcoding of any missions within the codebase, so any missions could be used here.  If the target mission does not appear in the text, then the text will just not be classified.

(Advanced Note: There is a subtlety here, where the machine learning model can be trained on a training set containing text about specific missions.  Potential biases can result from this training (e.g., if most HST text in the training set is "science", then all future input HST text might be classified as "science" based on that association).  This bias can be circumvented by having the codebase anonymize the training set (with respect to the missions within) before training the machine learning model on that anonymized text.  But we won't get into those details for this minimum-use case.)

In [None]:
#Fetch some Keyword objects (for specifying the mission)
#For HST
kobj_hubble = params.keyword_obj_HST
#For Kepler
kobj_kepler = params.keyword_obj_Kepler
#For K2
kobj_k2 = params.keyword_obj_K2

#Collect the missions into a convenient list
all_keyobjs = [kobj_hubble, kobj_kepler, kobj_k2]

Now, let's initialize our classifiers.  We have two types of classifiers implemented: the 'machine learning' (aka, the 'data-driven') classifier and the 'rule-based' (aka, the 'attempt-at-analytical') classifier.

Initializing the machine learning classifier may take a few seconds, but only has to be done once.  The same classifier can be recycled/reused for multiple texts.

In [None]:
##Initialize classifiers
#The machine learning classifier
print(filepath_model)
print(fileloc_ML)
classifier_ML = bibcat.Classifier_ML(filepath_model=filepath_model, fileloc_ML=fileloc_ML)


In [None]:
#The rule-based classifier
classifier_rules = bibcat.Classifier_Rules()


Finally, let's initialize our operators.  These operators are what take in the text, process the text if requested (i.e., do any text streamlining or simplification, based on the specified modif), and ultimately classify the text.

Like the classifiers, the operators only have to be initialized once, and then can be recycled/reused for multiple texts.

In [None]:
##Initialize operators
#The machine learning operator
tabby_ML = bibcat.Operator(classifier=classifier_ML,
                        mode=mode, keyword_objs=all_keyobjs, do_verbose=False)

#The rule-based operator
tabby_rules = bibcat.Operator(classifier=classifier_rules,
                        mode=mode, keyword_objs=all_keyobjs,
                        do_verbose=False, do_verbose_deep=False)


Now, we can use our operators to accept and classify some text.

In [None]:
##Run the operators
#The machine learning operator
results_ML = tabby_ML.classify(text=text,
                                    lookup=lookup, buffer=0,
                                    threshold=thres_ML,
                                    do_raise_innererror=False,
                                    do_check_truematch=True)

#The rule-based operator
results_rules = tabby_rules.classify(text=text,
                                    lookup=lookup, buffer=0,
                                    threshold=thres_rules,
                                    do_raise_innererror=False,
                                    do_check_truematch=True)


Let's print the results.

In [None]:
##Print the results
#Print the text and classes as a reminder
print("Text:\n\"\n{0}\n\"\n".format(text))
print("Lookup: {0}\n".format(lookup))

#The machine learning results
print("> Machine learning results:")
print("Paragraph:\n\"\n{0}\n\"".format(results_ML["modif"]))
print("Verdict: {0}".format(results_ML["verdict"]))
print("Probabilities: {0}".format(results_ML["uncertainty"]))
print("-\n")

#The rule-based results
print("> Rule-based results:")
print("Paragraph:\n\"\n{0}\n\"".format(results_rules["modif"]))
print("Verdict: {0}".format(results_rules["verdict"]))
print("Probabilities: {0}".format(results_rules["uncertainty"]))
print("-\n")

And with that, we're done!

---

## Full Workflow: High-Level Methodology.

The full workflow of bibcat, at a high level, is broken down into steps within the figure below:

![Graphic of the full high-level workflow of bibcat.](workflow_graphic.png "Graphic of the full high-level workflow of bibcat.")

![Procedural steps describing the full high-level workflow of bibcat.](workflow_procedure.png "Procedural steps describing the full high-level workflow of bibcat.")

Figure 1 (top panel) illustrates the full workflow of bibcat at a high level.  The user inputs and outputs are written in bold white and bold gold boxes, respectively.  Classes that are used by the user are colored in blue and outlined in bold.  Classes that are called internally within the code (and are not meant to ever be seen or touched by the user) are colored in purple and outlined with dashed lines.

We describe each bibcat class, and their primary functions, within the next section.

---

## Full Workflow: Classes.

### The _Base Class.

The \_Base class is *not* meant to be used by users.  It is a class that is purely meant to be inherited by other classes.  Essentially, \_Base is a collection of methods that other classes often use.

The primary (internal!) methods and use-cases of _Base include:
* `_get_info`, `_store_info`: Store and retrieve information (values, booleans, etc.) for a given class instance.
* `_assemble_keyword_wordchunks`: Build noun chunks containing target mission keywords from given text.
* `_check_importance`: Check if some given text contains any important terms (where important terms includes mission keywords, 1st-person and 3rd-person pronouns, a paper citation, etc.).
* `_check_truematch`: Check if some given ambiguous text relates to a given mission (e.g. Hubble observations), or is instead likely a false match (e.g. Edwin Hubble).
* `_cleanse_text`: Cleanse some given text of, e.g., excessive whitespace and punctuation. Can also, e.g., replace citations with an 'Authoretal' placeholder of sorts.
* `_extract_core_from_phrase`: Formulate a core representative 'meaning' for some given text.
* `_is_pos_word`: Check if some given word (of the NLP type) has a particular part-of-speech.
* `_process_database_ambig`: Load, process, and store external table of ambiguous mission-related phrases.
* `_search_text`: Search some given text for mission keywords/acronyms (e.g., search for "HST").
* `_streamline_phrase`: Run _cleanse_text(), and also streamline, e.g., websites by replacing them with uniform placeholders.

### The Keyword Class.

The Keyword class is a user-friendly class that serves as a collection of 'terms' (i.e., titles, keywords, and/or acronyms) for a given mission.  "Hubble Space Telescope", "Hubble", and "HST", for example, are all terms that describe the Hubble Space Telescope (HST), and so would be included in a Keyword instance for Hubble.  These terms were used in the demo above, as reproduced below:

In [None]:
#Generate an example Keyword object for HST
kobj_example = bibcat.Keyword(
                keywords=["Hubble", "Hubble Telescope",
                          "Hubble Space Telescope"],
                acronyms=["hst", "ht"])

#Print the Keyword instance
print(kobj_example)

The primary (internal!) methods and use-cases of Keyword include:
* `get_name`: Return a representative name for this Keyword instance.
* `is_keyword`: Return whether or not some given text contains terms that match to this Keyword instance.
* `replace_keyword`: Replace occurrence of terms that match to this Keyword instance within the given text.

### The Paper Class.

The Paper class is a user-friendly class; that being said, users never have to interact with the Paper class directly, as during the User Workflow all calls to the Paper class are handled internally within the codebase.  The main purpose of this class is to extract any and all sentences from within a larger block of text that refer to a given mission(s).  This collection of sentences for each mission, denoted from here on as a 'paragraph', is created to focus in on the portions of the original text that actually relate to the target mission.  Using paragraphs for classification, instead of the full text, allows us to remove much of the 'noise' inherent to the rest of the text.

Here is a snippet illustrating how the Paper class is used internally to produce paragraphs for each mission (although again, users would never have to run such code themselves):

In [None]:
#Generate an example Paper object for the text and for example Keyword instance
buffer = 0 #+/- sentences to include within paragraph around each sentence with target terms
paper_example = bibcat.Paper(text=text, keyword_objs=[kobj_hubble], do_check_truematch=True)
paper_example.process_paragraphs(buffer=buffer, do_overwrite=False)
paragraphs = paper_example.get_paragraphs()

#Print what is stored within paragraphs
print("Full printout of what is stored within paragraphs:")
print(paragraphs)

#Print the paragraphs for the example Keyword instance
print("Printout of paragraph stored for example Keyword instance:")
name = kobj_example.get_name() #Fetch the name of this Keyword instance
print(paragraphs[name]) #Use name to access paragraph for this Keyword instance

The buffer parameter allows us to provide some context around the sentences with target terms. A buffer of # will include # sentence(s) from the original text before and after each target sentence:

In [None]:
#Run the Paper class again with a different buffer
buffer = 1 #+/- sentences to include within paragraph around each sentence with target terms
paper_example.process_paragraphs(buffer=buffer, do_overwrite=True) #Overwrite stored paragraphs
paragraphs = paper_example.get_paragraphs()

#Print the paragraphs for the example Keyword instance
print("Printout of paragraph stored for example Keyword instance, now with buffer={0}:"
     .format(buffer))
print(paragraphs[name]) #Use name to access paragraph for this Keyword instance

The primary (internal!) methods and use-cases of Paper include:
* `get_paragraphs`: Fetch stored paragraphs previously generated with `process_paragraphs`.
* `process_paragraphs`: Collect all sentences containing mission terms into 'paragraphs'. Buffer each sentence by including +/- surrounding sentences if given buffer is nonzero.
* `_buffer_indices`: Given a set of indices and a buffer value, return the index spans of the applied buffer.
* `_extract_paragraph`: Perform the actual extraction of the paragraph from the text.
* `_split_text`: Split the given text into sentences using naive sentence boundaries.
* `_verify_acronyms`: Scan the entire text to find possible meanings of given acronyms.

### The Grammar Class.

The Grammar class is a user-friendly class on the surface; however, users are not meant to interact with Grammar directly, as much of its content is extremely technical. In a nutshell, the main purpose of the Grammar class is to:
1) Break a paragraph (generated by the Paper class) down into its grammatical components (e.g., parts-of-speech, flagging verbs, the hierarchical structure of each sentence, etc).
2) Streamline, simplify, and/or anonymize the contents of the paragraph, based on the user's specifications.

Essentially, the Grammar class is meant to modify a given paragraph, so as to remove any extra or unnecessary information from within each sentence in that paragraph.  These modifications are meant to further reduce the 'noise' that, e.g., the machine learning classifier must deal with later on in order to classify a given paragraph.  For clarity, from here on we refer to these modified paragraphs as 'modifs'.

There are different modes for modifying a given paragraph (thus producing different modifs), which will do none or some combination of the following operations:
* `skim`: Remove useless extra words (e.g., adjectives) that are likely not important for readability.
* `trim`: Remove clauses and sentence snippets that are likely not relevant to the mission and are not important for readability.
* `anon`: Replace any mission terms with a generic placeholder.

Here is a snippet demonstrating these different modes of modification and producing example modifs for our example paragraph:

In [None]:
#Generate an example Grammar object for the text and for example Keyword instance
grammar_example = bibcat.Grammar(text=text, keyword_obj=kobj_hubble,
                                 do_check_truematch=True, buffer=0
                                ) #NOTE: buffer=0 required for mode=trim

#Run all allowed modifications of paragraph
grammar_example.run_modifications(which_modes=["none", "skim", "trim", "anon", "skim_trim_anon"])

#Fetch modified paragraphs ('modifs') for various modes
#
#For none
try_modif = grammar_example.get_modifs(which_modes=["none"]) #Order/case do not matter
print("none modif: (i.e., no modifications made)")
print(try_modif)
print("")
#
#For skim
try_modif = grammar_example.get_modifs(which_modes=["skim"]) #Order/case do not matter
print("skim modif:")
print(try_modif)
print("")
#
#For trim
try_modif = grammar_example.get_modifs(which_modes=["trim"]) #Order/case do not matter
print("trim modif:")
print(try_modif)
print("")
#
#For anon
try_modif = grammar_example.get_modifs(which_modes=["anon"]) #Order/case do not matter
print("anon modif:")
print(try_modif)
print("")
#
#For skim+trim+anon
try_modif = grammar_example.get_modifs(which_modes=["skim_trim_anon"]) #Order/case do not matter
print("skim_trim_anon modif:")
print(try_modif)
print("")
#

The primary (internal!) methods and use-cases of Grammar include the following (but again, note that the large majority of the Grammar class is extremely technical, and not recommended for the user to mess with):
* `get_modifs`: Fetch stored modifs previously generated with `run_modifications`.
* `run_modifications`: Operates modification of 'paragraphs' according to user-requested modes.
* `_add_aux, _add_verb, _add_word`: Given a set of indices and a buffer value, return the index spans of the applied buffer.
* `_get_wordchunk`: Fetch the full wordchunk of a given word.
* `_modify_structure`: Perform actual modification of a given paragraph using a given mode.
* `_recurse_NLP_categorization`: Recurse through the grammar hierarchy of a sentence and categorize+store information (e.g., part-of-speech) for each word within that sentence.
* `_run_NLP`: Run an external natural language processing (NLP) package on given text.
* `_set_wordchunks`: Assign each word in a sentence to a word chunk.

### The _Classifier Class.

Similarly to the \_Base class, the \_Classifier class is *not* meant to be used by users.  It is a class purely meant to be inherited by the various Classifier\_* classes.  In short, the \_Classifier class is a collection of methods that the different classifier types often use.

The primary (internal!) methods and use-cases of \_Classifier include:
* `classify_text`: Base classification method, overwritten by various classifier types during inheritance.
* `_load_text`: Load text from a given filepath.
* `_process_text`: Use the Grammar class (and internally the Paper class) to process given text into modifs.
* `_write_text`: Write a given text file to a given filepath.

### The Classifier_* Classes.

The Classifier\_* classes are the user-friendly classifier types implemented in bibcat.  They each support a different type of classification of a given block of text.  Currently there are two types implemented in bibcat:
* `Classifier_ML`: Classification using a previously trained machine learning (ML) model.
* `Classifier_Rules`: Classification using an internal decision tree.

While the internal workings of the Classifier\_* classes are different, ultimately all Classifier\_* offer the same method for classification:
* `classify_text`: Classify given text using the classification type specific to this Classifier\_* instance.

Although the initialization of the Classifier\_* classes is user-friendly, the inner workings of these classes are quite technical, are managed completely internally, and are not meant to be operated directly by users.

### The Operator Class.

Last but not least, the Operator class is an extremely user-friendly class that is meant to be run by users. The primary purpose of this class is to direct and run the entire workflow of bibcat, from reading in a given block of text to ultimately classifying that block of text.  Using its `classify` method, the Operator class internally handles all calls to the other classes (Paper, Grammar, and the given classifier) so far discussed.  Indeed, the `classify` method was used in the very first demo of this write-up, in order to classify the example text.

Formally, the primary methods and use-cases of Operator are:
* `_fetch_keyword_object`: A hidden method for fetching the stored Keyword instance that matches a given term.
* `classify`: A method designed for users that prepares and runs the entire bibcat workflow, from input raw text to classified output.
* `process`: A method designed for users that processes given text into modifs, from input raw text to output modifs. Does not include classification (for that, run `classify`); useful for preprocessing raw text.
* `train_model_ML`: A method designed for users that trains a machine learning (ML) model on input raw text. Under the hood, this calls `process` to carry out the preprocessing of the raw text.

---

### That's all to start!

---

In [None]:
#Set end marker for this tutorial.
print("This tutorial completed successfully.")