## Sequence Tagging Patterns
### Tanya Balaraju
#### Modified from Homework 2 in Class
Matthew Stone, CS 533, to accompany second homework.
Initial version, Spring 2017.  Updated Spring 2018.


In [1]:
from __future__ import print_function
import nltk
import vocabulary
import itertools
import numpy as np
import scipy
import sklearn
import heapq
import tagtools
from nltk.corpus import wordnet
from nltk.tag import pos_tag

In [2]:
reference_train_file, reference_test_file, reference_dev_file = \
  "reference_train.xml", "reference_test.xml", "reference_dev.xml"
reference_xml_item_keyword = "entry"

ingredients_train_file, ingredients_test_file, ingredients_dev_file = \
  "ingredients_small_train.xml", "ingredients_small_test.xml", "ingredients_devset.xml"
#  "ingredients_big_train.xml", "ingredients_big_test.xml", "ingredients_devset.xml"
ingredients_xml_item_keyword = "ingredient"

In [3]:
def tagged_contexts(seq) :
    '''take the tagged tokens in seq and create a new sequence 
    that yields the same tokens but presents them with the full
    context that precedes and follows them'''
    items = [x for x in seq]
    words = [w for (w,_) in items]
    for i, (w, t) in enumerate(items) :
        if i == 0 :
            before = []
        else :
            before = words[i-1::-1]
        after = words[i+1:]
        yield (w, before, after), t

In [4]:
def make_cxt_feature_processor(word_tests, list_tests) :
    def feature_processor(features, item) :
        this, before, after = item
        result = []
        
        def addf(name) :
            if name :
                r = features.add(name)
                if r :
                    result.append(r)

        def add_word_features(w, code) :
            for ff in word_tests :
                addf(ff(w, code))

        def add_list_features(l, code) :
            for ff in list_tests :
                addf(ff(l, code))

        first = lambda l: l[0] if l else None
        
        add_word_features(this, u"w")
        add_list_features(before, u"-")
        add_list_features(after, u"+")
        for wx, cx in [(first(before), u"-w"),
                       (first(after), u"+w")] :
            if wx :
                add_word_features(wx, cx)
    
        return np.array(result)
    return feature_processor

## Writing specific feature detectors

### Model Features

Here are some model features to give a sense of what's involved in writing the pattern matching routines for a feature processor.

* The `identity_feature` constructs a feature for the item itself.  It shows how you can incorporate material from the item into the feature definition you create on a match.  Token identity is always a good default for memorizing arbitrary associations in a classification problem.

* The `all_digits` feature tests whether a word item consists entirely of numerical characters, e.g., matches `[0-9]*`  In the bibliography context, this might be a good starting cue to recognize a date or for the volume number of a journal.  In the recipe domain this might be a good cue for recognizing the quantity of an ingredient.

* The `lonely_initial` feature tests whether something looks like an abbreviated first name, which of course is useful for identifying author and editor fields in bibliographies: a two character token consisting of an upper case letter and a period (e.g,. `A.`)

* The `is_empty` feature tests whether a list context contains no tokens.  When applied to the context features for a target token, this feature fires in one way when the target token is the first token in a text sequence and in another way when the target token is the final token in a text sequence.  This makes it generally useful for identifying material that's consistently placed at the beginning or end of descriptions (e.g., author vs date in bibliography entries, or quantity vs comment in recipe elements).

#### New features added:

* The `is_web_addr` function identifies one of two features: `is_email` and `is_website`, based on the formatting of the `item` passed in. It was useful to include both features in the same function due to the similarity of their formatting.

* The `is_quantity` feature determines whether an item is a fraction--useful for identifying quantities (particularly fractions) in recipes.

* The `is_page_range` feature determines whether an item is a page range, most often denoted by the formatting `<number>-<number>` but also inclusive of other possible formats.

* The `is_year` feature checks whether an item is a four-digit number likely to be a year.

* The `is_institution` feature checks for institution-related words that determine whether the item in question is a university.

* The `is_volume` feature denotes whether an item is a volume, which is usually formatted as `<number>(<number>)`.

* The `is_author` feature uses detection of single initials to determine whether a name belongs to an author. This is a list-based feature, because author names are split up.

* The `is_location` feature uses both Wordnet and a Part of Speech Tagger to determine whether an item is likely to be a location. In particular, it looks for whether the Wordnet definition of an item contains the word "city" and whether the POS Tagger tags the item as a proper noun.

In [5]:
identity_feature = lambda item, code: "{}: {}".format(code, item)

def all_digits(item, code) :
    if item.isdigit() :
        return u"{}: is all digits".format(code)
    
def lonely_initial(item, code) :
    if len(item) == 2 and item[0].isupper and item[1] == '.' :
        return u"{}: lonely initial".format(code)
    
def is_empty(l, code) :
    if not l :
        return u"{}: empty".format(code)

def is_web_addr(l, code):
    domains = {'.com', '.org', '.net', '.edu', '.gov', '.cn', \
                      '.uk', '.eu', '.ru', '.info', '.nl', '.de'}
    def is_email():
        if '@' in l and True in [x in l[l.index('@'):] for x in domains]:
            return True
        return False
    def is_website():
        if True in ([i in l for i in domains]):
            return True
        return False
    if is_email(): return u"{}: is email".format(code)
    if is_website(): return u"{}: is website".format(code)
    
def is_quantity(item, code):
    if item.replace('/', '').replace(' ', '').isdigit() and '/' in item:
        return u"{}: is quantity".format(code)

def is_page_range(item, code):
    if item.replace('-', '').replace(',','').replace('.','').isdigit() and '-' in item:
        return u"{}: is page range".format(code)

def is_year(item, code):
    if len(item.replace('-','').replace(',','')) == 4 and ('19' in item[:1] or '20' in item[:1]):
        return u"{}: is year".format(code)

def is_institution(item, code):
    if 'University' in item or 'Univ' in item or 'Universite' in item or 'MIT' in item:
        return u"{}: is institution".format(code)

def is_author(l, code):
    #print (l)
    if len(l) > 0 and len(l[0]) == 2:
        if l[0][1] == '.' and l[0].replace('.','').isupper():
            return u"{}: is author".format(code)

def is_volume(item, code):
    new = item.replace('(', '').replace(')', '').replace(',','')
    if len(new) <= 3 and new.isdigit():
        return u"{}: is volume".format(code)

def is_location(item, code):
    syns = wordnet.synsets(item)
    if len(syns) > 0:
        if 'town' in syns[0].definition() or 'city' in syns[0].definition():
            pos = pos_tag([item])
            w, t = pos[0]
            if t == 'NNP':
                return u"{}: is location".format(code)

The `feature_processor` call was updated to include the new features.

In [6]:
default_tokenizer = \
    lambda i: tagged_contexts(tagtools.bies_tagged_tokens(i))
default_token_view = lambda i : i[0]
default_feature_processor = \
    make_cxt_feature_processor([all_digits, lonely_initial, 
                                identity_feature, is_quantity, 
                                is_page_range, is_year, is_institution, 
                                is_location, is_volume],
                               [is_empty, is_web_addr, is_author])
def default_features(vocab) :
    return lambda data: vocab

bib_features = vocabulary.Vocabulary()

bib_data = tagtools.DataManager(reference_train_file, 
                                reference_test_file, 
                                reference_dev_file,
                                reference_xml_item_keyword,
                                default_tokenizer,
                                default_token_view,
                                default_features(bib_features),
                                default_feature_processor)

Load the data from the file system

In [7]:
bib_data.initialize()

In [8]:
bib_data.test_features_dev_item(2)

Ventura (b: author)
	-: empty	+w: ,
, (i: author)
	w: ,	-w: Ventura	+w: Dan
Dan (i: author)
	+w: ,	w: Dan	-w: ,
, (e: author)
	w: ,	-w: Dan	+w: (
( (b: date)
	-w: ,	w: (	+w: is all digits	+w: 1995
1995 (i: date)
	w: is all digits	w: 1995	-w: (	+w: )
) (i: date)
	w: )	-w: is all digits	-w: 1995	+w: .
. (e: date)
	w: .	-w: )	+w: On
On (b: title)
	w: On	-w: .	+w: Discretization
Discretization (i: title)
	w: Discretization	-w: On	+w: as
as (i: title)
	w: as	-w: Discretization	+w: a
a (i: title)
	w: a	-w: as	+w: Preprocessing
Preprocessing (i: title)
	w: Preprocessing	-w: a	+w: Step
Step (i: title)
	w: Step	-w: Preprocessing	+w: for
for (i: title)
	w: for	-w: Step	+w: Supervised
Supervised (i: title)
	w: Supervised	-w: for	+w: Learning
Learning (i: title)
	w: Learning	-w: Supervised	+w: Models
Models (i: title)
	+w: ,	w: Models	-w: Learning
, (e: title)
	w: ,	-w: Models	+w: Masters
Masters (b: tech)
	-w: ,	w: Masters	+w: Thesis
Thesis (i: tech)
	+w: ,	w: Thesis	-w: Masters
, (e: tech)
	w: ,

For now, we can use exactly the same family of operations to describe the recipe data.  If you want to explore both recipe data and bibliography data yourself, however, you may want to consider using different features and maybe even different tagging pipelines for the two data sets.

In [9]:
recipe_features = vocabulary.Vocabulary()

recipe_data = tagtools.DataManager(ingredients_train_file, 
                                   ingredients_test_file,
                                   ingredients_dev_file,
                                   ingredients_xml_item_keyword,
                                   default_tokenizer,
                                   default_token_view,
                                   default_features(recipe_features),
                                   default_feature_processor)

Again, we have a separate cell to load the data from the file system.  (The recipe data is very large; this takes a while!)

In [10]:
recipe_data.initialize()

TaggingExperiment was updated with the function `test()` to run the classifier and decoder on the test data. This function is almost exactly the same as `decode_and_validate()`.

In [11]:
class TaggingExperiment(object) :
    '''Organize the process of getting data, building a classifier,
    and exploring new representations'''
    
    def __init__(self, data, features, classifier, decoder) :
        'set up the problem of learning a classifier from a data manager'
        self.data = data
        self.classifier = classifier
        self.features = features
        self.decoder = decoder
        self.initialized = False
        self.trained = False
        self.decoded = False
        
    def initialize(self) :
        'materialize the training data, dev data and test data as matrices'
        if not self.initialized :
            self.train_X, self.train_y, self.train_d = self.data.training_data()
            self.features.stop_growth()
            self.dev_X, self.dev_y, self.dev_d = self.data.dev_data()
            self.test_X, self.test_y, self.test_d = self.data.test_data()
            self.initialized = True
        
    def fit_and_validate(self) :
        'train the classifier and assess predictions on dev data'
        if not self.initialized :
            self.initialize()
        self.classifier.fit(self.train_X, self.train_y)
        self.tagset = self.classifier.classes_
        self.trained = True
        self.dev_predictions = self.classifier.predict(self.dev_X)
        self.accuracy = sklearn.metrics.accuracy_score(self.dev_y, self.dev_predictions)
    
    def visualize_classifier(self, item_number) :
        'show the results of running the classifier on text number item_number'
        if not self.trained :
            self.fit_and_validate()
        w = self.data.dev_item_token_views(item_number)
        s = self.dev_d[item_number]
        e = self.dev_d[item_number+1]
        tagtools.visualize(w, {'actual': self.dev_y[s:e], 
                               'predicted': self.dev_predictions[s:e]})

    def decode_and_validate(self) :
        '''use the trained classifier and beam search to find the consistent
        analyses of all the items in the dev data'''
        if not self.trained :
            self.fit_and_validate()
        self.dev_log_probs = self.classifier.predict_log_proba(self.dev_X)
        results = []
        self.dev_partials = []
        self.dev_exacts = []
        for i in range(len(self.dev_d)-1) :
            s = self.dev_d[i]
            e = self.dev_d[i+1]
            tags, score = self.decoder.search(self.tagset, self.dev_log_probs[s:e])
            p_t = tags[1:-1]
            results.append(p_t)
            self.dev_partials.append(tagtools.agrees(p_t, iter(self.dev_y[s:e]), partial=True))
            self.dev_exacts.append(tagtools.agrees(p_t, iter(self.dev_y[s:e]), partial=False))
        self.dev_decoded = np.concatenate(results)
    
    def test(self):
        if not self.trained :
            self.fit_and_validate()
        self.test_predictions = self.classifier.predict(self.test_X)
        self.test_accuracy = sklearn.metrics.accuracy_score(self.test_y, self.test_predictions)
        self.test_log_probs = self.classifier.predict_log_proba(self.test_X)
        results = []
        self.test_partials = []
        self.test_exacts = []
        for i in range(len(self.test_d)-1) :
            s = self.test_d[i]
            e = self.test_d[i+1]
            tags, score = self.decoder.search(self.tagset, self.test_log_probs[s:e])
            p_t = tags[1:-1]
            results.append(p_t)
            self.test_partials.append(tagtools.agrees(p_t, iter(self.test_y[s:e]), partial=True))
            self.test_exacts.append(tagtools.agrees(p_t, iter(self.test_y[s:e]), partial=False))
        self.test_decoded = np.concatenate(results)
        
    def visualize_decoder(self, item) :
        'show the results of running the classifier and decoder on text number item_number'
        if not self.decoded :
            self.decode_and_validate()
        w = self.data.dev_item_token_views(item)
        s = self.dev_d[item]
        e = self.dev_d[item+1]
        tagtools.visualize(w, {'actual': self.dev_y[s:e], 
                               'best': self.dev_predictions[s:e],
                               'predicted': self.dev_decoded[s:e]})

    @classmethod
    def transform(cls, expt, operation, classifier) :
        'use operation to transform the data from expt and set up new classifier'
        if not expt.initialized :
            expt.initialize()
        result = cls(expt.data, classifier)
        result.train_X, result.train_y, result.train_d = \
            operation(expt.train_X, expt.train_y, expt.train_d)
        result.dev_X, result.dev_y, result.dev_d = \
            operation(expt.dev_X, expt.dev_y, expt.dev_d)
        result.test_X, result.test_y, result.test_d = \
            operation(expt.test_X, expt.test_y, expt.test_d)
        result.initialized = True
        return result


## Searching for Consistent Taggings

The four functions below were modified as follows:

* `initial_status()`: This remains unchanged.

* `is_consistent(t1, t2)`: The basic rules for this function are listed in the comments within the function. In general, an 'i' or 'e' tag must follow a 'b' or 'i' tag, and an 's' or 'b' tag must follow an 'e' or 's' tag. In the former case, the tags (such as 'author') must match; in the latter case, they should not match. This makes sense--'b' or 'i' 'author' tags can logically be followed by another author, but an 'e' or 's' tag for 'author' means that there cannot be another author following it.

* `next_status(status, t1, t2)`: The basic notion behind this function is that the tags journal, booktitle, tech, and quantity MUST be unique in a single entry. However, 'note', 'comment', and 'other' can appear multiple times.

* `do_special(status, t1, j, node, heuristics)`.  `do_special` follows up on `next_status` by addressing the case in which the tags 'note', 'comment', or 'other' appear in an entry. When these are present, the other constraints are no longer enforced. This is also true when a supposedly "unique" tag appears more than once in an entry. Whereas the default heuristic is 100,000, this is changed when the "rules" are "broken," and the value is modified to the heuristic value at index j.


In [12]:
# a good default
def initial_status():
    return frozenset(['START'])

def is_consistent(t1, t2) :
    #(i or e) must follow (b or i) and tags should match
    #(s or b) must follow (e or s) and tags shouldn't match
    if t2 == 'START' and t1 == 'END' : return False
    return (t1 == 'START' and t2[0] == 'b') or \
        (t2 == 'END' and (t1[0] == 'e' or t1[0] == 's')) or \
        (((t1[0] == 'b' or t1[0] == 'i') and (t2[0] == 'i' or t2[0] == 'e')) and t1[3:] == t2[3:]) or \
        (((t1[0] == 'e' or t1[0] == 's') and (t2[0] == 's' or t2[0] == 'b')) and t1[3:] != t2[3:])

def next_status(status, t1, t2) :
    #first dataset: note can be multiple, 
        #have to be individual: journal, booktitle, tech
    #second dataset: comments and other can be multiple, quantities cannot
    unique = {'journal', 'booktitle', 'tech', 'quantity'}
    not_unique = {'note', 'comment', 'other'}
    single = {'author', 'date', 'title', 'volume', 'location', 'publisher', 'institution', 'editor', 'paper', 'date'}
    field1 = t1[3:]
    field2 = t2[3:]
    if field2 in single and field2 not in status: return status.union([field2])
    if field2 in not_unique: return status.union([field2])
    if field2 in unique and len(status & not_unique) == 0: 
        return status.union([field2])
    return status

def do_special(status, t1, j, node, heuristics):
    #model failure: if t1 is in the "not_unique" set from before
    not_unique = {'note', 'comment', 'other'}
    special_node = node    
    new_value = 100000
    for i in range(j, len(heuristics) - 2) :
        special_node = (None, special_node)
    if t1 not in {'START', 'END'} and t1[3:] in status and len(status & not_unique) != 0:
        new_value = heuristics[j]
    return [(special_node, new_value)]
    

## Setting up experiments

The definitions below put everything together.  We create a classifier to learn correlations between features and tags, and crate a decoder to put the learned decisions together into an analyses of complete texts.  Then we set up the infrastructure to explore the results systematically.

In [13]:
bib_classifier = sklearn.linear_model.SGDClassifier(loss="log",
                                           penalty="elasticnet",
                                           n_iter=5)


bib_decoder = tagtools.BeamDecoder(initial_status,
                                   is_consistent,
                                   next_status,
                                   do_special)

bib = TaggingExperiment(bib_data, 
                        bib_features,
                        bib_classifier,
                        bib_decoder)

Load the data

In [14]:
bib.initialize()

Train the classifier and explore it

In [15]:
bib.fit_and_validate()



In [16]:
print (bib.accuracy)

0.8100526008182349


In [17]:
bib.visualize_classifier(20)

Some errors.
('Jeffrey        ', u'\tactual     b: author \tpredicted  b: author ')
('Kuskin         ', u'\tactual     i: author \tpredicted  i: title  ')
(',              ', u'\tactual     i: author \tpredicted  i: author ')
('David          ', u'\tactual     i: author \tpredicted  i: author ')
('Ofelt          ', u'\tactual     i: author \tpredicted  i: author ')
(',              ', u'\tactual     i: author \tpredicted  i: author ')
('Mark           ', u'\tactual     i: author \tpredicted  i: author ')
('Heinrich       ', u'\tactual     i: author \tpredicted  i: author ')
(',              ', u'\tactual     i: author \tpredicted  i: author ')
('John           ', u'\tactual     i: author \tpredicted  i: author ')
('Heinlein       ', u'\tactual     i: author \tpredicted  i: title  ')
(',              ', u'\tactual     i: author \tpredicted  i: author ')
('Richard        ', u'\tactual     i: author \tpredicted  i: author ')
('Simoni         ', u'\tactual     i: author \tpredicted  i: aut

Search for consistent analyses and report details

In [18]:
bib.decode_and_validate()
print (sum([1 for t in bib.dev_partials if t]), "correct with omissions")
print (sum([1 for t in bib.dev_exacts if t]), "fully correct")

15 correct with omissions
15 fully correct


Summarize performance of the classifier

In [19]:
print (tagtools.bieso_classification_report(bib.dev_y, bib.dev_predictions))

                precision    recall  f1-score   support

     b: author       0.96      0.98      0.97        49
     i: author       0.80      0.93      0.86       338
     e: author       0.69      0.92      0.79        49
  e: booktitle       0.68      0.52      0.59        25
  i: booktitle       0.88      0.76      0.82       209
  b: booktitle       0.85      0.88      0.86        25
       i: date       0.81      1.00      0.89        50
       e: date       0.78      0.96      0.86        49
       b: date       0.91      0.98      0.94        49
     b: editor       0.00      0.00      0.00         6
     e: editor       0.00      0.00      0.00         6
     i: editor       0.93      0.30      0.45        44
i: institution       0.73      0.40      0.52        20
b: institution       0.50      0.14      0.22         7
e: institution       1.00      0.43      0.60         7
    i: journal       0.58      0.29      0.39        38
    b: journal       0.90      0.56      0.69  

  'precision', 'predicted', average, warn_for)


Summarize performance of the decoder

In [20]:
print (tagtools.bieso_classification_report(bib.dev_y, bib.dev_decoded))

                precision    recall  f1-score   support

     b: author       1.00      0.92      0.96        49
     i: author       1.00      0.82      0.90       338
     e: author       0.96      0.88      0.91        49
  e: booktitle       0.75      0.36      0.49        25
  i: booktitle       0.99      0.40      0.57       209
  b: booktitle       0.92      0.44      0.59        25
       i: date       0.90      0.72      0.80        50
       e: date       0.93      0.51      0.66        49
       b: date       0.93      0.51      0.66        49
     b: editor       1.00      0.17      0.29         6
     e: editor       1.00      0.17      0.29         6
     i: editor       1.00      0.09      0.17        44
i: institution       1.00      0.35      0.52        20
b: institution       1.00      0.14      0.25         7
e: institution       1.00      0.14      0.25         7
    i: journal       1.00      0.18      0.31        38
    b: journal       1.00      0.31      0.48  

#### Bibligraphy Test Data Results

In [21]:
bib.test()
print ("Classifier Performance \n Accuracy: {}".format(bib.test_accuracy))
print (tagtools.bieso_classification_report(bib.test_y, bib.test_predictions))

Classifier Performance 
 Accuracy: 0.830203916818
                precision    recall  f1-score   support

     b: author       0.99      1.00      0.99       148
     i: author       0.80      0.96      0.87       970
     e: author       0.77      0.92      0.84       148
  e: booktitle       0.74      0.48      0.59        66
  i: booktitle       0.80      0.71      0.75       477
  b: booktitle       0.84      0.85      0.84        66
       i: date       0.92      0.94      0.93       171
       e: date       0.87      0.97      0.91       147
       b: date       0.94      0.95      0.95       147
     b: editor       1.00      0.08      0.15        12
     e: editor       1.00      0.08      0.15        12
     i: editor       0.89      0.23      0.37       139
i: institution       0.75      0.60      0.67        65
b: institution       0.83      0.42      0.56        12
e: institution       1.00      0.33      0.50        12
    i: journal       0.71      0.37      0.48       1

In [22]:
print ("Decoder Performance")
print(tagtools.bieso_classification_report(bib.test_y, bib.test_decoded))

Decoder Performance
                precision    recall  f1-score   support

     b: author       0.99      0.97      0.98       148
     i: author       0.99      0.97      0.98       970
     e: author       0.98      0.97      0.97       148
  e: booktitle       0.88      0.45      0.60        66
  i: booktitle       0.99      0.47      0.63       477
  b: booktitle       0.97      0.50      0.66        66
       i: date       0.98      0.77      0.86       171
       e: date       0.97      0.65      0.78       147
       b: date       0.98      0.66      0.79       147
     b: editor       0.00      0.00      0.00        12
     e: editor       0.00      0.00      0.00        12
     i: editor       0.00      0.00      0.00       139
i: institution       1.00      0.17      0.29        65
b: institution       1.00      0.25      0.40        12
e: institution       1.00      0.25      0.40        12
    i: journal       1.00      0.31      0.47       166
    b: journal       0.93  

#### Recipe Dev Data Analysis

In [23]:
rec_classifier = sklearn.linear_model.SGDClassifier(loss="log",
                                           penalty="elasticnet",
                                           n_iter=5)


rec_decoder = tagtools.BeamDecoder(initial_status,
                                   is_consistent,
                                   next_status,
                                   do_special)

rec = TaggingExperiment(recipe_data, 
                        recipe_features,
                        rec_classifier,
                        rec_decoder)

In [24]:
rec.initialize()

In [25]:
rec.fit_and_validate()



In [26]:
print (rec.accuracy)

0.7989575577066269


In [27]:
rec.visualize_classifier(20)
rec.decode_and_validate()

Analyzed correctly.
('1              ', u'\tactual     s: qty    \tpredicted  s: qty    ')
('cup            ', u'\tactual     s: unit   \tpredicted  s: unit   ')
('coarsely       ', u'\tactual     b: comment\tpredicted  b: comment')
('chopped        ', u'\tactual     e: comment\tpredicted  e: comment')
('leeks          ', u'\tactual     s: name   \tpredicted  s: name   ')


In [28]:
print (tagtools.bieso_classification_report(rec.dev_y, rec.dev_predictions))

              precision    recall  f1-score   support

  e: comment       0.78      0.77      0.77      1063
  b: comment       0.69      0.57      0.63      1063
  s: comment       0.72      0.67      0.69       590
  i: comment       0.72      0.85      0.78      2327
    s: index       0.00      0.00      0.00         1
     i: name       0.88      0.23      0.36       265
     s: name       0.81      0.86      0.83      1118
     e: name       0.82      0.77      0.79       915
     b: name       0.74      0.82      0.78       915
    b: other       0.85      0.19      0.31        57
    s: other       0.69      0.39      0.50       443
    e: other       0.92      0.19      0.32        57
    i: other       0.50      0.20      0.29        45
      b: qty       0.87      1.00      0.93       106
      e: qty       0.72      0.99      0.84       106
      s: qty       0.97      1.00      0.99      1555
b: range_end       0.00      0.00      0.00         1
s: range_end       0.91    

In [29]:
print (tagtools.bieso_classification_report(rec.dev_y, rec.dev_decoded))

              precision    recall  f1-score   support

  e: comment       0.76      0.01      0.02      1063
  b: comment       0.11      0.14      0.12      1063
  s: comment       0.69      0.04      0.08       590
  i: comment       0.49      0.01      0.02      2327
    s: index       0.00      0.00      0.00         1
     i: name       0.94      0.06      0.11       265
     s: name       0.88      0.03      0.06      1118
     e: name       0.87      0.10      0.18       915
     b: name       0.87      0.10      0.18       915
    b: other       0.03      0.28      0.06        57
    s: other       0.89      0.02      0.04       443
    e: other       1.00      0.02      0.03        57
    i: other       0.00      0.00      0.00        45
      b: qty       0.91      0.99      0.95       106
      e: qty       0.91      0.99      0.95       106
      s: qty       0.00      0.00      0.00      1555
b: range_end       0.00      0.00      0.00         1
s: range_end       0.00    

#### Recipe Test Data Results

In [30]:
rec.test()
print ("Classifier Performance \n Accuracy: {}".format(rec.test_accuracy))
print (tagtools.bieso_classification_report(rec.test_y, rec.test_predictions))

Classifier Performance 
 Accuracy: 0.809852388855
              precision    recall  f1-score   support

  e: comment       0.77      0.78      0.77      5185
  b: comment       0.72      0.60      0.65      5185
  s: comment       0.75      0.65      0.70      2912
  i: comment       0.69      0.86      0.76      9295
     i: name       0.82      0.19      0.30      1221
     s: name       0.85      0.88      0.86      5935
     e: name       0.84      0.81      0.82      4256
     b: name       0.76      0.83      0.79      4256
    b: other       0.88      0.14      0.24       270
    s: other       0.72      0.40      0.51      2085
    e: other       0.89      0.12      0.21       270
    i: other       0.73      0.14      0.24       228
      b: qty       0.93      1.00      0.96       560
      e: qty       0.81      1.00      0.89       560
      s: qty       0.98      1.00      0.99      7780
b: range_end       0.00      0.00      0.00        11
s: range_end       0.87      0.

In [31]:
print ("Decoder Performance")
print(tagtools.bieso_classification_report(rec.test_y, rec.test_decoded))

Decoder Performance
              precision    recall  f1-score   support

  e: comment       0.88      0.03      0.05      5185
  b: comment       0.11      0.15      0.13      5185
  s: comment       0.71      0.03      0.06      2912
  i: comment       0.52      0.01      0.02      9295
     i: name       0.94      0.07      0.13      1221
     s: name       0.93      0.05      0.09      5935
     e: name       0.87      0.10      0.19      4256
     b: name       0.88      0.11      0.19      4256
    b: other       0.03      0.26      0.05       270
    s: other       0.71      0.02      0.03      2085
    e: other       0.80      0.01      0.03       270
    i: other       0.00      0.00      0.00       228
      b: qty       0.96      0.99      0.98       560
      e: qty       0.96      0.99      0.98       560
      s: qty       1.00      0.00      0.00      7780
b: range_end       0.00      0.00      0.00        11
s: range_end       1.00      0.01      0.01       150
e: rang

### Conclusion

First, multiple new features were added. These improved the precision of the classifier by around 3%--from 78% to 81%. Many of these features were tailored more toward the bibliography dataset (such as is_author and is_web_addr, among others), with the recipe dataset being more experimental. A feature that would have been beneficial to implement is is_booktitle, which could distinguish titles from booktitles (the classifier seemed to have some trouble with this). However, this seemed close to impossible logistically; there were few consistent differences between the two tags, and because it would need to be a list-level feature, it was difficult to find where exactly the "booktitle" and "title" tags began and ended in the list. Overall, however, the features implemented did demonstrate a 3% improvement in precision over approximately 10 trials.

The development set precision for the bibliography data showed significant improvement after use of the decoder. Precision shot up from 80% to 97% after the four decoder functions were implemented. Accuracy did not improve to this extent, possibly due to multiple items being classified as the same specific feature but not necessarily the most correct one (adding more features would have increased the likelihood of this happening). This pattern was not observed with the recipe data, most likely because this was a more experimental part of this exercise and less features were implemented to detect features in recipe data.

The test set precision for bibliography data also showed improvement (after use of the decoder from 83% to 95%. This pattern was observed consistently for approximately 10 trials. This pattern was not repeated for the recipe data, again most likely due to the prioritization of the bibliography data features. An improvement that could have been made for the recipe data trials is to implement more recipe-related features and separate the is_quantity and is_volume feature implementations between two separate feature processors. (These features appear very similar during processing of individual items.)

Overall, the A* decoder with the improved `is_consistent`, `next_status`, and `do_special` functions greatly improved the precision of this classifier. On average, the dev set precisions improved by 16%, and the test set precisions improved by 15% on average. 