# Zero-shot sequence labelling
*Credit to https://github.com/marekrei/mltagger*
***

As mentioned in the author's github above, training phase can be performed with
> `python experiment.py config_file.conf`

Due to a lack of documentation in the original code, the purpose of this notebook is to break down `experiment.py` to explain how it works.

The notebook equivalent of the above command is:
> `experiment.run_experiment(config_path)`

In [1]:
import experiment

In [None]:
config_path = './conf/fcepublic.conf'

<div class="alert alert-block alert-warning">
<b>WARNING!</b> The following will trigger the full training!
</div>

In [2]:
# experiment.run_experiment(config_path = config_path) 

# Prelude
***

In order to illustrate how the code functions, the function `run_experiment` in `experiment.py` has been duplicated in the following cell, together with any other dependencies. Note that the function wrapper has been **removed**, thus exposing the local variables and allowing us to view/print them for tutorial purposes.

<div class="alert alert-block alert-info">
<b>Run the following cell for tutorial purposes.</b> But it's not necessary to read it.
</div>

In [16]:
# For illustration purposes only
# -------------------------------
# For code dependencies to function in jupyter, selected functions are manually imported into this cell

import collections
import numpy
import random

try:
    import ConfigParser as configparser
except:
    import configparser

from model import MLTModel

def read_input_files(file_paths, max_sentence_length=-1):
    """
    Reads input files in whitespace-separated format.
    Will split file_paths on comma, reading from multiple files.
    """
    sentences = []
    line_length = None
    for file_path in file_paths.strip().split(","):
        with open(file_path, "r") as f:
            sentence = []
            for line in f:
                line = line.strip()
                if len(line) > 0:
                    line_parts = line.split()
                    assert(len(line_parts) >= 2), line
                    assert(len(line_parts) == line_length or line_length == None)
                    line_length = len(line_parts)
                    sentence.append(line_parts)
                elif len(line) == 0 and len(sentence) > 0:
                    if max_sentence_length <= 0 or len(sentence) <= max_sentence_length:
                        sentences.append(sentence)
                    sentence = []
            if len(sentence) > 0:
                if max_sentence_length <= 0 or len(sentence) <= max_sentence_length:
                    sentences.append(sentence)
    return sentences
    
def parse_config(config_section, config_path):
    """
    Reads configuration from the file and returns a dictionary.
    Tries to guess the correct datatype for each of the config values.
    """
    config_parser = configparser.SafeConfigParser(allow_no_value=True)
    config_parser.read(config_path)
    config = collections.OrderedDict()
    for key, value in config_parser.items(config_section):
        if value is None or len(value.strip()) == 0:
            config[key] = None
        elif value.lower() in ["true", "false"]:
            config[key] = config_parser.getboolean(config_section, key)
        elif value.isdigit():
            config[key] = config_parser.getint(config_section, key)
        elif is_float(value):
            config[key] = config_parser.getfloat(config_section, key)
        else:
            config[key] = config_parser.get(config_section, key)
    return config

def is_float(value):
    """
    Check in value is of type float()
    """
    try:
        float(value)
        return True
    except ValueError:
        return False

def create_batches_of_sentence_ids(sentences, batch_equal_size, max_batch_size):
    """
    Groups together sentences into batches
    If max_batch_size is positive, this value determines the maximum number of sentences in each batch.
    If max_batch_size has a negative value, the function dynamically creates the batches such that each batch contains abs(max_batch_size) words.
    Returns a list of lists with sentences ids.
    """
    batches_of_sentence_ids = []
    if batch_equal_size == True:
        sentence_ids_by_length = collections.OrderedDict()
        sentence_length_sum = 0.0
        for i in range(len(sentences)):
            length = len(sentences[i])
            if length not in sentence_ids_by_length:
                sentence_ids_by_length[length] = []
            sentence_ids_by_length[length].append(i)

        for sentence_length in sentence_ids_by_length:
            if max_batch_size > 0:
                batch_size = max_batch_size
            else:
                batch_size = int((-1 * max_batch_size) / sentence_length)

            for i in range(0, len(sentence_ids_by_length[sentence_length]), batch_size):
                batches_of_sentence_ids.append(sentence_ids_by_length[sentence_length][i:i + batch_size])
    else:
        current_batch = []
        max_sentence_length = 0
        for i in range(len(sentences)):
            current_batch.append(i)
            if len(sentences[i]) > max_sentence_length:
                max_sentence_length = len(sentences[i])
            if (max_batch_size > 0 and len(current_batch) >= max_batch_size) \
              or (max_batch_size <= 0 and len(current_batch)*max_sentence_length >= (-1 * max_batch_size)):
                batches_of_sentence_ids.append(current_batch)
                current_batch = []
                max_sentence_length = 0
        if len(current_batch) > 0:
            batches_of_sentence_ids.append(current_batch)
    return batches_of_sentence_ids
    
# The following originally belonged to function 'run_experiment'
# Variables are no longer contained within the function and can be called within this notebook's scope

# def run_experiment(config_path):

config = parse_config("config", config_path)
temp_model_path = config_path + ".model"
if "random_seed" in config:
    random.seed(config["random_seed"])
    numpy.random.seed(config["random_seed"])

for key, val in config.items():
    print(str(key) + ": " + str(val))

data_train, data_dev, data_test = None, None, None
if config["path_train"] != None and len(config["path_train"]) > 0:
    data_train = read_input_files(config["path_train"], config["max_train_sent_length"])
if config["path_dev"] != None and len(config["path_dev"]) > 0:
    data_dev = read_input_files(config["path_dev"])
if config["path_test"] != None and len(config["path_test"]) > 0:
    data_test = []
    for path_test in config["path_test"].strip().split(":"):
        data_test += read_input_files(path_test)

model = MLTModel(config)
model.build_vocabs(data_train, data_dev, data_test, config["preload_vectors"])
model.construct_network()
model.initialize_session()
if config["preload_vectors"] != None:
    model.preload_word_embeddings(config["preload_vectors"])



path_train: data/fce/fce-error-detection/tsv/fce-public.train.original.tsv
path_dev: data/fce/fce-error-detection/tsv/fce-public.dev.original.tsv
path_test: data/fce/fce-error-detection/tsv/fce-public.dev.original.tsv:data/fce/fce-error-detection/tsv/fce-public.test.original.tsv
default_label: O
model_selector: dev_sent_f:high
preload_vectors: embeddings/glove/glove.6B.300d.txt
word_embedding_size: 300
emb_initial_zero: False
train_embeddings: True
char_embedding_size: 100
word_recurrent_size: 300
char_recurrent_size: 100
hidden_layer_size: 50
char_hidden_layer_size: 50
lowercase: True
replace_digits: True
min_word_freq: -1.0
singletons_prob: 0.1
allowed_word_length: -1.0
max_train_sent_length: -1.0
vocab_include_devtest: True
vocab_only_embedded: False
initializer: glorot
opt_strategy: adadelta
learningrate: 1.0
clip: 0.0
batch_equal_size: False
max_batch_size: 32
epochs: 200
stop_if_no_improvement_for_epochs: 7
learningrate_decay: 0.9
dropout_input: 0.5
dropout_word_lstm: 0.5
tf_per_

ValueError: Variable word_embeddings already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

  File "C:\Users\Daniel\Anaconda3\envs\sequence_labeler\lib\site-packages\tensorflow\python\framework\ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
  File "C:\Users\Daniel\Anaconda3\envs\sequence_labeler\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op
    op_def=op_def)
  File "C:\Users\Daniel\Anaconda3\envs\sequence_labeler\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)


# Code Breakdown
***

## `run_experiment`

1. Use python's inbuilt `configparser` which creates a dictionary from a specifically-structured configuration file.

>```python 
def run_experiment(config_path):
    config = parse_config("config", config_path)
    ...
```

2. Read in the data files, assuming the relevant file paths are filled up in the config file.

>```python 
def run_experiment(config_path):
    ...    
    if config["path_train"] != None and len(config["path_train"]) > 0:
        data_train = read_input_files(config["path_train"], config["max_train_sent_length"])
    if config["path_dev"] != None and len(config["path_dev"]) > 0:
        data_dev = read_input_files(config["path_dev"])
    if config["path_test"] != None and len(config["path_test"]) > 0:
        data_test = []
        for path_test in config["path_test"].strip().split(":"):
            data_test += read_input_files(path_test)
    ...
```

For example:

In [14]:
print('data_train:\n-----------\nNumber of sentences: {}\nSample sentences:\n{}\n\n{}\n\n{}'
      .format(len(data_train), data_train[0], data_train[7], data_train[50]))

data_train:
-----------
Number of sentences: 28731
Sample sentences:
[['Dear', 'c'], ['Sir', 'c'], ['or', 'c'], ['Madam', 'c'], [',', 'c']]

[['You', 'c'], ['promised', 'c'], ['a', 'c'], ['perfect', 'c'], ['evening', 'c'], ['but', 'c'], ['it', 'c'], ['became', 'c'], ['a', 'c'], ['big', 'c'], ['disastrous', 'i'], ['!', 'c']]

[['If', 'c'], ['weather', 'i'], ['is', 'c'], ['hot', 'c'], ['then', 'c'], ['we', 'c'], ['do', 'c'], ["n't", 'c'], ['have', 'c'], ['to', 'c'], ['wear', 'c'], ['under', 'i'], ['wear', 'i'], ['because', 'c'], ['very', 'c'], ['thin', 'c'], ['and', 'c'], ['light', 'c'], ['clothes', 'c'], ['will', 'c'], ['support', 'c'], ['our', 'c'], ['bodies', 'c'], ['.', 'c']]


3. Initializes an instance of the `MTLmodel`, a class defined in `model.py` which serves as the model base on which training, inference, and other operations can be performed.

>```python 
def run_experiment(config_path):
    ...
    model = MLTModel(config)
    ...
```

4. Initalizes index lookup dictionaries (`self.word2id`, `self.char2id`, `self.singletons`) which are used to convert input data from text to indices for proper input into the model.

>```python 
def run_experiment(config_path):
    ...
    model.build_vocabs(data_train, data_dev, data_test, config["preload_vectors"])
    ...
```

In [8]:
model.word2id

OrderedDict([('<unk>', 0),
             ('.', 1),
             ('i', 2),
             ('the', 3),
             (',', 4),
             ('to', 5),
             ('and', 6),
             ('you', 7),
             ('in', 8),
             ('a', 9),
             ('of', 10),
             ('it', 11),
             ('was', 12),
             ('that', 13),
             ('is', 14),
             ('for', 15),
             ('my', 16),
             ('have', 17),
             ('we', 18),
             ('be', 19),
             ('at', 20),
             ('would', 21),
             ('but', 22),
             ('your', 23),
             ('because', 24),
             ("n't", 25),
             ('me', 26),
             ('like', 27),
             ('very', 28),
             ('not', 29),
             ('this', 30),
             ('are', 31),
             ('with', 32),
             ('will', 33),
             ('on', 34),
             ('about', 35),
             ('as', 36),
             ('all', 37),
             ('do', 38),

In [9]:
model.char2id

OrderedDict([('<cunk>', 0),
             ('e', 1),
             ('t', 2),
             ('o', 3),
             ('a', 4),
             ('n', 5),
             ('i', 6),
             ('s', 7),
             ('r', 8),
             ('h', 9),
             ('l', 10),
             ('d', 11),
             ('u', 12),
             ('y', 13),
             ('m', 14),
             ('c', 15),
             ('w', 16),
             ('f', 17),
             ('g', 18),
             ('p', 19),
             ('.', 20),
             ('I', 21),
             ('b', 22),
             ('v', 23),
             (',', 24),
             ('k', 25),
             ('T', 26),
             ("'", 27),
             ('A', 28),
             ('S', 29),
             ('E', 30),
             ('O', 31),
             ('D', 32),
             ('H', 33),
             ('M', 34),
             ('W', 35),
             ('x', 36),
             ('N', 37),
             ('F', 38),
             ('R', 39),
             ('L', 40),
             ('C', 41

In [11]:
model.singletons

{'panfulet',
 'show-organization',
 'peopel',
 'sedentary',
 'permision',
 'camper',
 'avenging',
 'conference-organisation',
 'disclosed',
 'easier.',
 'feeding',
 'allowe',
 'rattanakosin',
 'ranked',
 'architerture',
 'unemploeers',
 'techincs',
 'museim',
 'convidated',
 'knowe',
 'coluor',
 'soilors',
 '00.00.0',
 'concentrat',
 'human-being',
 'vaccination',
 'alleys',
 'eate',
 'carrots',
 'goverments',
 'puzzle',
 'definity',
 'tidy.',
 'appartement',
 'mananger',
 'lightened',
 'cream',
 'sky-train',
 'n0c',
 'abandoned',
 'owen',
 'extreamly',
 'funy',
 'excpect',
 'secontly',
 'leter',
 'plausible',
 'cleanest',
 'enumerous',
 'tranfer',
 'wellknown',
 'visites',
 'thusday',
 'contine',
 'batteries',
 'persuately',
 'pepperoni',
 'barkeley',
 'flew',
 'promissing',
 'sigh',
 'handled',
 'asted',
 'tee',
 'prapare',
 'varys',
 'diffences',
 'absolutlly',
 'anis',
 'district',
 'accomotation',
 'disturbing',
 'anser',
 'bangkok',
 'questionarie',
 'climates',
 'convinent',
 'n

5. This builds the underlying neural network architecture in the form of a tensorflow graph. 

   Broadly speaking, the word embeddings (with an option to preload from GloVe) and character embeddings (randomly initialized) are combined and fed as an input into a bidirectional LSTM, whose output then goes into an **attention** layer that is eventually used for unsupervised classification. 

   More details of the architecture can be found in the author's paper.

>```python 
def run_experiment(config_path):
    ...
    model.construct_network()
    ...
```

6. Do some tensorflow initialization (such as any defined tensorflow `Variable`).

>```python 
def run_experiment(config_path):
    ...
    model.initialize_session()
    ...
```

7. If a path to GloVe is specified in `config['preload_vectors']`, then the word embeddings which were initialized in the previous step will be overridden by GloVe embeddings (if the word cannot be found, the original embedding will be left unmodified).

>```python 
def run_experiment(config_path):
    ...
    if config["preload_vectors"] != None:
        model.preload_word_embeddings(config["preload_vectors"])
    ...
```

8. Training occurs, and any results will be evaluated.

>```python 
def run_experiment(config_path):
    ...
    results_train = process_sentences(data_train, model, is_training=True, learningrate=learningrate, 
                                      config=config, name="train")
    ...
```

## `process_sentences`


1. Before entering the model, the raw data is converted into batches (default config = of equal size 32). Each batch contains the index of the data instead of the data itself.

>```python 
def process_sentences(data, model, is_training, learningrate, config, name):
    ...
    batches_of_sentence_ids = create_batches_of_sentence_ids(data, config["batch_equal_size"], config["max_batch_size"])
    ...
```

For example

In [24]:
batches_of_sentence_ids = create_batches_of_sentence_ids(data_train, config['batch_equal_size'], config['max_batch_size'])

# First 3 examples
for i in batches_of_sentence_ids[:3]:
    print('length: {}\n{}\n'.format(len(i), i))

length: 32
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]

length: 32
[32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]

length: 32
[64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]



2. The model does training batch by batch, and records the results.

>```python 
def process_sentences(data, model, is_training, learningrate, config, name):
    ...
    for sentence_ids_in_batch in batches_of_sentence_ids:
        batch = [data[i] for i in sentence_ids_in_batch]
        cost, sentence_scores, token_scores_list = model.process_batch(batch, is_training, learningrate)
        evaluator.append_data(cost, batch, sentence_scores, token_scores_list)
    ...
```