## Program Induction using Atlas ##

We will use Atlas generators to write a program to extract initials from full names of people. Although it is fairly easy to write a traditional program to achieve the same task using simple string operations, things can get messy if the names are in a variety of formats (for example *optionally* prefixed with Mr. or Ms. or Mrs.). Hence we'll use ``Select`` operators to extract initials individually and then combine them into the required suitable format (for example ``{first-name-initial}.{last-name-initial}``.)

### Writing the Generator ###

First let us define the generator along the lines of the description above. The input to the generator is the name of the person. We use two ``Select`` calls to choose the initials corresponding to the first and last names respectively. We then construct the final output by interleaving the initials with a `.`

In [1]:
from atlas import generator

@generator
def get_initials(name: str):
    #  Initial corresponding to the first name
    first_initial = Select(name)
    #  Initial corresponding to the last name
    last_initial = Select(name)
    #  Combine the two
    return f"{first_initial}.{last_initial}."

The definition is expectedly simple. However, this generator by itself is not very useful for computing initials. That is, calling ``get_initials.generate`` simply returns all possible pairings of characters from the input separated by the `.`.

In [2]:
list(get_initials.generate('Alan Turing'))

['A.A.',
 'A.l.',
 'A.a.',
 'A.n.',
 'A. .',
 'A.T.',
 'A.u.',
 'A.r.',
 'A.i.',
 'A.n.',
 'A.g.',
 'l.A.',
 'l.l.',
 'l.a.',
 'l.n.',
 'l. .',
 'l.T.',
 'l.u.',
 'l.r.',
 'l.i.',
 'l.n.',
 'l.g.',
 'a.A.',
 'a.l.',
 'a.a.',
 'a.n.',
 'a. .',
 'a.T.',
 'a.u.',
 'a.r.',
 'a.i.',
 'a.n.',
 'a.g.',
 'n.A.',
 'n.l.',
 'n.a.',
 'n.n.',
 'n. .',
 'n.T.',
 'n.u.',
 'n.r.',
 'n.i.',
 'n.n.',
 'n.g.',
 ' .A.',
 ' .l.',
 ' .a.',
 ' .n.',
 ' . .',
 ' .T.',
 ' .u.',
 ' .r.',
 ' .i.',
 ' .n.',
 ' .g.',
 'T.A.',
 'T.l.',
 'T.a.',
 'T.n.',
 'T. .',
 'T.T.',
 'T.u.',
 'T.r.',
 'T.i.',
 'T.n.',
 'T.g.',
 'u.A.',
 'u.l.',
 'u.a.',
 'u.n.',
 'u. .',
 'u.T.',
 'u.u.',
 'u.r.',
 'u.i.',
 'u.n.',
 'u.g.',
 'r.A.',
 'r.l.',
 'r.a.',
 'r.n.',
 'r. .',
 'r.T.',
 'r.u.',
 'r.r.',
 'r.i.',
 'r.n.',
 'r.g.',
 'i.A.',
 'i.l.',
 'i.a.',
 'i.n.',
 'i. .',
 'i.T.',
 'i.u.',
 'i.r.',
 'i.i.',
 'i.n.',
 'i.g.',
 'n.A.',
 'n.l.',
 'n.a.',
 'n.n.',
 'n. .',
 'n.T.',
 'n.u.',
 'n.r.',
 'n.i.',
 'n.n.',
 'n.g.',
 'g.A.',
 

What we need is a **trained** ``get_initials`` to return correct values (`'A.T.'` in this case) earlier than the other incorrect values.

### The Data-Set ###

First let us prepare a dataset containing input-output pairs for the initial extraction task described above. We have two files `training.csv` and `validation.csv` containing the required training and validation data.

In [3]:
import pandas as pd
train_df = pd.read_csv('https://atlas-tutorial-initial-extraction.s3.us-east-2.amazonaws.com/training.csv', index_col=0)
valid_df = pd.read_csv('https://atlas-tutorial-initial-extraction.s3.us-east-2.amazonaws.com/validation.csv', index_col=0)
train_df.head(5)

Unnamed: 0,name,initials
0,Traxton Preston,T.P.
1,Mrs. Christianne Bousfield,C.B.
2,Ms. Scotlynn Heed,S.H.
3,Findlay Gael,F.G.
4,Mr. Shahryar Brickell,S.B.


As you can see, the data has two columns ``name`` and ``initials`` as the input and output values respectively. The ``name`` column contains names in a variety of formats as can be seen in the first five entries. Hence the initial extraction task is not so trivial.

### Creating Demonstration Data ###
To train ``get_initials``, we are going to use imitation learning. That is, we will tell ``get_initials`` which one of its many execution paths (i.e. operator choices) is the correct one for a given input. More concretely, for a given input to ``get_initials`` we will provide the correct choices to be made by both of the ``Select`` operators.

However, such data is not readily available. We only have the right output for a given input to ``get_initials``. Where do we get this data?

One easy approach is to trace the executions of `get_initials` when we call ``generate`` on it and only keep the one where it returns the right output. The trace would contain the operator choices made in that particular run of ``get_initials`` which is exactly what we need. Atlas provides easy-to-use utilities which makes this extraction very easy.

In [4]:
correct_trace = None
for output, trace in get_initials.with_env(tracing=True).generate('Alan Turing'):
    if output == 'A.T.':
        correct_trace = trace
        break
        
print(correct_trace)


        GeneratorTrace(inputs=(('Alan Turing',), {}),
                       op_traces=[
OpTrace(op_info=OpInfo(sid='/get_initials/Select@@1', gen_name='get_initials', op_type='Select', index=1, gen_group=None, uid=None, tags=None),
        choice='A',
        domain='Alan Turing',
        context=None,
        **{}
       ), 
OpTrace(op_info=OpInfo(sid='/get_initials/Select@@2', gen_name='get_initials', op_type='Select', index=2, gen_group=None, uid=None, tags=None),
        choice='T',
        domain='Alan Turing',
        context=None,
        **{}
       )]



We can see that the trace collected above is an instance of a ``GeneratorTrace`` object which has tracked the input to the generator as well as the various operator choices made along with some other meta-data for Atlas' internal use.

However doing this for all data-points may take too long as we are exploring a quadratic (in the size of ``name``) number of possible generator executions. We can speed this up by exploiting the fact that given the output, it is easy to reverse engineer the choices to be made by each operator ourselves (first two elements after `split`-ing on the `.`). Let us see the API Atlas provides to achieve this.

#### 1. Add unique labels to operators

In [5]:
@generator
def get_initials(name: str):
    #  Initial corresponding to the first name
    first_initial = Select(name, uid="first")
    #  Initial corresponding to the last name
    last_initial = Select(name, uid="last")
    #  Combine the two
    return f"{first_initial}.{last_initial}."

#### 2. Make the generator replay the right choices 

In [6]:
_, correct_trace = get_initials.with_env(replay={"first": ["A"], "last": ["T"]}, tracing=True).call('Alan Turing')
print(correct_trace)


        GeneratorTrace(inputs=(('Alan Turing',), {}),
                       op_traces=[
OpTrace(op_info=OpInfo(sid='/get_initials/Select@first@1', gen_name='get_initials', op_type='Select', index=1, gen_group=None, uid='first', tags=None),
        choice='A',
        domain='Alan Turing',
        context=None,
        **{}
       ), 
OpTrace(op_info=OpInfo(sid='/get_initials/Select@last@2', gen_name='get_initials', op_type='Select', index=2, gen_group=None, uid='last', tags=None),
        choice='T',
        domain='Alan Turing',
        context=None,
        **{}
       )]



We get the same trace as we got using the earlier approach. The dictionary passed to the `with_replay` method contains the values each operator needs to return indexed by its label. Now we are ready to create our training/validation data efficiently.

In [7]:
import tqdm
training_traces = []
validation_traces = []

#  First, the training data
for name, initials in tqdm.tqdm(zip(train_df.name, train_df.initials), total=train_df.shape[0]):
    choice_first, choice_last, _ = initials.split('.')
    replay_choices = {'first': [choice_first], 'last': [choice_last]}
    _, trace = get_initials.with_env(replay=replay_choices, tracing=True).call(name)
    training_traces.append(trace)
            
#  Now do the validation data
for name, initials in tqdm.tqdm(zip(valid_df.name, valid_df.initials), total=valid_df.shape[0]):
    choice_first, choice_last, _ = initials.split('.')
    replay_choices = {'first': [choice_first], 'last': [choice_last]}
    _, trace = get_initials.with_env(replay=replay_choices, tracing=True).call(name)
    validation_traces.append(trace)

100%|██████████| 100000/100000 [00:09<00:00, 10875.60it/s]
100%|██████████| 1000/1000 [00:00<00:00, 10398.90it/s]


### Defining the Model ###

We now define the neural network model that will guide the choices made by each of the select operators. We will use a simple Keras-based RNN model that takes in a string and outputs a single character.

In [8]:
import tensorflow as tf
import string
import numpy as np
from atlas.models.catalogue import Models

class SelectModel(Models.Keras):
    def __init__(self):
        super().__init__()

        characters = string.ascii_letters + '. '
        self.char_index = {c: idx for idx, c in enumerate(characters, 3)}
        self.char_index['<PAD>'] = 0
        self.char_index['<START>'] = 1
        self.char_index['<END>'] = 2
        self.inverse_char_index = {v: k for k, v in self.char_index.items()}
        self.max_length = 50

    def preprocess(self, data, mode='training'):
        if mode == 'training':
            inputs = [['<START>'] + list(i.domain) + ['<END>'] for i in data]
            inputs = [list(map(lambda x: self.char_index[x], i)) for i in inputs]
            inputs = tf.keras.preprocessing.sequence.pad_sequences(inputs, maxlen=self.max_length, padding='post',
                                                                   value=self.char_index['<PAD>'])
            labels = [self.char_index[i.choice] for i in data]
            return inputs, labels
        
        else:
            inputs = [['<START>'] + list(i) + ['<END>'] for i in data]
            inputs = [list(map(lambda x: self.char_index[x], i)) for i in inputs]
            inputs = tf.keras.preprocessing.sequence.pad_sequences(inputs, maxlen=self.max_length, padding='post',
                                                                   value=self.char_index['<PAD>'])
            return inputs

    def infer(self, domain, *args, **kwargs):
        probabilities = super().infer([domain])[0]
        char_predictions = np.argsort(probabilities)

        predictions = [self.inverse_char_index[i] for i in reversed(char_predictions)]
        return list(filter(lambda x: x in domain, predictions))

    def build(self):
        self.model = tf.keras.Sequential([
            tf.keras.layers.Embedding(len(self.char_index), 64),
            tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dense(len(self.char_index), activation='softmax')
        ])

        self.model.compile(optimizer='adam',
                           loss='sparse_categorical_crossentropy',
                           metrics=['accuracy'])
        
class GetInitialsModel(Models.Generators.Imitation.IndependentOperators):
    @operator
    def Select(*args, **kwargs):
        return SelectModel()

In [9]:
from atlas.models.utils import save_model, restore_model
model = GetInitialsModel()
model.train(training_traces, validation_traces, num_epochs=1)

100%|██████████| 100000/100000 [00:00<00:00, 1115723.00it/s]
100%|██████████| 1000/1000 [00:00<00:00, 956948.21it/s]

[+] Training model for OpInfo(sid='/get_initials/Select@first@1', gen_name='get_initials', op_type='Select', index=1, gen_group=None, uid='first', tags=None)
Instructions for updating:
Colocations handled automatically by placer.





Train on 100000 samples, validate on 1000 samples
Epoch 00001: val_acc improved from -inf to 1.00000, saving model to /var/folders/s9/__w2d9dx2ljdx9qk865hh5qs559bh9/T/tmp0vzmsklz/model.h5
[+] Training model for OpInfo(sid='/get_initials/Select@last@2', gen_name='get_initials', op_type='Select', index=2, gen_group=None, uid='last', tags=None)
Train on 100000 samples, validate on 1000 samples
Epoch 00001: val_acc improved from -inf to 0.98500, saving model to /var/folders/s9/__w2d9dx2ljdx9qk865hh5qs559bh9/T/tmpkq6gvod_/model.h5


In [10]:
save_model(model, './trained_model.zip')

In [11]:
model = restore_model("./trained_model.zip") 
get_initials.set_default_model(model)

In [12]:
import itertools
list(itertools.islice(get_initials.generate('Miss Caroline Lemieux'), 5))

['C.L.', 'C.C.', 'C.n.', 'C.M.', 'C.r.']

An interesting by-product of our approach of combining known constraints with models is that we can edit the constraints and reuse the models without retraining.

In [13]:
@generator
def get_initials(name: str):
    #  Initial corresponding to the first name
    first_initial = Select(name)
    #  Initial corresponding to the last name
    last_initial = Select(name)
    #  Combine the two
    return f"Initials - {first_initial}.{last_initial}"

get_initials.set_default_model(model)

In [14]:
list(itertools.islice(get_initials.generate('Ms. Caroline Lemieux'), 5))\

['Initials - C.L',
 'Initials - C.C',
 'Initials - C.n',
 'Initials - C.M',
 'Initials - C.r']