# Playing with the model

### Loading the data

In `data/data.py`, I created a `Data()` object to extract, format, and navigate the *HaAretz* corpus. Below, I instantiate this object, which gives us access to training, development, and test datasets. Since this project is in its early stages, I will test the model here on the development set and refrain from interacting with the test set.

In [1]:
from data import Data  # data is a local module

data = Data()

Using TensorFlow backend.


Inspecting the training data, we see that `data.train_X` is a list containing the respective inputs to the word and character embedding layers:

In [2]:
print('First sentence vector in word-embedding input:')
print(data.train_X[0][0])
print()
print('First sentence vector in character-embedding input:')
print(data.train_X[1][0])

First sentence vector in word-embedding input:
[18772 12467   374 12828 16803 12500 13790  2517 15816  5001 19409  8682
 12884 13861  2084 11977 10221     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0]

First sentence vector in character-embedding input:
[[ 1 15 46 ...  0  0  0]
 [15 40 35 ...  0  0  0]
 [17  4 31 ...  0  0  0]
 ...
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]]


To accommodate variable length words and sentences, the input vectors are zero-padded. All of the sentence vectors have been set to the length of the longest sentence in the corpus (`data.max_sent_len`). Likewise, all of the word vectors have been set to the length of the longest word in the corpus (`data.max_word_len`). Below, I have printed these values along with the shapes of the datasets.

In [3]:
print('Maximum sentence length:', '\t\t', data.max_sent_len)
print('Maximum word length:', '\t\t\t', data.max_word_len)

print('\n\033[1mTRAINING\033[0m')
print('Number of examples:', '\t\t\t', len(data.train_X[0]))
print('Shape of word embeddings input:', '\t', data.train_X[0].shape)
print('Shape of character embeddings input:', '\t', data.train_X[1].shape)
print('Shape of output labels:', '\t\t', data.train_Y.shape)

print('\n\033[1mDEV\033[0m')
print('Number of examples:', '\t\t\t', len(data.dev_X[0]))
print('Shape of word embeddings input:', '\t', data.dev_X[0].shape)
print('Shape of character embeddings input:', '\t', data.dev_X[1].shape)
print('Shape of output labels:', '\t\t', data.dev_Y.shape)

print('\n\033[1mTEST\033[0m')
print('Number of examples:', '\t\t\t', len(data.test_X[0]))
print('Shape of word embeddings input:', '\t', data.test_X[0].shape)
print('Shape of character embeddings input:', '\t', data.test_X[1].shape)
print('Shape of output labels:', '\t\t', data.test_Y.shape)

Maximum sentence length: 		 75
Maximum word length: 			 16

[1mTRAINING[0m
Number of examples: 			 2822
Shape of word embeddings input: 	 (2822, 75)
Shape of character embeddings input: 	 (2822, 75, 16)
Shape of output labels: 		 (2822, 75, 38)

[1mDEV[0m
Number of examples: 			 353
Shape of word embeddings input: 	 (353, 75)
Shape of character embeddings input: 	 (353, 75, 16)
Shape of output labels: 		 (353, 75, 38)

[1mTEST[0m
Number of examples: 			 352
Shape of word embeddings input: 	 (352, 75)
Shape of character embeddings input: 	 (352, 75, 16)
Shape of output labels: 		 (352, 75, 38)


### Loading the model

The morphological tagger is implemented in `nn.py` as `MorphTagger`. Below, I use `MorphTagger` to load a previously trained model. However, if the tagger is passed a filename for a nonexistent file, it will train and save a new model. Whenever the model is instantiated, it will print out Keras' model summary. For now, I have hard-coded arbitrary values in `nn.py` for different parameters required by the network, such as the number of hidden units for the embedding layers (i.e., 150 for word embeddings and 200 for character embeddings).

In [4]:
from nn import MorphTagger

MODEL_FN = './models/morph_tagger.h5'

model = MorphTagger(
    model_fn=MODEL_FN,
    train_X=data.train_X,
    train_Y=data.train_Y,
    )

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 75, 16)       0                                            
__________________________________________________________________________________________________
input_1 (InputLayer)            (None, 75)           0                                            
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, 75, 16, 200)  11200       input_2[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 75, 150)      3120750     input_1[0][0]                    
__________________________________________________________________________________________________
time_distr

### Playing with predictions

With that, lets generate some predictions with `MorphTagger` and view the model's performance:

In [5]:
predictions = model.predict(data.dev_X)
model.acc(predictions, data.dev_Y)


[1mPrediction accuracies given 6252 words across 353 sentences[0m:
╒══════════════════╤═══════════╕
│ ALL-OR-NOTHING   │   0.69002 │
╞══════════════════╪═══════════╡
│ POS              │   0.81942 │
├──────────────────┼───────────┤
│ PERSON           │   0.93138 │
├──────────────────┼───────────┤
│ GENDER           │   0.90739 │
├──────────────────┼───────────┤
│ NUMBER           │   0.88564 │
├──────────────────┼───────────┤
│ TENSE            │   0.94386 │
├──────────────────┼───────────┤
│ DEFINITENESS     │   0.86644 │
╘══════════════════╧═══════════╛


The model is also equipped with the ability to print out a confusion matrix for the POS tags predicted by the network. This can be invoked by calling `model.pos_confusion(predictions, data.dev_Y)`, however, given the size of the matrix, I just include an image of it here.

<img src="./images/pos-confusion.png" alt="POS confusion matrix">

To assess the *quality* of the model's predictions, I imbued the `Data()` object  with a `decipher()` method that will return a human-readable string given an output vector; e.g., see the gold vector for a `3rd-person, singular, feminine pronoun` below. The `decipher()` method attributes a label like `singular` to a word if the model gives that label a value greater or equal to a specified threshold (e.g., `threshold=0.5`). With that said, it will always assign the part of speech with the highest value, regardless of whether that value meets the threshold, since all tokens inherently belong to a POS class.

In [6]:
import numpy as np

label_vec = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
     0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])  # היא (she)

data.decipher(label_vec, threshold=0.5)

'pronoun 3 feminine singular'

For fun, I randomly select a sentence here from the predictions and show the predicted and gold labels side-by-side for each word in the sentence.

In [8]:
# randomly select a sentence index
i = np.random.randint(0, high=len(data.dev_Y), size=1)[0]

# compare the predictions against the gold labels for the randomly selected sentence
for word_idx, pred, gold in zip(data.dev_X[0][i], predictions[i], data.dev_Y[i]):

    # if the word is not a pad token
    if word_idx > 1:
        print('\033[1m%s\033[0m' % data.idx_word[word_idx])  # the word in Hebrew
        print(data.decipher(pred), '(predicted)')            # the prediction
        print(data.decipher(gold), '(gold)\n')               # the actual/gold label

[1mהוא[0m
pronoun 3 masculine singular indefinite (predicted)
pronoun 3 masculine singular indefinite (gold)

[1mחושב[0m
participle 1 2 3 masculine singular indefinite (predicted)
participle 1 2 3 masculine singular indefinite (gold)

[1mשהם[0m
pronoun 3 masculine plural indefinite (predicted)
pronoun 3 masculine plural indefinite (gold)

[1mעושים[0m
participle 1 2 3 masculine plural indefinite (predicted)
participle 1 2 3 masculine plural indefinite (gold)

[1mסדרה[0m
noun feminine singular indefinite (predicted)
noun feminine singular indefinite (gold)

[1mטובה[0m
adjective feminine singular indefinite (predicted)
adjective feminine singular indefinite (gold)

[1mומצחיקה[0m
noun feminine singular indefinite (predicted)
adjective feminine singular indefinite (gold)

[1m.[0m
punctuation (predicted)
punctuation (gold)

