# Deep Learning NER task

Tatjana Cucic and Sanna Volanen

https://spacy.io/api/annotation

# Milestones

## 1.1 Predicting word labels independently

* The first part is to train a classifier which assigns a label for each given input word independently. 
* Evaluate the results on token level and entity level. 
* Report your results with different network hyperparameters. 
* Also discuss whether the token level accuracy is a reasonable metric.









In [2]:
# Training data: Used for training the model
!wget -nc https://raw.githubusercontent.com/sainvo/DeepLearning_NER/master/data/train.tsv

# Development/ validation data: Used for testing different model parameters, for example level of regularization needed
!wget -nc https://raw.githubusercontent.com/sainvo/DeepLearning_NER/master/data/dev.tsv

# Test data: Never touched during training / model development, used for evaluating the final model
!wget -nc https://raw.githubusercontent.com/sainvo/DeepLearning_NER/master/data/test.tsv

#saved model
#!wget -nc https://raw.githubusercontent.com/sainvo/DeepLearning_NER/master/saved_models/Adamax90.h5



File ‘train.tsv’ already there; not retrieving.

File ‘dev.tsv’ already there; not retrieving.

File ‘test.tsv’ already there; not retrieving.



In [3]:
import sys 
import csv

csv.field_size_limit(sys.maxsize)

131072

In [111]:
#read tsv data to list of lists of lists: a list of sentences that contain lists of tokens that are lists of unsplit \t lines from the tsv, such as ['attract\tO']
token = {"word":"","entity_label":""}

def read_ontonotes(tsv_file): # 
    current_sent = [] # list of (word,label) lists
    with open(tsv_file) as f:
        tsvreader = csv.reader(f, delimiter= '\n')
        for line in tsvreader:
            #print(line)
            if not line:
                if current_sent:
                    yield current_sent
                    current_sent=[]
                continue
            current_sent.append(line[0]) 
        else:
            if current_sent:
                yield current_sent

full_train_data = list(read_ontonotes('train.tsv'))
size_tr = int(len(full_train_data)/2)
#print(size_tr)
##slice train
train_data_cut = full_train_data[:size_tr]
print(train_data_cut[:30])

#print()
full_dev_data = list(read_ontonotes('dev.tsv'))
size_dv = int(len(full_dev_data)/2)
#print(size_dv)
#slice dev
dev_data_sample = full_dev_data[:size_dv]
#print(dev_data_sample[:5])
#print()
full_test_data = list(read_ontonotes('test.tsv'))
size_ts = int(len(full_test_data)/2)
#print(size_ts)
test_data_sample = full_test_data[:size_ts]



[['Big\tO', 'Managers\tO', 'on\tO', 'Campus\tO'], ['In\tO', 'recent\tB-DATE', 'years\tI-DATE', ',\tO', 'advanced\tO', 'education\tO', 'for\tO', 'professionals\tO', 'has\tO', 'become\tO', 'a\tO', 'hot\tO', 'topic\tO', 'in\tO', 'the\tO', 'business\tO', 'community\tO', '.\tO'], ['With\tO', 'this\tO', 'trend\tO', ',\tO', 'suddenly\tO', 'the\tO', 'mature\tO', 'faces\tO', 'of\tO', 'managers\tO', 'boasting\tO', 'an\tO', 'average\tO', 'of\tO', 'over\tO', 'ten\tB-DATE', 'years\tI-DATE', 'of\tO', 'professional\tO', 'experience\tO', 'have\tO', 'flooded\tO', 'in\tO', 'among\tO', 'the\tO', 'young\tO', 'people\tO', 'populating\tO', 'university\tO', 'campuses\tO', '.\tO'], ['In\tO', 'order\tO', 'to\tO', 'attract\tO', 'this\tO', 'group\tO', 'of\tO', 'seasoned\tO', 'adults\tO', 'pulling\tO', 'in\tO', 'over\tO', 'NT$\tB-MONEY', '1\tI-MONEY', 'million\tI-MONEY', 'a\tO', 'year\tO', 'back\tO', 'to\tO', 'the\tO', 'ivory\tO', 'tower\tO', ',\tO', 'universities\tO', 'have\tO', 'begun\tO', 'to\tO', 'establish\t

In [121]:
import re
from pprint import pprint
#regex for empty space chars, \t \n
#tab = re.compile('[\t]')
#line = re.compile('[\n')

def clean(raw_data): ## input list is list of lists of strings 
    clean_data =[]  #list of lists that have one clean sentence per list
    for sent in raw_data:
        clean_sent = []
        for item in sent: # item is string
            #print('---')
            #print("item type: ", type(item))
            #print("item", item)
            one_sentence = []
            item_list = item.split("\n") # if new lines present
            if len(item_list)== 1: ## item only has one token and tag separated with \t
                #print("item as list", item_list)
                item = item.split("\t")
                one_sentence.append(item)
            elif len(item_list) > 1:   ## item has more than one token and tag pair separated with \t and also \n NOTE! these turned out to be quotes
                print("item as list", item_list)
                for subitem in item_list:
                    subitem_list = subitem.split("\n") ## if even the sublist items have items with \n
                    for subsubitem in subitem_list:
                        sub = subsubitem.split("\t")
                        #print("Splitted subitem: ", sub)
                        if sub[0] =="" : # replacing empty token with missing quote marks
                            #print(sub[0])
                            sub[0] = '\"'
                            #print(sub)
                    else:
                        print("Subitem type:", type(subitem))
                    sublist.append(sub )
                for item in sublist:
                    clean_sent.append(item)
        clean_data.append(clean_sent)        
    return clean_data

#train_data_sample = train_data_cut
train_data_clean = clean(train_data_cut[:50])
#print(len(train_data_clean))
#for item in train_data_clean[:3]:
    #print(item)
item_lengths = []
max_text = 0
for item in train_data_clean:
    #print(item)
    item_lengths.append(len(item))
    if len(item) > max_text:
        max_text = len(item)
        ind = train_data_clean.index(item)
print("Longest sentence:", max_text, "index: ",ind)

item_lengths_sorted = sorted(item_lengths, reverse=True)
max = item_lengths_sorted[0]
print(max)
#pprint(train_data_clean[21367])
pprint(item_lengths_sorted[:100])

item as list ['\tO', 'back\tO', 'flow\tO', 'education\tO', '\tO']
Subitem type: <class 'str'>


NameError: ignored

In [0]:

print('------------------------------------------')
dev_data_clean = clean(dev_data_sample)
print(len(dev_data_clean))
for item in dev_data_clean[:3]:
    print(item)
print('------------------------------------------')
test_data_clean = clean(test_data_sample)
print(len(test_data_clean))
for item in test_data_clean[:3]:
    print(item)
print('------------------------------------------')    

In [41]:
# shape into dicts per sentence

def reshape_sent2dicts(f):
    data_dict = []
    for item in f: # list of lists (tokens)
        #print(item)
        sent_text= [] 
        sent_tags = []
        for token in item:
            if len(token) ==2:
                sent_text.append(token[0])
                sent_tags.append(token[1])
        sent_dict = {'text':sent_text,'tags':sent_tags }
        #print(sent_dict['text'])
        #print(sent_dict['tags'])
        data_dict.append(sent_dict)
    return data_dict

train_data_sent = list(reshape_sent2dicts(train_data_clean[:30000]))
samp = train_data_sent[:2]
print(samp)
print()
dev_data_sent = list(reshape_sent2dicts(dev_data_clean))
samp2 = dev_data_sent[:3]
print(samp2)

[{'text': ['Big', 'Managers', 'on', 'Campus'], 'tags': ['O', 'O', 'O', 'O']}, {'text': ['In', 'recent', 'years', ',', 'advanced', 'education', 'for', 'professionals', 'has', 'become', 'a', 'hot', 'topic', 'in', 'the', 'business', 'community'], 'tags': ['O', 'B-DATE', 'I-DATE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}]

[{'text': ['President', 'Chen', 'Travels', 'Abroad'], 'tags': ['B-WORK_OF_ART', 'I-WORK_OF_ART', 'I-WORK_OF_ART', 'I-WORK_OF_ART']}, {'text': ['(', 'Chang', 'Chiung', '-', 'fang', '/', 'tr.', 'by', 'David', 'Mayer', ')'], 'tags': ['O', 'B-PERSON', 'I-PERSON', 'I-PERSON', 'I-PERSON', 'O', 'O', 'O', 'B-PERSON', 'I-PERSON', 'O']}, {'text': ['President', 'Chen', 'Shui', '-', 'bian', 'visited', 'the', 'Nicaraguan', 'National', 'Assembly', 'on', 'August', '17', ',', 'where', 'he', 'received', 'a', 'medal', 'from', 'the', 'president', 'of', 'the', 'assembly', ',', 'Ivan', 'Escobar', 'Fornos'], 'tags': ['O', 'B-PERSON', 'I-PERSON', 'I-PERSON', 'I-PE

In [47]:
item_lengths = []
max_text = 0

for item in train_data_sent:
    item_lengths.append(len(item["text"]))
    if len(item["text"]) > max_text:
        max_text = len(item["text"])
print("Longest sentence:", max_text)

item_lengths_sorted = sorted(item_lengths, reverse=True)
max = item_lengths_sorted[0]
print(max)
for i in range(100):
    print(item_lengths_sorted[i])
for item in train_data_sent:
    if len(item["text"]) == 17040:
        ind = train_data_sent.index(item)
print(ind)

max_text = 0
for item in train_data_sent:
    if len(item["text"]) > max_text:
        max_text = len(item["text"])
print("Longest sentence:", max_text)

Longest sentence: 17040
17040
17040
15708
11463
9411
6619
6433
6053
4500
4450
4277
4274
3937
3851
3726
3689
3670
3352
3346
3012
3006
2995
2754
2741
2412
2393
2352
2174
2151
2118
1992
1942
1833
1669
1619
1590
1560
1537
1525
1456
1358
1305
1240
1232
1215
1196
1185
1118
1116
1091
989
981
941
875
853
834
785
782
759
757
750
741
726
708
704
695
680
660
642
621
619
597
592
547
538
535
530
518
481
442
439
425
410
407
404
394
387
374
361
336
329
323
322
313
305
300
298
297
292
291
273
21367
Longest sentence: 17040


In [19]:
import random
import numpy

#random.seed(123)
#random.shuffle(train_data_sent)
#max_sent = [max(len(i["text"])) for i in train_data_sent]
#print(max_sent)
print(type(train_data_sent))
print(train_data_sent[0]) ##one dict
print()
print(train_data_sent[0]["text"])
print()
print(train_data_sent[0]["tags"])
print('------------')

def typed_listing(data, key):
    listed = []
    max_length = 0
    for item in data: # dictionary {text:"", tags:""}
        #print('Item: ', item)
        #print('Key: ', key, ' content: ', item[key], 'length: ',len(item[key]))
        if len(item[key]) > max_length:
            max = len(item[key])
        listed.append(item[key])
    return listed, max_length

listed_texts= typed_listing(train_data_sent, "text")
train_texts = listed_texts[0]
train_txt_max = listed_texts[1]
listed_labels = typed_listing(train_data_sent, "tags")
train_labels= listed_labels[0]
train_lbl_max = listed_labels[1]
print(train_txt_max)
print(train_texts[0])
print(train_labels[0])


print('-----------------------------')
print(len(train_texts))
print('-----------------------')
print('Text: ', train_texts[0])
print(' Texts length: ',len(train_texts))
print('Label: ', train_labels[0])
print(' Labels length: ',len(train_labels))


TypeError: ignored

Longest sentence: 17040
Longest labels: 17040


In [15]:
## same for validation/dev data
listed_texts= typed_listing(dev_data_sent, "text")
dev_texts = listed_texts[0]
dev_txt_max = listed_texts[1]
listed_labels = typed_listing(dev_data_sent, "tags")
dev_labels= listed_labels[0]
dev_lbl_max = listed_labels[1]
print('Text: ', dev_texts[0])
print(' Texts length: ',len(dev_texts))
print('Label: ', dev_labels[0])
print(' Labels length: ',len(dev_labels))


Text:  ['President', 'Chen', 'Travels', 'Abroad']
 Texts length:  5806
Label:  ['B-WORK_OF_ART', 'I-WORK_OF_ART', 'I-WORK_OF_ART', 'I-WORK_OF_ART']
 Labels length:  5806


In [16]:
# Load pretrained embeddings
!wget -nc https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip


--2020-05-13 17:43:34--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 104.22.74.142, 2606:4700:10::6816:4b8e, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 681808098 (650M) [application/zip]
Saving to: ‘wiki-news-300d-1M.vec.zip’


2020-05-13 17:44:37 (10.6 MB/s) - ‘wiki-news-300d-1M.vec.zip’ saved [681808098/681808098]



In [0]:
# Give -n argument so that a possible existing file isn't overwritten 
!unzip -n wiki-news-300d-1M.vec.zip

Archive:  wiki-news-300d-1M.vec.zip
  inflating: wiki-news-300d-1M.vec   


In [0]:
from gensim.models import KeyedVectors

vector_model = KeyedVectors.load_word2vec_format("wiki-news-300d-1M.vec", binary=False, limit=50000)


# sort based on the index to make sure they are in the correct order
words = [k for k, v in sorted(vector_model.vocab.items(), key=lambda x: x[1].index)]
print("Words from embedding model:", len(words))
print("First 50 words:", words[:50])

# Normalize the vectors to unit length
print("Before normalization:", vector_model.get_vector("in")[:10])
vector_model.init_sims(replace=True)
print("After normalization:", vector_model.get_vector("in")[:10])

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Words from embedding model: 50000
First 50 words: [',', 'the', '.', 'and', 'of', 'to', 'in', 'a', '"', ':', ')', 'that', '(', 'is', 'for', 'on', '*', 'with', 'as', 'it', 'The', 'or', 'was', "'", "'s", 'by', 'from', 'at', 'I', 'this', 'you', '/', 'are', '=', 'not', '-', 'have', '?', 'be', 'which', ';', 'all', 'his', 'has', 'one', 'their', 'about', 'but', 'an', '|']
Before normalization: [-0.0234 -0.0268 -0.0838  0.0386 -0.0321  0.0628  0.0281 -0.0252  0.0269
 -0.0063]
After normalization: [-0.0163762  -0.01875564 -0.05864638  0.02701372 -0.02246478  0.04394979
  0.01966543 -0.0176359   0.01882563 -0.00440898]


In [0]:
# Build vocabulary mappings

# Zero is used for padding in Keras, prevent using it for a normal word.
# Also reserve an index for out-of-vocabulary items.
vocabulary={
    "<PAD>": 0,
    "<OOV>": 1
}

for word in words: # These are words from the word2vec model
    vocabulary.setdefault(word, len(vocabulary))

print("Words in vocabulary:",len(vocabulary))
inv_vocabulary = { value: key for key, value in vocabulary.items() } # invert the dictionary


# Embedding matrix
def load_pretrained_embeddings(vocab, embedding_model):
    """ vocab: vocabulary from our data vectorizer, embedding_model: model loaded with gensim """
    pretrained_embeddings = numpy.random.uniform(low=-0.05, high=0.05, size=(len(vocab)-1,embedding_model.vectors.shape[1]))
    pretrained_embeddings = numpy.vstack((numpy.zeros(shape=(1,embedding_model.vectors.shape[1])), pretrained_embeddings))
    found=0
    for word,idx in vocab.items():
        if word in embedding_model.vocab:
            pretrained_embeddings[idx]=embedding_model.get_vector(word)
            found+=1
            
    print("Found pretrained vectors for {found} words.".format(found=found))
    return pretrained_embeddings

pretrained=load_pretrained_embeddings(vocabulary, vector_model)

Words in vocabulary: 50002
Found pretrained vectors for 50000 words.


Preprocessing

In [0]:
#Labels


not_letter = re.compile(r'[^a-zA-Z]')
# Label mappings
# 1) gather a set of unique labels
label_set = set()
for sentence_labels in train_labels: #loops over sentences 
    #print(sentence_labels)
    for label in sentence_labels: #loops over labels in one sentence
       # match = not_letter.match(label)
        #if match or label== 'O':
        #    break
        #else:    
        label_set.add(label)

# 2) index these
label_map = {}
for index, label in enumerate(label_set):
    label_map[label]=index
    
pprint(label_map)

{'B-CARDINAL': 0,
 'B-DATE': 31,
 'B-EVENT': 11,
 'B-FAC': 33,
 'B-GPE': 6,
 'B-LANGUAGE': 25,
 'B-LAW': 20,
 'B-LOC': 29,
 'B-MONEY': 19,
 'B-NORP': 23,
 'B-ORDINAL': 3,
 'B-ORG': 24,
 'B-PERCENT': 36,
 'B-PERSON': 30,
 'B-PRODUCT': 13,
 'B-QUANTITY': 35,
 'B-TIME': 22,
 'B-WORK_OF_ART': 4,
 'I-CARDINAL': 1,
 'I-DATE': 18,
 'I-EVENT': 8,
 'I-FAC': 26,
 'I-GPE': 5,
 'I-LANGUAGE': 15,
 'I-LAW': 34,
 'I-LOC': 10,
 'I-MONEY': 9,
 'I-NORP': 16,
 'I-ORDINAL': 28,
 'I-ORG': 2,
 'I-PERCENT': 7,
 'I-PERSON': 17,
 'I-PRODUCT': 32,
 'I-QUANTITY': 12,
 'I-TIME': 14,
 'I-WORK_OF_ART': 27,
 'O': 21}


In [0]:
# vectorize the labels
def label_vectorizer(train_labels,label_map):
    vectorized_labels = []
    for label in train_labels:
        vectorized_example_label = []
        for token in label:
            if token in label_map:
                vectorized_example_label.append(label_map[token])
        vectorized_labels.append(vectorized_example_label)
    vectorized_labels = numpy.array(vectorized_labels)
    return vectorized_labels
        

vectorized_labels = label_vectorizer(train_labels,label_map)
validation_vectorized_labels = label_vectorizer(dev_labels,label_map)

print(vectorized_labels[0])

[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21]


In [0]:
## vectorization of the texts
def text_vectorizer(vocab, train_texts):
    vectorized_data = [] # turn text into numbers based on our vocabulary mapping
    sentence_lengths = [] # Number of tokens in each sentence
    
    for i, one_example in enumerate(train_texts):
        vectorized_example = []
        for word in one_example:
            vectorized_example.append(vocab.get(word, 1)) # 1 is our index for out-of-vocabulary tokens

        vectorized_data.append(vectorized_example)     
        sentence_lengths.append(len(one_example))
        
    vectorized_data = numpy.array(vectorized_data) # turn python list into numpy array
    
    return vectorized_data, sentence_lengths

vectorized_data, lengths=text_vectorizer(vocabulary, train_texts)
validation_vectorized_data, validation_lengths=text_vectorizer(vocabulary, dev_texts)

print(train_texts[0])
print(vectorized_data[0])
pprint(type(lengths))
#max = lengths.index(17040)
#print(max)
#pprint(train_texts[11103])

['wrote', ':', 'I', "'ll", 'tell', 'a', 'tale', 'so', 'moving', 'that', 'it', "'s", 'sure', 'to', 'make', 'you', 'snivel', ',', 'A', 'turkey', 'learned', 'to', 'peck', 'the', 'keys', 'and', 'post', 'a', 'pile', 'of', 'drivel', ',', 'How', 'long', 'have', 'you', 'known', 'you', 'are', 'a', 'Turkey', '?']
[789, 11, 30, 1796, 1367, 9, 7233, 59, 1238, 13, 21, 26, 584, 7, 140, 32, 1, 2, 106, 12094, 2533, 7, 1, 3, 5824, 5, 699, 9, 11297, 6, 21599, 2, 979, 389, 38, 32, 456, 32, 34, 9, 1959, 39]
<class 'list'>


In [0]:
# padding for tensor
import tensorflow as tf
### Only needed for me, not to block the whole GPU, you don't need this stuff
#from keras.backend.tensorflow_backend import set_session
#config = tf.ConfigProto()
#config.gpu_options.per_process_gpu_memory_fraction = 0.3
#set_session(tf.Session(config=config))
### ---end of weird stuff

from keras.preprocessing.sequence import pad_sequences
print("Old shape:", vectorized_data.shape)
vectorized_data_padded=pad_sequences(vectorized_data, padding='pre', maxlen=max(lengths))
print("New shape:", vectorized_data_padded.shape)
print("First example:")
print( vectorized_data_padded[0])
# Even with the sparse output format, the shape has to be similar to the one-hot encoding
vectorized_labels_padded=numpy.expand_dims(pad_sequences(vectorized_labels, padding='pre', maxlen=max(lengths)), -1)
print("Padded labels shape:", vectorized_labels_padded.shape)
pprint(label_map)
print("First example labels:")
pprint(vectorized_labels_padded[0])

weights = numpy.copy(vectorized_data_padded)
weights[weights > 0] = 1
print("First weight vector:")
print( weights[0])

# Same stuff for the validation data
validation_vectorized_data_padded=pad_sequences(validation_vectorized_data, padding='pre', maxlen=max(lengths))
validation_vectorized_labels_padded=numpy.expand_dims(pad_sequences(validation_vectorized_labels, padding='pre',maxlen=max(lengths)), -1)
validation_weights = numpy.copy(validation_vectorized_data_padded)
validation_weights[validation_weights > 0] = 1

Old shape: (30000,)


TypeError: ignored

In [0]:
# Evaluation function
import keras

def _convert_to_entities(input_sequence):
    """
    Reads a sequence of tags and converts them into a set of entities.
    """
    entities = []
    current_entity = []
    previous_tag = label_map['O']
    for i, tag in enumerate(input_sequence):
        if tag != previous_tag and tag != label_map['O']: # New entity starts
            if len(current_entity) > 0:
                entities.append(current_entity)
                current_entity = []
            current_entity.append((tag, i))
        elif tag == label_map['O']: # Entity has ended
            if len(current_entity) > 0:
                entities.append(current_entity)
                current_entity = []
        elif tag == previous_tag: # Current entity continues
            current_entity.append((tag, i))
        previous_tag = tag
    
    # Add the last entity to our entity list if the sentences ends with an entity
    if len(current_entity) > 0:
        entities.append(current_entity)
    
    entity_offsets = set()
    
    for e in entities:
        entity_offsets.add((e[0][0], e[0][1], e[-1][1]+1))
    return entity_offsets

def _entity_level_PRF(predictions, gold, lengths):
    pred_entities = [_convert_to_entities(labels[:lengths[i]]) for i, labels in enumerate(predictions)]
    gold_entities = [_convert_to_entities(labels[:lengths[i], 0]) for i, labels in enumerate(gold)]
    
    tp = sum([len(pe.intersection(gold_entities[i])) for i, pe in enumerate(pred_entities)])
    pred_count = sum([len(e) for e in pred_entities])
    
    try:
        precision = tp / pred_count # tp / (tp+np)
        recall = tp / sum([len(e) for e in gold_entities])
        fscore = 2 * precision * recall / (precision + recall)
    except Exception as e:
        precision, recall, fscore = 0.0, 0.0, 0.0
    print('\nPrecision/Recall/F-score: %s / %s / %s' % (precision, recall, fscore))
    return precision, recall, fscore             

def evaluate(predictions, gold, lengths):
    precision, recall, fscore = _entity_level_PRF(predictions, gold, lengths)
    return precision, recall, fscore

class EvaluateEntities(keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.losses = []
        self.precision = []
        self.recall = []
        self.fscore = []
    def on_epoch_end(self, epoch, logs={}):
        self.losses.append(logs.get('loss'))
        pred = numpy.argmax(self.model.predict(validation_vectorized_data_padded), axis=-1)
        evaluation_parameters=evaluate(pred, validation_vectorized_labels_padded, validation_lengths)
        self.precision.append(evaluation_parameters[0])
        self.recall.append(evaluation_parameters[1])
        self.fscore.append(evaluation_parameters[2])
        return

In [0]:
from tensorflow.keras.models import load_model

model_EL = 

In [0]:
# model 3 KEEP!
from keras.models import Model
from keras.layers import Input, Dense, Embedding, Activation, TimeDistributed
from keras.optimizers import SGD, Adam, Adamax, Adadelta, Adagrad, Nadam 

example_count, sequence_len = vectorized_data_padded.shape
class_count = len(label_set)
hidden_size = 50

vector_size= pretrained.shape[1]

def build_model(example_count, sequence_len, class_count, hidden_size, vocabulary, vector_size, pretrained):
    inp=Input(shape=(sequence_len,))
    embeddings=Embedding(len(vocabulary), vector_size, mask_zero=True, trainable=False, weights=[pretrained])(inp)
    hidden = TimeDistributed(Dense(hidden_size, activation="sigmoid"))(embeddings) # We change this activation function
    outp = TimeDistributed(Dense(class_count, activation="softmax"))(hidden)
    return Model(inputs=[inp], outputs=[outp])

model3 = build_model(example_count, sequence_len, class_count, hidden_size, vocabulary, vector_size, pretrained)

In [0]:
print(model3.summary())

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 17040)             0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 17040, 300)        15000600  
_________________________________________________________________
time_distributed_1 (TimeDist (None, 17040, 50)         15050     
_________________________________________________________________
time_distributed_2 (TimeDist (None, 17040, 37)         1887      
Total params: 15,017,537
Trainable params: 16,937
Non-trainable params: 15,000,600
_________________________________________________________________
None


In [0]:
# train the model 3 KEEP!!
optimizer=Adadelta(lr=0.01) # define the learning rate
model3.compile(optimizer=optimizer,loss="sparse_categorical_crossentropy", sample_weight_mode='temporal')
evaluation_function=EvaluateEntities()

# train
vanilla_hist=model3.fit(vectorized_data_padded,vectorized_labels_padded, sample_weight=weights, batch_size=100,verbose=2,epochs=10, callbacks=[evaluation_function])

Epoch 1/10
 - 101s - loss: 0.0045

Precision/Recall/F-score: 0.0 / 0.0 / 0.0
Epoch 2/10
 - 100s - loss: 0.0045

Precision/Recall/F-score: 0.0 / 0.0 / 0.0
Epoch 3/10
 - 100s - loss: 0.0044

Precision/Recall/F-score: 0.0 / 0.0 / 0.0
Epoch 4/10


KeyboardInterrupt: ignored

In [0]:
# plot the f scores
%matplotlib inline
import matplotlib.pyplot as plt

def plot_history(fscores):
    print("History:", fscores)
    print("Highest f-score:", max(fscores))
    plt.plot(fscores)
    plt.legend(loc='lower center', borderaxespad=0.)
    plt.show()

plot_history(evaluation_function.fscore)

## 1.2 Expand context

Modify your network in such way that it is able to utilize the surrounding context of the word. This can be done for instance with a convolutional or recurrent layer. Analyze different neural network architectures and hyperparameters. How does utilizing the surrounding context influence the predictions?


In [0]:
#expanding to RNN model with context

from keras.layers import LSTM

example_count, sequence_len = vectorized_data_padded.shape
class_count = len(label_set)
rnn_size = 100

vector_size= pretrained.shape[1]

def build_rnn_model(example_count, sequence_len, class_count, rnn_size, vocabulary, vector_size, pretrained):
    inp=Input(shape=(sequence_len,))
    embeddings=Embedding(len(vocabulary), vector_size, mask_zero=False, trainable=False, weights=[pretrained])(inp)
    rnn = LSTM(rnn_size, activation='relu', return_sequences=True)(embeddings)
    outp=Dense(class_count, activation="softmax")(rnn)
    return Model(inputs=[inp], outputs=[outp])

rnn_model = build_rnn_model(example_count, sequence_len, class_count, rnn_size, vocabulary, vector_size, pretrained)

In [0]:

print(model.summary())

In [0]:

optimizer=Adam(lr=0.01) # define the learning rate
rnn_model.compile(optimizer=optimizer,loss="sparse_categorical_crossentropy", sample_weight_mode='temporal')

evaluation_function=EvaluateEntities()

# train
rnn_hist=rnn_model.fit(vectorized_data_padded,vectorized_labels_padded, sample_weight=weights, batch_size=100,verbose=2,epochs=10, callbacks=[evaluation_function])

In [0]:

%matplotlib inline

plot_history(evaluation_function.fscore)

## 2.1 Use deep contextual representations

Use deep contextual representations. Fine-tune the embeddings with different hyperparameters. Try different models (e.g. cased and uncased, multilingual BERT). Report your results.


## 2.2 Error analysis

Select one model from each of the previous milestones (three models in total). Look at the entities these models predict. Analyze the errors made. Are there any patterns? How do the errors one model makes differ from those made by another?

## 3.1 Predictions on unannotated text

Use the three models selected in milestone 2.2 to do predictions on the sampled wikipedia text.

## 3.2 Statistically analyze the results

Statistically analyze (i.e. count the number of instances) and compare the predictions. You can, for example, analyze if some models tend to predict more entities starting with a capital letter, or if some models predict more entities for some specific classes than others.