<a href="https://colab.research.google.com/github/sainvo/DeepLearning_NER/blob/master/DL_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning NER task

Tatjana Cucic and Sanna Volanen

# Milestones

## 1.1 Predicting word labels independently

* The first part is to train a classifier which assigns a label for each given input word independently. 
* Evaluate the results on token level and entity level. 
* Report your results with different network hyperparameters. 
* Also discuss whether the token level accuracy is a reasonable metric.









In [1]:
!wget https://raw.githubusercontent.com/sainvo/DeepLearning_NER/master/train.tsv
!wget https://raw.githubusercontent.com/sainvo/DeepLearning_NER/master/dev.tsv

import sys 
import csv

csv.field_size_limit(sys.maxsize)

--2020-05-06 13:35:30--  https://raw.githubusercontent.com/sainvo/DeepLearning_NER/master/train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17252156 (16M) [text/plain]
Saving to: ‘train.tsv’


2020-05-06 13:35:33 (26.6 MB/s) - ‘train.tsv’ saved [17252156/17252156]

--2020-05-06 13:35:35--  https://raw.githubusercontent.com/sainvo/DeepLearning_NER/master/dev.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2419425 (2.3M) [text/plain]
Saving to: ‘dev.tsv’


2020-05-06 13:35:37 (10.1 MB/s) - ‘dev.tsv’ saved [2419425/2419425]



131072

In [9]:
from collections import namedtuple
OneWord=namedtuple("OneWord",["word","entity_label"])

def read_ontonotes(tsv_file):
  #"""Yield complete sentences"""
    current_sentence=[] # list of (word,label) tuples
    with open(tsv_file) as f:
        tsvreader = csv.reader(f, delimiter= '\t')
        for line in tsvreader:
            #print(line)
            if not line: #sentence break
                if current_sentence: #if we gathered a sentence, we should yield it, because a new starts
                    yield current_sentence #much like return, but continues past this line once the element has been consumed
                    current_sentence=[] #...and start a new one
                continue
            #if we made it here, we are on a normal line
            columns=[line[0], line[1]] #an actual word line
            assert len(columns)==2 #we should have four columns, looking at the data
            current_sentence.append(OneWord(*columns)) #shorthand for looping over columns
        else: #for ... else -> the else part is executed once, when "for" runs out of elements
            if current_sentence: #yield also the last one!
                yield current_sentence

#read the data in as sentences
sentences_train=list(read_ontonotes("train.tsv"))
sentences_dev=list(read_ontonotes("dev.tsv"))

print("First three sentences")
for sent in sentences_train[:3]:
    print(sent)


First three sentences
[OneWord(word='Big', entity_label='O'), OneWord(word='Managers', entity_label='O'), OneWord(word='on', entity_label='O'), OneWord(word='Campus', entity_label='O')]

[OneWord(word='In', entity_label='O'), OneWord(word='recent', entity_label='B-DATE'), OneWord(word='years', entity_label='I-DATE'), OneWord(word=',', entity_label='O'), OneWord(word='advanced', entity_label='O'), OneWord(word='education', entity_label='O'), OneWord(word='for', entity_label='O'), OneWord(word='professionals', entity_label='O'), OneWord(word='has', entity_label='O'), OneWord(word='become', entity_label='O'), OneWord(word='a', entity_label='O'), OneWord(word='hot', entity_label='O'), OneWord(word='topic', entity_label='O'), OneWord(word='in', entity_label='O'), OneWord(word='the', entity_label='O'), OneWord(word='business', entity_label='O'), OneWord(word='community', entity_label='O'), OneWord(word='.', entity_label='O')]

[OneWord(word='With', entity_label='O'), OneWord(word='this', ent

In [14]:
# shape into dicts per sentence
data_dict_train = []
for line in sentences_train:
    sent_text= []
    sent_tags = []
    for OneWord in line:
        #print(OneWord)
        sent_text.append(OneWord.word)
        sent_tags.append(OneWord.entity_label)
    sent_dict = {'text':sent_text,'tags':sent_tags }
    #print(sent_dict)
    data_dict_train.append(sent_dict)
print(data_dict_train[0])

{'text': ['Big', 'Managers', 'on', 'Campus'], 'tags': ['O', 'O', 'O', 'O']}


In [16]:
import random
import numpy

data = data_dict_train
random.seed(124)
random.shuffle(data)

texts=[example["text"] for example in data]
labels=[example["tags"] for example in data]

print('Text: ', texts[0])
print('Label: ', labels[0])

Text:  ['Nonetheless', ',', 'when', 'taken', 'as', 'a', 'whole', 'it', 'presents', 'the', 'ideas', 'that', 'there', 'are', 'spirits', 'in', 'the', 'earth', 'and', 'sky', ',', 'and', 'that', 'the', 'universe', 'is', 'an', 'organic', 'and', 'connected', 'whole', '.']
Label:  ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


Vectorizer vocab size: 39078
Train shape (1029876, 39078)
Dev shape (250420, 39078)


[LibLinear]

LinearSVC(C=0.05, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=1)

0.8895415701621276

[LibLinear]

0.8925045922849613

## 1.2 Expand context

Modify your network in such way that it is able to utilize the surrounding context of the word. This can be done for instance with a convolutional or recurrent layer. Analyze different neural network architectures and hyperparameters. How does utilizing the surrounding context influence the predictions?


## 2.1 Use deep contextual representations

Use deep contextual representations. Fine-tune the embeddings with different hyperparameters. Try different models (e.g. cased and uncased, multilingual BERT). Report your results.


## 2.2 Error analysis

Select one model from each of the previous milestones (three models in total). Look at the entities these models predict. Analyze the errors made. Are there any patterns? How do the errors one model makes differ from those made by another?

## 3.1 Predictions on unannotated text

Use the three models selected in milestone 2.2 to do predictions on the sampled wikipedia text.

## 3.2 Statistically analyze the results

Statistically analyze (i.e. count the number of instances) and compare the predictions. You can, for example, analyze if some models tend to predict more entities starting with a capital letter, or if some models predict more entities for some specific classes than others.