# Troy&AbedInTheMorning Team Participation in SMM4H-Spanish

Team Participants:
* Sergio Santamaria Carrasco
* Roberto Cuervo Rosillo

In these notebook we describe our proposed system based on an encoder-decoder architecture with an attention mechanism powered by a combination of word embeddings that include pre-trained fine-tuned spanish BERT embeddings.
Our system serves as participation for ProfNER-ST focuses on the recognition of professions and occupations from Twitter using Spanish data.

<a href="https://temu.bsc.es/smm4h-spanish/">SMM4H-Spanish ProfNER-ST</a>


## Settings

1. Setting the parent dirrectory
2. Checking GPU is avaible

In [1]:
sst_home = '/home/sergio/Escritorio/ProfNER'

In [None]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

## Requirements

    Tensorflow 1.14.0
    Keras 2.2.4
    Fasttext 0.2.0
    MeaningCloud-python
    Spacy
    Sklearn_crfsuite
    Keras-contrib
    Pandas

## Text Format

The data of the corpus was obtained from a Twitter crawl that used keywords like “Covid-19”, “epidemia” (epidemic) or “confinamiento” (lockdown), as well as hashtags such as “#yomequedoencasa” (#istayathome), to retrieve relevant tweets.

In [2]:
file_text = sst_home + '/profner-data/subtask-2/brat/train/1242399976644325376.txt' 

with open(file_text, 'r') as file:
  print(file.read())

Cerramos nuestra querida Radio 😢 Nuestros colaboradores y conductores ¡Se quedan en casa! Desde mañana todos los programas de Radio Hoy se harán vía remoto X Skype. Seguimos al aire con el compromiso de siempre, nosotros apoyamos el #QuedateEnCasa #covıd19 #Coronavirus #RadioHoy https://t.co/Q81BBYpbdM


## Annotations Format

The corpus was annotated by linguist experts in an iterative process that included the creation of annotation guidelines specifically for this task. We could find the data annotated by BRAT and BIO format.

In [3]:
file_ann = sst_home + '/profner-data/subtask-2/brat/train/1242399976644325376.ann' 

with open(file_ann) as file:
  print(file.read())

T1	PROFESION 42 55	colaboradores
T2	PROFESION 58 69	conductores



## Requirements


    Tensorflow 1.14.0
    Keras 2.2.4
    Fasttext 0.2.0
    MeaningCloud-python
    Spacy
    Sklearn_crfsuite
    Keras-contrib
    Pandas
    Pickle5


## Syllabizer

Using the rules of Spanish orthography, the syllabizer allows a word to be broken into its component syllables. The necessary code has been extracted from https://github.com/mabodo/sibilizador/blob/master/Silabizator.ipynb

In [2]:
class char():
    def __init__(self):
        pass
    
class char_line():
    def __init__(self, word):
        self.word = word
        self.char_line = [(char, self.char_type(char)) for char in word]
        self.type_line = ''.join(chartype for char, chartype in self.char_line)
        
    def char_type(self, char):
        if char in set(['a', 'á', 'e', 'é','o', 'ó', 'í', 'ú']):
            return 'V' #strong vowel
        if char in set(['i', 'u', 'ü']):
            return 'v' #week vowel
        if char=='x':
            return 'x'
        if char=='s':
            return 's'
        else:
            return 'c'
            
    def find(self, finder):
        return self.type_line.find(finder)
        
    def split(self, pos, where):
        return char_line(self.word[0:pos+where]), char_line(self.word[pos+where:])
    
    def split_by(self, finder, where):
        split_point = self.find(finder)
        if split_point!=-1:
            chl1, chl2 = self.split(split_point, where)
            return chl1, chl2
        return self, False
     
    def __str__(self):
        return self.word
    
    def __repr__(self):
        return repr(self.word)

class silabizer():
    def __init__(self):
        self.grammar = []
        
    def split(self, chars):
        rules  = [('VV',1), ('cccc',2), ('xcc',1), ('ccx',2), ('csc',2), ('xc',1), ('cc',1), ('vcc',2), ('Vcc',2), ('sc',1), ('cs',1),('Vc',1), ('vc',1), ('Vs',1), ('vs',1)]
        for split_rule, where in rules:
            first, second = chars.split_by(split_rule,where)
            if second:
                if first.type_line in set(['c','s','x','cs']) or second.type_line in set(['c','s','x','cs']):
                    #print 'skip1', first.word, second.word, split_rule, chars.type_line
                    continue
                if first.type_line[-1]=='c' and second.word[0] in set(['l','r']):
                    continue
                if first.word[-1]=='l' and second.word[-1]=='l':
                    continue
                if first.word[-1]=='r' and second.word[-1]=='r':
                    continue
                if first.word[-1]=='c' and second.word[-1]=='h':
                    continue
                return self.split(first)+self.split(second)
        return [chars]
        
    def __call__(self, word):
        return self.split(char_line(word))

## Spacy Model

The Spacy model selected is 'es_core_news_md'. Cause this model doesn't tokenize hashtags, a new component is added at the end of the pipeline (HashtagMerger)

In [None]:
!python -m spacy download es_core_news_md

import spacy
import re
import es_core_news_md

nlp = es_core_news_md.load()


In [4]:
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token

# We're using a class because the component needs to be initialised with
# the shared vocab via the nlp object
class HashtagMerger(object):
    def __init__(self, nlp):
        # Register a new token extension to flag bad HTML
        Token.set_extension("is_hashtag", default=False)
        self.matcher = Matcher(nlp.vocab)
        # Add pattern for valid hashtag, i.e. '#' plus any ASCII token
        self.matcher.add("HASHTAG", None, [{"ORTH": "#"}, {"TEXT": {"REGEX": "[A-Za-záéíóúÁÉÍÓÚ]+"}}])

    def __call__(self, doc):
        # This method is invoked when the component is called on a Doc
        matches = self.matcher(doc)
        spans = []  # Collect the matched spans here
        for match_id, start, end in matches:
            spans.append(doc[start:end])
        with doc.retokenize() as retokenizer:
            for span in spans:
                retokenizer.merge(span)
                for token in span:
                    token._.is_hashtag = True  # Mark token as bad HTML
        return doc
    
hashtag_merger = HashtagMerger(nlp)
nlp.add_pipe(hashtag_merger, last = True)  # Add component to the pipeline

## BRAT to BIOES annotation

The next code transform the BRAT annotations into BIOES format. The code return a dict with the token offset and the BIOES schema. In this schema,tokens are annotated using the following tags:


    B: represents a token that conform the begining of an entity.
    I: indicate that the token belongs to an entity.
    O: represents that the token does not belong to an entity.
    E: marks a token as the end of a given entity.
    S: indicates that an entity is comprised of a single token.


In [5]:
############# BIOES NOTATION #####################
BEGIN = 'B'
INSIDE = 'I'
OUTSIDE = 'O'
END = 'E'
SINGLE = 'S'

def getDictEntities(file_ann, ent_classes = ['PROFESION', 'SITUACION_LABORAL']):
  entities = {}
  with open(file_ann) as anns:
    for ann in anns:
      if ann.split('\t')[1].split(' ')[0] in ent_classes:
          ent = ann[:-1].split('\t')[2]
          #print(ent)
          #ent = [token for token in nlp(ent) if not token.is_stop]
          ent = nlp(ent)
          start = int(ann[:-1].split('\t')[1].split(' ')[1])
          end = int(ann[:-1].split('\t')[1].split(' ')[2])
          if (len(ent) == 1):
            entities[(start, end)] = SINGLE + '_' + ann.split('\t')[1].split(' ')[0]
          else:
            entities[(start, start + len(ent[0].text))] = BEGIN + '_' + ann.split('\t')[1].split(' ')[0]
            entities[(end - len(ent[-1].text)), end] = END + '_' + ann.split('\t')[1].split(' ')[0]
            for i in range(len(ent) - 2):
              spaces = (ent[i + 1].idx) - (ent[i].idx + len(ent[i].text))
              start = start + len(ent[i].text) + spaces
              entities[(start, start + len(ent[i + 1].text))] = INSIDE + '_' + ann.split('\t')[1].split(' ')[0]
            
  return entities


In [6]:
import os
import re

def getElements(sst_home, max_len_seq, getTags = True):
  _words = dict()
  _doc_tags = {}
  _entities = {}
  _docs = {}
  _docs_offset = {}
    
  for file in [file[:-4] for file in os.listdir(sst_home) if file.endswith('.txt') and not ' ' in file]:
    file_text = os.path.join(sst_home, file + '.txt')
    if getTags:
      file_ann = os.path.join(sst_home, file + '.ann')
      _entities = getDictEntities(file_ann)
    with open(file_text) as f:
      text = f.read()
      spacy_text = nlp(text)
    
      _tweet = []
      _tweet_tags = []
      _tweet_pos = []
      _tweet_gtags = []
      _tweet_ent = []
      _tweet_offset = []
    
      for token in spacy_text[0:max_len_seq]:
          if not token.like_url:
              _tweet.append(token.text)
              _entity = _entities.get((token.idx, token.idx + len(token.text)), 'O')
              _tweet_tags.append(_entity)
              _words[token.text] = _words.get(token.text, 0) + 1
              _tweet_pos.append(token.pos_)
              _tweet_gtags.append(token.tag_)
              _tweet_offset.append((token.text, token.idx, token.idx + len(token.text)))

    _docs[file] = (_tweet, _tweet_pos, _tweet_gtags, _tweet_ent)
    _doc_tags[file] = _tweet_tags
    _docs_offset[file] = _tweet_offset

  return _words, _docs, _doc_tags, _docs_offset

## Getting the 2idx

Our model is based in a Bag of Words/Charachters/Syllables/PoS-Tags, so we need to create a dictionary to assign each of one see it during training to a cardinal id.

In [7]:
def get2idx(words, chars, pos, gramm_tags, sylls, ents):
    
  ET = ['B_SITUACION_LABORAL', 'B_PROFESION', 
        'I_SITUACION_LABORAL', 'I_PROFESION', 
        'E_SITUACION_LABORAL', 'E_PROFESION', 
        'S_SITUACION_LABORAL', 'S_PROFESION',
        'O']

  word2idx = {w: i + 2 for i, w in enumerate(words)}
  word2idx["#ENDPAD"] = 0
  word2idx["UNK"] = 1

  char2idx = {char:i + 2 for i,char in enumerate(chars)}
  char2idx["#ENDPAD"] = 0  # to ignore this by mask_zero = True
  char2idx["UNK"] = 1

  pos2idx = {pos:i + 2 for i,pos in enumerate(pos)}
  pos2idx["#ENDPAD"] = 0  # to ignore this by mask_zero = True
  pos2idx["UNK"] = 1

  grammtags2idx = {tag:i + 2 for i,tag in enumerate(gramm_tags)}
  grammtags2idx["#ENDPAD"] = 0  # to ignore this by mask_zero = True
  grammtags2idx["UNK"] = 1

  sylls2idx = {syll:i + 2 for i,syll in enumerate(sylls)}
  sylls2idx["#ENDPAD"] = 0  # to ignore this by mask_zero = True
  sylls2idx["UNK"] = 1

  tag2idx = {tag:i + 1 for i,tag in enumerate(ET)}
  tag2idx["#ENDPAD"] = 0 

  ent2idx = {ent:i + 2 for i,ent in enumerate(ents)}
  ent2idx["#ENDPAD"] = 0  # to ignore this by mask_zero = True
  ent2idx["UNK"] = 1

  return word2idx, char2idx, tag2idx, pos2idx, grammtags2idx, sylls2idx, ent2idx

# Getting the input Features

The proposed system are feed with the next features:

        Words: Two different 300 dimensional representations based on pre-trained word embeddings has been used with FastText. Both have been selected for their contribution of domain-specific knowledge since the former have been generated from Spanish medical corpora and the latter have been trained with Spanish Twitter data related to COVID-19. Contextual embeddings generated with a fine-tuned BETO model are also included, as these word representations are dynamically informed by the surrounding words improving performance.
        
        Part-of-speech: This feature has been considered due to the significant amount of information it offers about the word and its neighbors. It can also help in word sense disambiguation. The PoS-Tagging model used was the one provided by the Spacy. An embedding representaton of this feature is learned during training, resulting in a 40-dimensional vector.
        
        Characters: We also add character-level embeddings of the words, learned during training and resulting in a 30-dimensional vector. These have proven to be useful for specific-domain tasks and morphologically-rich languages.
    
       Syllables: Syllable-level embeddings of the words, learned during training and resulting in a 75-dimensional vector is also added. Like character-level embeddings, they help to deal with words outside the vocabulary and contribute to capturing common prefixes and suffixes in the domain and correctly classifying words.
    
       Cosine Similarity: The BETO embeddings of the entities found in the training and validation set are used to calculate the cosine similarity between the BETO representation of the word to be analyzed, since previous work has shown that could help to improve the results on data extracted from Twitter. This information is encoded as a 3717-dimensional vector. 

In [None]:
### CHARACTERS ###

def getCharacterInput(sentences, max_seq_len, max_chars_len, char2idx):
  X_char = []
  for sentence in sentences:
    sent_seq = []
    for i in range(max_seq_len):
      word_seq = []
      # char sequence for words
      for j in range(max_chars_len):
        try:
          # chars of specific sentence of i
          word_seq.append(char2idx.get(sentence[i][j], 1)) 
        except:  # if char-sequence is out of range , pad it with "PAD" tag
          word_seq.append(char2idx.get("#ENDPAD"))

      sent_seq.append(word_seq)
    X_char.append(np.array(sent_seq))

  return np.array(X_char)

### SYLLABLES ###

def getSyllsInput(sentences, max_seq_len, max_sylls_len, sylls2idx):
  X_syll = []
  for sentence in sentences:
    sent_seq = []
    for i in range(max_seq_len):
      word_seq = []
      syllables = []
      try:
        syllables = silabizer(sentence[i])
      except:
        pass
      for j in range(max_sylls_len):
        try:
          word_seq.append(sylls2idx.get(str(syllables[j]).lower(), 1)) 
        except: 
          word_seq.append(sylls2idx.get("#ENDPAD"))

      sent_seq.append(word_seq)
    X_syll.append(np.array(sent_seq))

  return np.array(X_syll)

from keras.preprocessing.sequence import pad_sequences

### WORDS ###

def getWordInput(sentences, max_seq_len, word2idx):
  X = [[word2idx.get(w,1) for w in s[:max_seq_len]] for s in sentences]
  X = pad_sequences(maxlen = max_seq_len, sequences = X, truncating= 'post', padding ='post', value=0 )
  return np.array(X)

### PoS-Tags ###

def getPosInput(sentences, max_seq_len, pos2idx):
  X_pos = [[pos2idx.get(w,1) for w in s] for s in sentences]
  X_pos = pad_sequences(maxlen = max_seq_len, sequences = X_pos, truncating= 'post', padding ='post', value=0 )
  return np.array(X_pos)

def getGTagInput(sentences, max_seq_len, grammtags2idx):
  X_gramm_tags = [[grammtags2idx.get(w,1) for w in s] for s in sentences]
  X_gramm_tags = pad_sequences(maxlen = max_seq_len, sequences = X_gramm_tags, truncating= 'post', padding ='post', value=0 )
  return np.array(X_gramm_tags)

from keras.utils.np_utils import to_categorical

### NER TAG ###
def getY(tags, max_seq_len, tag2idx):
  y = [[tag2idx[t] for t in sent_tags] for sent_tags in tags]
  y = pad_sequences(maxlen = max_seq_len, sequences = y, padding = "post", truncating = "post", value = tag2idx["#ENDPAD"])
  
  y = [to_categorical(i, num_classes = len(tag2idx)) for i in y]
  return np.array(y)

## Getting the total words

Cause we are using pre-trained embeddings, we use as our vocabulary the words contained in both the training, development and test sets

In [9]:
_total_words = set()
sst_home_train = sst_home + '/profner-data/subtask-2/brat/train'
for file in [file[:-4] for file in os.listdir(sst_home_train) if file.endswith('.txt') and not ' ' in file]:
    file_text = os.path.join(sst_home_train, file + '.txt')
    
    with open(file_text) as f:
        text = f.read()
        doc = nlp(text)
        for token in doc:
            _total_words.add(token.text)
           
sst_home_test = sst_home + '/profner-data/subtask-2/brat/valid'
for file in [file[:-4] for file in os.listdir(sst_home_test) if file.endswith('.txt') and not ' ' in file]:
    file_text = os.path.join(sst_home_test, file + '.txt')
    
    with open(file_text) as f:
        text = f.read()
        doc = nlp(text)
        for token in doc:
            _total_words.add(token.text)


sst_home_test = sst_home + '/profner-data/subtask-2/test-background-txt-files'
for file in [file[:-4] for file in os.listdir(sst_home_test) if file.endswith('.txt') and not ' ' in file]:
    file_text = os.path.join(sst_home_test, file + '.txt')
    
    with open(file_text) as f:
        text = f.read()
        doc = nlp(text)
        for token in doc:
            if not token.is_stop:
                _total_words.add(token.text)

_words = list(_total_words)

##  Generating Training Data

The next code generate the training data

In [10]:
import numpy as np

sst_home_train = sst_home + '/profner-data/subtask-2/brat/train'

MAX_CHARS_LEN = 25
MAX_SEQ_LEN = 75
MAX_SYLLS_LEN = 10

silabizer = silabizer()

words_tr, docs, docs_tags, _ = getElements(sst_home_train, MAX_SEQ_LEN)
words_tr = list(words_tr.keys())
chars = list(set(''.join(words_tr)))
sylls = list(set(str(syll).lower() for word in words_tr for syll in silabizer(word)))

tweets_train = [tweet[0] for tweet in docs.values()]
tweets_pos = [tweet[1] for tweet in docs.values()]
tweets_gtags = [tweet[2] for tweet in docs.values()]

all_pos = set(pos for sent in tweets_pos for pos in sent)
all_gtags = set(gtag for sent in tweets_gtags for gtag in sent)

tags = [tags for tags in docs_tags.values()]

word2idx, char2idx, tag2idx, pos2idx, grammtags2idx, sylls2idx = get2idx(_words, chars, all_pos, all_gtags, sylls)

X_char_train = getCharacterInput(tweets_train, MAX_SEQ_LEN, MAX_CHARS_LEN, char2idx)
X_words_train = getWordInput(tweets_train, MAX_SEQ_LEN, word2idx)
X_pos_train = getPosInput(tweets_pos, MAX_SEQ_LEN, pos2idx)
X_tag_train = getGTagInput(tweets_gtags, MAX_SEQ_LEN, grammtags2idx)
X_syll_train = getSyllsInput(tweets_train, MAX_SEQ_LEN, MAX_SYLLS_LEN, sylls2idx)

y_train = getY(tags, MAX_SEQ_LEN, tag2idx)

## BERT Embeddings

We load the bert embeddings generated from the other notebook 

In [16]:
import pickle5 as pickle
import numpy as np

BERT_DIM = 1536

train_bert = {}
with open(sst_home + '/saved_data/bert_train.pickle', 'rb') as handle:
    train_bert = pickle.load(handle)

X_train_bert = []
sst_home_train = sst_home + '/profner-data/subtask-2/brat/train'

for file in [file[:-4] for file in os.listdir(sst_home_train) if file.endswith('.txt') and not ' ' in file]:
    bert_vector = train_bert[file]
    X_train_bert.append(bert_vector)
X_train_bert = pad_sequences(maxlen = 75, sequences = X_train_bert, truncating= 'post', padding ='post', value=np.zeros(BERT_DIM))


## Generating Validation Data

The next code generate the validation data.

In [15]:
import numpy as np
MAX_CHARS_LEN = 25
MAX_SEQ_LEN = 75
MAX_SYLLS_LEN = 10

sst_home_test = sst_home + '/profner-data/subtask-2/brat/valid'

_, docs, docs_tags, docs_offset = getElements(sst_home_test, MAX_SEQ_LEN, get)

tweets_test = [tweet[0] for tweet in docs.values()]
tweets_pos = [tweet[1] for tweet in docs.values()]
tweets_gtags = [tweet[2] for tweet in docs.values()]


tags_test = [tags for tags in docs_tags.values()]

X_char_test = getCharacterInput(tweets_test, MAX_SEQ_LEN, MAX_CHARS_LEN, char2idx)
X_words_test = getWordInput(tweets_test, MAX_SEQ_LEN, word2idx)
X_pos_test = getPosInput(tweets_pos, MAX_SEQ_LEN, pos2idx)
X_tag_test = getGTagInput(tweets_gtags, MAX_SEQ_LEN, grammtags2idx)
X_syll_test = getSyllsInput(tweets_test, MAX_SEQ_LEN, MAX_SYLLS_LEN, sylls2idx)

y_test = getY(tags_test, MAX_SEQ_LEN, tag2idx)

In [None]:
test_bert = {}
with open(sst_home + '/saved_data/test_bert.pickle', 'rb') as handle:
    test_bert = pickle.load(handle)
    
def getBertInput(file_name):
    bert_vector = test_bert[file_name]
    X_test_bert = [bert_vector]
    X_test_bert = pad_sequences(maxlen = 75, sequences = X_test_bert, truncating= 'post', padding ='post', value=np.zeros(1536))
    return X_test_bert

X_test_bert = []
sst_home_train = sst_home + '/profner-data/subtask-2/brat/valid'
for file in [file[:-4] for file in os.listdir(sst_home_train) if file.endswith('.txt') and not ' ' in file]:
    bert_vector = test_bert[file]
    X_test_bert.append(bert_vector)
X_test_bert = pad_sequences(maxlen = 75, sequences = X_test_bert, truncating= 'post', padding ='post', value=np.zeros(BERT_DIM))

## Word embeddings from Spanish Medical Corpora & Twitter

In this model we'll use two different pre-trained word embeddings:

1. The Word Embeddings from Spanish Medical Corpora can be found in https://www.aclweb.org/anthology/W19-1916/
2. The Word Embedding from Spanish Twitter (Covid-19) can be found in https://zenodo.org/record/4449930#.YBbYOtaCE5k

In [11]:
import fasttext
import fasttext.util

def getEmbeddingMatrix(words2idx, emb_dim, model):
  embedding_matrix = np.zeros((len(words2idx), emb_dim))
  for word, i in words2idx.items():
    embedding_matrix[i] = model[word]

  return embedding_matrix

In [12]:
import numpy as np
sst_home_embeddings = sst_home + '/fast-text-model/'

### SPANISH MEDICAL CORPORA ###
ft = fasttext.load_model(sst_home_embeddings + 'cantemist-resource.bin')
medical_embedding_matrix = getEmbeddingMatrix(word2idx, ft.get_dimension(), ft)
del ft

### SPANISH TWITTER COVID-19 ###
ft = fasttext.load_model(sst_home_embeddings + 'covid_19_es_twitter_cbow_cased.bin')
twitter_embedding_matrix = getEmbeddingMatrix(word2idx, ft.get_dimension(), ft)
del ft



## Saving Data

We save the generated data so that it is not necessary to generate it again next time.

In [None]:
import pickle5 as pickle

train_data = [X_char_train, X_words_train, X_pos_train, X_tag_train, X_syll_train, X_train_bert, y_train]
with open(sst_home + '/saved_data/train.pickle', 'wb') as handle:
    pickle.dump(train_data, handle, protocol=pickle.HIGHEST_PROTOCOL)


test_data = [X_char_test, X_words_test, X_pos_test, X_tag_test, X_syll_test, X_test_bert, y_test]
with open(sst_home + '/saved_data/test.pickle', 'wb') as handle:
    pickle.dump(test_data, handle, protocol=pickle.HIGHEST_PROTOCOL)


embedding_matrix = [medical_embedding_matrix, twitter_embedding_matrix, cosine_matrix, spacy_embedding_matrix]   
with open(sst_home + '/saved_data/we.pickle', 'wb') as handle:
    pickle.dump(embedding_matrix, handle, protocol=pickle.HIGHEST_PROTOCOL)

two2idx = [word2idx, char2idx, tag2idx, pos2idx, grammtags2idx, sylls2idx, ent2idx]
with open(sst_home + '/saved_data/2idx.pickle', 'wb') as handle:
    pickle.dump(two2idx, handle, protocol=pickle.HIGHEST_PROTOCOL)


## Loading Data

In case the data is previously saved, we load these.

In [10]:
import pickle5 as pickle

X_train = []
with open(sst_home + '/saved_data/train.pickle', 'rb') as handle:
    X_train = pickle.load(handle)

X_test = []
with open(sst_home + '/saved_data/test.pickle', 'rb') as handle:
    X_test = pickle.load(handle)

embedding_matrixes = []
with open(sst_home + '/saved_data/we.pickle', 'rb') as handle:
    embedding_matrix = pickle.load(handle)

two2idx = []
with open(sst_home + '/saved_data/2idx.pickle', 'rb') as handle:
    two2idx = pickle.load(handle)
    


X_char_train, X_words_train, X_pos_train, X_tag_train, X_syll_train, X_train_bert, y_train = X_train
X_char_test, X_words_test, X_pos_test, X_tag_test, X_syll_test, X_test_bert, y_test = X_test
word2idx, char2idx, tag2idx, pos2idx, grammtags2idx, sylls2idx, ent2idx = two2idx
medical_embedding_matrix, twitter_embedding_matrix, _, spacy_embedding_matrix = embedding_matrix

## BERT Embeddings Entities

The BERT embeddings of the entities founded in train corpus are loaded

In [11]:
import pickle5 as pickle

bert_entities = []
with open(sst_home + '/saved_data/bert_entities.pickle', 'rb') as handle:
    bert_entities = pickle.load(handle)
    
bert_entities_ = np.array(bert_entities)

# Deep Learning Architecture

In the proposed system, the character and syllable information is previously proccesed by a convolutional and global max pooling block, to be concatenated with the rest of the input features to serve as input to an encoder-decoder architecture with attention mechanism. The context vector as well as decoder outputs feeds a fully connected dense layer with $tanh$ activation function. The last layer (CRF optimization layer) consists of a conditional random fields layer selected due to the ability of the layer to take into account the dependencies between the different labels.  The output of this layer provides the most probable sequence of labels.

![Architecture of the proposed model for pro-fession and occupations recognition.](./imgs/model2.png)

In [None]:
!pip install git+https://www.github.com/keras-team/keras-contrib.git

### Cosine Similarity Layer

A custom Keras layer is created to calculate Cosine Similarity between BERT Word Embeddings and Entities found in corpus

In [23]:
from keras.layers import Layer
from keras.backend import constant

class CosineSimilarity(Layer):
    def __init__(self):
        super(CosineSimilarity, self).__init__()
        self.result = None
        
    def call(self, inputs):
        entities = constant(bert_entities_)
        bert_input = inputs
        norm_entities_bert = tf.norm(entities, axis = 1)
        norm_bert_input = tf.norm(bert_input, axis =  2)

        cosine = tf.einsum('nd,bmd->bmn', entities, bert_input)
        norm = tf.einsum('bm,n->bmn',norm_bert_input, norm_entities_bert)

        self.result = tf.math.divide_no_nan(cosine, norm)
        
        return self.result
    
    def compute_output_shape(self, input_shape):
        return [(None, 75, 3717)]

In [None]:
import tensorflow as tf
from keras.models import Model
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Concatenate, Input, SpatialDropout1D
from keras.layers import Conv1D, MaxPooling1D,Flatten,GlobalMaxPooling1D, Reshape, RepeatVector,  Dot, GRU, Activation
from keras_contrib.layers import CRF
from keras.callbacks import ModelCheckpoint
from keras.backend import constant, variable

CHAR_EMBEDDINGS_SIZE = 30    # Characters Embeddings Size
SYLL_EMBEDDINGS_SIZE = 75    # Syllable Embeddings Size
WORD_EMBEDDINGS_SIZE = 300   # Word Embeddings Size
POS_EMBEDDING_SIZE = 20      # PoS Embedding Size
GTAGS_EMBEDDING_SIZE = 20    # Tag Embedding Size

MAX_CHARS_LEN = 25           # Max sequence char length
MAX_SEQ_LEN = 75             # Max sequence word length Level II
MAX_SYLLS_LEN = 10           # Max sequence sylls length

CONV_FILTERS = 50            # Convolutional Filters in Character and Syllable Convolutional Layer
LSTM_UNITS = 300             # LSTM Units in both LSTM layers
DENSE_UNITS = 200            # Number of units in Dense layer  

BERT_DIM = 1536

word_input = Input(shape=(MAX_SEQ_LEN,), name = 'word_input')
pos_input = Input(shape=(MAX_SEQ_LEN,), name = 'pos_input')
gtag_input = Input(shape=(MAX_SEQ_LEN,), name = 'tag_input')
char_input = Input(shape=(MAX_SEQ_LEN, MAX_CHARS_LEN), name = 'char_input')
sylls_input = Input(shape=(MAX_SEQ_LEN, MAX_SYLLS_LEN), name = 'sylls_input')
bert_input = Input(shape=(MAX_SEQ_LEN, BERT_DIM,), name = 'bert_input')

###################### COSINE SIMILARITY ##################################################################
cosine_bert = CosineSimilarity()(bert_input)

###################### CHARACTER SEQUENCE PROCCESSED BY CONVOLUTIONAL LAYER ################################
char_embedding = TimeDistributed(Embedding(input_dim = len(char2idx), output_dim = CHAR_EMBEDDINGS_SIZE, input_length=MAX_CHARS_LEN, name = 'char_embeddings', trainable = True))(char_input)
conv_1d = TimeDistributed(Conv1D(filters = CONV_FILTERS, kernel_size = 3,padding = "valid", activation = "relu", name="Conv1D_char"))(char_embedding)
conv_1d = TimeDistributed(Dropout(0.4))(conv_1d)
maxpool1d = TimeDistributed(GlobalMaxPooling1D(), name = 'max_pooling')(conv_1d)
char_enc = TimeDistributed(Flatten(), name = 'char_enc')(maxpool1d)

###################### SYLLABLE SEQUENCE PROCCESSED BY CONVOLUTIONAL LAYER ################################
syll_embedding = TimeDistributed(Embedding(input_dim = len(sylls2idx), output_dim = SYLL_EMBEDDINGS_SIZE, input_length=MAX_SYLLS_LEN, name = 'sylls_embeddings', trainable = True))(sylls_input)
conv_1d_syll = TimeDistributed(Conv1D(filters = CONV_FILTERS, kernel_size = 3,padding="valid", activation="relu", name="Conv1D_syll"))(syll_embedding)
conv_1d_syll = TimeDistributed(Dropout(0.4))(conv_1d_syll)
maxpool1d_syll = TimeDistributed(GlobalMaxPooling1D())(conv_1d_syll)
syll_enc = TimeDistributed(Flatten())(maxpool1d_syll)

##################### PoS + TAG EMBEDDINGS ################################################################
pos_embedding = Embedding(input_dim = len(pos2idx), output_dim = POS_EMBEDDING_SIZE, input_length = MAX_SEQ_LEN, name = 'pos_embeddings', trainable = True)(pos_input)
pos_embedding = Dropout(0.4)(pos_embedding)
gtags_embedding = Embedding(input_dim = len(grammtags2idx), output_dim = GTAGS_EMBEDDING_SIZE, input_length = MAX_SEQ_LEN, name = 'gtags_embeddings', trainable = True)(gtag_input)
gtags_embedding = Dropout(0.4)(gtags_embedding)

##################### WORD EMBEDDING LAYER ################################################################
medical_embedding_layer = Embedding(input_dim = len(word2idx), output_dim = WORD_EMBEDDINGS_SIZE, weights=[medical_embedding_matrix], trainable=True, name = 'medical_word_embeddings')
medical_word_embedding = medical_embedding_layer(word_input)

twitter_embedding_layer = Embedding(input_dim = len(word2idx), output_dim = WORD_EMBEDDINGS_SIZE, weights=[twitter_embedding_matrix], trainable=True, name = 'twitter_word_embeddings')
twitter_word_embedding = twitter_embedding_layer(word_input)

bert_out = Dense(BERT_DIM, activation='relu')(bert_input)
bert_out = Dropout(0.4)(bert_out)

################### CONCATENATE INPUT FEATURES ###########################################
x = Concatenate(axis = -1)([medical_word_embedding, twitter_word_embedding, bert_out, pos_embedding, gtags_embedding, char_enc, syll_enc, cosine_bert])

################## BiLSTM MAX SEQUENCE WORD LENGTH 50 ######################################################
encoder_outputs, forward_h, forward_c, backward_h, backward_c = Bidirectional(LSTM(units=LSTM_UNITS//2, return_sequences=True,recurrent_dropout=0, return_state = True, recurrent_activation = 'sigmoid'))(x)
state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])
encoder_states = [state_h, state_c]

################## Attention Mechanism ################################################################
decoder_outputs, _, _ = LSTM(units=LSTM_UNITS, recurrent_dropout=0, return_sequences = True, recurrent_activation = 'sigmoid', return_state = True)(x, initial_state = encoder_states)
attention = Dot(axes = (2,2))([decoder_outputs, encoder_outputs])
attention = Activation('softmax')(attention)
context = Dot(axes=(2,1))([attention, encoder_outputs])
x = Concatenate()([context, decoder_outputs])
################## DENSE LAYER ############################################################################
x = Dropout(0.4)(x)
x = Dense(DENSE_UNITS, activation='tanh')(x)

################## CRF LAYER ####################################################################################
crf = CRF(len(tag2idx), sparse_target = False)
loss = crf.loss_function
y_output = crf(x)

loss = crf.loss_function

model = Model(inputs = [char_input, word_input, pos_input, gtag_input, sylls_input, bert_input], outputs = y_output)
model.compile(optimizer = "adam", loss = loss, metrics = [crf.accuracy])
model.summary()

In [None]:
NUM_EPOCHS = 4
checkpoint = ModelCheckpoint(sst_home + '/model_weights/final_model_4.hdf5', monitor='loss', verbose=1, save_best_only=True, mode='auto', period=1)
history = model.fit([X_char_train, X_words_train, X_pos_train, X_tag_train, X_syll_train, X_train_bert], y_train,
                    batch_size = 32,
                    epochs = NUM_EPOCHS,
                    callbacks=[checkpoint])

# Generating the annotations

The next step is generate the annotations in the right format to be evaluated. First we loaded the trained model.

## Loading the Model

In [14]:
from keras.layers import Layer
from keras.backend import constant

class CosineSimilarity(Layer):
    def __init__(self):
        super(CosineSimilarity, self).__init__()
        self.result = None
        
    def call(self, inputs):
        entities = constant(bert_entities_)
        bert_input = inputs
        norm_entities_bert = tf.norm(entities, axis = 1)
        norm_bert_input = tf.norm(bert_input, axis =  2)

        cosine = tf.einsum('nd,bmd->bmn', entities, bert_input)
        norm = tf.einsum('bm,n->bmn',norm_bert_input, norm_entities_bert)

        self.result = tf.math.divide_no_nan(cosine, norm)
        
        return self.result
    
    def compute_output_shape(self, input_shape):
        return [(None, None, 3717)]

In [None]:
import tensorflow as tf
from keras.models import Model
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Concatenate, Input, SpatialDropout1D
from keras.layers import Conv1D, MaxPooling1D,Flatten,GlobalMaxPooling1D, Reshape, RepeatVector,  Dot, GRU, Activation
from keras_contrib.layers import CRF
from keras.callbacks import ModelCheckpoint
from keras.backend import constant, variable

CHAR_EMBEDDINGS_SIZE = 30    # Characters Embeddings Size
SYLL_EMBEDDINGS_SIZE = 75    # Syllable Embeddings Size
WORD_EMBEDDINGS_SIZE = 300   # Word Embeddings Size
POS_EMBEDDING_SIZE = 20      # PoS Embedding Size
GTAGS_EMBEDDING_SIZE = 20    # Tag Embedding Size
SPACY_EMBEDDING_SIZE = 20

MAX_CHARS_LEN = 25           # Max sequence char length
MAX_SEQ_LEN = 75             # Max sequence word length Level II
MAX_SYLLS_LEN = 10           # Max sequence sylls length

CONV_FILTERS = 50            # Convolutional Filters in Character and Syllable Convolutional Layer
LSTM_UNITS = 300             # LSTM Units in both LSTM layers
DENSE_UNITS = 200            # Number of units in Dense layer  

BERT_DIM = 1536

word_input = Input(shape=(None,), name = 'word_input')
pos_input = Input(shape=(None,), name = 'pos_input')
gtag_input = Input(shape=(None,), name = 'tag_input')
char_input = Input(shape=(None, MAX_CHARS_LEN), name = 'char_input')
sylls_input = Input(shape=(None, MAX_SYLLS_LEN), name = 'sylls_input')
bert_input = Input(shape=(None, BERT_DIM,), name = 'bert_input')

###################### COSINE SIMILARITY ##################################################################
cosine_bert = CosineSimilarity()(bert_input)

###################### CHARACTER SEQUENCE PROCCESSED BY CONVOLUTIONAL LAYER ################################
char_embedding = TimeDistributed(Embedding(input_dim = len(char2idx), output_dim = CHAR_EMBEDDINGS_SIZE, input_length=MAX_CHARS_LEN, name = 'char_embeddings', trainable = True))(char_input)
conv_1d = TimeDistributed(Conv1D(filters = CONV_FILTERS, kernel_size = 3,padding = "valid", activation = "relu", name="Conv1D_char"))(char_embedding)
conv_1d = TimeDistributed(Dropout(0.4))(conv_1d)
maxpool1d = TimeDistributed(GlobalMaxPooling1D(), name = 'max_pooling')(conv_1d)
char_enc = TimeDistributed(Flatten(), name = 'char_enc')(maxpool1d)

###################### SYLLABLE SEQUENCE PROCCESSED BY CONVOLUTIONAL LAYER ################################
syll_embedding = TimeDistributed(Embedding(input_dim = len(sylls2idx), output_dim = SYLL_EMBEDDINGS_SIZE, input_length=MAX_SYLLS_LEN, name = 'sylls_embeddings', trainable = True))(sylls_input)
conv_1d_syll = TimeDistributed(Conv1D(filters = CONV_FILTERS, kernel_size = 3,padding="valid", activation="relu", name="Conv1D_syll"))(syll_embedding)
conv_1d_syll = TimeDistributed(Dropout(0.4))(conv_1d_syll)
maxpool1d_syll = TimeDistributed(GlobalMaxPooling1D())(conv_1d_syll)
syll_enc = TimeDistributed(Flatten())(maxpool1d_syll)

##################### PoS + TAG EMBEDDINGS ################################################################
pos_embedding = Embedding(input_dim = len(pos2idx), output_dim = POS_EMBEDDING_SIZE, name = 'pos_embeddings', trainable = True)(pos_input)
pos_embedding = Dropout(0.4)(pos_embedding)
gtags_embedding = Embedding(input_dim = len(grammtags2idx), output_dim = GTAGS_EMBEDDING_SIZE, name = 'gtags_embeddings', trainable = True)(gtag_input)
gtags_embedding = Dropout(0.4)(gtags_embedding)

##################### WORD EMBEDDING LAYER ################################################################
medical_embedding_layer = Embedding(input_dim = len(word2idx), output_dim = WORD_EMBEDDINGS_SIZE, weights=[medical_embedding_matrix], trainable=True, name = 'medical_word_embeddings')
medical_word_embedding = medical_embedding_layer(word_input)

twitter_embedding_layer = Embedding(input_dim = len(word2idx), output_dim = WORD_EMBEDDINGS_SIZE, weights=[twitter_embedding_matrix], trainable=True, name = 'twitter_word_embeddings')
twitter_word_embedding = twitter_embedding_layer(word_input)

bert_out = Dense(BERT_DIM, activation='relu')(bert_input)
bert_out = Dropout(0.4)(bert_out)
################### CONCATENATE INPUT FEATURES ###########################################
x = Concatenate(axis = -1)([medical_word_embedding, twitter_word_embedding, bert_out, pos_embedding, gtags_embedding, char_enc, syll_enc, cosine_bert])

################## BiLSTM MAX SEQUENCE WORD LENGTH 50 ######################################################
encoder_outputs, forward_h, forward_c, backward_h, backward_c = Bidirectional(LSTM(units=LSTM_UNITS//2, return_sequences=True,recurrent_dropout=0, return_state = True, recurrent_activation = 'sigmoid'))(x)
state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])
encoder_states = [state_h, state_c]

################## Attention Mechanism ################################################################
decoder_outputs, _, _ = LSTM(units=LSTM_UNITS, recurrent_dropout=0, return_sequences = True, recurrent_activation = 'sigmoid', return_state = True)(x, initial_state = encoder_states)
attention = Dot(axes = (2,2))([decoder_outputs, encoder_outputs])
attention = Activation('softmax')(attention)
context = Dot(axes=(2,1))([attention, encoder_outputs])

x = Concatenate()([context, decoder_outputs])
################## DENSE LAYER ############################################################################
x = Dropout(0.4)(x)
x = Dense(DENSE_UNITS, activation='tanh')(x)

################## CRF LAYER ####################################################################################
crf = CRF(len(tag2idx), sparse_target = False)
loss = crf.loss_function
y_output = crf(x)

loss = crf.loss_function

model = Model(inputs = [char_input, word_input, pos_input, gtag_input, sylls_input, bert_input], outputs = y_output)
model.compile(optimizer = "adam", loss = loss, metrics = [crf.accuracy])
model.load_weights(sst_home + '/model_weights/final_model_4.hdf5')
model.summary()

## Annotation Function

In [23]:
SINGLE = 'S'
BEGIN = 'B'
END = 'E'
INSIDE = 'I'
OUT = 'O'

def annotateV2(y_pred, offset, file_name, extra = None):    
    max_index = len(y_pred)
    if '#ENDPAD' in y_pred:
        max_index = y_pred.index('#ENDPAD')
        
    right_moves = {'B': [ 'I', 'E'],
                   'I': ['I', 'E'],
                   'E': ['B', 'S', 'O'],
                   'S': ['B', 'S', 'O'],
                   'O': ['B', 'S', 'O']}
    text = ''
    entity = ''
    start = -1
    end = -1
    
    has_error = False
    for i in range(max_index - 1):
        if not (has_error and y_pred[i][0] in ['I','E']):
            has_error = False
            ann = y_pred[i][0]
            info = offset[i]
            next_ann = y_pred[i + 1][0]
            entity_class = y_pred[i][2:]
            if next_ann in right_moves[ann]:
                info = offset[i]
                if ann == BEGIN:
                    entity = info[0]
                    start = info[1]
                    end = info[2]
                if ann == INSIDE:
                    entity = entity + ' ' + info[0]
                    end = info[2]
                if ann == END:
                    entity = entity + ' ' + info[0]
                    end = info[2]
                    text = text + f'{file_name}\t{start}\t{end}\t{entity_class}\t{entity}\n'
                    entity = ''
                    start = -1
                    end = -1
                if ann == SINGLE:
                    text = text + f'{file_name}\t{info[1]}\t{info[2]}\t{entity_class}\t{info[0]}\n'
                    entity = ''
                    start = -1
                    end = -1
                if ann == OUT:
                    entity = ''
                    start = -1
                    end = -1
            else:
                has_error = True
                
                entity = ''
                start = -1
                end = -1
            
    return text  

In [None]:
import numpy as np

sst_home_test = sst_home + '/final-profner-data/subtask-2/test-background-txt-files'
idx2tag = {idx:tag for (tag, idx) in tag2idx.items()}

_, docs, _, docs_offset = getElements(sst_home_test, 1000, getTags = False, minit = 25000, maxi = 28000)

result_path = sst_home + '/final-result/'
text = ''

for (file_name, doc_sents) in docs.items():
  MAX_SEQ_LEN = len(docs[file_name][0])
  X_char = getCharacterInput([docs[file_name][0]], MAX_SEQ_LEN, MAX_CHARS_LEN, char2idx)
  X_words = getWordInput([docs[file_name][0]], MAX_SEQ_LEN, word2idx)
  X_pos = getPosInput([docs[file_name][1]], MAX_SEQ_LEN, pos2idx)
  X_tag = getGTagInput([docs[file_name][2]], MAX_SEQ_LEN, grammtags2idx)
  X_syll = getSyllsInput([docs[file_name][0]], MAX_SEQ_LEN, MAX_SYLLS_LEN, sylls2idx)
  X_bert = np.array(getBertInput(file_name))
  X_bert = pad_sequences(maxlen = MAX_SEQ_LEN, sequences = X_bert, truncating= 'post', padding ='post', value=np.zeros(1536))

  y_pred = model.predict([X_char, X_words, X_pos, X_tag, X_syll, X_bert])
  print(y_pred)
  y_pred = [list(map(lambda x: idx2tag[np.argmax(x)], sent)) for sent in y_pred][0]
  offset_test = docs_offset[file_name]
  text = text + annotateV2(y_pred, offset_test, file_name, result_path)

text = 'tweet_id\tbegin\tend\ttype\textraction\n' + text
with open(result_path + 'results.tsv', 'w') as f:
    f.write(text)