# Transformer Network Application: Named-Entity Recognition

When faced with a large amount of unstructured text data, named-entity recognition (NER) can help you detect and classify important information in your dataset. For instance, in the running example "Jane vists Africa in September", NER would help detect "Jane", "Africa", and "September" as named-entities and classify them as person, location, and time. 

<font color='blue'>Approach:  
* Use tokenizers and pre-trained models from the HuggingFace Library.
* Fine-tune a pre-trained transformer model for Named-Entity Recognition

In [1]:
import pandas as pd
import tensorflow as tf
import json
# import random
# import logging
import re
import tensorflow as tf
from transformers import DistilBertTokenizerFast
from tensorflow.keras.preprocessing.sequence import pad_sequences

  from .autonotebook import tqdm as notebook_tqdm


## Dataset of resumes

In [2]:
df = pd.read_json("data/nlp_ner/ner.json", lines=True)
df = df.drop(['extras'], axis=1)
df['content'] = df['content'].str.replace("\n", " ")

In [3]:
df.shape

(220, 2)

In [4]:
df.head()

Unnamed: 0,content,annotation
0,Abhishek Jha Application Development Associate...,"[{'label': ['Skills'], 'points': [{'start': 12..."
1,Afreen Jamadar Active member of IIIT Committee...,"[{'label': ['Email Address'], 'points': [{'sta..."
2,"Akhil Yadav Polemaina Hyderabad, Telangana - E...","[{'label': ['Skills'], 'points': [{'start': 37..."
3,Alok Khandai Operational Analyst (SQL DBA) Eng...,"[{'label': ['Skills'], 'points': [{'start': 80..."
4,Ananya Chavan lecturer - oracle tutorials Mum...,"[{'label': ['Degree'], 'points': [{'start': 20..."


In [5]:
df['content'][0][:300]

"Abhishek Jha Application Development Associate - Accenture  Bengaluru, Karnataka - Email me on Indeed: indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a  • To work for an organization which provides me the opportunity to improve my skills and knowledge for my individual and company's growth in best possibl"

In [6]:
df['annotation'][0][:5]

[{'label': ['Skills'],
  'points': [{'start': 1295,
    'end': 1621,
    'text': '\n• Programming language: C, C++, Java\n• Oracle PeopleSoft\n• Internet Of Things\n• Machine Learning\n• Database Management System\n• Computer Networks\n• Operating System worked on: Linux, Windows, Mac\n\nNon - Technical Skills\n\n• Honest and Hard-Working\n• Tolerant and Flexible to Different Situations\n• Polite and Calm\n• Team-Player'}]},
 {'label': ['Skills'],
  'points': [{'start': 993,
    'end': 1153,
    'text': 'C (Less than 1 year), Database (Less than 1 year), Database Management (Less than 1 year),\nDatabase Management System (Less than 1 year), Java (Less than 1 year)'}]},
 {'label': ['College Name'],
  'points': [{'start': 939, 'end': 956, 'text': 'Kendriya Vidyalaya'}]},
 {'label': ['College Name'],
  'points': [{'start': 883, 'end': 904, 'text': 'Woodbine modern school'}]},
 {'label': ['Graduation Year'],
  'points': [{'start': 856, 'end': 860, 'text': '2017\n'}]}]

In [7]:
df['annotation'][0][1]['label']

['Skills']

## Pre-processing - to create 'Entities' column, which is an ordered list of  NERs based on data in 'Annotation' columns

In [8]:
def mergeIntervals(intervals):
    sorted_by_lower_bound = sorted(intervals, key=lambda tup: tup[0])
    merged = []

    for higher in sorted_by_lower_bound:
        if not merged:
            merged.append(higher)
        else:
            lower = merged[-1]
            if higher[0] <= lower[1]:
                if lower[2] is higher[2]:
                    upper_bound = max(lower[1], higher[1])
                    merged[-1] = (lower[0], upper_bound, lower[2])
                else:
                    if lower[1] > higher[1]:
                        merged[-1] = lower
                    else:
                        merged[-1] = (lower[0], higher[1], higher[2])
            else:
                merged.append(higher)
    return merged

In [9]:
def get_entities(df):
    entities = []
    
    for i in range(len(df)):
        entity = []
    
        for annot in df['annotation'][i]:
            try:
                ent = annot['label'][0]
                start = annot['points'][0]['start']
                end = annot['points'][0]['end'] + 1
                entity.append((start, end, ent))
            except:
                pass
    
        entity = mergeIntervals(entity)
        entities.append(entity)
    
    return entities

In [10]:
df['entities'] = get_entities(df)
df.head()

Unnamed: 0,content,annotation,entities
0,Abhishek Jha Application Development Associate...,"[{'label': ['Skills'], 'points': [{'start': 12...","[(0, 12, Name), (13, 46, Designation), (49, 58..."
1,Afreen Jamadar Active member of IIIT Committee...,"[{'label': ['Email Address'], 'points': [{'sta...","[(0, 14, Name), (62, 68, Location), (104, 148,..."
2,"Akhil Yadav Polemaina Hyderabad, Telangana - E...","[{'label': ['Skills'], 'points': [{'start': 37...","[(0, 21, Name), (22, 31, Location), (65, 117, ..."
3,Alok Khandai Operational Analyst (SQL DBA) Eng...,"[{'label': ['Skills'], 'points': [{'start': 80...","[(0, 12, Name), (13, 51, Designation), (54, 60..."
4,Ananya Chavan lecturer - oracle tutorials Mum...,"[{'label': ['Degree'], 'points': [{'start': 20...","[(0, 13, Name), (14, 22, Designation), (24, 41..."


In [11]:
df['entities'][0]

[(0, 12, 'Name'),
 (13, 46, 'Designation'),
 (49, 58, 'Companies worked at'),
 (60, 69, 'Location'),
 (95, 146, 'Email Address'),
 (372, 405, 'Designation'),
 (407, 416, 'Companies worked at'),
 (727, 770, 'Designation'),
 (771, 814, 'College Name'),
 (856, 861, 'Graduation Year'),
 (883, 905, 'College Name'),
 (939, 957, 'College Name'),
 (993, 1154, 'Skills'),
 (1295, 1622, 'Skills')]

## Sentences `->` NER English Tags `->` NER Integer Tags

In [12]:
def convert_dataturks_to_spacy(dataturks_JSON_FilePath):
    try:
        training_data = []
        lines=[]
        with open(dataturks_JSON_FilePath, 'r', encoding='utf8') as f:
            lines = f.readlines()

        for line in lines:
            data = json.loads(line)
            text = data['content'].replace("\n", " ")
            entities = []
            data_annotations = data['annotation']
            if data_annotations is not None:
                for annotation in data_annotations:
                    #only a single point in text annotation.
                    point = annotation['points'][0]
                    labels = annotation['label']
                    # handle both list of labels or a single label.
                    if not isinstance(labels, list):
                        labels = [labels]

                    for label in labels:
                        point_start = point['start']
                        point_end = point['end']
                        point_text = point['text']
                        
                        lstrip_diff = len(point_text) - len(point_text.lstrip())
                        rstrip_diff = len(point_text) - len(point_text.rstrip())
                        if lstrip_diff != 0:
                            point_start = point_start + lstrip_diff
                        if rstrip_diff != 0:
                            point_end = point_end - rstrip_diff
                        entities.append((point_start, point_end + 1 , label))
            training_data.append((text, {"entities" : entities}))
        return training_data
    except Exception as e:
        logging.exception("Unable to process " + dataturks_JSON_FilePath + "\n" + "error = " + str(e))
        return None

def trim_entity_spans(data: list) -> list:
    """Removes leading and trailing white spaces from entity spans.

    Args:
        data (list): The data to be cleaned in spaCy JSON format.

    Returns:
        list: The cleaned data.
    """
    invalid_span_tokens = re.compile(r'\s')

    cleaned_data = []
    for text, annotations in data:
        entities = annotations['entities']
        valid_entities = []
        for start, end, label in entities:
            valid_start = start
            valid_end = end
            while valid_start < len(text) and invalid_span_tokens.match(
                    text[valid_start]):
                valid_start += 1
            while valid_end > 1 and invalid_span_tokens.match(
                    text[valid_end - 1]):
                valid_end -= 1
            valid_entities.append([valid_start, valid_end, label])
        cleaned_data.append([text, {'entities': valid_entities}])
    return cleaned_data  

In [13]:
data = trim_entity_spans(convert_dataturks_to_spacy("data/nlp_ner/ner.json"))

In [14]:
# from tqdm.notebook import tqdm - changed from tqdm to list in for loop

def clean_dataset(data):
    cleanedDF = pd.DataFrame(columns=["setences_cleaned"])
    sum1 = 0
    for i in list(range(len(data))):
        start = 0
        emptyList = ["Empty"] * len(data[i][0].split())
        numberOfWords = 0
        lenOfString = len(data[i][0])
        strData = data[i][0]
        strDictData = data[i][1]
        lastIndexOfSpace = strData.rfind(' ')
        for i in range(lenOfString):
            if (strData[i]==" " and strData[i+1]!=" "):
                for k,v in strDictData.items():
                    for j in range(len(v)):
                        entList = v[len(v)-j-1]
                        if (start>=int(entList[0]) and i<=int(entList[1])):
                            emptyList[numberOfWords] = entList[2]
                            break
                        else:
                            continue
                start = i + 1  
                numberOfWords += 1
            if (i == lastIndexOfSpace):
                for j in range(len(v)):
                        entList = v[len(v)-j-1]
                        if (lastIndexOfSpace>=int(entList[0]) and lenOfString<=int(entList[1])):
                            emptyList[numberOfWords] = entList[2]
                            numberOfWords += 1
        cleanedDF = cleanedDF.append(pd.Series([emptyList],  index=cleanedDF.columns ), ignore_index=True )
        sum1 = sum1 + numberOfWords
    return cleanedDF

In [15]:
cleanedDF = clean_dataset(data)

In [16]:
cleanedDF.head(2)

Unnamed: 0,setences_cleaned
0,"[Name, Name, Designation, Designation, Designa..."
1,"[Name, Name, Empty, Empty, Empty, Empty, Empty..."


In [17]:
cleanedDF['setences_cleaned'][0][:10]

['Name',
 'Name',
 'Designation',
 'Designation',
 'Designation',
 'Empty',
 'Empty',
 'Empty',
 'Empty',
 'Empty']

In [18]:
df['content'][0][:300]

"Abhishek Jha Application Development Associate - Accenture  Bengaluru, Karnataka - Email me on Indeed: indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a  • To work for an organization which provides me the opportunity to improve my skills and knowledge for my individual and company's growth in best possibl"

### Padding and Generating Tags

In [19]:
unique_tags = set(cleanedDF['setences_cleaned'].explode().unique())
tag2id = {tag: id for id, tag in enumerate(unique_tags)}
id2tag = {id: tag for tag, id in tag2id.items()}

In [20]:
unique_tags

{'College Name',
 'Companies worked at',
 'Degree',
 'Designation',
 'Email Address',
 'Empty',
 'Graduation Year',
 'Location',
 'Name',
 'Skills',
 'UNKNOWN',
 'Years of Experience'}

In [21]:
tag2id

{'UNKNOWN': 0,
 'Companies worked at': 1,
 'Location': 2,
 'Name': 3,
 'Years of Experience': 4,
 'Email Address': 5,
 'Skills': 6,
 'Degree': 7,
 'Empty': 8,
 'Designation': 9,
 'College Name': 10,
 'Graduation Year': 11}

In [22]:
cleanedDF.shape

(220, 1)

In [23]:
MAX_LEN = 512
labels = cleanedDF['setences_cleaned'].values.tolist()

tags = pad_sequences([[tag2id.get(l) for l in lab] for lab in labels],
                     maxlen=MAX_LEN, value=tag2id["Empty"], padding="post",truncating="post",
                     dtype="long")  #value - padding value

In [24]:
tags

array([[3, 3, 9, ..., 8, 8, 8],
       [3, 3, 8, ..., 8, 8, 8],
       [3, 3, 3, ..., 8, 6, 8],
       ...,
       [3, 3, 9, ..., 8, 8, 8],
       [3, 3, 9, ..., 8, 8, 8],
       [3, 3, 9, ..., 8, 8, 8]])

In [25]:
tags.shape

(220, 512)

## Tokenize and Align Labels with 🤗 Library

Before feeding the texts to a Transformer model, we need to tokenize the input using a [🤗 Transformer tokenizer](https://huggingface.co/transformers/main_classes/tokenizer.html). It is crucial that the tokenizer you use must match the Transformer model type we are using - [DistilBERT fast tokenizer](https://huggingface.co/transformers/model_doc/distilbert.html) in this case, which standardizes the length of sequences to 512 and pads with zeros. Notice this matches the maximum length we used when creating tags. 

In [26]:
from transformers import DistilBertTokenizerFast 
tokenizer = DistilBertTokenizerFast.from_pretrained('pretrainedmodel/nlp_transformer/tokenizer/')

**Transformer models are often trained by tokenizers that split words into subwords.** For instance, the word 'Africa' might get split into multiple subtokens. This can create some misalignment between the list of tags for the dataset and the list of labels generated by the tokenizer, **since the tokenizer can split one word into several**, or add special tokens. Before processing, it is important that we align the lists of tags and the list of labels generated by the selected tokenizer with a `tokenize_and_align_labels()` function.

* The tokenizer cuts sequences that exceed the maximum size allowed by your model with the parameter `truncation=True`
* Aligns the list of tags and labels with the tokenizer `word_ids` method returns a list that maps the subtokens to the original word in the sentence and special tokens to `None`. 
* Set the labels of all the special tokens (`None`) to -100 to prevent them from affecting the loss function. 
* Label of the first subtoken of a word and set the label for the following subtokens to -100. 

In [27]:
label_all_tokens = True

def tokenize_and_align_labels(tokenizer, examples, tags):
    tokenized_inputs = tokenizer(examples, truncation=True, is_split_into_words=False, padding='max_length', max_length=512)
    labels = []
    for i, label in enumerate(tags):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so 
            # they are automatically ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [28]:
# temp = tokenizer(df['content'].values.tolist(), truncation=True, is_split_into_words=False, padding='max_length', max_length=512)
# labels = []

# for i, label in enumerate(tags[:1,:]):
#     print('original labels: ', label[:10])
#     word_ids = temp.word_ids(batch_index=i)
#     print('words :', word_ids[:20])
#     previous_word_idx = None
#     label_ids = []
#     for word_idx in word_ids:
# #         print('word idx :', word_idx)
#         if word_idx is None:
#             label_ids.append(-100)
#         elif word_idx != previous_word_idx:
#             label_ids.append(label[word_idx])
#         else:
#             label_ids.append(label[word_idx] if label_all_tokens else -100)
#         previous_word_idx = word_idx

#     labels.append(label_ids)

# temp["labels"] = labels
# print('revised labels: ', temp['labels'][0][:15])

# # word_ids are like a sequential ids (0, 1, 2, 3) etc.
# # revised labels - essentially reshape original labels using idx from words

In [29]:
df['content'].shape, tags.shape

((220,), (220, 512))

In [30]:
test = tokenize_and_align_labels(tokenizer, df['content'].values.tolist(), tags)

In [31]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    test['input_ids'],
    test['labels']
))

In [32]:
len(test['input_ids']), len(test['labels'])

(220, 220)

In [33]:
len(test['input_ids'][0]), len(test['labels'][0])

(512, 512)

## Optimization

Now we can feed our data into into a pretrained 🤗 model. We will optimize a DistilBERT model, which matches the tokenizer we used to preprocess your data.

In [34]:
len(unique_tags)

12

In [35]:
from transformers import TFDistilBertForTokenClassification

model = TFDistilBertForTokenClassification.from_pretrained('pretrainedmodel/nlp_transformer/DistilBertTokenClass/',
                                                           num_labels=len(unique_tags))

All model checkpoint layers were used when initializing TFDistilBertForTokenClassification.

All the layers of TFDistilBertForTokenClassification were initialized from the model checkpoint at pretrainedmodel/nlp_transformer/tokenizer/DistilBert/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForTokenClassification for predictions without further training.


### ...the usual stuff

In [36]:
model.compute_loss

<bound method TFTokenClassificationLoss.compute_loss of <transformers.models.distilbert.modeling_tf_distilbert.TFDistilBertForTokenClassification object at 0x000001B9A688AF08>>

In [37]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy']) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16),
          epochs=5, 
          batch_size=16)

Epoch 1/5
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


  return py_builtins.overload_of(f)(*args)


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1b9afdb0488>

In [39]:
# Had to change version of some packages to get the model training to work
# scipy==1.4.1
# gast==0.3.3
# tensorflow-estimator==2.3.0