<a href="https://colab.research.google.com/github/rahiakela/advanced-natural-language-processing-with-tensorflow-2/blob/main/3-named-entity-recognition/1_named_entity_recognition_with_BiLSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Named Entity Recognition with BiLSTM

One of the fundamental building blocks of NLU is **Named Entity Recognition
(NER)**. The names of people, companies, products, and quantities can be tagged in a piece of text with NER, which is very useful in chatbot applications and many other use cases in information retrieval and extraction.

Building and training a model capable of doing NER requires several techniques, such as **Conditional Random Fields (CRFs)** and **Bi-directional LSTMs(BiLSTMs)**.



## Named Entity Recognition

Given a sentence or a piece of text, the objective of an NER model is to locate and classify text tokens as named entities in categories such as people's names, organizations and companies, physical locations, quantities, monetary quantities, times, dates, and even protein or DNA sequences. 

NER should tag the following sentence:

```
Ashish paid Uber $80 to go to the Twitter offices in San Francisco.
```

as follows:

$$
[Ashish]_{PER} \space paid \space [Uber]_{ORG} \space [$80]_{MONEY} \space to \space go \space to \space the \space [Twitter]_{ORG} \space offices \space in \space [San Francisco]_{LOC}.
$$

The most common tags are listed in the table below:

| **Type** | Example Tag | Example |
|---|---|---|
| Person | PER | Gregory went to the castle. |
| Organization | ORG | WHO just issued an epidemic advisory.|
| Location | LOC | She lives in Seattle. |
| Money | MONEY | You owe me twenty dollars. |
| Percentage | PERCENT | Stocks have risen 10% today. |
| Date | DATE | Let's meet on Wednesday. |
| Time | TIME | Is it 5 pm already? |

There are different data sets and tagging schemes that can be used to train NER models. Different data sets will have different subsets of the tags.

There are a few different ways to build an NER model. If the sentence is considered a sequence, then this task can be modeled as a word-by-word labeling task.

Hence, models similar to the models used for Part of Speech (POS) tagging are applicable. Features can be added to a model to improve labeling. The POS of a word and its neighboring words are the most straightforward features to add. Word shape features that model lowercase letters can add a lot of information, principally because a lot of the entity types deal with proper nouns, such as those for people and organizations.

Another vital feature involves checking a word in a gazetteer. A gazetteer is like a database of important geographical entities.

## Setup

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_addons as tfa
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras import Model, Input, Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, TimeDistributed, Dense, Dropout
from tensorflow.keras.layers import Layer
from tensorflow.keras import backend as K

import numpy as np
import pandas as pd

import os
import csv
import collections
import glob
import matplotlib.pyplot as plt

tf.__version__

In [None]:
tfds.__version__

In [None]:
######## GPU CONFIGS FOR RTX 2070 ###############
## Please ignore if not training on GPU       ##
## this is important for running CuDNN on GPU ##

tf.keras.backend.clear_session() #- for easy reset of notebook state

# chck if GPU can be seen by TF
tf.config.list_physical_devices('GPU')
# only if you want to see how commands are executed, uncomment below
# tf.debugging.set_log_device_placement(True)
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only use the first GPU
  try:
    tf.config.experimental.set_memory_growth(gpus[0], True)
    tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
  except RuntimeError as e:
    # Visible devices must be set before GPUs have been initialized
    print(e)
###############################################

In [None]:
%%shell

wget https://gmb.let.rug.nl/releases/gmb-2.2.0.zip

# !unzip -o gmb-2.2.0.zip  <= use the -o to expand and overwrite whtout prompting
!unzip gmb-2.2.0.zip

## Loading the data

In [None]:
data_root = './gmb-2.2.0/data/'

fnames = []
for root, dirs, files in os.walk(data_root):
    for filename in files:
        if filename.endswith(".tags"):
            fnames.append(os.path.join(root, filename))

In [None]:
fnames[:2]

In [None]:
!mkdir ner