<a href="https://colab.research.google.com/github/viniciusrpb/cic0269_natural_language_processing/blob/main/lectures/aula_named_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aplicação: Reconhecimento de Entidades Nomeadas

https://www.tensorflow.org/datasets/catalog/conll2003


CoNLL 2003
This dataset includes 1,393 English and 909 German news articles. The English-language corpus is free, but the German corpus comes at $75, unfortunately. This is the only corpus that costs something in this post. To build the English-language corpus you need the RCV1 Reuters corpus. You will obtain access a couple days after submitting the organisational and individual agreement at no charge.

Entities are annotated with LOC (location), ORG (organisation), PER (person) and MISC (miscellaneous). 

In [None]:
!pip install -U tensorflow-datasets
!pip install keras
!pip install tensorflow
!pip install keras-crf

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_addons as tfa
import pandas as pd
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense,Activation,Embedding,SimpleRNN,Dropout,LSTM,Bidirectional,Input
from keras.utils.np_utils import to_categorical
from collections import Counter
from keras_crf import CRFModel

In [3]:
def preprocessDataFrame(df):

    dic = {}
    dic['tokens'] = []

    for text in df['tokens']:
        tokens = []
        for x in text:
            tokens.append(x.decode('utf-8'))
        l = " ".join(tokens)
        dic['tokens'].append(l.split())
        
    res_df = pd.DataFrame.from_dict(dic)
    res_df['ner'] = df['ner']
    return res_df

In [4]:
ds_train = tfds.load('conll2003', split='train', shuffle_files=True)
ds_valid = tfds.load('conll2003', split='dev', shuffle_files=False)
ds_test = tfds.load('conll2003', split='test', shuffle_files=False)

Downloading and preparing dataset 959.94 KiB (download: 959.94 KiB, generated: 3.87 MiB, total: 4.80 MiB) to /root/tensorflow_datasets/conll2003/conll2022/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/14042 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/conll2003/conll2022/1.0.0.incompleteYSKAZ3/conll2003-train.tfrecord*...:  …

Generating dev examples...:   0%|          | 0/3251 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/conll2003/conll2022/1.0.0.incompleteYSKAZ3/conll2003-dev.tfrecord*...:   0…

Generating test examples...:   0%|          | 0/3454 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/conll2003/conll2022/1.0.0.incompleteYSKAZ3/conll2003-test.tfrecord*...:   …

Dataset conll2003 downloaded and prepared to /root/tensorflow_datasets/conll2003/conll2022/1.0.0. Subsequent calls will reuse this data.


In [5]:
df_train = preprocessDataFrame(tfds.as_dataframe(ds_train))
df_valid = preprocessDataFrame(tfds.as_dataframe(ds_valid))
df_test = preprocessDataFrame(tfds.as_dataframe(ds_test))

In [6]:
df_train

Unnamed: 0,tokens,ner
0,"["", If, they, 're, saying, at, least, 20, perc...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,"[Lauck, 's, lawyer, vowed, he, would, appeal, ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,"[Thailand, 's, powerful, military, thinks, the...","[5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, ..."
3,"[A, forensic, scientist, who, examined, the, s...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ..."
4,"[Werder, Bremen, 3, 0, 1, 2, 4, 6, 1]","[3, 4, 0, 0, 0, 0, 0, 0, 0]"
...,...,...
14037,"["", He, was, not, involved, ...]","[0, 0, 0, 0, 0, 0]"
14038,"["", It, goes, without, saying, that, we, 're, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ..."
14039,"[Bowling, :]","[0, 0]"
14040,"[National, League]","[7, 8]"


In [26]:
def label2int():
    iob_labels = ["B", "I"]
    ner_labels = ["PER", "ORG", "LOC", "MISC"]
    all_labels = [(label1, label2) for label2 in ner_labels for label1 in iob_labels]
    all_labels = ["-".join([a, b]) for a, b in all_labels]
    dic = dict(zip(range(1, len(all_labels) + 1), all_labels))
    dic[0] = 'O'
    return dic 

In [27]:
int2tag = label2int()

int2tag

{1: 'B-PER',
 2: 'I-PER',
 3: 'B-ORG',
 4: 'I-ORG',
 5: 'B-LOC',
 6: 'I-LOC',
 7: 'B-MISC',
 8: 'I-MISC',
 0: 'O'}

In [29]:
tag2int = {}
for key in int2tag:
    value = int2tag[key]
    tag2int[value] = key
print(tag2int)

{'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8, 'O': 0}
