# BLSTM-BR-Morphological-Classifier

This notebook contains the creation and training of a Bidirectional Long Short-Term Memory (BLSTM) model to perform part-of-speech tagging in Portuguese.

BLSTM is a type of recurrent neural network (RNN) capable of processing data streams in both directions, either left to right or right to left.

For this, two LSTM layers are used, each one processing the input sequence in a different direction. The outputs from the two layers are then combined to produce the final output.

The corpus used in the training is Mac-Morpho V3. Available at: http://nilc.icmc.usp.br/macmorpho/

In [1]:
# Importing the required libraries
import numpy as np
import tensorflow as tf
from keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Bidirectional, Dropout, BatchNormalization, Conv1D, MaxPooling1D, Flatten, Embedding
tf.config.list_physical_devices('GPU')

2023-04-04 21:53:58.809805: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-04-04 21:53:58.851376: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-04 21:54:00.479249: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.


[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

2023-04-04 21:54:00.486164: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-04 21:54:00.486445: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.


In [2]:
# Load the data from the file
with open('macmorpho-train.txt', 'r') as f:
    data = f.readlines()

In [3]:
# Preprocessing the data
X, Y = [], []
for line in data:
    tokens = line.strip().split()
    X.append([t.split('_')[0].lower() for t in tokens])
    Y.append([t.split('_')[1] for t in tokens])
    
# Creating vocabulary and dictionaries
vocab = sorted(set(np.concatenate(X)))
vocab.append('<UNK>')  # add <UNK> to represent unknown words
tag_dict = {t: i for i, t in enumerate(sorted(set(np.concatenate(Y))))}
reverse_tag_dict = {v: k for k, v in tag_dict.items()}
word_dict = {w: i+1 for i, w in enumerate(vocab)}  # shift indices by 1 to make room for <UNK>
word_dict['<UNK>'] = 0  # add <UNK> to dictionary and give it index 0

# Converting words and tags to numbers
X = [[word_dict[word] for word in sent] for sent in X]
Y = [[tag_dict[tag] for tag in sent] for sent in Y]

In [4]:
# Padding the sequences
X = tf.keras.preprocessing.sequence.pad_sequences(X)
Y = tf.keras.preprocessing.sequence.pad_sequences(Y)

# Creating train and validation sets
X_train, X_val = X[:50000], X[50000:]
Y_train, Y_val = Y[:50000], Y[50000:]

In [5]:
# Defining the model architecture
model = Sequential()
model.add(Embedding(input_dim=len(vocab), output_dim=128, mask_zero=True))
model.add(Bidirectional(LSTM(units=128, return_sequences=True)))
model.add(Dropout(0.7))
model.add(Bidirectional(LSTM(units=64, return_sequences=True)))
model.add(Dropout(0.7))
model.add(Dense(units=len(tag_dict), activation='softmax'))

2023-04-04 21:54:02.387860: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-04 21:54:02.388213: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-04 21:54:02.388440: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-04 21:54:03.164544: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-04 21:54:03.165234: I tensorflow/compile

In [6]:
# Compilando o modelo
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])

In [7]:
# Training the model
model.fit(X_train, Y_train, validation_data=(X_val, Y_val), epochs=10, batch_size=128)

Epoch 1/10


2023-04-04 21:54:08.462249: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/ReverseV2_grad/ReverseV2/ReverseV2/axis' with dtype int32 and shape [1]
	 [[{{node gradients/ReverseV2_grad/ReverseV2/ReverseV2/axis}}]]
2023-04-04 21:54:08.873983: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/ReverseV2_grad/ReverseV2/ReverseV2/axis' with dtype int32 and shape [1]
	 [[{{node gradients/ReverseV2_grad/ReverseV2/ReverseV2/axis}}]]
2023-04-04 21:54:12.031438: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ign

Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fb8700fedf0>

In [8]:
# Preprocess test data
with open('macmorpho-test.txt', 'r') as f:
    test_data = f.readlines()

test_X, test_Y = [], []
for line in test_data:
    tokens = line.strip().split()
    test_X.append([word_dict.get(t.split('_')[0].lower(), 0) for t in tokens])  # replace unknown words with <UNK>
    test_Y.append([tag_dict[t.split('_')[1]] for t in tokens])

# Padding the sequences
test_X = tf.keras.preprocessing.sequence.pad_sequences(test_X)
test_Y = tf.keras.preprocessing.sequence.pad_sequences(test_Y)

# Evaluating the model on test data
loss, accuracy = model.evaluate(test_X, test_Y, batch_size=128)
print('Test Loss:', loss)
print('Test Accuracy:', accuracy)

Test Loss: 0.2165159434080124
Test Accuracy: 0.9509977698326111


In [10]:
# Exemplo de frase para classificar
input_text = "Era uma vez um rapaz chamado Ivan. Num certo dia o Ivan foi à escola ele era tão distraído, que precisava dos professores. Passado algum tempo chegou a hora do lanche, o Ivan enquanto lanchava imitava as pessoas quando tocou para o intervalo. Como estava um dia de sol o Ivan foi jogar futebol como ele era um dos melhores a jogar estava na equipe principal da escola"

# Converte a frase em uma sequência de índices de palavras
input_sequence = [word_dict.get(word.lower(), 0) for word in input_text.split()]

# Adiciona padding à sequência
padded_input_sequence = tf.keras.preprocessing.sequence.pad_sequences([input_sequence])

# Faz a previsão das etiquetas das palavras na frase
predicted_tags = model.predict(padded_input_sequence)[0]

# Converte os índices das etiquetas em suas representações em texto
predicted_tags_text = [reverse_tag_dict[np.argmax(tag)] for tag in predicted_tags]

# Imprime as palavras e suas etiquetas previstas
for i, word in enumerate(input_text.split()):
    print(word, predicted_tags_text[i])

Era V
uma ART
vez N
um ART
rapaz N
chamado PCP
Ivan. ADV
Num PREP+ART
certo ADJ
dia N
o ART
Ivan NPROP
foi V
à PREP+ART
escola N
ele PROPESS
era V
tão ADV
distraído, ADV
que KS
precisava V
dos PREP+ART
professores. ADV
Passado N
algum PROADJ
tempo N
chegou V
a ART
hora N
do PREP+ART
lanche, ADV
o ART
Ivan NPROP
enquanto KS
lanchava ADV
imitava ADV
as ART
pessoas N
quando KS
tocou V
para PREP
o PROSUB
intervalo. ADV
Como KS
estava V
um ART
dia N
de PREP
sol N
o ART
Ivan NPROP
foi V
jogar V
futebol N
como KS
ele PROPESS
era V
um PROSUB
dos PREP+ART
melhores ADJ
a PREP
jogar V
estava V
na PREP+ART
equipe N
principal ADJ
da PREP+ART
escola N
