<a href="https://colab.research.google.com/github/viniciusrpb/cic0269_natural_language_processing/blob/main/lectures/cap12_2_lang_models_rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capítulo 12 - Redes Neurais Recorrentes

## 12.2. Modelos de Linguagem Baseados em Redes Neurais Recorrentes

O objetivo deste notebook consiste em desenvolver modelos de linguagem baseados em redes neurais recorrentes. Iremos abordar dois tipos de modelos:

*   word-to-word: trata cada palavra do *corpus* como um documento. O processo de treinamento consiste em pares (palavra,próxima palavra) como sendo o texto e o rótulo;
*   sentence-to-word: considera as palavras anteriores de uma sentença (e suas relações de dependência) do *corpus* para prever a próxima palavra.



In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense,LSTM,Embedding,Activation,Flatten,Dropout,Bidirectional
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

In [2]:
corpus = "Quero jogar futebol hoje\n Hoje não tem futebol\n"
corpus = corpus.lower()

### Modelo de Linguagem do tipo Word-to-Word

In [3]:
termos = {}
tokens = []
for sentence in corpus.split('\n'):
    for word in sentence.split():
        tokens.append(word)
        if word in termos:
            termos[word]+=1
        else:
            termos[word]=1
vocab_size = len(termos)

In [4]:
vocab_size

6

In [5]:
tokenizer = Tokenizer(num_words = vocab_size)
tokenizer.fit_on_texts(termos)
word

'futebol'

In [6]:
word2index = tokenizer.word_index
word2index['<OOV>'] = 0
index2word = {}
for key in word2index:
    value = word2index[key]
    index2word[value] = key

In [7]:
index2word

{1: 'quero',
 2: 'jogar',
 3: 'futebol',
 4: 'hoje',
 5: 'não',
 6: 'tem',
 0: '<OOV>'}

In [8]:
X_train = [0]
y_train = [word2index[tokens[0]]]
for i in range(0,len(tokens)-1):
    X_train.append(word2index[tokens[i]])
    y_train.append(word2index[tokens[i+1]])
X_train.append(word2index[tokens[len(tokens)-1]])
y_train.append(0)

In [9]:
X_train

[0, 1, 2, 3, 4, 4, 5, 6, 3]

In [10]:
y_train

[1, 2, 3, 4, 4, 5, 6, 3, 0]

In [11]:
model = Sequential()
model.add(Embedding(input_dim=vocab_size+1,output_dim=32,input_length=1))
model.add(Bidirectional(LSTM(256,activation='relu')))
model.add(Dropout(0.5))
model.add(Dense(vocab_size+1,activation='softmax'))

In [12]:
sgd = SGD(learning_rate = 0.001)
model.compile(optimizer=sgd,loss='sparse_categorical_crossentropy',metrics=['accuracy'])
model.fit(X_train,y_train,batch_size=16,epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f9cbbed4c90>

In [13]:
import numpy as np

frase = 'futebol quero jogar'

for w in frase.split():
    idx = word2index[w]
    prob = model.predict([idx])
    pal = np.argmax(prob)
    print(f'Palavra atual: {index2word[idx]} Proxima palavra: {index2word[pal]}')


Palavra atual: futebol Proxima palavra: futebol
Palavra atual: quero Proxima palavra: quero
Palavra atual: jogar Proxima palavra: futebol


### Modelo de Linguagem do tipo Sentence-2-Word

Nesse tipo de modelo, as palavras são analisadas dentro das sentenças:

In [14]:
novel_corpus = []
y_train = []
for sentence in corpus.split('\n'):

    novos_termos = sentence.split()

    for i in range(0,len(novos_termos)):
        lista = novos_termos[:i+1]
        novel_corpus.append(lista)
        if i < len(novos_termos)-1:
            y_train.append(word2index[novos_termos[i+1]])
        else:
            y_train.append(0)

In [15]:
novel_corpus

[['quero'],
 ['quero', 'jogar'],
 ['quero', 'jogar', 'futebol'],
 ['quero', 'jogar', 'futebol', 'hoje'],
 ['hoje'],
 ['hoje', 'não'],
 ['hoje', 'não', 'tem'],
 ['hoje', 'não', 'tem', 'futebol']]

In [16]:
y_train

[2, 3, 4, 0, 5, 6, 3, 0]

In [17]:
max_length = max([len(sentence) for sentence in novel_corpus])

In [18]:
max_length

4

In [19]:
train_sequences = tokenizer.texts_to_sequences(novel_corpus)

In [20]:
train_sequences

[[1], [1, 2], [1, 2, 3], [1, 2, 3, 4], [4], [4, 5], [4, 5], [4, 5, 3]]

In [21]:
trunc_type = 'post'
padding_type = 'pre'

train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [22]:
train_padded

array([[0, 0, 0, 1],
       [0, 0, 1, 2],
       [0, 1, 2, 3],
       [1, 2, 3, 4],
       [0, 0, 0, 4],
       [0, 0, 4, 5],
       [0, 0, 4, 5],
       [0, 4, 5, 3]], dtype=int32)

In [23]:
y_train = to_categorical(y_train)

In [24]:
model2 = Sequential()
model2.add(Embedding(input_dim=vocab_size+1,output_dim=32,input_length=max_length))
model2.add(Bidirectional(LSTM(256,activation='relu')))
model2.add(Dropout(0.5))
model2.add(Dense(vocab_size+1,activation='softmax'))

In [25]:
sgd = SGD(learning_rate = 0.001)
model.compile(optimizer=sgd,loss='categorical_crossentropy',metrics=['accuracy'])
model.fit(train_padded,y_train,batch_size=16,epochs=3)

Epoch 1/3




Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f9cbbd56050>

## Modelo de Linguagem com Corpus da Reuters

**Homework:** desenvolver um modelo de linguagem do tipo Sentence-to-Word utilizando o vocabulário do *corpus* (split do treinamento) da reuters. Aproveite os splits de validação e de testes para experimentos.

Obs.: delimitar o ```vocab_size``` para que seja possível a execução desse notebook no Google Colab. 

In [26]:
from keras.datasets import reuters

vocab_size = 3000

(x_train,y_train_int),(x_test2,y_test2) = reuters.load_data(num_words=vocab_size,test_split=0.3)
word2index = reuters.get_word_index()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/reuters.npz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/reuters_word_index.json


In [27]:
index2word = {}

for key,value in word2index.items():
  index2word[value] = key

print(' '.join([index2word[x] for x in x_train[0]]))

the of of mln loss for plc said at only ended said of could 1 traders now april 0 a after said from 1985 and from foreign 000 april 0 prices its account year a but in this mln home an states earlier and rise and revs vs 000 its 16 vs 000 a but 3 of oils several and shareholders and dividend vs 000 its all 4 vs 000 1 mln agreed largely april 0 are 2 states will billion total and against 000 pct dlrs


Lembre-se de construir as variáveis apropriadamente as variáveis ``` y_train ``` e ``` X_train ``` para treinamento dos modelos.

