<h3>Carrega bibliotecas</h3>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from keras.preprocessing import text_dataset_from_directory
from keras.models import Sequential, load_model
from keras.layers import Dense, LSTM, Input, TextVectorization, Embedding
from keras.optimizers import Adam

from tensorflow.strings import regex_replace
from tensorflow import convert_to_tensor

from sklearn.preprocessing import MinMaxScaler

2024-06-04 15:58:01.131447: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-04 15:58:01.175081: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<h3>Lê os dados dos diretório</h3>
A estrutura do diretório deve ser, de acordo com a documentação do Keras:
<pre>
main_directory/
...class_a/
......a_text_1.txt
......a_text_2.txt
...class_b/
......b_text_1.txt
......b_text_2.txt
</pre>
Ver documentação <a href="https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text_dataset_from_directory"> aqui </a>

In [2]:
def readData(dir):
  data = text_dataset_from_directory(dir)
  return data.map(
    lambda text, label: (regex_replace(text, '<br />', ' '), label), # os arquivos possuem quebra de linha
  )

Esses dados são baseados em review de filmes do site IMDB e formam um banco de dados disponível originalmente em http://ai.stanford.edu/~amaas/data/sentiment/

In [3]:
datadir = "imdb"
data_train = readData(datadir+"/train")
data_test  = readData(datadir+"/test")

text_train = data_train.map(lambda text, label: text)

Found 25000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


<h3>Vamos ver uns dados</h3>

In [4]:
for text, label in data_train.take(4):
    print(text.numpy()[0])
    print(label.numpy()[0])

b'Dan, the widowed father of three girls, has his own advice column that will probably go into syndication. After his wife\'s death, he has taken time to raise his daughters. Having known no romance in quite some time, nothing prepares him for the encounter with the radiant Marie, at a local book store in a Rhode Island small town on the ocean, where he has gone to celebrate Thanksgiving with the rest of his big family. After liking Marie at first sight, little prepares him when the gorgeous woman appears at the family compound. After all, she is the date of Dan\'s brother, Mitch.  It is clear from the outset that Dan and Marie are made for one another, and although we sense what the outcome will be, we go for the fun ride that Peter Hedges, the director wants to give us. Mr. Hedges, an author and screenplay writer on his own, has given us two excellent novels, "What\'s Eating Gilber Grapes", and "An Ocean in Iowa", and the delightful indie, "Pieces of April, which he also directed. It

2024-06-04 15:58:04.001566: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


<h3>Vamos criar a nossa rede</h3>

In [5]:
model = Sequential()
model.add(Input(shape=(1,), dtype="string"))

<h3>Vamos criar uma camada de vetores de texto.</h3>h3>
Dois parâmetros importantes são o tamanho do vocabulário e a dimensionalidade do vetor. Antes de adicionar a camada ao modelo, precisamos adaptar essa camada ao textos que serão usados. No caso de estourar o tamanho máximo do vocabulários, novas palavras serão classificadas como "out of vocabulary" 00V

In [6]:
max_tokens = 2000  # tamanho do vocabulário
max_dim = 300 # dimensionalidade do vetor. No fundo significa que apenas os max_dim primeiros tokens serão usados para converter um texto em números
vector_layer = TextVectorization(max_tokens=max_tokens, output_mode="int", output_sequence_length=max_dim)
vector_layer.adapt(text_train)
model.add(vector_layer)

2024-06-04 15:58:07.222550: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Vamos ver as palavras mais frequêntes no dicionário

In [7]:
vocabulary = vector_layer.get_vocabulary()
n = 20
for index in range(n):
    print(f'{index}: {vocabulary[index]} ')

0:  
1: [UNK] 
2: the 
3: and 
4: a 
5: of 
6: to 
7: is 
8: in 
9: it 
10: i 
11: this 
12: that 
13: was 
14: as 
15: for 
16: with 
17: movie 
18: but 
19: film 


Vamos vetorizar uma frase

In [8]:
frase = [["But honestly, it doesn't even matter what the film is about."]]
vetor = vector_layer(frase)
vetor

<tf.Tensor: shape=(1, 300), dtype=int64, numpy=
array([[  18, 1220,    9,  144,   53,  543,   48,    2,   19,    7,   42,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,   

<h3>Criar uma camada de embedding</h3> Note que o tamanho do vocabulário é max_tokens + 1 por conta do out of vocabulary

In [9]:
embedding_layer = Embedding(input_dim = max_tokens + 1, output_dim = 128)
model.add(embedding_layer)

Vamos passar o vetor anterior pela camada de embedding e ver o que sai

In [10]:
resultado = embedding_layer(vetor)
resultado

<tf.Tensor: shape=(1, 300, 128), dtype=float32, numpy=
array([[[-0.03481696, -0.0008384 , -0.01511191, ...,  0.02641903,
          0.00711516,  0.0178939 ],
        [ 0.02262378,  0.01581805, -0.04367347, ..., -0.01852804,
         -0.03592256, -0.04343842],
        [-0.04170716,  0.01707349, -0.04461306, ...,  0.02186661,
          0.00029119, -0.04977271],
        ...,
        [-0.02977302, -0.0094866 , -0.03631153, ...,  0.02489788,
         -0.00381721,  0.02919066],
        [-0.02977302, -0.0094866 , -0.03631153, ...,  0.02489788,
         -0.00381721,  0.02919066],
        [-0.02977302, -0.0094866 , -0.03631153, ...,  0.02489788,
         -0.00381721,  0.02919066]]], dtype=float32)>

<h3>Vamos adicionar as camadas restantes, sendo uma delas do tipo LSTM</h3>

In [11]:
model.add(LSTM(64))
model.add(Dense(64, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

In [12]:
optimizer = Adam()

In [13]:
model.compile(optimizer=optimizer, loss='binary_crossentropy',  metrics=['accuracy'])

In [14]:
model.fit(data_train, validation_data = data_test, epochs = 10)

Epoch 1/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 86ms/step - accuracy: 0.5141 - loss: 0.6926 - val_accuracy: 0.5015 - val_loss: 0.6945
Epoch 2/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 108ms/step - accuracy: 0.5191 - loss: 0.6872 - val_accuracy: 0.5461 - val_loss: 0.6719
Epoch 3/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m88s[0m 113ms/step - accuracy: 0.5517 - loss: 0.6713 - val_accuracy: 0.7133 - val_loss: 0.5902
Epoch 4/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m87s[0m 111ms/step - accuracy: 0.7780 - loss: 0.4840 - val_accuracy: 0.8414 - val_loss: 0.3609
Epoch 5/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 116ms/step - accuracy: 0.8563 - loss: 0.3441 - val_accuracy: 0.8597 - val_loss: 0.3367
Epoch 6/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 107ms/step - accuracy: 0.8884 - loss: 0.2825 - val_accuracy: 0.8636 - val_loss: 0.3409
Epoch 7/10


<keras.src.callbacks.history.History at 0x7f99cd911640>

<h3>Vamos ver como a camada de embedding ficou depois de ser treinada</h3>

In [15]:
resultado = embedding_layer(vetor)
resultado

<tf.Tensor: shape=(1, 300, 128), dtype=float32, numpy=
array([[[ 0.1831876 , -0.12070046, -0.18899943, ...,  0.2586965 ,
         -0.06618347,  0.4126398 ],
        [-0.03309989, -0.16083215, -0.31954482, ...,  0.0948881 ,
          0.24190028,  0.14767277],
        [-0.09027941,  0.16346623,  0.06425152, ..., -0.03135087,
         -0.03991551,  0.02266031],
        ...,
        [ 0.767456  , -0.07145088,  0.76197535, ...,  0.57415056,
          0.06219751, -0.03874456],
        [ 0.767456  , -0.07145088,  0.76197535, ...,  0.57415056,
          0.06219751, -0.03874456],
        [ 0.767456  , -0.07145088,  0.76197535, ...,  0.57415056,
          0.06219751, -0.03874456]]], dtype=float32)>

<h3>Salvando o modelo</h3>
Treinar um modelo de linguagem leva tempo e custa CPU/GPU. Normalmente você não faz isso o tempo todo. Você treina o seu modelo e salva as saídas. Vamos fazer isso e depois usar esse treinamento em uma outra rede

In [16]:
file = 'meu_modelo.keras'
model.save(file)

<h3>Vamos criar um novo modelo e carregar esse treinamento para usar</h3>

In [17]:
uso = load_model(file)

In [18]:
meu_review = ["""I had already watched this movie, but I remembered almost nothing, including that several interesting actors are part of it. 
The opening scene is intensely gory and very interesting, with crude visual effects worthy of 2002, but very pleasing to the eyes of fans of B movies from the 90s/00s. 
Without a doubt, it is a film worth watching and rewatching, especially if the viewer enjoys supernatural exploitation films.
"""]

In [19]:
print(uso.predict(convert_to_tensor(meu_review)))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 209ms/step
[[0.01852844]]


In [20]:
meu_review = ["""It has no plot, no comedy, no drama, no passion. You basically waste 2 hours of your time watching these characters you just can't seem to get attached to... 
Watch it only if you're interested in watching all these actors as they were before they got famous, or if you feel REALLY nostalgic about the '80s.
"""]

In [21]:
print(uso.predict(convert_to_tensor(meu_review)))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[[0.00242021]]


In [22]:
meu_review = ["""I usually steer toward Sci-Fi and Fantasy movies, but after watching the The Breakfast Club, I thought I may enjoy seeing actors and actresses like Judd Nelson, and Ally Sheedy working together again. 
Well, I was right. Some movies can make you feel different emotions by being tearful, or violent. St.Elmo's Fire made me cry- not because of the actual plot, but the way it truly played to real life. 
The aspects of this movie letf wishing to see what would happen next, after the character's old lives were left behind."""]

In [23]:
print(uso.predict(convert_to_tensor(meu_review)))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
[[0.99375916]]
