<a href="https://colab.research.google.com/github/thiago2608santana/Natural_Language_Processing_with_Python/blob/main/Text_Generation_with_LSTMipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Método auxiliar

In [10]:
def read_file(filepath):
  with open(filepath) as f:
    str_text = f.read()

  return str_text

In [11]:
def separate_punc(doc_text):
  return [token.text.lower() for token in nlp(doc_text) if token.text not in '\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n ']

**Apenas os quatro primeiros capítulos do livro Moby-Dick**

In [12]:
#read_file('/content/drive/MyDrive/NLP_with_Python_Datasets/moby_dick_four_chapters.txt')

**Livro completo Moby-Dick**

In [13]:
#read_file('/content/drive/MyDrive/NLP_with_Python_Datasets/melville-moby_dick.txt')

# Importação de bibliotecas

In [14]:
import spacy
import numpy as np
from keras.preprocessing.text import Tokenizer
import tensorflow.keras as k
from tensorflow import device
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding
from pickle import dump, load

# Código principal

In [15]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'ner'])

**Aumentar a quantidade de palavras máximas para trabalhar com textos grandes**

In [16]:
nlp.max_length = 1198623

In [17]:
d = read_file('/content/drive/MyDrive/NLP_with_Python_Datasets/moby_dick_four_chapters.txt')

In [18]:
#from google.colab import drive
#drive.mount('/content/drive')

In [19]:
tokens = separate_punc(d)



In [20]:
len(tokens)

11338

**Using #25 words --> network predict #26**

In [21]:
train_len = 25 + 1

text_sequences = []

for i in range(train_len, len(tokens)):
  seq = tokens[i-train_len:i]
  text_sequences.append(seq)

In [22]:
' '.join(text_sequences[0])

'call me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on'

In [23]:
' '.join(text_sequences[1])

'me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore'

# Transformar as sequencias de textos em sequencias de números (tokenizer)

In [24]:
tokenizer = Tokenizer()

In [25]:
tokenizer.fit_on_texts(text_sequences)

In [26]:
sequences = tokenizer.texts_to_sequences(text_sequences)

In [27]:
sequences[0]

[956,
 14,
 263,
 51,
 261,
 408,
 87,
 219,
 129,
 111,
 954,
 260,
 50,
 43,
 38,
 314,
 7,
 23,
 546,
 3,
 150,
 259,
 6,
 2713,
 14,
 24]

**Dicionário contendo o número correspondente a cada palavra**

In [28]:
for i in sequences[0]:
  print(f'{i} : {tokenizer.index_word[i]}')
#tokenizer.index_word

956 : call
14 : me
263 : ishmael
51 : some
261 : years
408 : ago
87 : never
219 : mind
129 : how
111 : long
954 : precisely
260 : having
50 : little
43 : or
38 : no
314 : money
7 : in
23 : my
546 : purse
3 : and
150 : nothing
259 : particular
6 : to
2713 : interest
14 : me
24 : on


**Exibir a quantidade de vezes que cada palavra apareceu no texto**

In [29]:
tokenizer.word_counts

OrderedDict([('call', 27),
             ('me', 2471),
             ('ishmael', 133),
             ('some', 758),
             ('years', 135),
             ('ago', 84),
             ('never', 449),
             ('mind', 164),
             ('how', 321),
             ('long', 374),
             ('precisely', 37),
             ('having', 142),
             ('little', 767),
             ('or', 950),
             ('no', 1003),
             ('money', 120),
             ('in', 5647),
             ('my', 1786),
             ('purse', 71),
             ('and', 9646),
             ('nothing', 281),
             ('particular', 152),
             ('to', 6497),
             ('interest', 24),
             ('on', 1716),
             ('shore', 26),
             ('i', 7150),
             ('thought', 676),
             ('would', 702),
             ('sail', 104),
             ('about', 1014),
             ('a', 10377),
             ('see', 416),
             ('the', 15540),
             ('watery', 26),
  

In [30]:
vocabulary_size = len(tokenizer.word_counts)

In [31]:
vocabulary_size

2718

# Transformar a lista contendo as sequências numéricas (tokens) em um array no qual cada linha representa uma sequência

In [32]:
sequences = np.array(sequences)
sequences

array([[ 956,   14,  263, ..., 2713,   14,   24],
       [  14,  263,   51, ...,   14,   24,  957],
       [ 263,   51,  261, ...,   24,  957,    5],
       ...,
       [ 952,   12,  166, ...,  262,   53,    2],
       [  12,  166, 2712, ...,   53,    2, 2718],
       [ 166, 2712,    3, ...,    2, 2718,   26]])

# Separar os dados em treinamento e teste

In [33]:
sequences[:,:-1]

array([[ 956,   14,  263, ...,    6, 2713,   14],
       [  14,  263,   51, ..., 2713,   14,   24],
       [ 263,   51,  261, ...,   14,   24,  957],
       ...,
       [ 952,   12,  166, ...,   11,  262,   53],
       [  12,  166, 2712, ...,  262,   53,    2],
       [ 166, 2712,    3, ...,   53,    2, 2718]])

In [34]:
sequences[:,-1]

array([  24,  957,    5, ...,    2, 2718,   26])

In [35]:
X = sequences[:,:-1]
y = sequences[:,-1]

In [36]:
y = k.utils.to_categorical(y, num_classes=vocabulary_size+1)

In [37]:
X.shape

(11312, 25)

In [38]:
y.shape

(11312, 2719)

In [39]:
seq_len = X.shape[1]

# Função para criar o modelo com os parâmetros pré estabelecidos

In [40]:
def create_model(vocabulary_size, seq_len):

  model = Sequential(name='Sequential_Text_Generator')
  model.add(Embedding(vocabulary_size, seq_len, input_length=seq_len, name='Embedding'))
  model.add(LSTM(150, return_sequences=True, name='LSTM_1'))
  model.add(LSTM(150, name='LSTM_2'))
  model.add(Dense(150, activation='relu', name='Dense_1'))
  model.add(Dense(vocabulary_size, activation='softmax', name='Dense_2'))

  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

  model.summary()

  return model

In [41]:
model = create_model(vocabulary_size+1, seq_len)

Model: "Sequential_Text_Generator"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Embedding (Embedding)       (None, 25, 25)            67975     
                                                                 
 LSTM_1 (LSTM)               (None, 25, 150)           105600    
                                                                 
 LSTM_2 (LSTM)               (None, 150)               180600    
                                                                 
 Dense_1 (Dense)             (None, 150)               22650     
                                                                 
 Dense_2 (Dense)             (None, 2719)              410569    
                                                                 
Total params: 787,394
Trainable params: 787,394
Non-trainable params: 0
_________________________________________________________________


# Treinar o modelo

In [42]:
with device('/device:GPU:0'):

  model.fit(X, y, batch_size=128, epochs=200)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

In [43]:
model.save('my_mobydick_model.h5')

In [44]:
dump(tokenizer, open('my_simpletokenizer', 'wb'))