<a href="https://colab.research.google.com/github/vicotrbb/machine_learning/blob/master/projects/wikipedia-nlp/poc_wikipedia_nlp_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amostra e teste utilizando LSTM

**Modelo retirado e adaptado dos exemplos criados para tensorflow**

In [37]:
import pickle
import math
import pandas as pd
import numpy as np
from numpy import array

from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding

from tensorflow.keras.models import load_model
from tensorflow.keras.callbacks import ModelCheckpoint
from keras.callbacks import EarlyStopping

from pickle import load

In [1]:
import keras
keras.__version__

'2.4.3'

## Text generation with LSTM

This notebook contains the code samples found in Chapter 8, Section 1 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.

----

[...]

## Implementing character-level LSTM text generation


Let's put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a 
language model. You could use any sufficiently large text file or set of text files -- Wikipedia, the Lord of the Rings, etc. In this 
example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model 
we will learn will thus be specifically a model of Nietzsche's writing style and topics of choice, rather than a more generic model of the 
English language.

## Preparando os dados sem utilizar Tokenizer 

**Tive que reduzir o tamanho do dataset por problemas com memoria, o modelo final deve ser treinado em um servidor com mais disponibilidade**

In [48]:
import keras
import numpy as np
import json

f = open('wikipedia-content-dataset.json',)
data = json.load(f)

content = list(data[x] for x in data.keys())
text = ''

for c in content[0:200]:
  for i in c:
    text += i

print('Corpus length:', len(text))

Corpus length: 716167


In [6]:
text[100:1000]

"rd. He was the first black musician to win the competition since its launch in 1978. He played at the wedding of Prince Harry to Meghan Markle on 19 May 2018 under the direction of Christopher Warren-Green.\n\n\n\nKanneh-Mason grew up in Nottingham, England. He was born to Stuart Mason, a luxury hotel business manager from Antigua, and Dr. Kadiatu Kanneh, a former lecturer at the University of Birmingham, from Sierra Leone.  He is the third of seven children and began learning the cello at the age of six with Sarah Huson-Whyte, having briefly played the violin. His love for the cello started when he saw his sister perform in 'Stringwise', an annual weekend course for young Nottingham string players, run by the local music charity Music for Everyone. He then switched from violin to cello and went on to take part in Music for Everyone's Stringwise courses, impressing their conductors with his "


Next, we will extract partially-overlapping sequences of length `maxlen`, one-hot encode them and pack them in a 3D Numpy array `x` of 
shape `(sequences, maxlen, unique_characters)`. Simultaneously, we prepare a array `y` containing the corresponding targets: the one-hot 
encoded characters that come right after each extracted sequence.

In [49]:
# Length of extracted character sequences
maxlen = 60

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up characters)
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Number of sequences: 238703
Unique characters: 356
Vectorization...


In [54]:
print(x.shape)
print(y.shape)

(238703, 60, 356)
(238703, 356)


## Building the network

Our network is a single `LSTM` layer followed by a `Dense` classifier and softmax over all possible characters. But let us note that 
recurrent neural networks are not the only way to do sequence data generation; 1D convnets also have proven extremely successful at it in 
recent times.

In [50]:
from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

Since our targets are one-hot encoded, we will use `categorical_crossentropy` as the loss to train the model:

In [51]:
optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

## Training the language model and sampling from it


Given a trained model and a seed text snippet, we generate new text by repeatedly:

* 1) Drawing from the model a probability distribution over the next character given the text available so far
* 2) Reweighting the distribution to a certain "temperature"
* 3) Sampling the next character at random according to the reweighted distribution
* 4) Adding the new character at the end of the available text

This is the code we use to reweight the original probability distribution coming out of the model, 
and draw a character index from it (the "sampling function"):

In [52]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


Finally, this is the loop where we repeatedly train and generated text. We start generating text using a range of different temperatures 
after every epoch. This allows us to see how the generated text evolves as the model starts converging, as well as the impact of 
temperature in the sampling strategy.

## Acompanhamento do treinamento

In [53]:
import random
import sys

for epoch in range(1, 100):
    print('epoch', epoch)
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=128,
              epochs=1)

    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    print('--- Generating with seed: "' + generated_text + '"')

    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print('------ temperature:', temperature)
        sys.stdout.write(generated_text)

        # We generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

epoch 1
--- Generating with seed: "s basketball season
WAC Women's Basketball Tournament
2017–1"
------ temperature: 0.2
s basketball season
WAC Women's Basketball Tournament
2017–1994)












































































































































































































































































































































































































------ temperature: 0.5


































































































































































san and a spart for was a mather the cultury and lack and the the and her other of the the and her small or and an who hime desing the the mare has a for a madion of expers by the for Christion which of the socies that mans the resi

KeyboardInterrupt: ignored

# Outras arquiteturas de treinamento e tratamento

## Preparando os dados utilizando tokenizer

In [None]:
import keras
import numpy as np
import json

f = open('wikipedia-content-dataset.json',)
data = json.load(f)

content = list(data[x] for x in data.keys())
text = ''

for c in content[0:200]:
  for i in c:
    text += i

print('Corpus length:', len(text))

Corpus length: 716167


In [None]:
text[100:1000]

"rd. He was the first black musician to win the competition since its launch in 1978. He played at the wedding of Prince Harry to Meghan Markle on 19 May 2018 under the direction of Christopher Warren-Green.\n\n\n\nKanneh-Mason grew up in Nottingham, England. He was born to Stuart Mason, a luxury hotel business manager from Antigua, and Dr. Kadiatu Kanneh, a former lecturer at the University of Birmingham, from Sierra Leone.  He is the third of seven children and began learning the cello at the age of six with Sarah Huson-Whyte, having briefly played the violin. His love for the cello started when he saw his sister perform in 'Stringwise', an annual weekend course for young Nottingham string players, run by the local music charity Music for Everyone. He then switched from violin to cello and went on to take part in Music for Everyone's Stringwise courses, impressing their conductors with his "

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

max_words = 500000
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(content[0:300]) # Para tokenização, usamos a lista de listas
sequences = tokenizer.texts_to_sequences(content[0:300]) # Para tokenização, usamos a lista de listas
print(sequences[:5])

[[10, 11], [12, 13], [14, 15], [16, 17], [18, 19]]


In [None]:
vocab_size = len(tokenizer.word_index)
print('Tamanho do vocabulario: ', vocab_size) # Apresenta um vocabulario maior que a versão codada na mão
print('Numero de sequencias: ', len(sequences))

Tamanho do vocabulario:  591
Numero de sequencias:  300


In [None]:
sentence_len = 60
pred_len = 3
train_len = sentence_len - pred_len
seq = []
for i in range(len(text)-sentence_len):
    seq.append(text[i:i+sentence_len])
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))

In [None]:
trainX = []
trainy = []
for i in seq:
    trainX.append(i[:train_len])
    trainy.append(i[-1])

In [None]:
trainX[:5]

['Sheku Kanneh-Mason  (born 4 April 1999) is a British cell',
 'heku Kanneh-Mason  (born 4 April 1999) is a British celli',
 'eku Kanneh-Mason  (born 4 April 1999) is a British cellis',
 'ku Kanneh-Mason  (born 4 April 1999) is a British cellist',
 'u Kanneh-Mason  (born 4 April 1999) is a British cellist ']

In [47]:
trainy[:5]

['t', ' ', 'w', 'h', 'o']

In [59]:
np.asarray(trainX)

array(['Sheku Kanneh-Mason  (born 4 April 1999) is a British cell',
       'heku Kanneh-Mason  (born 4 April 1999) is a British celli',
       'eku Kanneh-Mason  (born 4 April 1999) is a British cellis', ...,
       'nnui-eki) is a railway station in Oshamambe, Hokkaidō, Ja',
       'nui-eki) is a railway station in Oshamambe, Hokkaidō, Jap',
       'ui-eki) is a railway station in Oshamambe, Hokkaidō, Japa'],
      dtype='<U57')

In [58]:
pd.get_dummies(np.asarray(trainy))

Unnamed: 0,\t,\n,Unnamed: 3,!,"""",#,$,%,&,',(,),*,+,",",-,.,/,0,1,2,3,4,5,6,7,8,9,:,;,<,=,>,?,@,A,B,C,D,E,...,る,ク,グ,ゴ,サ,シ,ブ,ボ,ム,ル,ン,・,ー,世,光,勝,区,君,国,坂,宅,安,店,庭,成,暮,楽,界,社,笑,縫,脇,集,音,駅,근,범,성,승,차
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
716102,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
716103,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
716104,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
716105,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Diferentes arquiteturas de modelo possiveis

### Versão 1


1. Embedding layer
    - Helps model understand 'meaning' of words by mapping them to representative vector space instead of semantic integers
2. Stacked LSTM layers
    - Stacked LSTMs add more depth than additional cells in a single LSTM layer (see paper: https://arxiv.org/abs/1303.5778)
    - The first LSTM layer must have `return sequences` flag set to True in order to pass sequence information to the second LSTM layer instead of just its end states
3. Dense (regression) layer with ReLU activation
4. Dense layer with Softmax activation 
    - Outputs word probability across entire vocab

In [38]:
model = Sequential([
    Embedding(vocab_size+1, 50, input_length=train_len),
    LSTM(100, return_sequences=True),
    LSTM(100),
    Dense(100, activation='relu'),
    Dense(vocab_size, activation='softmax')
])

In [39]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 57, 50)            29600     
_________________________________________________________________
lstm (LSTM)                  (None, 57, 100)           60400     
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 100)               10100     
_________________________________________________________________
dense_1 (Dense)              (None, 591)               59691     
Total params: 240,191
Trainable params: 240,191
Non-trainable params: 0
_________________________________________________________________


### Versão 2

This model is similar to model 1, but we add a dropout layer to prevent overfitting. The dropout layer randomly turns off a proportion of neurons fed into it from the previous layer, forcing the model to come up with more robust features

In [None]:
model = Sequential([
    Embedding(vocab_size+1, 50, input_length=train_len),
    LSTM(100, return_sequences=True),
    LSTM(100),
    Dense(100, activation='relu'),
    Dropout(0.1),
    Dense(vocab_size, activation='softmax')
])

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [None]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 19, 50)            785700    
_________________________________________________________________
lstm_4 (LSTM)                (None, 19, 100)           60400     
_________________________________________________________________
lstm_5 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_4 (Dense)              (None, 100)               10100     
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 15713)             1587013   
Total params: 2,523,613
Trainable params: 2,523,613
Non-trainable params: 0
_________________________________________________________________


### Model 3

Model 2 had an additional dropout layer, but the accuracy took a 30% hit.

For model 3, we'll try removing the dropout layer and up the number of neurons across all layers by 50%. 

As expected, this resulted in a higher accuracy on the training set of about 40%.

In [43]:
model = Sequential([
    Embedding(vocab_size+1, 50, input_length=train_len),
    LSTM(150, return_sequences=True),
    LSTM(150),
    Dense(150, activation='relu'),
    Dense(vocab_size, activation='softmax')
])

In [41]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 57, 50)            29600     
_________________________________________________________________
lstm_2 (LSTM)                (None, 57, 150)           120600    
_________________________________________________________________
lstm_3 (LSTM)                (None, 150)               180600    
_________________________________________________________________
dense_2 (Dense)              (None, 150)               22650     
_________________________________________________________________
dense_3 (Dense)              (None, 591)               89241     
Total params: 442,691
Trainable params: 442,691
Non-trainable params: 0
_________________________________________________________________


### Compile e fit

In [45]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(np.asarray(trainX), pd.get_dummies(np.asarray(trainy)), batch_size=128, epochs=100)



ValueError: ignored

### Acompanhamento do treinamento

In [None]:
import random
import sys

for epoch in range(1, 100):
    print('epoch', epoch)
    model.fit(np.asarray(trainX), 
              pd.get_dummies(np.asarray(trainy)), 
              batch_size=128, 
              epochs=1)

    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    print('--- Generating with seed: "' + generated_text + '"')

    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print('------ temperature:', temperature)
        sys.stdout.write(generated_text)

        for i in range(400):
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

# Limpando o texto

In [None]:
import keras
import numpy as np
import json

f = open('wikipedia-content-dataset.json',)
data = json.load(f)

content = list(data[x] for x in data.keys())
text = ''

for c in content[0:200]:
  for i in c:
    text += i

print('Corpus length:', len(text))