# Amostra e teste utilizando LSTM

**Modelo retirado e adaptado dos exemplos criados para tensorflow**

In [None]:
import keras
keras.__version__

'2.4.3'

## Text generation with LSTM

This notebook contains the code samples found in Chapter 8, Section 1 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.

----

[...]

## Implementing character-level LSTM text generation


Let's put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a 
language model. You could use any sufficiently large text file or set of text files -- Wikipedia, the Lord of the Rings, etc. In this 
example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model 
we will learn will thus be specifically a model of Nietzsche's writing style and topics of choice, rather than a more generic model of the 
English language.

## Preparing the data

**Tive que reduzir o tamanho do dataset por problemas com memoria, o modelo final deve ser treinado em um servidor com mais disponibilidade**

In [None]:
import keras
import numpy as np
import json

f = open('wikipedia-content-dataset (1).json',)
data = json.load(f)

content = list(data[x] for x in data.keys())
text = ''

for c in content[0:250]:
  for i in c:
    text += i

print('Corpus length:', len(text))

Corpus length: 901824


In [None]:
text[1:1000]

"heku Kanneh-Mason  (born 4 April 1999) is a British cellist who won the 2016 BBC Young Musician award. He was the first black musician to win the competition since its launch in 1978. He played at the wedding of Prince Harry to Meghan Markle on 19 May 2018 under the direction of Christopher Warren-Green.\n\n\n\nKanneh-Mason grew up in Nottingham, England. He was born to Stuart Mason, a luxury hotel business manager from Antigua, and Dr. Kadiatu Kanneh, a former lecturer at the University of Birmingham, from Sierra Leone.  He is the third of seven children and began learning the cello at the age of six with Sarah Huson-Whyte, having briefly played the violin. His love for the cello started when he saw his sister perform in 'Stringwise', an annual weekend course for young Nottingham string players, run by the local music charity Music for Everyone. He then switched from violin to cello and went on to take part in Music for Everyone's Stringwise courses, impressing their conductors with 


Next, we will extract partially-overlapping sequences of length `maxlen`, one-hot encode them and pack them in a 3D Numpy array `x` of 
shape `(sequences, maxlen, unique_characters)`. Simultaneously, we prepare a array `y` containing the corresponding targets: the one-hot 
encoded characters that come right after each extracted sequence.

In [None]:
# Length of extracted character sequences
maxlen = 60

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up characters)
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Number of sequences: 300588
Unique characters: 430
Vectorization...


## Building the network

Our network is a single `LSTM` layer followed by a `Dense` classifier and softmax over all possible characters. But let us note that 
recurrent neural networks are not the only way to do sequence data generation; 1D convnets also have proven extremely successful at it in 
recent times.

In [None]:
from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

Since our targets are one-hot encoded, we will use `categorical_crossentropy` as the loss to train the model:

In [None]:
optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

## Training the language model and sampling from it


Given a trained model and a seed text snippet, we generate new text by repeatedly:

* 1) Drawing from the model a probability distribution over the next character given the text available so far
* 2) Reweighting the distribution to a certain "temperature"
* 3) Sampling the next character at random according to the reweighted distribution
* 4) Adding the new character at the end of the available text

This is the code we use to reweight the original probability distribution coming out of the model, 
and draw a character index from it (the "sampling function"):

In [None]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


Finally, this is the loop where we repeatedly train and generated text. We start generating text using a range of different temperatures 
after every epoch. This allows us to see how the generated text evolves as the model starts converging, as well as the impact of 
temperature in the sampling strategy.

In [None]:
import random
import sys

for epoch in range(1, 60):
    print('epoch', epoch)
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=128,
              epochs=1)

    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    print('--- Generating with seed: "' + generated_text + '"')

    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print('------ temperature:', temperature)
        sys.stdout.write(generated_text)

        # We generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

epoch 1
--- Generating with seed: "cles. Typically, one level of the ship is fitted with railwa"
------ temperature: 0.2
cles. Typically, one level of the ship is fitted with railway the prosided the first and allow and allow and the are the first and the first and the side and as the state and the company the are and the orred the second the state of the are and as a the are desided as the sist of the compotition in the short with the first and the the first and the and the state of the State Chartic and the and the state and shown and the state and as the season and the st
------ temperature: 0.5
e state and shown and the state and as the season and the stall of the Marran Chastic in the Pirthn in 1918 and partion of the nated the officed in 1963. The stany of the win this the are spanding the state and advided and the the that company they in the compot of the areashing and is county of the provided on the County for a midiling resuan his the allowed as the of the (Chillic and Nover

KeyboardInterrupt: ignored