<a href="https://colab.research.google.com/github/retuyu88/digitalentkominfo/blob/master/Copy_of_%5B29_1b%5D_Character_Sequence_Text_Generation_ADF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src = "https://i.imgur.com/UjutVJd.jpg" align = "center">

# Character Level Text Generation

Di sini kita akan membuat language model untuk membangkitkan text dari level karakter berdasarkan input sekuens karakter yang diberikan

In [0]:
import tensorflow as tf

from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import TimeDistributed
from tensorflow.keras.layers import Reshape
from tensorflow.keras.optimizers import RMSprop, Adam
from tensorflow.keras.utils import get_file

import numpy as np
import random
import sys
import io

# Text Data
Untuk memulainya, kita perlu memiliki data untuk melatih model kita. Anda dapat menggunakan file teks apa pun yang Anda inginkan untuk proses ini

di sini telah disediakan beberapa data text yang bisa digunakan

In [0]:
dataset = {
    'shakespeare'  : 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt',
    'wonderland'   : 'https://www.gutenberg.org/cache/epub/11/pg11.txt',
    'harry'        : 'https://www.linguistik.uzh.ch/dam/jcr:169bff5c-ac13-457b-9acb-4fe7f1ad5cb0/Harry%20Potter%20and%20the%20Sorcerer.txt',
    'nietzsche'    : 'https://s3.amazonaws.com/text-datasets/nietzsche.txt',
    'frankenstein' : 'https://www.gutenberg.org/files/84/84-0.txt'
}

Pilih satu data

In [0]:
filename = dataset['frankenstein']
path = get_file( filename.split('/')[-1], origin=filename)


Kita akan ubah menjadi huruf lowercase agar kita tidak perlu khawatir tentang kapitalisasi dalam contoh ini.

In [5]:
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower()
print('corpus length:', len(text))


# Take a look at the first 250 characters in text
print(text[:250])

corpus length: 440748
﻿
project gutenberg's frankenstein, by mary wollstonecraft (godwin) shelley

this ebook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  you may copy it, give it away or
re-use it under the terms of the projec


# Encoding
Jaringan saraf bekerja dengan angka, bukan karakter teks. Jadi kita perlu mengkonversi input karakter menjadi angka. 

Pertama, kita urutkan daftar unik semua karakter yang muncul dalam teks tersebut, kemudian gunakan fungsi enumerasi untuk mendapatkan angka yang mewakili karakter tersebut. 

Berikutnya buat kamus yang menyimpan kunci dan nilai, atau karakter dan angka yang mewakili mereka.

In [6]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))


total chars: 69


# Sequence Building

Di sini kita set bahwa maksimum sequence dari karakter input adalah 40

Untuk itu, kita harus memotong semua text dalam bentuk sekuens semi-redundan sepanjang 40 karakter. Kita gunakan nilai redundansi sebesar 3 karakter

artinya, misal kita memiliki teks: `"saya suka makan nasi"`, kemudian kita buat sekuens semi-redundan dengan panjang 5 dan redundansi 2, maka kita akan memiliki
* `'saya '` dengan target `'aya s'`
* `'ya su'` dengan target `'a suk'`
* `' suka'` dengan target `'suka '`
* `'uka m'` dengan target `'ka ma'`
* dan seterusnya


In [7]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i+1 : i + maxlen+1])
print('nb sequences:', len(sentences))



nb sequences: 146903


In [8]:
for i in range(10):
  print([sentences[i]],[next_chars[i]])

["\ufeff\nproject gutenberg's frankenstein, by m"] ["\nproject gutenberg's frankenstein, by ma"]
["roject gutenberg's frankenstein, by mary"] ["oject gutenberg's frankenstein, by mary "]
["ect gutenberg's frankenstein, by mary wo"] ["ct gutenberg's frankenstein, by mary wol"]
[" gutenberg's frankenstein, by mary wolls"] ["gutenberg's frankenstein, by mary wollst"]
["tenberg's frankenstein, by mary wollston"] ["enberg's frankenstein, by mary wollstone"]
["berg's frankenstein, by mary wollstonecr"] ["erg's frankenstein, by mary wollstonecra"]
["g's frankenstein, by mary wollstonecraft"] ["'s frankenstein, by mary wollstonecraft "]
[' frankenstein, by mary wollstonecraft (g'] ['frankenstein, by mary wollstonecraft (go']
['ankenstein, by mary wollstonecraft (godw'] ['nkenstein, by mary wollstonecraft (godwi']
['enstein, by mary wollstonecraft (godwin)'] ['nstein, by mary wollstonecraft (godwin) ']


Berikutnya kita buat data latih dan targetnya berupa vektor angka yang diambil dari dictionary berdasarkan kalimat sekuens yang sudah kita buat

In [0]:
# print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
        

for i, sentence in enumerate(next_chars):
    for t, char in enumerate(sentence):
        y[i, t, char_indices[char]] = 1
        


# LSTM Model
Sekarang kita coba bangun jaringan sederhana mengguankan 1 layer LSTM dengan ukuran output vektor 128. Setelah layer LSTM, kita tambahkan Layer Dense untuk memprediksi kelanjutan karakter dari 40 karakter input

In [10]:
model = Sequential()
model.add(LSTM(256, input_shape=(maxlen, len(chars)), return_sequences=True))
model.add(LSTM(256, return_sequences=True))
model.add(TimeDistributed(Dense(len(chars), activation='softmax')))
model.add(Reshape((maxlen, len(chars))))


optimizer = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

W0808 03:33:15.063859 139969991964544 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


# Sample Probability Function
Berikut adalah helper function untuk melakukan sampling karakter output berdasarkan output probability dari softmax

In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


# Training Checkpoint
Berikutnya mari kita tambahkan sebuah callback pada fungsi training agar kita bisa melihat contoh hasil pembangkitan text yang dilakukan setiap 5 epoch

In [0]:
def on_epoch_end(epoch, _):
  if epoch%5==0:
    # Function invoked at end of each epoch. Prints generated text.
    print('\n---------------------------------------------------------------------')
    print('>>>>> Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    diversity = 0.7
    print('\n>>>>> diversity:', diversity)

    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    print('>>>>> Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)

    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_indices[char]] = 1.

        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds[-1], diversity)
        next_char = indices_char[next_index]

        generated += next_char
        sentence = sentence[1:] + next_char

        sys.stdout.write(next_char)
        sys.stdout.flush()
    print('\n---------------------------------------------------------------------')
    print('>>>>> Continuing training')
        
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

# Training Process

Sekarang tinggal kita latih model Text Generator kita

In [17]:
model.fit(x, y,
          batch_size=1080,
          epochs=20,
          callbacks=[print_callback])

Epoch 1/20
---------------------------------------------------------------------
>>>>> Generating text after Epoch: 0

>>>>> diversity: 0.7
>>>>> Generating with seed: "he quitted italy with an attendant, a na"
he quitted italy with an attendant, a native country and sense of my parents were all, when he began to ribernal of women, when all was little in the consummation of my friend and delighted; but i will not credit it.”

“septed on the great and spirit with refugent and labour—i was, in beautience i had contented my mind and then clervancing in a difted how to me.  agatha listened to the heaven to the heavens; i found that i could not
mag
---------------------------------------------------------------------
>>>>> Continuing training
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
---------------------------------------------------------------------
>>>>> Generating text after Epoch: 5

>>>>> diversity: 0.7
>>>>> Generating with seed: "in the woods.

“and now, with the world

<tensorflow.python.keras.callbacks.History at 0x7f4cf0a6aa58>

# Testing Process
setelah model terlatih, mari kita uji untuk membangkitkan text sepanjang 400 karakter

In [15]:
start_index = random.randint(0, len(text) - maxlen - 1)
diversity = 0.7
print('\n>>>>> diversity:', diversity)

generated = ''
sentence = text[start_index: start_index + maxlen]
generated += sentence
print('>>>>> Generating with seed: "' + sentence + '"')
sys.stdout.write(generated)

for i in range(400):
    x_pred = np.zeros((1, maxlen, len(chars)))
    for t, char in enumerate(sentence):
        x_pred[0, t, char_indices[char]] = 1.

    preds = model.predict(x_pred, verbose=0)[0]
    next_index = sample(preds[-1], diversity)
    next_char = indices_char[next_index]

    generated += next_char
    sentence = sentence[1:] + next_char

    sys.stdout.write(next_char)
    sys.stdout.flush()



>>>>> diversity: 0.7
>>>>> Generating with seed: "raw near, he aimed a gun,
which he carri"
raw near, he aimed a gun,
which he carried wathful, for it was to be restless.  in darkness; but they were clouded my intended to the heavens; and yet you are the stranger repented all was dry; all sleep.

“it was allied to this genius and uried see presented
in his
family!  precious to a coin, i found my own spirit to the sun was opened and destroy the place in the same time than the cause of her dearest conviction that i could not ref

<p>Copyright &copy; 2019 <a href=https://www.linkedin.com/in/andityaarifianto/>ADF</a> </p>