### Generate Irish Song with RNN 

![image.png](attachment:image.png)

it goes on in the sequence<br>
sequence can be long but that will weaken as context spreads, the word at 1st position has very less to do with the word at 100th position<br>
we need not long or short term memory, we need both LSTM (long short term memory)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

LSTM uses cell state, which is contect and can be maintained across many time steps, which can bring meaning from the begining of the sentence, it is also bidirectional, the later words can provide context to current word

tf.keras.layers.Embedding(tokenizer.vocab)size, 64)<br>
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))

we first define, it takes numeric param with number of hidden nodes within it, this is dimensionality of the output, Bidirectional may not beuseful in every case but its worth trying

we must keep **return_sequences=True** in previous bidirectional layers if we have stacked multiple bidirectional layers<br>
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))

#### import libraries 

In [1]:
import tensorflow as tf

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import numpy as np

In [2]:
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)

##### change this cell to code if you want to download the dataset

!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/irish-lyrics-eof.txt \
    -O ../datasets/irish_song/irish-lyrics-eof.txt

In [3]:
! ls ../datasets/irish_song/

irish-lyrics-eof.txt


In [4]:
tokenizer = Tokenizer()

data = open('../datasets/irish_song/irish-lyrics-eof.txt').read()

# we break at every line break to get the corpus
corpus = data.lower().split("\n")

tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

print(tokenizer.word_index)
print(total_words)


2690


##### when generating texts we don't need validation dataset, we use every  bit we have<br>
now we will turn our corpus into training data, training data is different when we are generating texts

In [5]:
input_sequences = []

for line in corpus:
    # below statement will give text to seq for current line
    token_list = tokenizer.texts_to_sequences([line])[0]
    # below we get n-grams, each n-gram pair from the sentence
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

#### pad sequences

In [6]:
# this will get the maximum length by checking the max length of each sentences
# or in other words length of longest sentence
max_sequence_len = max([len(x) for x in input_sequences])

In [7]:
input_sequences = np.array(pad_sequences(input_sequences,
                                        maxlen=max_sequence_len,
                                        padding='pre'))

**our inputs have transformed for each line like below**<br>
ideal for giving us x and y<br>**we can take trailing values as x and the last value as y**

![image.png](attachment:image.png)

**create predictors and labels - will generate xs and ys**

In [8]:
xs, labels = input_sequences[:, :-1], input_sequences[:, -1]

**now we do one hot encoding**

In [9]:
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

![image.png](attachment:image.png)

#### train a simple model, unoptimized 

In [10]:
model = Sequential()
# 240 dimensions below because of massive number of words
# max length is -1 because last word has been used for y
model.add(Embedding(total_words, 240, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))
adam = Adam(lr=0.01)

In [11]:
model.compile(loss='categorical_crossentropy', optimizer=adam,
              metrics=['accuracy'])

In [12]:
history = model.fit(xs, ys, epochs=100, verbose=1)

Epoch 1/100
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epo

Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


this is very unstructured data so don't expect a good result, we need accuracy around 70-80 percent

### Generate text 

In [15]:
seed_text = "I made a poetry machine"
next_words = 100
  
for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    predicted = model.predict_classes(token_list, verbose=0)
    output_word = ""
    for word, index in tokenizer.word_index.items():
        if index == predicted:
            output_word = word
            break
    seed_text += " " + output_word
print(seed_text)

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
I made a poetry machine could i oft times too late are you cannot hear me mothers house with whiskey duns alone as bees teacher as any or else ill expire way wid side by side by side by side that proud ran free cheeks unto my lot so