In [None]:
"""
Text generation can be done through simply:
predicting the next most likely word, given an input sequence.
This can be done over and over by feeding the original input sequence, plus the newly predicted end word, as the next input sequence to the model.
As such, the full output generated from a very short original input can effectively go on however long you want it to be.

The only real change to the network here is the output layer will now be equivalent to a node per each possible new word to generate - so, if you have 1,000 possible words in your corpus, you’d have an output array of length 1,000.
You’ll also need to change the loss function from binary cross-entropy to categorical cross entropy - before, we had only a 0 or 1 as output, now there are potentially thousands of output “classes” (each possible word).

Text Generation takes an input and outputs probabilities for the next most probable word.
"""

In [None]:
"""
s noted before, there are hardly any differences in the model itself, other than changing the number of nodes in the output layer and changing the loss function.

The more obvious changes come in working with the input and output data. The input data takes chunks of sequences and just splits off the final word as its label. So, if we had the sentence “I went to the beach with my dog”, and we had a max input length of five words, we’d get:

Input: I went to the beach

Label: with

Now, that’s not the only sequence that will come from the sentence! We would also get:

Input: went to the beach with

Label: my

And:

Input: to the beach with my

Label: dog

That’s how the N-Grams used in the pre-processing work - a single input sequence might actually become a series of sequences and labels.
"""

In [None]:
###Constructing a Text Generation Model

In [None]:
#Using most of the techniques you've already learned, it's now possible to generate new text by predicting the next word that follows a given seed word. To practice this method, we'll use the Kaggle Song Lyrics Dataset.

In [1]:
#Import TensorFlow and related functions

import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Other imports for processing data
import string
import numpy as np
import pandas as pd

In [2]:
###Get the Dataset
#As noted above, we'll utilize the Song Lyrics dataset on Kaggle.

!wget --no-check-certificate \
    https://drive.google.com/uc?id=1LiJFZd41ofrWoBtW-pMYsfz1w8Ny0Bj8 \
    -O /tmp/songdata.csv

'wget' is not recognized as an internal or external command,
operable program or batch file.


###First 10 Songs
#Let's first look at just 10 songs from the dataset, and see how things perform.

In [4]:
###Preprocessing
#Let's perform some basic preprocessing to get rid of punctuation and make everything lowercase. We'll then split the lyrics up by line and tokenize the lyrics.

def tokenize_corpus(corpus, num_words=-1):
  # Fit a Tokenizer on the corpus
  if num_words > -1:
    tokenizer = Tokenizer(num_words=num_words)
  else:
    tokenizer = Tokenizer()
  tokenizer.fit_on_texts(corpus)
  return tokenizer

def create_lyrics_corpus(dataset, field):
  # Remove all other punctuation
  dataset[field] = dataset[field].str.replace('[{}]'.format(string.punctuation), '')
  # Make it lowercase
  dataset[field] = dataset[field].str.lower()
  # Make it one long string to split by line
  lyrics = dataset[field].str.cat()
  corpus = lyrics.split('\n')
  # Remove any trailing whitespace
  for l in range(len(corpus)):
    corpus[l] = corpus[l].rstrip()
  # Remove any empty lines
  corpus = [l for l in corpus if l != '']

  return corpus

In [3]:
# Read the dataset from csv - just first 10 songs for now
dataset = pd.read_csv('C:/Users/fortn/anaconda3/songdata.csv')

In [6]:
# Create the corpus using the 'text' column containing lyrics
corpus = create_lyrics_corpus(dataset, 'text')
# Tokenize the corpus
tokenizer = tokenize_corpus(corpus)

total_words = len(tokenizer.word_index) + 1

print(tokenizer.word_index)
print(total_words)

  dataset[field] = dataset[field].str.replace('[{}]'.format(string.punctuation), '')


98215


In [8]:
###Create Sequences and Labels
#After preprocessing, we next need to create sequences and labels. Creating the sequences themselves is similar to before with texts_to_sequences, but also including the use of N-Grams; creating the labels will now utilize those sequences as well as utilize one-hot encoding over all potential output words.

sequences = []
for line in corpus:
	token_list = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(token_list)):
		n_gram_sequence = token_list[:i+1]
		sequences.append(n_gram_sequence)

In [10]:
# Pad sequences for equal input length 
max_sequence_len = max([len(seq) for seq in sequences])
sequences = np.array(pad_sequences(sequences, maxlen=max_sequence_len, padding='pre'))

In [11]:
# Split sequences between the "input" sequence and "output" predicted word
input_sequences, labels = sequences[:,:-1], sequences[:,-1]

In [12]:
print(input_sequences)

[[   0    0    0 ...    0    0  129]
 [   0    0    0 ...    0  129   69]
 [   0    0    0 ...  129   69   67]
 ...
 [   0    0    0 ...   68   19 2197]
 [   0    0    0 ...   19 2197    4]
 [   0    0    0 ...    0    0   30]]


In [14]:
print(labels)

[ 69  67 173 ...   4   3 916]


In [15]:
np.shape([labels])

(1, 10625150)

In [16]:
np.shape([total_words])

(1,)

In [29]:
# One-hot encode the labels
#one_hot_labels = tf.keras.utils.to_categorical(labels, num_classes=total_words)

In [19]:
# Check out how some of our data is being stored
# The Tokenizer has just a single index per word
print(tokenizer.word_index['know'])
print(tokenizer.word_index['feeling'])
# Input sequences will have multiple indexes
print(input_sequences[5])
print(input_sequences[6])
# And the one hot labels will be as long as the full spread of tokenized words
#print(one_hot_labels[5])
#print(one_hot_labels[6])

26
239
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 129  69  67 173  25   6]
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 129
  69  67 173  25   6 995]


In [None]:
"""
Train a Text Generation Model
Building an RNN to train our text generation model will be very similar to the sentiment models you've built previously. The only real change necessary is to make sure to use Categorical instead of Binary Cross Entropy as the loss function - we could use Binary before since the sentiment was only 0 or 1, but now there are hundreds of categories.

From there, we should also consider using more epochs than before, as text generation can take a little longer to converge than sentiment analysis, and we aren't working with all that much data yet. I'll set it at 200 epochs here since we're only use part of the dataset, and training will tail off quite a bit over that many epochs.
"""

In [22]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional

model = Sequential()
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(20)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [27]:
#history = model.fit(input_sequences, one_hot_labels, epochs=200, verbose=1)

In [28]:
###View the Training Graph
"""
import matplotlib.pyplot as plt

def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.show()

plot_graphs(history, 'accuracy')
"""

'\nimport matplotlib.pyplot as plt\n\ndef plot_graphs(history, string):\n  plt.plot(history.history[string])\n  plt.xlabel("Epochs")\n  plt.ylabel(string)\n  plt.show()\n\nplot_graphs(history, \'accuracy\')\n'

In [30]:
"""
Generate new lyrics!
It's finally time to generate some new lyrics from the trained model, and see what we get. To do so, we'll provide some "seed text", or an input sequence for the model to start with. We'll also decide just how long of an output sequence we want - this could essentially be infinite, as the input plus the previous output will be continuously fed in for a new output word (at least up to our max sequence length).
"""

'\nGenerate new lyrics!\nIt\'s finally time to generate some new lyrics from the trained model, and see what we get. To do so, we\'ll provide some "seed text", or an input sequence for the model to start with. We\'ll also decide just how long of an output sequence we want - this could essentially be infinite, as the input plus the previous output will be continuously fed in for a new output word (at least up to our max sequence length).\n'

In [31]:
seed_text = "im feeling chills"
next_words = 100
  
for _ in range(next_words):
	token_list = tokenizer.texts_to_sequences([seed_text])[0]
	token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
	predicted = np.argmax(model.predict(token_list), axis=-1)
	output_word = ""
	for word, index in tokenizer.word_index.items():
		if index == predicted:
			output_word = word
			break
	seed_text += " " + output_word
print(seed_text)

im feeling chills tminus tminus appeard radioactivity radioactivity winkin jasper decisis decisis biieetch biieetch elem elem elem elem measured measured ghiddy measured wildand iiicy concentrate howd coaldusted sharper rocknrolling swimin organizers bach mij drugzoooh songstress fetus foreplay fiddadidadam conceived conceived prussianblue misrepresented hammocks gs3 beckett trembleon guardo hunn hypodermics angadai burgundy friendboy writhes gajyeoga whipple rainthats slanger reports geilheit hines terrarium peaux pyaar zoloft revolting revolting remniscin matang matang varmints missin antibeat parecerte jalapeno dampest binansagang mailmans licketysplit regurgetated abut eeheeee eponineinterjections kazataima kincade discotecque carrry carrry airports wizardry sipain hardtoget slipside ralla gonethe lickin bydd exuma brrrrrrrr loooveee bionic leprechaun graph folarin
