# Bob Ross Episode Text Generator
> The following shows how to create a text generator using LSTM's in Keras.

- toc: true 
- badges: true
- comments: true
- categories: [nlp, keras]

This project shows how we can gather data and build a model to generate text in the style of bob ross. 
In order to gather data, we'll be using a script called [download-yt-playlist.py](bob_ross/scripts/download-yt-playlist.py) that uses the YouTube API to download a Bob Ross playlist. This playlist contains most of the Bob Ross epiodes as well as the transcript from each epiode


In [33]:
!pip install beautifulsoup4

Collecting beautifulsoup4
  Using cached https://files.pythonhosted.org/packages/f8/c7/741c97d7366f4779ca73d244904978b43a81fd37d85fcf05ad19d472c1ce/beautifulsoup4-4.6.3-py2-none-any.whl
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.6.3


In [34]:
import pandas as pd
import tensorflow as tf
from bs4 import BeautifulSoup
import numpy as np

Next, we'll import the dataset that we created using the `download-ty-playlist` script,
The csv is included in the repo 

we'll then load the dataset into a pandas dataframe
Our csv contains 249 rows, which are the number of episodes that was returned by the script.
We've removed any columns that are empty, since not all of the episodes had a transcript

In [35]:
df = pd.read_csv('bob_ross/bob_ross_episodes.csv', index_col=0, parse_dates=['snippet.publishedAt'], usecols=['snippet.description', 'snippet.publishedAt', 'snippet.title', 'transcript'])
df.dropna(inplace=True)
# df['snippet.publishedAt'] =pd.to_datetime(df['snippet.publishedAt'])
df.sort_values(by='snippet.publishedAt', inplace=True)
df.head()

Unnamed: 0_level_0,snippet.publishedAt,snippet.title,transcript
snippet.description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Season 21 of The Joy of Painting with Bob Ross features the following wonderful painting instructions: Valley View, Tranquil Dawn, Royal Majesty, Serenity, Cabin at Trails End, Mountain Rhapsody, Wilderness Cabin, By the Sea, Indian Summer, Blue Winter, Desert Glow, Lone Mountain, and Florida’s Glory.\n\nSubscribe to the official Bob Ross YouTube channel - http://bit.ly/BobRossSubscribe\n\nSeason 21 Playlist: https://www.youtube.com/playlist?list=PLAEQD0ULngi5_UcEWkQZu23WzQP1Tkxq3\n\nThe Joy of Painting Season 20 is now on iTunes! - http://bit.ly/iTunesBobRoss\n\nOfficial Bob Ross website - http://www.BobRoss.com\n\nOfficial Bob Ross Twitch.tv Stream! - http://twitch.tv/BobRoss\n\nAll episodes of Bob Ross are now live on Roku - http://bit.ly/BobRossOnRoku\n\nOriginally aired on 9/5/1990",2015-03-25 16:32:35,Bob Ross - Valley View (Season 21 Episode 1),"(bright music) - Hello, I&#39;m Bob Ross and\n..."
"Season 6 of The Joy of Painting with Bob Ross features the following wonderful painting instructions: Blue River, Nature's Edge, Morning Mist, Whispering Stream, Secluded Forest, Snow Trail, Arctic Beauty, Horizons West, High Chateau, Country Life, Western Expanse, Marshlands, and Blaze of Color.\n\nSubscribe to the official Bob Ross YouTube channel - http://bit.ly/BobRossSubscribe\n\nSeason 6 Playlist: https://www.youtube.com/playlist?list=PLAEQD0ULngi5UR35RJsvL0Xvlm3oeY4Ma\n\nThe Joy of Painting : Season 20 is now on iTunes! http://bit.ly/iTunesBobRoss\n\nOfficial Bob Ross website - http://www.BobRoss.com\n\nOfficial Bob Ross Twitch.tv Stream! - http://twitch.tv/BobRoss",2015-03-27 17:01:20,Bob Ross - Arctic Beauty (Season 6 Episode 7),- Welcome back. Awful glad you could join me t...
"Season 6 of The Joy of Painting with Bob Ross features the following wonderful painting instructions: Blue River, Nature's Edge, Morning Mist, Whispering Stream, Secluded Forest, Snow Trail, Arctic Beauty, Horizons West, High Chateau, Country Life, Western Expanse, Marshlands, and Blaze of Color.\n\nSubscribe to the official Bob Ross YouTube channel - http://bit.ly/BobRossSubscribe\n\nSeason 6 Playlist: https://www.youtube.com/playlist?list=PLAEQD0ULngi5UR35RJsvL0Xvlm3oeY4Ma\n\nThe Joy of Painting : Season 20 is now on iTunes! http://bit.ly/iTunesBobRoss\n\nOfficial Bob Ross website - http://www.BobRoss.com\n\nOfficial Bob Ross Twitch.tv Stream! - http://twitch.tv/BobRoss",2015-03-27 17:24:24,Bob Ross - Horizons West (Season 6 Episode 8),"- Welcome back, I&#39;m awful\nglad to see you..."
"Season 6 of The Joy of Painting with Bob Ross features the following wonderful painting instructions: Blue River, Nature's Edge, Morning Mist, Whispering Stream, Secluded Forest, Snow Trail, Arctic Beauty, Horizons West, High Chateau, Country Life, Western Expanse, Marshlands, and Blaze of Color.\n\nSubscribe to the official Bob Ross YouTube channel - http://bit.ly/BobRossSubscribe\n\nSeason 6 Playlist: https://www.youtube.com/playlist?list=PLAEQD0ULngi5UR35RJsvL0Xvlm3oeY4Ma\n\nThe Joy of Painting : Season 20 is now on iTunes! http://bit.ly/iTunesBobRoss\n\nOfficial Bob Ross website - http://www.BobRoss.com\n\nOfficial Bob Ross Twitch.tv Stream! - http://twitch.tv/BobRoss",2015-03-27 18:16:39,Bob Ross - Blue River (Season 6 Episode 1),"(peaceful instrumental music) - Hello, I&#39;m..."
"Season 5 of The Joy of Painting with Bob Ross features the following wonderful painting instructions: Mountain Waterfall, Twilight Meadow, Mountain Blossoms, Winter Stillness, Quiet Pond, OCean Sunrise, Bubbling Brook, Arizona Splendor, Anatomy of a Wave, The Windmill, Autumn Glory, Indian Girl, and Meadow Stream.\n\nSubscribe to the official Bob Ross YouTube channel - http://bit.ly/BobRossSubscribe\n\nSeason 5 Playlist: https://www.youtube.com/playlist?list=PLAEQD0ULngi6bAFRfcqgpKP4T4SnoxoAz\n\nThe Joy of Painting : Season 20 is now on iTunes! http://bit.ly/iTunesBobRoss\n\nOfficial Bob Ross website - http://www.BobRoss.com\n\nOfficial Bob Ross Twitch.tv Stream! - http://twitch.tv/BobRoss",2015-03-27 18:34:49,Bob Ross - Twilight Meadow (Season 5 Episode 2),"- Hi, welcome back. I&#39;m glad to see you to..."


The following will build out text generator.
We'll do the following,
- load a sample of the dataset (about 30%)",
- combine all the transcription into one long string,
- We use BeautifulSoup to remove any html tags in the text,
- we'll then generate a list of all the characters in the transcription

In [36]:
#only use about %20 of rows
test_df = df.sample(frac=.3)

In [37]:
len(test_df)

75

In [38]:
#combine transcription into 1 list
descriptions = ''
all_transcriptions = ''
for index, row in test_df.iterrows():
    all_transcriptions += BeautifulSoup(row['transcript'],"lxml").get_text().replace('\n', ' ')  

In [39]:
len(all_transcriptions)

1522162

Next, we'll just display a piece of the all_transcriptions just to see what it looks like

In [40]:
all_transcriptions[:100]

"- Hi, welcome back. I'm certainly glad you could join us today. And, as you can see, today I have on"

In [41]:
chars = sorted(list(set(all_transcriptions)))
print('Count of unique characters (i.e., features):', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

Count of unique characters (i.e., features): 81


Next, we'll generate seperate lists of all the strings that we'll feed into the model
This list is 40 charcters of the full text, seperated by 3 characters(`step`)

In [42]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(all_transcriptions) - maxlen, step):
    sentences.append(all_transcriptions[i: i + maxlen])
    next_chars.append(all_transcriptions[i + maxlen])
print('nb sequences:', len(sentences))

print(sentences[:10], "\n")
print(next_chars[:10])

nb sequences: 507374
["- Hi, welcome back. I'm certainly glad y", "i, welcome back. I'm certainly glad you ", "welcome back. I'm certainly glad you cou", "come back. I'm certainly glad you could ", "e back. I'm certainly glad you could joi", "ack. I'm certainly glad you could join u", ". I'm certainly glad you could join us t", "'m certainly glad you could join us toda", 'certainly glad you could join us today. ', 'tainly glad you could join us today. And'] 

['o', 'c', 'l', 'j', 'n', 's', 'o', 'y', 'A', ',']


We now have 507374 lists, that each contain 40 characters of the string,
The first list is `- Hi, welcome back. I'm certainly glad y`, followed by `i, welcome back. I'm certainly glad you`

Next, we'll create tensors of x and y, that contain the lists of all the sentences, we've created

In [43]:
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

## Builing The Model

Next, we'll build out our model

In [44]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.callbacks import LambdaCallback, ModelCheckpoint
import random
import sys
import io

The following are 2 functions that will print the prediction from each epoch, as well as the `temperature`

temperature is defined as the following:

"Temperature is a scaling factor applied to the outputs of our dense layer before applying the softmaxactivation function. In a nutshell, it defines how conservative or creative the model's guesses are for the next character in a sequence. Lower values of temperature (e.g., 0.2) will generate \"safe\" guesses whereas values of temperature above 1.0 will start to generate riskier guesses. Think of it as the amount of surpise you'd have at seeing an English word start with \"st\" versus \"sg\". When temperature is low, we may get lots of the's and and's; when temperature is high, things get more unpredictable.

-- https://medium.freecodecamp.org/applied-introduction-to-lstms-for-text-generation-380158b29fb3


In [50]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def on_epoch_end(epoch, logs):
    # Function invoked for specified epochs. Prints generated text.
    # Using epoch+1 to be consistent with the training epochs printed by Keras
    if epoch+1 == 1 or epoch+1 == 15:
        print()
        print('----- Generating text after Epoch: %d' % epoch)

        start_index = random.randint(0, len(all_transcriptions) - maxlen - 1)
        for diversity in [0.2, 0.5, 1.0, 1.2]:
            print('----- diversity:', diversity)

            generated = ''
            sentence = all_transcriptions[start_index: start_index + maxlen]
            generated += sentence
            print('----- Generating with seed: "' + sentence + '"')
            sys.stdout.write(generated)

            for i in range(400):
                x_pred = np.zeros((1, maxlen, len(chars)))
                for t, char in enumerate(sentence):
                    x_pred[0, t, char_indices[char]] = 1.

                preds = model.predict(x_pred, verbose=0)[0]
                next_index = sample(preds, diversity)
                next_char = indices_char[next_index]

                generated += next_char
                sentence = sentence[1:] + next_char

                sys.stdout.write(next_char)
                sys.stdout.flush()
            print()
    else:
        print()
        print('----- Not generating text after Epoch: %d' % epoch)

generate_text = LambdaCallback(on_epoch_end=on_epoch_end)

In [None]:
def build_basic_model()
    model = Sequential()
    model.add(LSTM(batch_size, input_shape=(maxlen,len(chars))))
    model.add(Dense(len(chars)))
    model.add(Activation("softmax"))
    return model

Here, we'll create our model.
After a few tests, i've seen that having 2 LSTMs with a batch size of 256, returns very good results.
The first model is a basic model with 1 LSTM'

In [16]:
batch_size=128
learning_rate = 0.01

model = build_basic_model()
optimizer = RMSprop(lr=learning_rate)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

# define the checkpoint
filepath = "weights.hdf5"
checkpoint = ModelCheckpoint(filepath, 
                             monitor='loss', 
                             verbose=1, 
                             save_best_only=True, 
                             mode='min')

# fit model using our gpu
with tf.device('/gpu:0'):
    model.fit(x, y,
              batch_size=batch_size,
              epochs=15,
              verbose=1,
              callbacks=[generate_text, checkpoint])

Epoch 1/15
 - 168s - loss: 1.4472

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "ainted in your mountain. There. Isn't th"
ainted in your mountain. There. Isn't the back, and there we go the to start out of the back. There we go with the back, and there. There we go. There. There. There. There. And we'll go back and start the background of this one of the back of the base of the back. There. And we'll start one of the brush on the back, and there's a little bit of the back. There we go that one of the back. There we go the back. And we want the back, and th
----- diversity: 0.5
----- Generating with seed: "ainted in your mountain. There. Isn't th"
ainted in your mountain. There. Isn't that easy like that. So let's do that. So let's go the back. Where we got the way of the old bright over the touching. I have to make the let the hang here. I stay over there, and that's run on the base the ingict of right in here lives right there. Okay, let's jus

  after removing the cwd from sys.path.


y're back here in the bright and that lives right there. There we go. And it's all the color in the bristles that live in the bright and little bit of the bright red of the brush. And they come right there. And they're a little bit of the brown and some little things that live out here. And a little bit of the bright and then we have a little bit of the bit 
----- diversity: 0.5
----- Generating with seed: "tly tap this and pull down at the same t"
tly tap this and pull down at the same thing because that's where a little bit of color in there. And we'll go sid it to get into the top, and they come right there. That still the black gesso basic little black, they paint thinner, right over the brush and happening beautiful way of your painting, that live out there. And becommen it back and sort of some big tree that lives real and sort of this side it dark to dry the big mountches a
----- diversity: 1.0
----- Generating with seed: "tly tap this and pull down at the same t"
tly tap this a

In [None]:
You can see that the results were good, but lets go deeper

## Builing a better model

Here, we'll be using 2 LSTM's and dropout, durning training, we'll save the best model for later

In [47]:
from keras.layers import Dropout

batch_size=256
learning_rate = 0.01
def build_deeper_model():
    model = Sequential()
    model.add(LSTM(batch_size, input_shape=(maxlen, len(chars)), return_sequences=True))
    model.add(Dropout(0.2))
    model.add(LSTM(batch_size))
    model.add(Dropout(0.2))
    model.add(Dense(len(chars), activation='softmax'))
    
model = build_deeper_model()
model.compile(loss='categorical_crossentropy', optimizer='adam')


# define the checkpoint
filepath = "bob_ross/weights-deepeer.hdf5"
checkpoint = ModelCheckpoint(filepath, 
                             monitor='loss', 
                             verbose=1, 
                             save_best_only=True, 
                             mode='min')

# fit model using our gpu
with tf.device('/gpu:0'):
    model.fit(x, y,
              batch_size=64,
              epochs=15,
              verbose=1,
              callbacks=[generate_text, checkpoint])

## Loading the Model

After training, which took about 2 hours to train, using a GCP instance with a Tesla P100 GPU, we load the best model and perfrom a prediction

We loaded our model from our weights, and now we can predict
I choose a temperature of `0.5`. it seemed the have the best results

In [58]:
# model.load_weights("weights-deepeer.hdf5")
from keras.models import load_model
model = load_model("bob_ross/weights-deepeer.hdf5")
model
# model.compile(loss='categorical_crossentropy', optimizer='adam')

<keras.engine.sequential.Sequential at 0x7f78bb4f7748>

In [59]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [71]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

In [78]:
start_index = 0


for diversity in [0.5]:
    print('----- diversity:', diversity)

    generated = ''
    sentence = all_transcriptions[start_index: start_index + maxlen]
    generated += sentence
#     print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)

    for i in range(1000):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_indices[char]] = 1.

        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds, diversity)
        next_char = indices_char[next_index]

        generated += next_char
        sentence = sentence[1:] + next_char

        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()

----- diversity: 0.5
- Hi, welcome back. I'm certainly glad you can do this black canvas. I have the same clouds that the light on that little bushes that lives on the brush, and I'm gonna go up in here. There, something like that. There, and we'll just put a little bit of this but of the Prussian blue to think on the brush here. We'll just push in some little bushes. And I wanna see what you looks like that, let's go back into the bright red. And you can make it a little bit of the little bushes and sidight to have a little bit of the little light color. Just a little bit of the background color to the colors on the brush, and I wanna do is in the background, I'm gonna put a little bit of black in here and there. Just sort of lay the color. There, that easy. And we can see it in a little more of the lighter and they go right into the one of the lay of the paintings that you have the colors that you go. And we got a little bit of lighter on the canvas on the canvas, and we can see the 

As you can see above, we've generated a good amount of text from all of our transcriptions.
Notice, that the model was able to understand color names (Prussian blue) and you kind of get the idea that its a story about painting.
