In [1]:
import os
import time

import pandas as pd
import sqlite3
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import re
from tqdm import tqdm
from sklearn.model_selection import train_test_split


import tensorflow as tf
from tensorflow.keras.layers import StringLookup
from tensorflow.strings import unicode_split, reduce_join
from tensorflow.keras.layers import GRU, Dense, LSTM, Input, Embedding
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from nltk.translate.bleu_score import sentence_bleu

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_colwidth',1024)
pd.set_option('display.width',1024)

drive_path = '/content/drive/MyDrive/01_Applied_AI_Course_New/CaseStudies/Text Creation/'

# Data Processing
1. Load the respective stored files from drive
2. Preprocess the text, convert lower case and remove non-asci chars
3. Combine sequences of the data to single text of respective files
4. Generate the vocabulary of respetive files and combine them to form single common vocubalury


In [2]:
# converts to lower case and remove non-asci chars
def format_text(text):
  text = text.lower()
  text = ''.join([i if ord(i) < 128 else ' ' for i in text])
  return text

## All the News

In [3]:
# load the longform_text data 
longform_text_data = pd.read_parquet(drive_path+'longform_text_data.parquet.gzip')
print(longform_text_data.shape)
longform_text_data.head()

(196959, 1)


Unnamed: 0,longform_text
0,"Agent Cooper in Twin Peaks is the audience: once delighted, now disintegrating And never more so than in Showtime’s new series revival Some spoilers ahead through episode 4 of season 3 of Twin Peaks. On May 21st, Showtime brought back David Lynch’s groundbreaking TV series Twin Peaks, and fulfilled a prophecy in the process. In the second season finale, back in 1991, the spirit of series-defining murder victim Laura Palmer told FBI special agent and series protagonist Dale Cooper, “I’ll see you again in 25 years.” That clip plays again in the first episode of Lynch’s Twin Peaks revival, as a reminder that decades have in fact gone by, Laura’s promise has been carried out, and a series canceled mid-story is back on the air.A lot has changed in 25 years. The original cast members, who are mostly back on board, have all aged heavily and visibly. Many of the characters have moved on in life, getting new jobs, forming families, or taking up new obsessions. But in the opening episode, Dale Cooper was still..."
1,"AI, the humanity! AlphaGo’s victory isn’t a defeat for humans — it’s an opportunity A loss for humanity! Man succumbs to machine! If you heard about AlphaGo’s latest exploits last week — crushing the world’s best Go player and confirming that artificial intelligence had mastered the ancient Chinese board game — you may have heard the news delivered in doomsday terms.There was a certain melancholy to Ke Jie’s capitulation, to be sure. The 19-year-old Chinese prodigy declared he would never lose to an AI following AlphaGo’s earthshaking victory over Lee Se-dol last year. To see him onstage last week, nearly bent double over the Go board and fidgeting with his hair, was to see a man comprehensively put in his place.But focusing on that would miss the point. DeepMind, the Google-owned company that developed AlphaGo, isn’t attempting to crush humanity — after all, the company is made up of humans itself. AlphaGo represents a major human achievement and the takeaway shouldn’t be that AI is surpassing our a..."
2,"The Viral Machine Super Deluxe built a weird internet empire. Can it succeed on TV? When Wolfgang Hammer talks about the future of entertainment, people listen. Hammer is the mastermind behind the American reboot of House of Cards, the guy with the unlikely idea of bringing together David Fincher and a forgotten BBC series. He oversaw two of CBS Films’ first prestige movies: the Coen brothers’ Inside Llewyn Davis and Martin McDonagh’s Seven Psychopaths. He’s had a charmed career: leap-frogging from a master’s degree at Stanford to an entry-level job at Media Rights Capital to eventually becoming the president of CBS’s fledgling films division. So when Hammer came to Turner with an ambitious concept, the cable giant was willing to entertain it. Turner owns TBS, TNT, CNN, and Cartoon Network, but what Hammer was proposing was something altogether different: an all-in-one production company that would thrive online and pretty much do whatever it wanted. Now 18 months old, Super Deluxe is being nurtured ..."
3,"How Anker is beating Apple and Samsung at their own accessory game Steven Yang quit his job at Google in the summer of 2011 to build the products he felt the world needed: a line of reasonably priced accessories that would be better than the ones you could buy from Apple and other big-name brands. These accessories — batteries, cables, chargers — would solve our most persistent gadget problem by letting us stay powered on at all times. There were just a few problems: Yang knew nothing about starting a company, building consumer electronics, or selling products. “I was a software engineer all my life at Google. I didn’t know anyone in the electronics manufacturing world,” Yang tells me over Skype from his office in Shenzhen, China. But he started the company regardless, thanks in no small part to his previous experience with Amazon’s sellers marketplace, a platform for third-party companies and tiny one- or two-person teams interested in selling directly to consumers. He named the company Anker, after..."
4,"Tour Black Panther’s reimagined homeland with Ta-Nehisi Coates Ahead of Black Panther’s 2018 theatrical release, Marvel turned to Ta-Nehisi Coates to breathe new life into the nation of Wakanda. “I made most of my career analyzing the forces of racism and white supremacy as an idea in America. But what you begin to realize after you do that long enough — you aren’t talking about anything specific. In other words, you aren’t really talking about whether some people have lighter skin or some people have blonde hair or some people have blue eyes or some people have kinky hair. You’re talking about power.” This is the voice of journalist, cultural critic, and best-selling author Ta-Nehisi Coates. Coates is the writer of Marvel’s latest entry in the Black Panther canon, Black Panther: A Nation Under Our Feet. With the book, he’s been charged with turning one of Marvel’s least understood and appreciated black characters into a marquee superhero.Even if you don’t read comics, you likely know about the chara..."


In [4]:
# convert longform_text_data to lower case & remove non-asci char
%%time
longform_text_data['longform_text'] = longform_text_data.longform_text.apply(lambda x: format_text(x))
print(longform_text_data['longform_text'].head())
print(' ')

0    agent cooper in twin peaks is the audience: once delighted, now disintegrating      and never more so than in showtime s new series revival some spoilers ahead through episode 4 of season 3 of twin peaks. on may 21st, showtime brought back david lynch s groundbreaking tv series twin peaks, and fulfilled a prophecy in the process. in the second season finale, back in 1991, the spirit of series-defining murder victim laura palmer told fbi special agent and series protagonist dale cooper,  i ll see you again in 25 years.  that clip plays again in the first episode of lynch s twin peaks revival, as a reminder that decades have in fact gone by, laura s promise has been carried out, and a series canceled mid-story is back on the air.a lot has changed in 25 years. the original cast members, who are mostly back on board, have all aged heavily and visibly. many of the characters have moved on in life, getting new jobs, forming families, or taking up new obsessions. but in the opening episo

In [5]:
# merge all contents to form single text and get the vocab from it
longform_text =  ' '.join(longform_text_data.longform_text.values).replace(r'  ', ' ')

total_char_count = len(longform_text)
print('total characters ', total_char_count)

total characters  896650760


## SciFi Stories

In [6]:
# load the scifi text data
scifi_data = pd.read_pickle(drive_path+'scifi_data.pkl')
scifi_text_data = pd.DataFrame({'scifi_text': scifi_data})
print(scifi_text_data.shape)
scifi_text_data.head()

(145827, 1)


Unnamed: 0,scifi_text
0,"MARCH # All Stories New and Complete Publisher Editor IF is published bi-monthly by Quinn Publishing Company, Inc., Kingston, New York. Volume #, No. #. Copyright # by Quinn Publishing Company, Inc. Application for Entry' as Second Class matter at Post Office, Buffalo, New York, pending. Subscription # for # issues in U.S. and Possessions: Canada # for # issues; elsewhere #. Aiiow four weeks for change of address. All stories appearing in this magazine are fiction. Any similarity to actual persons is coincidental. #c a fcopy. Printed ia U.S. A. A chat with the editor i # science fiction magazine called IF. The title was selected after much thought because of its brevity and on the theory it is indicative of the field and will be easy to remember. The tentative title that just morning and couldn't remember it until we'd had a cup of coffee, it was summarily discarded. A great deal of thought and effort lias gone into the formation of this magazine. We have had the aid of several very talented and generou..."
1,"for which we are most grateful. Much is due them for their warmhearted assistance. And now that the bulk of the formative work is done, we will try to maintain IF as one of the finest books on the market. t a great public demand for our magazine. In short, why will you buy IF? We cannot, in honesty, say we will publish at all times the best science fiction in the field. That would not be true. But we will have access to the best stories, and we will get our fair share of works from the best writers. We definitely will not talk ""adult"" or ""juvenile"" relative to our content as we feel such terms are misleading. We would rather think at all times in the terms of ""story"". Some of the greatest escapist literature ever written, Treasure Island for instance, could be put into either category or both. And if Edgar Rice Burroughs is juvenile, then so are we, because the late master has given us some memorable thrills. Frankly, we don't think you'll buy IF because you feel we print better yams than any other mag."
2,"You will buy it, we hope, because you like its personality. Every magazine, we feel, does have a definite personality of its own. This personality is usually a reflection of the editors, their way of thinking, their appreciation of tKe market, their interpretation of what you will like best in stories and artwork. We have tried to make IF different from any other science fiction magazine on the stands while still building it along the lines of what every science fiction mag must be. Aside from the letter columns and the editorial, which are departments of field-wide use, we have not copied any feature of any other magazine. We will not, for instance, review fanzines, because we feel that is being most ably done by other mags. Nor will we, as a general practice, review books because that appears to us to be overdone. a personality of our own and hope thereby to establish an affinity with a large number of readers who will remember IF when they buy a science fiction mag as one they like and wish to continue..."
3,"At all times we will hew to the story-line and will exhort with our writers to do the same. As an example, when Howard Browne phoned to talk over the plot for his lead novel in this issue, he described what ivas without doubt a staggering premise, a really startling concept. ""But,"" he mourned, S T suppose I'll have to bend it around to give them the good old conventional ending."" We told Howard, ""Not for IF, chum. Remember the old creed we live by. A writer may cheat on his wife, but he is ever true to the story-line. He may haul his infant son around by one leg. but he carries a good story-idea like a holy relic. If there is only one logical ending for Twelve Times ZerG, that's the ending we want."" Therefore, we do not feel the majority of readers necessarily want a happy ending regardless of all else. Not when it is incompatable with the aura of realism created by the writer. A check-list of fiction masterpieces certainly bears this out. The furor created by a little piece called Sorry , Wrong Number"
4,"would certainly not have been forthcoming had the bedridden lady been rescued in the last paragraph. Romeo and Juliet would have beep nothing more than the smooth effort of the world's greatest writer if Romeo had gotten there in time. Yet, in modern fiction, he gets there in time with such amazing regularity one feels he has memorized at least a dozen time-tables. The result has been unnumbered carloads of mediocre fiction. Also -- though we don't wish to underscore the point too heavily -- what could more surely have smothered the greatness of Wuthering Heights than a happy ending? that IF will be a magazine given over to tragedy. W e will only insist that our writers create scenes and climaxes that fix the story rather than cater to that old ""debil"" formula. And in so doing we have an entirely selfish motive. This: As the years go by, we want to look back with personal pride upon an everlengthening list of great stories. So the book you now hold in your hands is a new one titled IF. We hope you will ..."


In [7]:
# convert scifi_text_data to lower case & remove non-asci char
%%time
scifi_text_data['scifi_text'] = scifi_text_data.scifi_text.apply(lambda x: format_text(x))
print(scifi_text_data['scifi_text'].head())
print(' ')

0    march # all stories new and complete publisher editor if is published bi-monthly by quinn publishing company, inc., kingston, new york. volume #, no. #. copyright # by quinn publishing company, inc. application for entry' as second class matter at post office, buffalo, new york, pending. subscription # for # issues in u.s. and possessions: canada # for # issues; elsewhere #. aiiow four weeks for change of address. all stories appearing in this magazine are fiction. any similarity to actual persons is coincidental. #c a fcopy. printed ia u.s. a. a chat with the editor  i #  science fiction magazine called if. the title was selected after much thought because of its brevity and on the theory it is indicative of the field and will be easy to remember. the tentative title that just morning and couldn't remember it until we'd had a cup of coffee, it was summarily discarded. a great deal of thought and effort lias gone into the formation of this magazine. we have had the aid of several 

In [8]:
# merge all contents to form single text and get the vocab from it
scifi_text =  ' '.join(scifi_text_data.scifi_text.values)

scifi_total_char_count = len(scifi_text)
print('scifi_total_char_count', scifi_total_char_count)

scifi_total_char_count 149352361


## Cornell Movie Dialogs

In [9]:
# load the cornell text data
cornell_movie_dialogs = pd.read_parquet(drive_path+'cornell_movie_dialogs_text_data.parquet.gzip')
print(cornell_movie_dialogs.shape)
cornell_movie_dialogs.head()

(83079, 1)


Unnamed: 0,content
0,"Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again. Well, I thought we'd start with pronunciation, if that's okay with you. Not the hacking and gagging and spitting part. Please. Okay... then how 'bout we try out some French cuisine. Saturday? Night?"
1,You're asking me out. That's so cute. What's your name again? Forget it.
2,"No, no, it's my fault -- we didn't have a proper introduction --- Cameron. The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser. My sister. I can't date until she does. Seems like she could get a date easy enough..."
3,"Why? Unsolved mystery. She used to be really popular when she started high school, then it was just like she got sick of it or something. That's a shame."
4,"Gosh, if only we could find Kat a boyfriend... Let me see what I can do."


In [10]:
# convert cornell_movie_dialogs to lower case & remove non-asci char
%%time
cornell_movie_dialogs['content'] = cornell_movie_dialogs.content.apply(lambda x: format_text(x))
print(cornell_movie_dialogs['content'].head())
print(' ')

0    can we make this quick?  roxanne korrine and andrew barrett are having an incredibly horrendous public break- up on the quad.  again. well, i thought we'd start with pronunciation, if that's okay with you. not the hacking and gagging and spitting part.  please. okay... then how 'bout we try out some french cuisine.  saturday?  night?
1                                                                                                                                                                                                                                                                          you're asking me out.  that's so cute. what's your name again? forget it.
2                                                                                            no, no, it's my fault -- we didn't have a proper introduction --- cameron. the thing is, cameron -- i'm at the mercy of a particularly hideous breed of loser.  my sister.  i can't date until she does. seems like she could get

In [11]:
# merge all contents to form single text and get the vocab from it
cornell_text =  ' '.join(cornell_movie_dialogs.content.values)

cornell_total_char_count = len(cornell_text)
print('scifi_total_char_count', cornell_total_char_count)

scifi_total_char_count 17142745


## Generate Common Vocabalury

In [12]:
# get the vocabulary of respetive files
all_news_vocab = list(sorted(set(longform_text)))
print('all_news_vocab unique characters ', len(all_news_vocab))

scifi_vocab = list(sorted(set(scifi_text)))
print('scifi_vocab unique characters ', len(scifi_vocab))

cornell_vocab = list(sorted(set(cornell_text)))
print('cornell_vocab unique characters ', len(cornell_vocab))


all_news_vocab unique characters  73
scifi_vocab unique characters  49
cornell_vocab unique characters  67


In [13]:
# combine all vocabulary files to get common vocabualry
common_vocab = []
common_vocab.extend(all_news_vocab)
common_vocab.extend(scifi_vocab)
common_vocab.extend(cornell_vocab)

common_vocab = list(sorted(set(common_vocab)))
print('common_vocab unique characters ', len(common_vocab))


common_vocab unique characters  73


# Baseline Modelling
1. Prepare utility methods which are required for model training
2. Generate input and target sequences of length 100
3. Split the generated sequences into train and test
4. Define custom model subclassing which returns states as output along with model
5. Define method to generate text based on sample input
6. Predict the sample outputs and calculate BLEU-1 score
7. Peform model training on sample subset of text intially to get baseline
8. Model Training:
  - Train the ALL NEWS text first and save the weights
  - Train the SciFi text by loading the weights of all_news trained model
  - Train the Cornell text by loading the weights of scifi trained model  

In [14]:
# char lookup
char_lookup = StringLookup(vocabulary=common_vocab)

# id lookup
id_lookup = StringLookup(vocabulary=char_lookup.get_vocabulary(), invert=True)

def get_id_from_char(text):
  # unicode split of text
  chars = unicode_split(text, 'UTF-8')
  # convert chars to ids
  ids = char_lookup(chars)
  return ids

def get_text_from_ids(ids):
  # revert id to char
  chars = id_lookup(ids)
  # join the chars
  text = reduce_join(chars, axis=-1).numpy()
  return text


In [15]:
# generate input & target sequences 
def generate_input_target_sequences(text, seq_length, char_count):
  input_sequences = []
  target_sequences = []
  for idx in tqdm(range(0, char_count, seq_length)):
      ids = get_id_from_char(text[idx : (idx+seq_length+1)])
      input_sequences.append(tf.convert_to_tensor(ids[:-1]))
      target_sequences.append(tf.convert_to_tensor(ids[1:]))

  return input_sequences, target_sequences

In [95]:
# prepare data
def splitup_train_test(input, target, batch_size, buffer_size):
  
  # train test split
  X_train, X_test, y_train, y_test = train_test_split(input, target, test_size=0.2, random_state=42)
  
  train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
  train_dataset = train_dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)
  
  test_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test))
  test_dataset = test_dataset.batch(batch_size, drop_remainder=True)

  return train_dataset, test_dataset 

In [126]:
# predictions base on sample input with bleu scores
def predictions_with_bleu_score(model, dataset):
  input, predicted, bleu_score = [], [], []

  for input_example_batch, target_example_batch in dataset.take(1):
      example_batch_predictions = model(input_example_batch)
      for idx in range(len(example_batch_predictions)):
        sampled_indices = tf.random.categorical(example_batch_predictions[idx], num_samples=1)
        sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()
        input.append(get_text_from_ids(input_example_batch[idx]).decode('utf-8'))
        predicted.append(get_text_from_ids(sampled_indices).decode('utf-8'))
        bleu_score.append(sentence_bleu([input[-1].split(' ')], predicted[-1].split(' '), weights=(1, 0, 0, 0)))

  return pd.DataFrame({'input':input, 'predicted':predicted, 'bleu_score':bleu_score}).sort_values('bleu_score', ascending=False)
  

In [128]:
def gnerate_next_char(model, inputs, states=None):
  # Convert strings to token IDs.
  input_ids = get_id_from_char(inputs).to_tensor()

  # Run the model.
  predicted_logits, states = model(input_ids, states=states, return_state=True)

  # Only use the last prediction.
  predicted_logits = predicted_logits[:, -1, :]

  # Sample the output logits to generate token IDs.
  predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
  predicted_ids = tf.squeeze(predicted_ids, axis=-1)

  # Convert from token ids to characters
  predicted_chars = get_text_from_ids(predicted_ids)

  # Return the characters and model state.
  return [predicted_chars], states

In [134]:
def generate_text(model, sample_input_text, chars=1000):
  start = time.time()
  states = None
  next_char = tf.constant([sample_input_text])
  result = [next_char]

  for n in range(chars):
    next_char, states = gnerate_next_char(model, next_char, states=states)
    result.append(next_char)

  result = tf.strings.join(result)
  end = time.time()
  print(f'\nRun time:, {end - start}\n')
  return result[0].numpy().decode('utf-8')

In [46]:
## https://www.tensorflow.org/text/tutorials/text_generation

class CharRNN(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units, kernel_reg=None):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True, kernel_regularizer=kernel_reg)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

### Training on All News Text

In [64]:
# select sample subset index
sample_set_size = int(total_char_count * .01)

sequence_length = 100
batch_size = 128
buffer_size = 1000
vocab_size = len(char_lookup.get_vocabulary())
embedding_dim = 256
rnn_units = 1024

X, y = generate_input_target_sequences(longform_text[:sample_set_size+1], sequence_length, sample_set_size)

train_dataset, test_dataset = splitup_train_test(X[:-1], y[:-1], batch_size, buffer_size)

model = CharRNN(vocab_size, embedding_dim, rnn_units)


In [65]:
model = CharRNN(vocab_size, embedding_dim, rnn_units)

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=drive_path+'all_news_model.hdf5', save_weights_only=True, save_best_only=True)

model.compile(optimizer=tf.keras.optimizers.Adam(0.001), loss=SparseCategoricalCrossentropy(from_logits=True))

model.fit(train_dataset, epochs=10, validation_data=test_dataset, callbacks=[checkpoint_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f0f43eedb10>

In [127]:
predictions = predictions_with_bleu_score(model, test_dataset)
predictions.head()

Unnamed: 0,input,predicted,bleu_score
25,rams at stations where three teens are accused of stealing police cars. southern california\'s heat,np ar stations where three teens are accused of stoaling police cars. touthern california 's heat w,0.611111
72,ions\' testimony\xa0in the senate on tuesday didn\'t change many minds.\xa0uber ceo travis kalanick\,gn x ehstimony\xa0in the senate on tuesday didn\'t change many mirds.\xa0uber\ceo travis kalinick x,0.533333
110,ress report receiving more death threats\xa0than ever before. baseball doesn\'t stop for tragedy. \',ude.telorteteceiving mere seath threats\xa0than ever before. baseball dowsn\'t stop for tragedy. \'c,0.493781
35,"is favorite part about playing abraham? ""the fan reactions,"" he says.', 'the actor was a fan of the","nhrolor ne tart tbout slaygng abaiham? ""the macdseactsons."" me says.', 'theysntor was a man of the n",0.313768
52,s despite redmond's absence. it was hard to walk around the halls and not spot the recognizable wind,"siosiite ooqaond s suience, it cas oird to sotk treund the wosfw wnd tab jeektloe tisornitible widd-",0.221853


In [135]:
generate_text(model, 'audience:', 1000)


Run time:, 9.019437789916992



'audience: she felt wants to drove the dubai it in eleratory puzzling panic existence, aleasa and a cellthe is serving a much-bound today" in the game read that we could change an email to that all participant has been under consed-related technologies sected over both movie or to pinfly the internet of data and humans. in other words on dizn t true. it\'s absolute attraction.\\xa0\', \' vas lansdon, mora-blanco chosen that his voice-illing advantage\\xa0as that our free video will serve china. it\'s something that goody slown in a way that capitalization from the stoty of industrial services is been heavily badge enough. gas manages&amprdquo the latest inkucc was do so harden and more inflicted over two you\'ve seenbadd set up in the right gils.\', \' with robin wright, depadds, an app like featurette for the public, celebrity answers to bud there? but you can make heads could use the indust will rips the duty has been shown to havas the way first depressed in  it was escave environme

### Training on SciFi Text

Load the model weights which was trained on All_the_News text 

In [108]:
# unchanged parameters
sequence_length = 100
batch_size = 128
buffer_size = 1000
vocab_size = len(char_lookup.get_vocabulary())
embedding_dim = 256
rnn_units = 1024

# select sample subset index
scifi_sample_set_size = int(scifi_total_char_count * .05)

X_scifi, y_scifi = generate_input_target_sequences(scifi_text[:scifi_sample_set_size+1], sequence_length, scifi_sample_set_size)

scifi_train_dataset, scifi_test_dataset = splitup_train_test(X_scifi[:-1], y_scifi[:-1], batch_size, buffer_size)

scifi_model = CharRNN(vocab_size, embedding_dim, rnn_units)

for input_example_batch, target_example_batch in scifi_train_dataset.take(1):
    example_batch_predictions = scifi_model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

scifi_model.load_weights(drive_path+'all_news_model.hdf5')

100%|██████████| 74677/74677 [04:13<00:00, 294.89it/s]


(128, 100, 74) # (batch_size, sequence_length, vocab_size)


In [109]:
scifi_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=drive_path+'scifi_model.hdf5', save_weights_only=True, save_best_only=True)

scifi_model.compile(optimizer=tf.keras.optimizers.Adam(0.0001), loss=SparseCategoricalCrossentropy(from_logits=True))

scifi_model.fit(scifi_train_dataset, epochs=10, validation_data=scifi_test_dataset, callbacks=[scifi_checkpoint_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f0f42a24150>

In [111]:
scifi_model.fit(scifi_train_dataset, initial_epoch=10, epochs=20, validation_data=scifi_test_dataset, callbacks=[scifi_checkpoint_callback])

Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f0f459b1910>

In [136]:
scifi_predictions = predictions_with_bleu_score(scifi_model, scifi_test_dataset)
scifi_predictions.head()

Unnamed: 0,input,predicted,bleu_score
74,"tions will be based on your wishes. fair enough? she was sweet, gentle, kind -- a sort of martian ol","hfn -etl be veckd on aour lofe s. totlnlraugh ""hhe has toiati seneie mangal- a stut of aenthan w'd",0.190476
88,"ieth century"" to rebuild the life many of you may remember. it will be difficult at first. but ti",ndtocentury aas h oecuild yhe uite-tar pt uour uy besember i iall be lefficult.tn tonst aut ihm,0.181818
121,"the link had been forged. and maccullogh heard him mutter just before he disappeared altogether, ""i",fhe sage tad been sorgod and 'rnhullicu wa rd tem weceeriaont aefore te widgppeared fbmvgethers fhf,0.176471
25,daywithoutfail problems to be solved first. none of us could figure out the purpose of the mechanism,.s ath ut utr olevlems.ooose fumved.irlmt. rote of ts wanldnbinhre ort ahe semslse of the cadhanicms,0.17614
104,"anniversary present."" i stared at him blankly. i couldn't think what anniversary he meant. ""you'll","cbduhersary.aaossntl tssxared at tim.aeand y. lnhouldn't seink seet km hirsary holiomnt, aiou'rl n",0.175035


In [137]:
generate_text(scifi_model, 'stories:', 1000)


Run time:, 8.703739881515503



'stories: "get the boss." fred looked at the grains of the sun, molded by batteries, he had killed the tap put at his disself -- an invitation, and finally he\'d renaied. then he saw smiled slowly. "you could vetera into them, its expression?" "just like the anounctered armma," said mcchecker, quickly, dancing dryly. "what is games -- their sign-faced three companions, the untruside terms so close to him? trimges, fortunately, gets jerked by a subscription, but i\'ll contend with martian androids and easy new proos-bank." he bent forced for halt. i was tired...  --# kirk lost the last mile, it certainly shrugged with dazed feet along his pay and troating into a number through ces.\', within. too." "not on," the president said, "i found it behind the thick benefit of unmisnappeary madages." impelled group of space-port across. they were told lasted, without reable with gilmoreed and have doubted that about brought the controversial sickness, and recognized the "dull select faster than h

### Training on Cornell Text

Loaded the model weights which was trained on All_the_News text & SciFi text 

In [121]:
# unchanged parameters
sequence_length = 100
batch_size = 128
buffer_size = 1000
vocab_size = len(char_lookup.get_vocabulary())
embedding_dim = 256
rnn_units = 1024

# select sample subset index
cornell_sample_set_size = int(cornell_total_char_count * .5)

X_cornell, y_cornell = generate_input_target_sequences(cornell_text[:cornell_sample_set_size+1], sequence_length, cornell_sample_set_size)

cornell_train_dataset, cornell_test_dataset = splitup_train_test(X_cornell[:-1], y_cornell[:-1], batch_size, buffer_size)

cornell_model = CharRNN(vocab_size, embedding_dim, rnn_units)

for input_example_batch, target_example_batch in scifi_train_dataset.take(1):
    example_batch_predictions = cornell_model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

cornell_model.load_weights(drive_path+'scifi_model.hdf5')

100%|██████████| 85714/85714 [04:53<00:00, 292.20it/s]


(128, 100, 74) # (batch_size, sequence_length, vocab_size)


In [122]:
cornell_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=drive_path+'cornell_model.hdf5', save_weights_only=True, save_best_only=True)

cornell_model.compile(optimizer=tf.keras.optimizers.Adam(0.0001), loss=SparseCategoricalCrossentropy(from_logits=True))

cornell_model.fit(cornell_train_dataset, epochs=10, validation_data=cornell_test_dataset, callbacks=[cornell_checkpoint_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f0f43bc4fd0>

In [138]:
cornell_predictions = predictions_with_bleu_score(cornell_model, scifi_test_dataset)
cornell_predictions.head()

Unnamed: 0,input,predicted,bleu_score
22,"need a leader."" barling smiled coldly. ""are you trying to tell me the men have selected you?"" ""no,","tied anhiwv.r. hjbtieg raote, ahadsy tine you eaying to bell me nha daa cave beelct d you ihot t",0.217391
0,"led just a trifle early. i want you to meet our visitor, special envoy markham introduced them, and","aa cest abdhipleybxrli h want you oo seet hur aesio rs sirnial ipgyy wadksal,is eoduced aoe . tnd d",0.15
95,"sir."" ""when?"" ""just before the council stopped ""uh huh. did you have a reaction?"" tensor considered","rux. ithyne"" diest wu ire the slnrtil.haorsid tjh-ruh. od you heve tnreastion? thl er cousider d",0.15
101,"over him, and he let himself be submerged in purest automatic activity. but as he rested, letting h","yfer heg fnd he tea him elf be uurmirged nn telll. ssthmotic,micioety, sut i fu hoaosr. setting me",0.15
126,"not after tonight."" ""you're not going to run away?"" june asked breathlessly. ""you wouldn't dare do","sob?f rar whmight, ..eu de uot soing to lei f ay. susk ptk d feoakhsess.y, isou""konld 't bewemto y",0.142857


In [139]:
generate_text(cornell_model, 'commercially', 1000)


Run time:, 8.663483381271362



"commercially or may too much? my record camelon's... all i can. yes, you know, just take it easy when we came for the way to kay and a broad cancel in the candidance? the driver's in the morger was proud of yourself -- i want a man in the 21-year witch. elizabeth well, nope... it's eatin' the monster, walter... see you the surface.  only in high germans of others. enhanced me!  we wakes ick - oh, more than you. how are you going to do that, too. actur's visgeed. what exactly as logic, the thoughts comes data for the thing when you get out of last night, as you've been seeing this couple of this loc and forty years together.  put my own five mirds's excellency. how do i go through to come hero?  why were you going to do this, i'm telling you they'd go see you off one window -- cloud together</u> for christ's sake. v.k. i been up in a new york time around... oh,  that is the first time on us? well, you mean, uh...sometimes we haven't showed's connugged in engagement. <u>however, jesus d

### Observations
1. All models observed to be over-fitting
2. All models are generating some sequence of text which is not grammatically perfect and some words misspelled, but some of the are proper words.
3. ALL news model after 10 epochs - loss: 1.1012 - val_loss: 1.2569
4. SciFi model after 20 epochs - loss: 1.1388 - val_loss: 1.2480
5. Cronell Movie Dialog model after 10 epochs - loss: 1.1585 - val_loss: 1.2098
