Homework 5: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 3
---

Task 3: Feedforward Neural Language Model (60 points)
--------------------------

For this task, you will create and train neural LMs for both your word-based embeddings and your character-based ones. You should write functions when appropriate to avoid excessive copy+pasting.

### a) First, encode  your text into integers (5 points)

In [1]:
# Importing utility functions from Keras
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

# necessary
from keras.models import Sequential
from keras.layers import Dense

# optional
# from keras.layers import Dropout

# if you want fancy progress bars
from tqdm import notebook
from IPython.display import display

# your other imports here
import time
import neurallm_utils as nutils

import numpy as np


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shash\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# constants you may find helpful. Edit as you would like.
EMBEDDINGS_SIZE = 50
NGRAM = 3 # The ngram language model you want to train

In [3]:
# load in necessary data
TRAIN_FILE = 'spooky_author_train.csv'
data_train_word = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=False)
data_train_char = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=True)

In [4]:
# Initialize a Tokenizer and fit on your data
# do this for both the word and character data
word_tokenizer = Tokenizer(char_level=False)
char_tokenizer = Tokenizer(char_level=True)

word_tokenizer.fit_on_texts(data_train_word)
word_encoded = word_tokenizer.texts_to_sequences(data_train_word)

char_tokenizer.fit_on_texts(data_train_char)
char_encoded = char_tokenizer.texts_to_sequences(data_train_char)

# It is used to vectorize a text corpus. Here, it just creates a mapping from 
# word to a unique index. (Note: Indexing starts from 0)
# Example:
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(data)
# encoded = tokenizer.texts_to_sequences(data)


In [5]:
# print out the size of the word index for each of your tokenizers
# this should match what you calculated in Task 2 with your embeddings
print("char_tokenizer counts: ", len(char_tokenizer.word_counts))
print("word_tokenizer counts: ",len(word_tokenizer.word_counts))



char_tokenizer counts:  60
word_tokenizer counts:  25374


### b) Next, prepare the sequences to train your model from text (5 points)

#### Fixed n-gram based sequences

In [6]:
def generate_ngram_training_samples(encoded: list, ngram: int) -> list:
    '''
    Takes the encoded data (list of lists) and 
    generates the training samples out of it.
    Parameters:
    up to you, we've put in what we used
    but you can add/remove as needed
    return: 
    list of lists in the format [[x1, x2, ... , x(n-1), y], ...]
    '''

    ngrams = []
    for sentence in encoded:
        ngrams += [sentence[i:i+ngram] for i in range(0, len(sentence)-NGRAM+1)]

    print(ngrams[:5])
    return ngrams


# generate your training samples for both word and character data
# print out the first 5 training samples for each
# we have displayed the number of sequences
# to expect for both characters and words
#
# Spooky data by character should give 2957553 sequences
# [21, 21, 3]
# [21, 3, 9]
# [3, 9, 7]
# ...
# Spooky data by words shoud give 634080 sequences
# [1, 1, 32]
# [1, 32, 2956]
# [32, 2956, 3]
# ...
word_train = generate_ngram_training_samples(word_encoded, NGRAM)
char_train = generate_ngram_training_samples(char_encoded, NGRAM)
print(len(word_train))
print(len(char_train))


[[1, 1, 32], [1, 32, 2956], [32, 2956, 3], [2956, 3, 155], [3, 155, 3]]
[[21, 21, 3], [21, 3, 9], [3, 9, 7], [9, 7, 8], [7, 8, 1]]
634080
2957553


### c) Then, split the sequences into X and y and create a Data Generator (20 points)

In [7]:
# 2.5 points
# Note here that the sequences were in the form: 
# sequence = [x1, x2, ... , x(n-1), y]
# We still need to separate it into [[x1, x2, ... , x(n-1)], ...], [y1, y2, ...]]
# do that here
word_x = [sublist[:-1] for sublist in word_train]
word_y = [sublist[-1] for sublist in word_train]

char_x = [sublist[:-1] for sublist in char_train]
char_y = [sublist[-1] for sublist in char_train]


# print out the shapes to verify that they are correct
print(len(word_x), " ", len(word_y))
print(len(char_x), " ", len(char_y))



634080   634080
2957553   2957553


In [8]:
# 2.5 points

# Initialize a function that reads the word embeddings you saved earlier
# and gives you back mappings from words to their embeddings and also 
# indexes from the tokenizers to their embeddings

def read_embeddings(filename: str, tokenizer: Tokenizer) -> (dict, dict):
    '''Loads and parses embeddings trained in earlier.
    Parameters:
        filename (str): path to file
        Tokenizer: tokenizer used to tokenize the data (needed to get the word to index mapping)
    Returns:
        (dict): mapping from word to its embedding vector
        (dict): mapping from index to its embedding vector
    '''
    # YOUR CODE HERE
    word_to_embedding = {}
    index_to_embedding = {}

    with open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            tokens = line.split()
            if len(tokens) == 2:
                continue
            embedding_vector = [float(x) for x in tokens[1:]]
        
            if tokens[0] in tokenizer.word_index:
                word_to_embedding[tokens[0]] = embedding_vector
                index_to_embedding[tokenizer.word_index[tokens[0]]] = embedding_vector
    
    return word_to_embedding, index_to_embedding


In [9]:
# NECESSARY FOR CHARACTERS

# the "0" index of the Tokenizer is assigned for the padding token. Initialize
# the vector for padding token as all zeros of embedding size
# this adds one to the number of embeddings that were initially saved
# (and increases your vocab size by 1)
_, word_embedding_index = read_embeddings('spooky_embedding_word.txt', word_tokenizer)
word_embedding_index[0] = [0] * len(word_embedding_index[1])

_, char_embedding_index = read_embeddings('spooky_embedding_char.txt', char_tokenizer)
char_embedding_index[0] = [0] * len(char_embedding_index[1])

In [10]:
# 10 points
def data_generator(X: list, y: list, num_sequences_per_batch: int, index_2_embedding: dict, for_feedforward: bool=True) -> (list,list):
    '''
    Returns data generator to be used by feed_forward
    https://wiki.python.org/moin/Generators
    https://realpython.com/introduction-to-python-generators/
    
    Yields batches of embeddings and labels to go with them.
    Use one hot vectors to encode the labels 
    (see the to_categorical function)
    
    If for_feedforward is True: 
    Returns data generator to be used by feed_forward
    else: Returns data generator for RNN model
    '''
    
    for i in range(0, len(X), num_sequences_per_batch):

        if for_feedforward:
            batch_X = X[i:i + num_sequences_per_batch]
            batch_y = y[i:i + num_sequences_per_batch]
        else: 
            batch_X = X[i:]
            batch_y = y[i:]
            for j in range((i +  num_sequences_per_batch) - len(X)):
                n_gram = len(batch_X[0])
                batch_X.append([0] * n_gram)
                batch_y.append(0)

        embeddings = []
        for x_vector in batch_X:
            cur_vector = []
            for token in x_vector:
                cur_vector.extend(index_2_embedding[token])
            embeddings.append(cur_vector)

        one_hot_vectors = to_categorical(batch_y, num_classes=len(index_2_embedding))
        yield np.array(embeddings), one_hot_vectors

In [11]:
# 5 points

# initialize your data_generator for both word and character data
# print out the shapes of the first batch to verify that it is correct for both word and character data

# word_embedding, word_index_embedding = read_embeddings('spooky_embedding_word.txt', word_tokenizer)
# word_generator = data_generator(word_x, word_y, 128, word_embedding)

# char_embedding, char_index_embedding = read_embeddings('spooky_embedding_char.txt', char_tokenizer)
# char_generator = data_generator(char_x, char_y, 128, char_embedding)

# sample = next(char_generator)
# print(sample[0].shape)

# steps_per_epoch = len(word_x)//128  # Number of batches per epoch

num_sequences_per_batch = 128 # this is the batch size
steps_per_epoch = len(word_x)//num_sequences_per_batch  # Number of batches per epoch
steps_per_epoch_char = len(char_x)//num_sequences_per_batch  # Number of batches per epoch


word_generator = data_generator(word_x, word_y, num_sequences_per_batch, word_embedding_index)
sample = next(word_generator)

char_generator = data_generator(char_x, char_y, num_sequences_per_batch, char_embedding_index)
sample = next(word_generator)


print(sample[0].shape)

print(sample[1].shape)

# Examples:
# num_sequences_per_batch = 128 # this is the batch size
# steps_per_epoch = len(sequences)//num_sequences_per_batch  # Number of batches per epoch
# train_generator = data_generator(X, y, num_sequences_per_batch)

# sample=next(train_generator) # this is how you get data out of generators
# sample[0].shape # (batch_size, (n-1)*EMBEDDING_SIZE)  (128, 200)
# sample[1].shape   # (batch_size, |V|) to_categorical


(128, 100)
(128, 25375)


### d) Train & __save__ your models (15 points)

In [12]:
# 15 points 

# code to train a feedforward neural language model for 
# both word embeddings and character embeddings
# make sure not to just copy + paste to train your two models
# (define functions as needed)

# train your models for between 3 & 5 epochs
# on Felix's machine, this takes ~ 24 min for character embeddings and ~ 10 min for word embeddings
# DO NOT EXPECT ACCURACIES OVER 0.5 (and even that is very for this many epochs)
# We recommend starting by training for 1 epoch

# Define your model architecture using Keras Sequential API
# Use the adam optimizer instead of sgd
# add cells as desired

# def create_feedforward_model():
#     model = Sequential()
#     model.add(Dense(units=3, input_dim=50, activation='relu'))
#     model.add(Dense(25375, activation='sigmoid'))
#     model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#     return model

# word_model = create_feedforward_model()
# char_model = create_feedforward_model()
def create_ff_model(units: int):
    model = Sequential()
    model.add(Dense(units=3, activation='relu', input_dim=100))
    model.add(Dense(units=units, activation='sigmoid'))
    model.summary()
    model.compile(optimizer='adam',  # You can choose an optimizer (e.g., 'adam', 'sgd')
                loss='categorical_crossentropy',  # Specify the loss function for classification
                metrics=['accuracy'])  # Optional: Specify metrics for evaluation
    return model

word_model = create_ff_model(25375)
char_model = create_ff_model(61)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 3)                 303       
                                                                 
 dense_1 (Dense)             (None, 25375)             101500    
                                                                 
Total params: 101803 (397.67 KB)
Trainable params: 101803 (397.67 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_2 (Dense)             (None, 3)                 303       
                                                                 
 dense_3 (Dense)             (None, 61)                244       
                                                            

In [13]:
# Here is some example code to train a model with a data generator
# model.fit(x=train_generator, 
#           steps_per_epoch=steps_per_epoch,
#           epochs=1)


# word_model.fit(x=word_generator, steps_per_epoch=steps_per_epoch, epochs=1)
# char_model.fit(x=char_generator, steps_per_epoch=steps_per_epoch, epochs=1)

word_model.fit(x=word_generator, 
          steps_per_epoch=steps_per_epoch-1,
          epochs=3)

char_model.fit(x=char_generator, 
          steps_per_epoch=steps_per_epoch_char-1,
          epochs=3)


Epoch 1/3
Epoch 2/3
Epoch 1/3
Epoch 2/3


<keras.src.callbacks.History at 0x2c27902e650>

In [14]:

# spooky data model by character for 5 epochs takes ~ 24 min on Felix's computer
# with adam optimizer, gets accuracy of 0.3920

# spooky data model by word for 5 epochs takes 10 min on Felix's computer
# results in accuracy of 0.2110


In [15]:
# save your trained models so you can re-load instead of re-training each time
# also, you'll need these to generate your sentences!

word_model.save('word_model.h5')
char_model.save('char_model.h5')


  saving_api.save_model(


### e) Generate Sentences (15 points)

In [16]:
# load your models if you need to


In [27]:
# 10 points
import random
# generate a sequence from the model until you get an end of sentence token
# This is an example function header you might use
def generate_seq(model: Sequential, 
                 tokenizer: Tokenizer, 
                 seed: list):
    '''
    Parameters:
        model: your neural network
        tokenizer: the keras preprocessing tokenizer
        seed: [w1, w2, w(n-1)]
    Returns: string sentence
    '''
    sentence = []
    sentence.extend(seed)

    while True:
        x = np.array([seed])
        y = model.predict(x, verbose=False)
        next_token = np.argmax(y)

        if next_token == 2:  # Check for the end of sentence token
            break

        sentence.append(next_token)
        seed = seed[1:] + [next_token]

    return sentence

word_generated = generate_seq(word_model, word_tokenizer, [random.randint(1, 2000)] * 100)
char_generated = generate_seq(char_model, char_tokenizer, [random.randint(1, 60)] * 100)
# prediction = 
# while prediction != word_tokenizer.word_index['</s>']:
#     prediction = np.argmax(model_word.predict(np.array([random.choices([n for n in range(29000)], k=100)])))
#     print(prediction)

In [25]:
# 5 points

# generate and display one sequence from both the word model and the character model
# do not include <s> or </s> in your displayed sentences
# make sure that you can read the output easily (i.e. don't just print out a list of tokens)
words = ""
for word in word_generated:
    words += " " + word_tokenizer.index_word[word]

chars = ''
for char in char_generated:
    chars += " " + char_tokenizer.index_word[char]

print(words)
print(chars)
# you may leave _ as _ or replace it with a space if you prefer

 , , , , , , , , , , , , , , , , , , , , , , ,
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ t _ _ t t t t t t t t _ _ n _ _ t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ n </s> h h f _ h h _ t t _ t t _ _ _ _ t </s> _ </s> t _ o _ _ h _ h _ _ _ _ _ _ o h h _ _ _ _ _ _ _ _ t _ t _ _ _ _ _ h h _ t _ _ _ _ _ t _ t t _ _ _ _ h h h


In [32]:
# generate 100 example sentences with each model and save them to a file, one sentence per line
# do not include <s> and </s> in your saved sentences (you'll use these sentences in your next task)
# this will produce two files, one for each model

for i in range(5):
    word_generated = generate_seq(word_model, word_tokenizer, [random.randint(1, 25374)] * 100)
    char_generated = generate_seq(char_model, char_tokenizer, [random.randint(1, 60)] * 100)

    words = ""
    for word in word_generated:
        words += " " + word_tokenizer.index_word[word]
    words += "\n"

    chars = ""
    for char in char_generated:
        chars += " " + char_tokenizer.index_word[char]
    chars += "\n"
    
    with open('sentencesWord.txt', 'a') as file:
        file.write(words)
    with open('sentencesChar.txt', 'a') as file:
        file.write(chars)