For this homework, make sure that you format your notbook nicely and cite all sources in the appropriate sections. Programmatically generate or embed any figures or graphs that you need.

Names: __Zhen Guo and Vikram Chowdhary__

## DO THIS

Step 1: Train your own word embeddings
--------------------------------

We chose to use the provided Spooky Author dataset. It contains text from works of fiction written by "spooky authors" of the public domain - Edgar Allan poe, HP Lovecraft, and Mary Shelley. The features in this dataset are:
- id - a unique identifier for each sentence
- text - some text written by one of the authors
- author - the author of the sentence (EAP: Edgar Allan Poe, HPL: HP Lovecraft; MWS: Mary Wollstonecraft Shelley)
The training portion of this dataset has 19579 texts, and the testing portion has 8392

Describe what data set you have chosen to compare and contrast with the your chosen provided dataset. Make sure to describe where it comes from and it's general properties.

The dataset we selected was found on Kaggle, and consists of 50,000 IMDB reviews. https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [1]:
# import your libraries here
import pandas as pd
import nltk
import re
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import nltk
nltk.download('wordnet')
# !pip install gensim

[nltk_data] Downloading package wordnet to /Users/zhenguo/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### 0) Pre-processing and text-normalization

The following pre-processing steps are inspired from https://towardsdatascience.com/text-normalization-for-natural-language-processing-nlp-70a314bfa646.

We also pre-processed data so that it begins with < s> tokens (and ends with < /s> tokens). Inspired from answer: https://stackoverflow.com/questions/37605710/tokenize-a-paragraph-into-sentence-and-then-into-words-in-nltk

In [2]:
# normalize text to regular expression
# code from https://gist.github.com/yamanahlawat/4443c6e9e65e74829dbb6b47dd81764a

replacement_patterns = [
  (r'won\'t', 'will not'),
  (r'can\'t', 'cannot'),
  (r'i\'m', 'i am'),
  (r'ain\'t', 'is not'),
  (r'(\w+)\'ll', '\g<1> will'),
  (r'(\w+)n\'t', '\g<1> not'),
  (r'(\w+)\'ve', '\g<1> have'),
  (r'(\w+)\'s', '\g<1> is'),
  (r'(\w+)\'re', '\g<1> are'),
  (r'(\w+)\'d', '\g<1> would')
]

patterns = [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]

def replace(text):
    s = text
    for (pattern, repl) in patterns:
        s = re.sub(pattern, repl, s)
    return s

In [3]:
def process_text(text):
    """
    Process the paragram so it is tokenized into sentences, 
    each sentence start with <s> end withh </s>, words are tokenized and normalized for each sentence.
    """
    sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences
    
    # now loop over each sentence and tokenize it separately
    s = []
    for sentence in sent_text:
        # regualr expression
        sentence = replace(sentence)
        # tokenize sentence
        tokenized_text = nltk.word_tokenize(sentence)

        # lematizing and stemming words
        ps = SnowballStemmer("english")
        lemmatizer = WordNetLemmatizer()
        
        new_sent = ['<s>']
        for word in tokenized_text:
            # now remove punctuation
            if not word.isalpha():
                continue           
            # stemming:
            word = ps.stem(word)
            # lemmatizing
            word = lemmatizer.lemmatize(word)
            new_sent.append(word)

        # add begin and end
        new_sent.append('</s>')
        s = s + new_sent
    return s

def process_data(series):
    # returns text in this format:
    # data = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
    # 			['this', 'is', 'the', 'second', 'sentence'],
    # 			['yet', 'another', 'sentence'],
    # 			['one', 'more', 'sentence'],
    # 			['and', 'the', 'final', 'sentence']]
    sentences = []
    for _,row in series.items():
        sentences.append(process_text(row))
    
    return sentences

In [4]:
# nltk.download('omw-1.4')

# Read the file and prepare the training data 
# so that it is in the following format
spooky_authors_train = pd.read_csv('./spooky-author-identification/train.csv')
spooky_authors_test = pd.read_csv('./spooky-author-identification/test.csv')
df_imdb = pd.read_csv('IMDB_dataset.csv')

given_data_train = process_data(spooky_authors_train['text'])
given_data_test = process_data(spooky_authors_test['text'])
our_data = process_data(df_imdb['review'])

In [7]:
# # save the variables in case leter use
# # should be comment out before submission
# import pickle

# # with open('our_data.pickle', 'wb') as f:
# #     pickle.dump(our_data, f)

# with open('our_data.pickle', 'rb') as file: 
#     # Call load method to deserialze
#     out_data = pickle.load(file)

### a) Train embeddings on GIVEN dataset

In [5]:
from gensim.models import Word2Vec

# The dimension of word embedding. 
# This variable will be used throughout the program
# you may vary this as you desire
EMBEDDINGS_SIZE = 200

# Train the Word2Vec model from Gensim. 
# Below are the hyperparameters that are most relevant. 
# But feel free to explore other 
# options too:
# sg = 1
# window = 5
# size = EMBEDDINGS_SIZE
# min_count = 1
# train model on spooky authors training data
model = Word2Vec(sentences = given_data_train, 
                 vector_size = EMBEDDINGS_SIZE,
                 sg = 1, 
                 window = 5,  
                 min_count = 1)

In [6]:
# if you save your Word2Vec as the variable model, this will 
# print out the vocabulary size
# THIS DOES NOT WORK?
# print('Vocab size {}'.format(len(model.wv.vocab)))
# https://stackoverflow.com/questions/35596031/gensim-word2vec-find-number-of-words-in-vocabulary
print('Vocab size {}'.format(len(model.wv)))

Vocab size 14996


In [7]:
# You can save file in txt format, then load later if you wish.
# model.wv.save_word2vec_format('embeddings.txt', binary=False)

### b) Train embedding on YOUR dataset

In [8]:
# then do a second data set
# given data is roughly 70/30 train/test
our_data_train = our_data[:35000]
our_data_test = our_data[35000:]
our_model = Word2Vec(sentences = our_data_train, 
                     vector_size = EMBEDDINGS_SIZE,
                     sg = 1, 
                     window = 5,  
                     min_count = 1)

In [9]:
print('Vocab size {}'.format(len(our_model.wv)))

Vocab size 56862


In [12]:
# You can save file in txt format, then load later if you wish.
# our_model.wv.save_word2vec_format('imdb_embeddings.txt', binary=False)

What text-normalization and pre-processing did you do and why? 

__We processed the text so that each sentence begin with < s > and end with < /s >. This way we hope to more accurately generate sentences. We also expanded the words to regular expression, for example "we're" to "we are." This is because we are using embeddings as input, so normalizing wrods that have the same meaning to be the same format and length will make the calculation more accurate. For the same purpose we also lemmatizes and stemmed the words.__

Step 2: Evaluate the differences between the word embeddings
----------------------------

(make sure to include graphs, figures, and paragraphs with full sentences)

In [13]:
# model.wv.index_to_key
# get 10 most common and uncommon words/vectors? plot difference between them?

Get 10 most frequent words. Inspired by https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial

In [10]:
from collections import defaultdict  # For word frequency
def top10(data):
    word_freq = defaultdict(int)
    for sentence in data:
        for i in sentence:
            if not i.is_stop:
                word_freq[i] += 1
    return sorted(word_freq, key=word_freq.get, reverse=True)[:10]

top_spooky = top10(given_data_train)
top_movie = top10(out_data)

AttributeError: 'str' object has no attribute 'is_stop'

In [None]:
top_movie

In [None]:
keys = ['Paris', 'Python', 'Sunday', 'Tolstoy', 'Twitter', 'bachelor', 'delivery', 'election', 'expensive',
        'experience', 'financial', 'food', 'iOS', 'peace', 'release', 'war']

embedding_clusters = []
word_clusters = []
for word in keys:
    embeddings = []
    words = []
    for similar_word, _ in model.most_similar(word, topn=30):
        words.append(similar_word)
        embeddings.append(model[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)

In [None]:
from sklearn.manifold import TSNE
import numpy as np

embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)

In [None]:

import matplotlib.pyplot as plt
import matplotlib.cm as cm
% matplotlib inline


def tsne_plot_similar_words(title, labels, embedding_clusters, word_clusters, a, filename=None):
    plt.figure(figsize=(16, 9))
    colors = cm.rainbow(np.linspace(0, 1, len(labels)))
    for label, embeddings, words, color in zip(labels, embedding_clusters, word_clusters, colors):
        x = embeddings[:, 0]
        y = embeddings[:, 1]
        plt.scatter(x, y, c=color, alpha=a, label=label)
        for i, word in enumerate(words):
            plt.annotate(word, alpha=0.5, xy=(x[i], y[i]), xytext=(5, 2),
                         textcoords='offset points', ha='right', va='bottom', size=8)
    plt.legend(loc=4)
    plt.title(title)
    plt.grid(True)
    if filename:
        plt.savefig(filename, format='png', dpi=150, bbox_inches='tight')
    plt.show()


tsne_plot_similar_words('Similar words from Google News', keys, embeddings_en_2d, word_clusters, 0.7,
                        'similar_words.png')

##Write down your analysis:

Cite your sources:
-------------

Step 3: Feedforward Neural Language Model
--------------------------

# a) First, encode  your text into integers

In [12]:
# Importing utility functions from Keras
from keras.preprocessing.text import Tokenizer

# Initializing a Tokenizer
# It is used to vectorize a text corpus. Here, it just creates a mapping from 
# word to a unique index. (Note: Indexing starts from 0)
# Example:
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(data)
# encoded = tokenizer.texts_to_sequences(data)


2022-11-27 17:35:04.941779: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [13]:
spooky_tokenizer = Tokenizer()
imdb_tokenizer = Tokenizer()

# spooky authors data
# use out own tokenizer
spooky_train_list = given_data_train
spooky_tokenizer.fit_on_texts(spooky_train_list)
spooky_train_encoded = spooky_tokenizer.texts_to_sequences(spooky_train_list)

# # our data
imdb_train_list = list(df_imdb['review'].values)[:35000]
imdb_tokenizer.fit_on_texts(imdb_train_list)
imdb_train_encoded = imdb_tokenizer.texts_to_sequences(imdb_train_list)

### b) Next, prepare your sequences from text

#### Fixed ngram based sequences 

In [14]:
import itertools

In [15]:

def generate_ngram_training_samples(ngram_len: int, data: list) -> list:
    '''
    Takes the encoded data (list of lists) and 
    generates the training samples out of it.
    Parameters:
    up to you, we've put in what we used
    but you can add/remove as needed
    return: 
    list of lists in the format [[x1, x2, ... , x(n-1), y], ...]
    '''
    # TODO: does this make sense????

    combined_text = list(itertools.chain.from_iterable(data))
    final_ngrams = []
    for idx in range(len(combined_text) - ngram_len + 1):
        ngram_list = list(combined_text[idx:idx+ngram_len])
        final_ngrams.append(ngram_list)
        
    return final_ngrams


### c) Then, split the sequences into X and y and create a Data Generator

In [16]:
# Note here that the sequences were in the form: 
# sequence = [x1, x2, ... , x(n-1), y]
# We still need to separate it into [[x1, x2, ... , x(n-1)], ...], [y1, y2, ...]
def split_ngrams(ngram_list: list) -> list:
    x = [] #those are the context words that we need to get embeddings for
    y = []
    for ngram in ngram_list:
        y.append(ngram[-1])
        x.append(ngram[:-1])
    return x, y

In [17]:
import string

def read_embeddings(text, embeddings, tokenizer):
    '''Loads and parses embeddings trained in earlier.
    Parameters and return values are up to you.
    
    I updated this function so that it takes a list of words as input,
    instead of the raw list of list.
    '''
    
    # you may find generating the following two dicts useful:
    # word to embedding : {'the':[0....], ...}
    # index to embedding : {1:[0....], ...} 
    # use your tokenizer's word_index to find the index of
    # a given word
    word_to_embedding = dict()
    index_to_embedding = dict()
    tok_w_i = tokenizer.word_index
    for word in text:
        # since we already pre processed data, we no longer need to transform
        if word not in word_to_embedding.keys():
            word_to_embedding[word] = embeddings.wv[word]
            index_to_embedding[tok_w_i[word]] = embeddings.wv[word]
                
    return word_to_embedding, index_to_embedding

In [18]:
# a, b = read_embeddings(spooky_train_list[0], model, spooky_tokenizer)

In [19]:
import numpy as np
from keras.utils import to_categorical

def data_generator(X: list, y: list, num_sequences_per_batch: int, embeddings, tokenizer) -> (list,list):
    '''
    Returns data generator to be used by feed_forward
    https://wiki.python.org/moin/Generators
    https://realpython.com/introduction-to-python-generators/
    
    Yields batches of embeddings and labels to go with them.
    Use one hot vectors to encode the labels 
    (see the to_categorical function)
    
    generator uses yield instead of return
    I don't quit understand what num_sequence_per_batch is...
    Also, only have a vague idea on how this function can be used,
    I would suggest modifying it after we decide how to impliment leaning
    
    '''
    # IDEA: yield num_sequences_per_batch of X, and the same number of y
    # transform y to one hot encodings
    # assume X, y are lists of text/words/whatever comes out of split_ngrams
    cur_idx = 0
    y_encoded = tokenizer.texts_to_sequences(y)
    tok_w_i = tokenizer.word_index
#     y_result = to_categorical(y_encoded, num_classes=len(tokenizer.word_index) + 1, dtype ="float32")
    
    while cur_idx <= len(X) - num_sequences_per_batch:
        X_temp = X[cur_idx:cur_idx + num_sequences_per_batch]
        y_temp = y[cur_idx:cur_idx + num_sequences_per_batch]
        
        # assuming below version of embeddings, one hot encodings is correct
        # otherwise keras.preprocessing has a one_hot function that can be used
        X_out = []
        y_out = []
        
        for idx in range(len(X_temp)):
            # get embeddings for words in X
            w_2_e, i_2_e = read_embeddings(X_temp[idx], embeddings, tokenizer)
#             temp_embeddings = np.array(list(w_2_e.values()))
#             flat_emb = [item for sublist in temp_embeddings for item in sublist]
            X_out += list(w_2_e.values())
#             X_out.append(np.array(temp_embeddings))

            # get one-hot for words in y
            word_y = y_temp[idx]
            y_vect = [0] * len(tok_w_i)
            y_vect[tok_w_i[word_y]] = 1
            y_out.append(y_vect)
        
        cur_idx += num_sequences_per_batch
#         y_out = to_categorical(y_out)
        yield  np.array(X_out, dtype=np.float32), np.array(y_out)
        



In [20]:
# The size of the ngram language model you want to train
# change as needed for your experiments
N_GRAM = 2

In [21]:
# Examples
spooky_n_gram_temp = generate_ngram_training_samples(N_GRAM, spooky_train_list)[:200]
X, y = split_ngrams(spooky_n_gram_temp)

# initialize data_generator
num_sequences_per_batch = 20 # this is the batch size 128
steps_per_epoch = len(spooky_n_gram_temp)//num_sequences_per_batch  # Number of batches per epoch
train_generator = data_generator(X, y, num_sequences_per_batch, model, spooky_tokenizer)

sample=next(train_generator) # this is how you get data out of generators
sample[0].shape # (batch_size, (n-1)*EMBEDDING_SIZE)  (128, 200)
sample[1].shape # (batch_size, |V|) to_categorical

(20, 14996)

In [22]:
sample[0][0].shape

(200,)

In [23]:
sample[0].shape

(20, 200)

In [24]:
sample[1].shape

(20, 14996)

### d) Train your models

code to train a feedforward neural language model 
on a set of given word embeddings

make sure not to just copy + paste to train your two models

Sources used:
https://pyimagesearch.com/2021/05/06/implementing-feedforward-neural-networks-with-keras-and-tensorflow/

https://www.tensorflow.org/api_docs/python/tf/keras/Model

https://medium.com/analytics-vidhya/understanding-embedding-layer-in-keras-bbe3ff1327ce

In [25]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import SimpleRNN
from keras.layers import Embedding
from keras.layers import Input
from keras.layers import Flatten

In [26]:
len(spooky_tokenizer.word_index)+1

14997

In [27]:
len(model.wv)

14996

In [28]:
hidden_units = 20

In [31]:
# Define the model architecture using Keras Sequential API
spooky_nn_model = Sequential()

# input_dim is vocab size, 
# embedding_layer = Embedding(input_dim=len(model.wv),output_dim=(N_GRAM-1)*EMBEDDINGS_SIZE, input_length=N_GRAM-1)
# spooky_nn_model.add(embedding_layer)

spooky_nn_model.add(Dense(hidden_units, input_shape=((N_GRAM-1)*EMBEDDINGS_SIZE,), 
                          activation='sigmoid'))
# hidden layer
# spooky_nn_model.add(Dense(units=(N_GRAM-1)*num_sequences_per_batch, activation='sigmoid', 
#                          input_shape=(EMBEDDINGS_SIZE,(N_GRAM-1))))

# spooky_nn_model.add(Dense(EMBEDDINGS_SIZE, activation='sigmoid'))
                    
# output layer
# dense unit = vocab size ?
# x = Input(shape=((N_GRAM-1)*EMBEDDINGS_SIZE,))
spooky_nn_model.add(Dense(len(model.wv), activation='softmax'))
# spooky_nn_model.add(Dense((N_GRAM-1), activation='softmax'))


# spooky_nn_model.compile(loss='categorical_crossentropy', optimizer='sgd',metrics=['accuracy'])
# configure the learning process
spooky_nn_model.compile(loss='sparse_categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

In [93]:
# use pre-trained word embedding
# inspired from tutorial : https://keras.io/examples/nlp/pretrained_word_embeddings/

In [32]:
# Start training the model
spooky_nn_model.fit(x=train_generator, 
                    steps_per_epoch=steps_per_epoch,
                    epochs=1)

InvalidArgumentError: Graph execution error:

Detected at node 'sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits' defined at (most recent call last):
    File "/opt/anaconda3/envs/class/lib/python3.9/runpy.py", line 197, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/opt/anaconda3/envs/class/lib/python3.9/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/ipykernel_launcher.py", line 16, in <module>
      app.launch_new_instance()
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/traitlets/config/application.py", line 846, in launch_instance
      app.start()
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 677, in start
      self.io_loop.start()
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 199, in start
      self.asyncio_loop.run_forever()
    File "/opt/anaconda3/envs/class/lib/python3.9/asyncio/base_events.py", line 596, in run_forever
      self._run_once()
    File "/opt/anaconda3/envs/class/lib/python3.9/asyncio/base_events.py", line 1890, in _run_once
      handle._run()
    File "/opt/anaconda3/envs/class/lib/python3.9/asyncio/events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 457, in dispatch_queue
      await self.process_one()
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 446, in process_one
      await dispatch(*args)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 353, in dispatch_shell
      await result
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 648, in execute_request
      reply_content = await reply_content
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 353, in do_execute
      res = shell.run_cell(code, store_history=store_history, silent=silent)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 533, in run_cell
      return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2914, in run_cell
      result = self._run_cell(
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2960, in _run_cell
      return runner(coro)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 78, in _pseudo_sync_runner
      coro.send(None)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3185, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3377, in run_ast_nodes
      if (await self.run_code(code, result,  async_=asy)):
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3457, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "/var/folders/tp/g__7jx_16dj5dq8w8kblyp380000gr/T/ipykernel_2230/1125690068.py", line 2, in <module>
      spooky_nn_model.fit(x=train_generator,
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/keras/engine/training.py", line 1564, in fit
      tmp_logs = self.train_function(iterator)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/keras/engine/training.py", line 1160, in train_function
      return step_function(self, iterator)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/keras/engine/training.py", line 1146, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/keras/engine/training.py", line 1135, in run_step
      outputs = model.train_step(data)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/keras/engine/training.py", line 994, in train_step
      loss = self.compute_loss(x, y, y_pred, sample_weight)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/keras/engine/training.py", line 1052, in compute_loss
      return self.compiled_loss(
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 265, in __call__
      loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/keras/losses.py", line 152, in __call__
      losses = call_fn(y_true, y_pred)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/keras/losses.py", line 272, in call
      return ag_fn(y_true, y_pred, **self._fn_kwargs)
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/keras/losses.py", line 2084, in sparse_categorical_crossentropy
      return backend.sparse_categorical_crossentropy(
    File "/opt/anaconda3/envs/class/lib/python3.9/site-packages/keras/backend.py", line 5630, in sparse_categorical_crossentropy
      res = tf.nn.sparse_softmax_cross_entropy_with_logits(
Node: 'sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits'
logits and labels must have the same first dimension, got logits shape [20,14996] and labels shape [299920]
	 [[{{node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}}]] [Op:__inference_train_function_885]

### e) Generate Sentences

In [None]:
# generate a sequence from the model
def generate_seq(model: Sequential, 
                 tokenizer: Tokenizer, 
                 seed: list, 
                 n_words: int):
    '''
    Parameters:
        model: your neural network
        tokenizer: the keras preprocessing tokenizer
        seed: [w1, w2, w(n-1)]
        n_words: generate a sentence of length n_words
    Returns: string sentence
    '''
    pass

### f) Compare your generated sentences

In [37]:
def accuracy(y, y_hat):
    """
    Measure the accuracy of our model, print the results.
    Parameters:
    y (array): true labels
    y (array): model estimates
    Returns:
    None
    """
    count = 0
    for i in range(len(y)):
        guess = 1 if y_hat[i] > 0.5 else 0
        if guess == y[i]:
            count += 1
    print("Accuracy:", count / y.shape[0])

Sources Cited
----------------------------
