# Assignment 2: Text Classification with BERT

**Description:** This assignment covers various neural network architectures and components, largely used in the context of classification. You will compare Deep Averaging Networks, Deep Weighted Averaging Networks using Attention, and BERT-based models. You should also be able to develop an intuition for:

*   Working with the CLS token
*   Adding a DAN output head to a BERT model
*   Adding a CNN output head to a BERT model


The assignment notebook closely follows the lesson notebooks. We will use the IMDB dataset and will leverage some of the models, or part of the code, for our current investigation.

The initial part of the notebook is purely setup. We will then evaluate how Attention can make Deep Averaging networks better.

This notebook should be run on a Google Colab leveraging a GPU. By default, when you open the notebook in Colab it will try to use a GPU. Please note that you the GPU is reuqired for Section 3 but not for Sections 1 and 2.
Since colab is providing free access to a GPU they place constraints on that access.  Therefore you might want to turn off the GPU access (Edit -> Notebook Settings) until you get to section 3.  Total runtime of the entire notebook (with solutions and a Colab GPU) should be about 1h with the majority of that time being in Section 3.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2024-fall-main/blob/master/assignment/a2/Text_classification_BERT.ipynb)

The overall assignment structure is as follows:


0. Setup
  
  0.1 Libraries, Embeddings,  & Helper Functions

  0.2 Data Acquisition

  0.3. Data Preparation

      0.3.1 Training/Test Sets for BERT-based models


1. Classification with BERT

  1.1. BERT Basics

  1.2 CLS-Token-based Classification

  1.3 Averaging of BERT Outputs

  1.4. Adding a CNN on top of BERT



**INSTRUCTIONS:**:

* Questions are always indicated as **QUESTION**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1.  Please do **not** remove the output from your notebooks when you submit them as we'll look at the output as well as your code for grading purposes.  We cannot award points if the output cells are empty.

* **### YOUR CODE HERE** indicates that you are supposed to write code.

* If you want to, you can run all of the cells in section 0 in bulk. This is setup work and no questions are in there. At the end of section 0 we will state all of the relevant variables that were defined and created in section 1.

* Finally, unless otherwise indicated your validation accuracy will be 0.65 or higher if you have correctly implemented the model.



## 0. Setup

### 0.1. Libraries and Helper Functions

This notebook requires the TensorFlow dataset and other prerequisites that you must download.  This notebook uses Keras 2 and its functional API.  Do NOT change the version numbers in the pip install commands.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
#@title Installs
!pip install pydot --quiet
!pip install gensim --quiet
!pip install tensorflow==2.15.0 --quiet
!pip install tf_keras==2.15.0 --quiet
!pip install tensorflow-datasets==4.8 --quiet
!pip install tensorflow-text==2.15.0 --quiet
!pip install transformers==4.17 --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m475.2/475.2 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m54.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m442.0/442.0 kB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorstore 0.1.65 requires ml-dtypes>=0.3.1, but you have ml-dtypes 0.2.0 which is incompatible.
tf-ker

Now we are ready to do the imports.

In [3]:
#@title Imports

import numpy as np
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.layers import Embedding, Input, Dense, Lambda
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K
import tensorflow_datasets as tfds
import tensorflow_text as tf_text
import transformers

from transformers import BertTokenizer, TFBertModel
from transformers import logging
logging.set_verbosity_error()

import sklearn as sk
import os
import nltk
from nltk.data import find

import matplotlib.pyplot as plt

import re

import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.test.utils import datapath

In [4]:
def print_version(library_name):
    try:
        lib = __import__(library_name)
        version = getattr(lib, '__version__', 'Version number not found')
        print(f"{library_name} version: {version}")
    except ImportError:
        print(f"{library_name} not installed.")
    except Exception as e:
        print(f"An error occurred: {e}")


In [5]:
#confirm versions
print_version('numpy')
print_version('transformers')
print_version('tensorflow')
print_version('keras')

numpy version: 1.26.4
transformers version: 4.17.0
tensorflow version: 2.15.0
keras version: 2.15.0


Below is a helper function to plot histories.  Make sure that you are using tensorflow version==2.15.0 and keras version==2.15.0 and transformers version==4.17.0

In [6]:
#@title Plotting Function

# 4-window plot. Small modification from matplotlib examples.

def make_plot(axs,
              model_history1,
              model_history2,
              model_1_name='model 1',
              model_2_name='model 2',
              ):
    box = dict(facecolor='yellow', pad=5, alpha=0.2)

    for i, metric in enumerate(['loss', 'accuracy']):
        # small adjustment to account for the 2 accuracy measures in the Weighted Averging Model with Attention
        if 'classification_%s' % metric in model_history2.history:
            metric2 = 'classification_%s' % metric
        else:
            metric2 = metric

        y_lim_lower1 = np.min(model_history1.history[metric])
        y_lim_lower2 = np.min(model_history2.history[metric2])
        y_lim_lower = min(y_lim_lower1, y_lim_lower2) * 0.9

        y_lim_upper1 = np.max(model_history1.history[metric])
        y_lim_upper2 = np.max(model_history2.history[metric2])
        y_lim_upper = max(y_lim_upper1, y_lim_upper2) * 1.1

        for j, model_history in enumerate([model_history1, model_history2]):
            model_name = [model_1_name, model_2_name][j]
            model_metric = [metric, metric2][j]
            ax1 = axs[i, j]
            ax1.plot(model_history.history[model_metric])
            ax1.plot(model_history.history['val_%s' % model_metric])
            ax1.set_title('%s - %s' % (metric, model_name))
            ax1.set_ylabel(metric, bbox=box)
            ax1.set_ylim(y_lim_lower, y_lim_upper)

Next, we get the word2vec model from NLTK to use as our embeddings.

In [7]:
#@title NLTK & Word2Vec

nltk.download('word2vec_sample')

word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))

wvmodel = KeyedVectors.load_word2vec_format(datapath(word2vec_sample), binary=False)

[nltk_data] Downloading package word2vec_sample to /root/nltk_data...
[nltk_data]   Unzipping models/word2vec_sample.zip.


Now here we have the embedding **model** defined, let's see how many words are in the vocabulary:

In [8]:
len(wvmodel)

43981

What do the word vectors look like? As expected:

In [9]:
wvmodel['great'][:20]

array([ 0.0306035 ,  0.0886877 , -0.0121269 ,  0.0761965 ,  0.0566269 ,
       -0.0424702 ,  0.0410129 , -0.0497567 , -0.00364328,  0.0632889 ,
       -0.0142608 , -0.0791111 ,  0.0174877 , -0.0383064 ,  0.00926433,
        0.0295626 ,  0.0770293 ,  0.0949334 , -0.0428866 , -0.0295626 ],
      dtype=float32)

We can now build the embedding matrix and a vocabulary dictionary:

In [10]:
EMBEDDING_DIM = len(wvmodel['university'])      # we know... it's 300

# initialize embedding matrix and word-to-id map:
embedding_matrix = np.zeros((len(wvmodel) + 1, EMBEDDING_DIM))
vocab_dict = {}

# build the embedding matrix and the word-to-id map:
for i, word in enumerate(wvmodel.index_to_key):
    embedding_vector = wvmodel[word]

    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        vocab_dict[word] = i

# we can use the last index at the end of the vocab for unknown tokens
vocab_dict['[UNK]'] = len(vocab_dict)

In [11]:
embedding_matrix.shape

(43982, 300)

In [12]:
embedding_matrix[:5, :5]

array([[ 0.0891758 ,  0.121832  , -0.0671959 ,  0.0477279 , -0.013659  ],
       [ 0.0526281 ,  0.013157  , -0.010104  ,  0.0540819 ,  0.0386715 ],
       [ 0.0786419 ,  0.0373911 , -0.0131472 ,  0.0347375 ,  0.0288273 ],
       [-0.00157585, -0.0564239 ,  0.00320281,  0.0422498 ,  0.15264399],
       [ 0.0356899 , -0.00367283, -0.065534  ,  0.0213832 ,  0.00788408]])

The last row consists of all zeros. We will use that for the UNK token, the placeholder token for unknown words.

### 0.2 Data Acquisition


We will use the IMDB dataset delivered as part of the TensorFlow-datasets library, and split into training and test sets. For expedience, we will limit ourselves in terms of train and test examples.

In [13]:
train_data, test_data = tfds.load(
    name="imdb_reviews",
    split=('train[:80%]', 'test[80%:]'),
    as_supervised=True)

train_examples, train_labels = next(iter(train_data.batch(20000)))
test_examples, test_labels = next(iter(test_data.batch(5000)))

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteLVM4F9/imdb_reviews-train.tfrecord…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteLVM4F9/imdb_reviews-test.tfrecord*…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteLVM4F9/imdb_reviews-unsupervised.t…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


It is always highly recommended to look at the data. What do the records look like? Are they clean or do they contain a lot of cruft (potential noise)?

In [14]:
train_examples[:4]

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell a

In [15]:
train_labels[:4]

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([0, 0, 0, 1])>

For convenience, in this assignment we will define a sequence length and truncate all records at that length. For records that are shorter than our defined sequence length we will add padding characters to insure that our input shapes are consistent across all records.

In [16]:
MAX_SEQUENCE_LENGTH = 100

## 0.3. Data Preparation

### 0.3.1. Training/Test Sets for Word2Vec-based Models

First, we tokenize the data:

In [17]:
tokenizer = tf_text.WhitespaceTokenizer()
train_tokens = tokenizer.tokenize(train_examples)
test_tokens = tokenizer.tokenize(test_examples)

Let's look at some tokens.  Do they look acceptable?

In [18]:
train_tokens[0]

<tf.Tensor: shape=(116,), dtype=string, numpy=
array([b'This', b'was', b'an', b'absolutely', b'terrible', b'movie.',
       b"Don't", b'be', b'lured', b'in', b'by', b'Christopher', b'Walken',
       b'or', b'Michael', b'Ironside.', b'Both', b'are', b'great',
       b'actors,', b'but', b'this', b'must', b'simply', b'be', b'their',
       b'worst', b'role', b'in', b'history.', b'Even', b'their', b'great',
       b'acting', b'could', b'not', b'redeem', b'this', b"movie's",
       b'ridiculous', b'storyline.', b'This', b'movie', b'is', b'an',
       b'early', b'nineties', b'US', b'propaganda', b'piece.', b'The',
       b'most', b'pathetic', b'scenes', b'were', b'those', b'when',
       b'the', b'Columbian', b'rebels', b'were', b'making', b'their',
       b'cases', b'for', b'revolutions.', b'Maria', b'Conchita',
       b'Alonso', b'appeared', b'phony,', b'and', b'her', b'pseudo-love',
       b'affair', b'with', b'Walken', b'was', b'nothing', b'but', b'a',
       b'pathetic', b'emotional', b

Yup... looks right. Of course we will need to take care of the encoding later.

Next, we define a simple function that converts the tokens above into the appropriate word2vec index values so we can retrieve the embedding vector associated with the word.   

In [19]:
def docs_to_vocab_ids(tokenized_texts_list):
    """
    converting a list of strings to a list of lists of word ids
    """
    texts_vocab_ids = []
    text_labels = []
    valid_example_list = []
    for i, token_list in enumerate(tokenized_texts_list):

        # Get the vocab id for each token in this doc ([UNK] if not in vocab)
        vocab_ids = []
        for token in list(token_list.numpy()):
            decoded = token.decode('utf-8', errors='ignore')
            if decoded in vocab_dict:
                vocab_ids.append(vocab_dict[decoded])
            else:
                vocab_ids.append(vocab_dict['[UNK]'])

        # Truncate text to max length, add padding up to max length
        vocab_ids = vocab_ids[:MAX_SEQUENCE_LENGTH]
        n_padding = (MAX_SEQUENCE_LENGTH - len(vocab_ids))
        # For simplicity in this model, we'll just pad with unknown tokens
        vocab_ids += [vocab_dict['[UNK]']] * n_padding
        # Add this example to the list of converted docs
        texts_vocab_ids.append(vocab_ids)

        if i % 5000 == 0:
            print('Examples processed: ', i)

    print('Total examples: ', i)
    return np.array(texts_vocab_ids)

Now we can create training and test data that can be fed into the models of interest.  We need to convert all of the tokens in to their respective input ids.

In [20]:
train_input_ids = docs_to_vocab_ids(train_tokens)
test_input_ids = docs_to_vocab_ids(test_tokens)

train_input_labels = np.array(train_labels)
test_input_labels = np.array(test_labels)

Examples processed:  0
Examples processed:  5000
Examples processed:  10000
Examples processed:  15000
Total examples:  19999
Examples processed:  0
Total examples:  4999


Let's convince ourselves that the data looks correct:

In [21]:
train_input_ids[:2]

array([[21531, 25272, 12291,  7427, 37254, 43981,  6891, 12917, 38232,
        16915, 12929, 16182, 43981, 20526, 23487, 43981, 23807, 42958,
        35058, 43981, 19123, 35029, 41270, 29275, 12917, 32597, 20659,
          638, 16915, 43981,   174, 32597, 35058, 39971,  2326,  3636,
        22434, 35029, 43981, 33922, 43981, 21531, 34710, 16908, 12291,
        36880, 28137,  5376, 28038, 43981, 15402, 29155, 18063, 24951,
        17433, 17595,  8856, 14193, 43981, 43248, 17433,  6290, 32597,
         9001, 11511, 43981, 21807, 39168, 43981, 16856, 43981, 43981,
        23245, 43981,  8889,  1331, 43981, 25272, 31976, 19123, 43981,
        18063, 36309, 24099, 16915, 43981, 34710, 36633, 25272, 20413,
        43981, 33458, 14926, 43981, 12139, 12289, 39617, 36633,  9483,
        42958],
       [12139,  7841, 19666, 31757, 43981, 17853, 25745, 15445, 43981,
        19123, 35029, 16908, 21113, 21068, 43981, 43981,  5668, 43981,
        33456, 43981, 34554, 43981,  1200, 27498, 43981, 1880

### 0.3.2. Training/Test Sets for BERT-based models

We already imported the BERT model and the Tokenizer libraries. Now, let's load the pretrained BERT model and tokenizer. Always make sure to load the tokenizer that goes with the model you're going to use.

In [22]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
bert_model = TFBertModel.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/502M [00:00<?, ?B/s]

Next, we will preprocess our train and test data for use in the BERT model. We need to convert our documents into vocab IDs, like we did above with the Word2Vec vocabulary. But this time we'll use the BERT tokenizer, which has a different vocabulary specific to the BERT model we're going to use.

In [23]:
#@title BERT Tokenization of training and test data

train_examples_str = [x.decode('utf-8') for x in train_examples.numpy()]
test_examples_str = [x.decode('utf-8') for x in test_examples.numpy()]

bert_train_tokenized = bert_tokenizer(train_examples_str,
              max_length=MAX_SEQUENCE_LENGTH,
              truncation=True,
              padding='max_length',
              return_tensors='tf')
bert_train_inputs = [bert_train_tokenized.input_ids,
                     bert_train_tokenized.token_type_ids,
                     bert_train_tokenized.attention_mask]
bert_train_labels = np.array(train_labels)

bert_test_tokenized = bert_tokenizer(test_examples_str,
              max_length=MAX_SEQUENCE_LENGTH,
              truncation=True,
              padding='max_length',
              return_tensors='tf')
bert_test_inputs = [bert_test_tokenized.input_ids,
                     bert_test_tokenized.token_type_ids,
                     bert_test_tokenized.attention_mask]
bert_test_labels = np.array(test_labels)

Overall, here are the key variables and sets that we encoded for word2vec and BERT and that may be used moving forward. If the variable naming does not make it obvious, we also state the purpose:

#### Parameters:

* MAX_SEQUENCE_LENGTH (100)


#### BERT:


* bert_train(/test)_inputs: list of input_ids, token_type_ids and attention_mask for the training(/test) sets for BERT models
* bert_train(/test)_labels: the corresponding labels for BERT

**NOTE:** We recommend you inspect these variables if you have not gone through the code.


## 1. BERT-based Classification Models

Now we turn to classification with BERT. We will perform classifications with various models that are based on pre-trained BERT models.  If you have turned off GPU access, make sure you change the Notebook setings so you can access a GPU again.


### 1.1. Basics

Let us first explore some basics of BERT.

We've already loaded the pretrained BERT model and tokenizer that we'll use (
'bert-base-cased').

Now, consider this input:

In [24]:
test_input = ['this bank is closed on Sunday', 'the steepest bank of the river is dangerous']

Apply the BERT tokenizer to tokenize it:

In [25]:
tokenized_input = bert_tokenizer(test_input,
                                 max_length=12,
                                 truncation=True,
                                 padding='max_length',
                                 return_tensors='tf')

tokenized_input

{'input_ids': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[ 101, 1142, 3085, 1110, 1804, 1113, 3625,  102,    0,    0,    0,
           0],
       [ 101, 1103, 9458, 2556, 3085, 1104, 1103, 2186, 1110, 4249,  102,
           0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]], dtype=int32)>}

 **QUESTION:**

 1.a  Why do the attention_masks have 4 and 1 zeros, respectively?  Choose the correct one and enter it in the answers file.

  *  For the first example the last four tokens belong to a different segment. For the second one it is only the last token.

  *  For the first example 4 positions are padded while for the second one it is only one.

------


Next, let us look at the BERT outputs for these 2 sentences:

In [26]:
### YOUR CODE HERE

# bert_output = ...

bert_output = bert_model(tokenized_input)

### END YOUR CODE

 **QUESTION:**

 1.b How many outputs are there?

 Enter your code below.

In [27]:
### YOUR CODE HERE

#b. -> print it out

print(bert_output)

### END YOUR CODE

TFBaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=<tf.Tensor: shape=(2, 12, 768), dtype=float32, numpy=
array([[[ 0.39452153,  0.04198513,  0.06480418, ...,  0.05045479,
          0.22358865,  0.24238198],
        [-0.09458949,  0.06673875, -0.03607525, ...,  0.21925776,
         -0.06967184,  0.7444838 ],
        [ 0.00561068,  0.31316522, -0.179827  , ...,  0.19563268,
         -0.10614735,  0.47773603],
        ...,
        [ 0.22268742, -0.11558606,  0.15854394, ...,  0.30025312,
          0.01634067,  0.51333976],
        [ 0.31638372, -0.10986984,  0.23661843, ...,  0.10924138,
         -0.14340357,  0.328354  ],
        [ 0.34834057, -0.10076538,  0.26903224, ...,  0.12707612,
         -0.18430144,  0.26176235]],

       [[ 0.44506398,  0.22264998, -0.09972468, ..., -0.23736233,
          0.12722543,  0.07778168],
        [ 0.07407635, -0.31805816, -0.11924681, ..., -0.06680164,
         -0.30617064,  0.46923554],
        [ 0.31458062,  0.6265879 ,  0.00606293, ..

**QUESTION:**

 1.c Which output do we need to use to get token-level embeddings?

 * the first

 * the second

 Put your answer in the answers file.

**QUESTION:**

 1.d In the tokenized input, which input_id number (i.e. the vocabulary id) corresponds to 'bank' in the two sentences? ('bert_tokenizer.tokenize()' may come in handy.. and don't forget the CLS token! )


**QUESTION:**

 1.e In the array of tokens, which position index number corresponds to 'bank' in the first sentence? ('bert_tokenizer.tokenize()' may come in handy.. and don't forget the CLS token! )

In [28]:
### YOUR CODE HERE

#d/e. -> Look at tokens generated by the bert tokenizer for the first example

# Tokenize the word 'bank'
bank_token = bert_tokenizer.tokenize("bank")
print(f"Token for 'bank': {bank_token}")

# Find the input ID for 'bank'
bank_id = bert_tokenizer.convert_tokens_to_ids(bank_token)
print(f"Input ID for 'bank': {bank_id}")

# Tokenize first input sentence
tokenized_input_sentence1 = bert_tokenizer.tokenize("this bank is closed on Sunday")

# Find position index
position_index = tokenized_input_sentence1.index("bank")
print(f"Position index of 'bank' in the first sentence: {position_index}")

### END YOUR CODE

Token for 'bank': ['bank']
Input ID for 'bank': [3085]
Position index of 'bank' in the first sentence: 1


**QUESTION:**

1.f Which array position index number corresponds to 'bank' in the second sentence?

In [29]:
### YOUR CODE HERE

#f. -> Look at tokenization for the second example

# Tokenize second sentence
tokenized_input_sentence2 = bert_tokenizer.tokenize("the steepest bank of the river is dangerous")

# Find position index
position_index_2 = tokenized_input_sentence2.index("bank")
print(f"Position index of 'bank' in the second sentence: {position_index_2}")

### END YOUR CODE

Position index of 'bank' in the second sentence: 3


**QUESTION:**

 1.g What is the cosine similarity between the BERT embeddings for the two occurences of 'bank' in the two sentences?

In [30]:
### YOUR CODE HERE

#g.  -> get the vectors and calculate cosine similarity between the two 'bank' BERT embedddings

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import tensorflow as tf

# BERT embeddings
bert_output = bert_model(tokenized_input)

# Extract embeddings
embedding_bank_1 = bert_output[0][0][position_index]
embedding_bank_2 = bert_output[0][1][position_index_2]

# Reshape embeddings
embedding_bank_1 = tf.reshape(embedding_bank_1, (1, -1))
embedding_bank_2 = tf.reshape(embedding_bank_2, (1, -1))

# Calculate cosine similarity
cos_sim = cosine_similarity(embedding_bank_1, embedding_bank_2)[0][0]
print(f"Cosine similarity between the two 'bank' embeddings: {cos_sim}")

### END YOUR CODE

Cosine similarity between the two 'bank' embeddings: 0.5775851607322693


**QUESTION:**

1.h How does this relate to the cosine similarity of 'this' (in sentence 1) and the first 'the' (in sentence 2). Compute their cosine similarity.


In [31]:
### YOUR CODE HERE

#h.  -> get the vectors and calculate cosine similarity

# Extract embeddings
embedding_this = bert_output[0][0][1]
embedding_the = bert_output[0][1][1]

# Reshape embeddings
embedding_this = tf.reshape(embedding_this, (1, -1))
embedding_the = tf.reshape(embedding_the, (1, -1))

# Calculate cosine similarity
cosine_sim = cosine_similarity(embedding_this, embedding_the)[0][0]

print(f"Cosine similarity between 'this' in sentence 1 and 'the' in sentence 2: {cosine_sim}")

### END YOUR CODE

Cosine similarity between 'this' in sentence 1 and 'the' in sentence 2: 0.811026930809021


### 2 CLS-Token-based Classification

In the live session we discussed classification with BERT using the pooled token. We now will do the same but extract the [CLS] token output for each example and use that for classification purposes.

Consult the model from the live session and change accordingly. Make sure the BERT model is fully trainable.

**HINT:**
You will want to extract the output of the [CLS] token from the BERT output similarly to what we did above to get the output for 'bank', etc.


In [32]:
def create_bert_cls_model(bert_base_model,
                          max_sequence_length=MAX_SEQUENCE_LENGTH,
                          hidden_size = 100,
                          dropout=0.3,
                          learning_rate=0.00005):
    """
    Build a simple classification model with BERT. Use the CLS Token output for classification purposes.
    """

    ### YOUR CODE HERE

    # Input layers
    input_ids = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.int32, name="input_ids")
    attention_mask = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.int32, name="attention_mask")

    # BERT model
    bert_output = bert_base_model(input_ids, attention_mask=attention_mask)

    # Extract the [CLS] token output
    cls_token = bert_output.last_hidden_state[:, 0, :]

    # Add Dense layer
    hidden_layer = tf.keras.layers.Dense(hidden_size, activation='relu')(cls_token)

    # Apply dropout
    dropout_layer = tf.keras.layers.Dropout(dropout)(hidden_layer)

    # Output layer with softmax activation
    output = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(dropout_layer)

    # Define model
    classification_model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output)

    # Compile model
    classification_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                                 loss='binary_crossentropy',
                                 metrics=['accuracy'])

    ### END YOUR CODE

    return classification_model

Now create the model and train for 2 epochs. Use batch size 8 and the appropriate validation/test set. (We don't make a distinction here between validation and test although we might in other contexts.)


In [33]:
### YOUR CODE HERE

# Create model
bert_cls_model = create_bert_cls_model(bert_base_model=bert_model,
                                       max_sequence_length=MAX_SEQUENCE_LENGTH,
                                       hidden_size=100,
                                       dropout=0.3,
                                       learning_rate=0.00005)

# Train model
history = bert_cls_model.fit(
    x={'input_ids': bert_train_inputs[0], 'attention_mask': bert_train_inputs[2]},
    y=bert_train_labels,
    validation_data=(
        {'input_ids': bert_test_inputs[0], 'attention_mask': bert_test_inputs[2]},
        bert_test_labels
    ),
    epochs=2,
    batch_size=8
)

# Print results
evaluation_results = bert_cls_model.evaluate(
    x={'input_ids': bert_test_inputs[0], 'attention_mask': bert_test_inputs[2]},
    y=bert_test_labels
)
print(f"Test Loss: {evaluation_results[0]}")
print(f"Test Accuracy: {evaluation_results[1]}")

### END YOUR CODE

Epoch 1/2




Epoch 2/2
Test Loss: 0.45227837562561035
Test Accuracy: 0.7972000241279602


 **QUESTION:**

2.a What is the final validation accuracy that you observed for the [CLS]-classification model after training for 2 epochs? (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.5678 or 0.8765)




### 3 Classification by Averaging the BERT outputs

Instead of using only the output vector for the [CLS] token, we will now average the output vectors from BERT for all of the tokens in the full sequence.

**HINT:**
You will want to get the full sequence of token output vectors from the BERT model and then apply an average across the tokens. You may want to use:

```
tf.math.reduce_mean()
```
but you can also do it in other ways.



In [34]:
def create_bert_avg_model(bert_a_model,
                          max_sequence_length=MAX_SEQUENCE_LENGTH,
                          hidden_size = 100,
                          dropout=0.3,
                          learning_rate=0.00005):
    """
    Build a simple classification model with BERT. Use the average of the BERT output tokens
    """

    ### YOUR CODE HERE

    # Input layers
    input_ids = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.int32, name="input_ids")
    attention_mask = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.int32, name="attention_mask")

    # BERT model
    bert_output = bert_a_model(input_ids, attention_mask=attention_mask)

    # Extract token outputs
    token_outputs = bert_output.last_hidden_state

    # Average token outputs
    avg_output = tf.math.reduce_mean(token_outputs, axis=1)

    # Add Dense layer
    hidden_layer = tf.keras.layers.Dense(hidden_size, activation='relu')(avg_output)

    # Apply dropout
    dropout_layer = tf.keras.layers.Dropout(dropout)(hidden_layer)

    # Output layer with softmax activation
    output = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(dropout_layer)

    # Define model
    classification_model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output)

    # Compile model
    classification_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                                 loss='binary_crossentropy',
                                 metrics=['accuracy'])

    ### END YOUR CODE

    return classification_model

Now create the model and train for 2 epochs. Use batch size 8 and the appropriate validation/test set. (We again don't make a distinction here.)  Remember that all layers of the BERT model should be trainable.

In [35]:
### YOUR CODE HERE

# Create model
bert_avg_model = create_bert_avg_model(bert_a_model=bert_model,
                                       max_sequence_length=MAX_SEQUENCE_LENGTH,
                                       hidden_size=100,
                                       dropout=0.3,
                                       learning_rate=0.00005)

# Train model
history = bert_avg_model.fit(
    x={'input_ids': bert_train_inputs[0], 'attention_mask': bert_train_inputs[2]},
    y=bert_train_labels,
    validation_data=(
        {'input_ids': bert_test_inputs[0], 'attention_mask': bert_test_inputs[2]},
        bert_test_labels
    ),
    epochs=2,
    batch_size=8
)

# Print results
evaluation_results = bert_avg_model.evaluate(
    x={'input_ids': bert_test_inputs[0], 'attention_mask': bert_test_inputs[2]},
    y=bert_test_labels
)
print(f"Test Loss: {evaluation_results[0]}")
print(f"Test Accuracy: {evaluation_results[1]}")

### END YOUR CODE

Epoch 1/2




Epoch 2/2
Test Loss: 0.5324796438217163
Test Accuracy: 0.823199987411499


 **QUESTION:**

3.a What is the final validation accuracy that you observed for the BERT-averaging-classification model after training for 2 epochs? (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.5678 or 0.8765)




### 4 Adding a CNN on top of BERT

Can we also combine advanced architectures? Absolutely! In the end we are dealing with tensors and it does not matter whether they are coming from static word2vec embeddings or context-based BERT embeddings. (Whether we want to is another question, but let's try it here.)


**HINT:**
You should appropriately stitch together the BERT-based components and the CNN components from the lesson notebook. Remember that BERT provides a sequence of contextualized token embeddings as its main output, and a CNN takes a sequence of vectors as input.

Use the provided hyperparameters for CNN filter sizes and numbers of filters. Keep the same hyperparameters for the rest of the model, including a dropout layer and dense layer after the CNN, with the provided dropout rate and hidden_size. Again make sure the entire BERT model is trainable.

In [36]:
def create_bert_cnn_model(bert_cnn_model,
                          max_sequence_length=MAX_SEQUENCE_LENGTH,
                          num_filters = [131, 127, 51, 23, 17],
                          kernel_sizes = [2, 3, 4, 5, 7],
                          dropout = 0.3,
                          hidden_size = 275, #100
                          learning_rate=0.00005):
    """
    Build a  classification model with BERT, where you apply CNN layers  to the BERT output
    """

    ### YOUR CODE HERE

    # Input layers
    input_ids = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.int32, name="input_ids")
    attention_mask = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.int32, name="attention_mask")

    # BERT model
    bert_output = bert_cnn_model(input_ids, attention_mask=attention_mask)

    # Extract token outputs
    token_outputs = bert_output.last_hidden_state

    # Apply CNN layers with multiple kernel sizes
    conv_layers = []
    for num_filter, kernel_size in zip(num_filters, kernel_sizes):
        conv_layer = tf.keras.layers.Conv1D(filters=num_filter,
                                            kernel_size=kernel_size,
                                            activation='relu',
                                            padding='same')(token_outputs)
        pooled_layer = tf.keras.layers.GlobalMaxPooling1D()(conv_layer)
        conv_layers.append(pooled_layer)

    # Concatenate all CNN outputs
    concatenated_output = tf.keras.layers.Concatenate()(conv_layers)

    # Add Dense layer
    hidden_layer = tf.keras.layers.Dense(hidden_size, activation='relu')(concatenated_output)

    # Apply Dropout
    dropout_layer = tf.keras.layers.Dropout(dropout)(hidden_layer)

    # Output layer with sigmoid activation for binary classification
    output_layer = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(dropout_layer)

    # Define complete model
    classification_model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output_layer)

    # Compile model
    classification_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                                 loss='binary_crossentropy',
                                 metrics=['accuracy'])

    ### END YOUR CODE

    return classification_model

Train this model for 2 epochs as well with mini-batch size of 8:

In [37]:
### YOUR CODE HERE

# Create model
bert_cnn_model = create_bert_cnn_model(bert_cnn_model=bert_model,
                                       max_sequence_length=MAX_SEQUENCE_LENGTH,
                                       num_filters=[131, 127, 51, 23, 17],
                                       kernel_sizes=[2, 3, 4, 5, 7],
                                       dropout=0.3,
                                       hidden_size=275,
                                       learning_rate=0.00005)

# Train model
history = bert_cnn_model.fit(
    x={'input_ids': bert_train_inputs[0], 'attention_mask': bert_train_inputs[2]},
    y=bert_train_labels,
    validation_data=(
        {'input_ids': bert_test_inputs[0], 'attention_mask': bert_test_inputs[2]},
        bert_test_labels
    ),
    epochs=2,
    batch_size=8
)

# Print results
evaluation_results = bert_cnn_model.evaluate(
    x={'input_ids': bert_test_inputs[0], 'attention_mask': bert_test_inputs[2]},
    y=bert_test_labels
)
print(f"Test Loss: {evaluation_results[0]}")
print(f"Test Accuracy: {evaluation_results[1]}")

### END YOUR CODE

Epoch 1/2




Epoch 2/2
Test Loss: 0.694669246673584
Test Accuracy: 0.8027999997138977


In [38]:
from google.colab import drive
drive.flush_and_unmount()

 **QUESTION:**

4.a What is the final validation accuracy that you observed for the BERT-CNN-classification model after 2 epochs?  (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.5678 or 0.8765)


## Congratulations... You are done!
## We hope you learned a ton about how BERT works!