# Exercise Sheet 6 – Natural Language Processing with BERT 

 * Deep Learning – Winter term 2019/20
 * Instructor: Alexander Ecker
 * Tutor: Lindrit Kqiku <kqiku@cs.uni-goettingen.de>
 * Due date: Feb 3, 2020 at noon


## Introduction

The aim of this exercise is to build a model that can read textual data and make predictions about the *sentiment* of that text. We will use the IMDB movie data set to design and implement a model that is able to differentiate between positive and negative reviews of a movie.

By completing this exercise, you will be able to
- Understand and apply transfer learning techniques in NLP
- Use the state-of-the-art embedding techniques as part of the embedding layer



# IMPORTANT SUBMISSION INSTRUCTIONS

- When you're done, download the notebook and rename it to \<surname1\>_\<surname2\>_\<surname3\>.ipynb
- Only submit the ipynb file, no other file is required
- Submit only once
- The deadline is strict
- You are required to present your solution in the tutorial; submission of the notebook alone is not sufficient

# Setup and Dataset


## Using Colab GPU for Training 

A GPU can be added by going to the menu and selecting 


```
Edit --> Notebook Settings --> Hardware accelerator --> (GPU)
```



In [1]:
import torch

# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
    device = torch.device("cuda")
    print('There is/are %d GPU(s) available.' % torch.cuda.device_count())
    print()
    print("We will use the GPU: ", torch.cuda.get_device_name(0))
    print("with the following properties: ")
    print(torch.cuda.get_device_properties(0))
else:
    print('No GPU available, training on CPU instead.')

Training on GPU.
There is/are 1 GPU(s) available.

We will use the GPU:  Tesla T4
with the following properties: 
_CudaDeviceProperties(name='Tesla T4', major=7, minor=5, total_memory=15079MB, multi_processor_count=40)


## Installing the Hugging Face Library 

 Next, let’s install the transformers package from Hugging Face which will give us a pytorch interface for working with implementations of state-of-the-art embedding layers. This library contains interfaces for  pretrained language models like BERT, XLNet, OpenAI’s GPT and GPT-2. 

**More details about BERT**
>[Github page of the library](https://github.com/huggingface/transformers)

>


>[Paper](https://arxiv.org/abs/1910.03771v3)





In [2]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ee/fc/bd726a15ab2c66dc09306689d04da07a3770dad724f0883f0a4bfb745087/transformers-2.4.1-py3-none-any.whl (475kB)
[K     |▊                               | 10kB 23.9MB/s eta 0:00:01[K     |█▍                              | 20kB 6.0MB/s eta 0:00:01[K     |██                              | 30kB 8.5MB/s eta 0:00:01[K     |██▊                             | 40kB 5.6MB/s eta 0:00:01[K     |███▍                            | 51kB 6.8MB/s eta 0:00:01[K     |████▏                           | 61kB 8.0MB/s eta 0:00:01[K     |████▉                           | 71kB 9.1MB/s eta 0:00:01[K     |█████▌                          | 81kB 10.2MB/s eta 0:00:01[K     |██████▏                         | 92kB 11.3MB/s eta 0:00:01[K     |██████▉                         | 102kB 9.0MB/s eta 0:00:01[K     |███████▋                        | 112kB 9.0MB/s eta 0:00:01[K     |████████▎                       | 122kB 9.

Getting the data can be done automatically or manually.

## Downloading Datasets and Creating Folders

Execute the following cells and it will automatically run a Python and shell script to download datasets needed to complete this task and create data folders. *NOTE: It is not relevant for our exercise to understand this section' code*

If you are using Google Colab just execute the cells below. 
In the other case, please follow these steps:
1.   Create a python script named download.py and copy-paste the next cells below. Note: You will need to install requests library (pip install requests)
2.   Create a shell script by copy-pasting next cell and execute it



Alternatively, you can follow the link below and create the directory structure by yourself

> Save the data: [Imdb Data set](https://drive.google.com/file/d/13RliAESnCKvPA7_6PUUBYqUJB_ecMp7s/view?usp=sharing) in the folder: /content/data/imdb-ds-rating






In [14]:
%%shell
pip install requests





In [3]:
%%writefile download.py
# Creates a python script named download.py to download our datasets 
# from google drive files
#
# CREDITS: [1] https://stackoverflow.com/a/39225039
#          [2] Natural Language Processing with PyTorch - Build Intelligent Language Applications Using Deep Learning - Delip Rao & Brian McMahan


import requests
def progress_bar(some_iter):
    try:
        from tqdm import tqdm
        return tqdm(some_iter)
    except ModuleNotFoundError:
        return some_iter

def download_file_from_google_drive(id, destination):
    print("Downloading {}".format(destination))

    def get_confirm_token(response):
        for key, value in response.cookies.items():
            if key.startswith('download_warning'):
                return value

        return None

    def save_response_content(response, destination):
        CHUNK_SIZE = 32768

        with open(destination, "wb") as f:
            for chunk in progress_bar(response.iter_content(CHUNK_SIZE)):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)

    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)


if __name__ == "__main__":
    import sys
    if len(sys.argv) != 3:
        print("Usage: python download.py drive_file_id destination_file_path")
    else:
        # TAKE ID FROM SHAREABLE LINK
        file_id = sys.argv[1]
        # DESTINATION FILE ON YOUR DISK or CLOUD
        destination = sys.argv[2]
        download_file_from_google_drive(file_id, destination)


Writing download.py


In [4]:
%%shell
#! /bin/bash

# For each file, a download.py line is added to call the previous script
# Any additional processing on the downloaded file
HERE="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

# IMDB Reviews Dataset
mkdir -p $HERE/data/imdb-ds
if [ ! -f $HERE/data/imdb-ds/IMDB_Dataset.npz ]; then
    python download.py 10qLP8pckM_oTs2cqN48Km-whkOBBXj4c $HERE/data/imdb-ds/IMDB_Dataset.npz
fi

Downloading /content/data/imdb-ds/IMDB_Dataset.npz
1985it [00:00, 2006.02it/s]




In [5]:
ls /content/data/imdb-ds/ -alh

total 63M
drwxr-xr-x 2 root root 4.0K Jan 31 20:56 [0m[01;34m.[0m/
drwxr-xr-x 3 root root 4.0K Jan 31 20:55 [01;34m..[0m/
-rw-r--r-- 1 root root  63M Jan 31 20:56 IMDB_Dataset.npz


In [0]:
# !rm /content/data/imdb-ds/IMDB_Dataset.npz

# Sentiment Analysis

## Data Pre-processing

In [0]:
# Import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm

In [0]:
# load the training data
file_name = '/content/data/imdb-ds/IMDB_Dataset.npz'
data = np.load(file_name, allow_pickle=True)
reviews, labels = data['sentences'], data['labels']

### TODO: Explore the dataset
- What is its shape
- is the dataset balanced?
- Visualize the length of the reviews for some of the reviews


In [8]:
print(len(data))
print(reviews.size)
print(labels.size)

#balanced mean same number of pos and neg reviews in this case
sum(labels) #one is encoded with one and the other with zero
#half is pos and the other half is neg

len(reviews[1])

2
50000
50000


962


## BERT: Tokenization & Input Formatting

We will use the excellent [Transformers library](https://github.com/huggingface/transformers) by Hugging Face to work with BERT.

The first step is to tokenize the reviews and bring them into the format that BERT expects. This includes

- Tokenization
- Adding special tokens: [CLS], [SEP]
- Trimming sentences to maximum length
- Padding [PAD] in case of shorter sentences

Documentation of `BertTokenizer`: https://huggingface.co/transformers/model_doc/bert.html#berttokenizer


In [9]:
# Get rid of Colab warning about Tensorflow 2.0
%tensorflow_version 1.x

from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loading the BERT tokenizer...')

# We will use bert-base-uncased model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Loading the BERT tokenizer...


HBox(children=(IntProgress(value=0, description='Downloading', max=231508, style=ProgressStyle(description_wid…




In [10]:
# Maximum length of a sequence
MAX_LEN = 128

# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []

# For every sentence...
for review in tqdm(reviews):
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad to maximum length if the sequence is shorter
    sequence = tokenizer.encode_plus(
                    review,                      # Review to encode.
                    add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                    max_length=MAX_LEN,
                    pad_to_max_length=True,
    )
    input_ids.append(sequence['input_ids'])

input_ids = np.array(input_ids)

100%|██████████| 50000/50000 [03:56<00:00, 211.71it/s]


In [23]:
# Print example sentence 10, now as a list of IDs.
i = 10
print('Original: ', reviews[i])
print('Token IDs:', input_ids[i])

Original:  Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines.At first it was very odd and pretty funny but as the movie progressed I didn't find the jokes or oddness funny anymore.Its a low budget film (thats never a problem in itself), there were some pretty interesting characters, but eventually I just lost interest.I imagine this film would appeal to a stoner who is currently partaking.For something similar but better try "Brother from another planet"
Token IDs: [  101  6316  1996  7344  2003  2028  1997  2216 21864 15952  3152  2073
  1996 17211  2003  2241  2105  1996  5976  2791  1997  2673  2738  2084
  5025  8595 12735  1012  2012  2034  2009  2001  2200  5976  1998  3492
  6057  2021  2004  1996  3185 12506  1045  2134  1005  1056  2424  1996
 13198  2030  5976  2791  6057  4902  1012  2049  1037  2659  5166  2143
  1006  2008  2015  2196  1037  3291  1999  2993  1007  1010  2045  2020
  2070  3

## Training, Validation, Test Split

Use 1000-2000 reviews as validation and test set, respectively. Use the rest for training.

In [11]:
#also erst 4000 zufallszahlen zwischen 0 und 49999 (einschließlich) erzeugen und dann am Ende einfach in Test und Val teilen
#Alle Zahlen die nicht darin sind, sind dann im Train

## ToDo: Split data into training, validation, and test data (features and labels, x and y)

import random
random.seed(1234) #for recapabebility
num = random.sample(range(50000), k=4000)
mask = np.ones(50000, dtype=bool)
mask[num] = False

train_features, train_labels = input_ids[mask,...], labels[mask,...]
test_features, test_labels = input_ids[num[0:2000]], labels[num[0:2000]]
val_features, val_labels = input_ids[num[2000:]], labels[num[2000:]]


## ToDo: Print out the shapes of your splitted train, validation, and test set
print(train_features.shape, train_labels.shape)
print(test_features.shape, test_labels.shape)
print(val_features.shape, val_labels.shape)

(46000, 128) (46000,)
(2000, 128) (2000,)
(2000, 128) (2000,)


## Create DataLoader

You can use `TensorDataset` to create a dataset holding (reviews, labels).

In [0]:
#überprüfen, ob die eckigen Klammern benötigt werden
train_features = torch.tensor([train_features])
train_labels = torch.tensor([train_labels])

test_features = torch.tensor([test_features])
test_labels = torch.tensor([test_labels])

val_features = torch.tensor([val_features])
val_labels = torch.tensor([val_labels])

In [0]:
import torch.utils.data

#train_dataset = torch.utils.data.TensorDataset(train_features, train_labels)
test_dataset = torch.utils.data.TensorDataset(test_features, test_labels)
val_dataset = torch.utils.data.TensorDataset(val_features, val_labels)

In [0]:
# Define the data loaders
batch_size = 75
train_loader  = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader   = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)
val_loader    = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)


# Save data set sizes for later
train_size    = len(train_dataset)
test_size     = len(test_dataset)
val_size      = len(val_dataset)
train_batches = len(train_loader)
test_batches  = len(test_loader)
val_batches   = len(val_loader)


# Sentiment Network - Classification Model

We will build a sentiment classifier by using transfer learning from a pre-trained BERT model. For simplicity and to speed up the training process, we will freeze the weights of the BERT encoder (code already included below).

Use the 768-dimensional embedding corresponding to the [CLS] token and add a binary classifier on top. Thus, your linear layer should map a (BATCH_SIZE x 768) tensor onto (BATCH_SIZE, ) or (BATCH_SIZE, 2), depending on whether you use a sigmoid activation and scalar outputs or code the two classes one-hot and use a softmax layer.



### Implement a Binary Classifier 

Look up how to instantiate a pre-trained `BertModel`. A good starting point is the quickstart guide of the Transformers library: https://huggingface.co/transformers/quickstart.html

Documentation of `BertModel`: https://huggingface.co/transformers/model_doc/bert.html#bertmodel

In [0]:
from transformers import BertModel
import torch.nn as nn
from torch.nn import CrossEntropyLoss

class SentimentClassifier(nn.Module):
    def __init__(self, fine_tune=False):
        super(SentimentClassifier, self).__init__()

        # TODO: BertModel as an embedding layer
        self.bert = BertModel.from_pretrained('bert-base-uncased')

        # Turn gradients for BertModel on/off
        self.bert.requires_grad_(fine_tune)

        # TODO: Linear binary classification layer
        self.Layer = nn.Linear(x.shape[0]*768,x.shape[0])

    def forward(self, x):
        # TODO
        x = self.bert.forward(x)
        x = self.Layer(x)
        return x




## Instantiate the Network 


In [0]:
# ToDo: Instantiate the model
net = SentimentClassifier()

print(net)

## Training and Testing

Make sure to choose an appropriate learning rate. Regularly output training and validation loss + accuracy. Since the dataset is large, you want to output statistics in regular intervals also within an epoch.

In [0]:
optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-5)
loss = 

# ToDo: Train & validate
def train(model, train_data):
  train_loss = []
  for x,y in train_data:
    model.train()
    optimizer.zero_grad()
    x = x.to(device)
    y_hat = model(x)

    loss = loss(y_hat, y)
    loss.backward()
    optimizer.step()

    train_loss.append(loss.item())
  
  return train_loss

def validate(model, val_data):
  val_loss = []
  for x,y in val_data:
    model.eval()

Plot the training and validation loss and report accuracy on the test set.

In [0]:
# TODO

## Inference Testing

This section consists of testing the models using inference testing, i.e., in our case, using other reviews to categorize them as positive or negative. 

In [0]:
# TODO:
## Create a function that takes in the trained net, a text review, 
## and a sequence length to predict its sentiment to positive/negative

def predict(net, test_review, sequence_length=MAX_LEN):
        '''
        params:
        net - The trained net 
        test_review - a review as a string
        sequence_length - the padded length of a review
        '''
    
    
    # print custom response based on whether test_review is pos/neg

In [0]:
# TODO: Choose any of your preferred movies, find a review from imdb.com or
#       any other source and compare your model result against the source rating

## [Optional] Fine-tune BERT model

Can you improve performance by fine-tuning BERT in addition to training the linear classifier?

In [0]:
# TODO