<a href="https://colab.research.google.com/github/rahiakela/coursera-natural-language-processing-specialization/blob/4-natural-language-processing-with-attention-models/week-3/1_assignment_3_question_answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 3: Question Answering

Welcome to this week's assignment of course 4. In this you will explore question answering. You will implement the "Text to Text Transfer from Transformers" (better known as T5). Since you implemented transformers from scratch last week you will now be able to use them. 

<img src='https://github.com/rahiakela/img-repo/blob/master/deeplearning.ai-NLPS/qa.png?raw=1' width='800'/>

## Outline

- [Overview](#0)
- [Part 0: Importing the Packages](#0)
- [Part 1: C4 Dataset](#1)
    - [1.1 Pre-Training Objective](#1.1)
    - [1.2 Process C4](#1.2)
        - [1.2.1 Decode to natural language](#1.2.1)
    - [1.3 Tokenizing and Masking](#1.3)
        - [Exercise 01](#ex01)
    - [1.4 Creating the Pairs](#1.4)
- [Part 2: Transfomer](#2)
    - [2.1 Transformer Encoder](#2.1)
        - [2.1.1 The Feedforward Block](#2.1.1)
            - [Exercise 02](#ex02)
        - [2.1.2 The Encoder Block](#2.1.2)
            - [Exercise 03](#ex03)
        - [2.1.3 The Transformer Encoder](#2.1.3)            
            - [Exercise 04](#ex04)

<a name='0'></a>
## Overview

This assignment will be different from the two previous ones. Due to memory and time constraints of this environment you will not be able to train a model and use it for inference. Instead you will create the necessary building blocks for the transformer encoder model and will use a pretrained version of the same model in two ungraded labs after this assignment.

After completing these 3 (1 graded and 2 ungraded) labs you will:
* Implement the code neccesary for Bidirectional Encoder Representation from Transformer (BERT).
* Understand how the C4 dataset is structured.
* Use a pretrained model for inference.
* Understand how the "Text to Text Transfer from Transformers" or T5 model works. 

<a name='0'></a>
## Part 0: Importing the Packages

In [None]:
!pip install trax==1.3.4

In [2]:
import ast
import string
import textwrap
import itertools
import numpy as np

import trax 
from trax import layers as tl
from trax.supervised import decoding

# Will come handy later.
wrapper = textwrap.TextWrapper(width=70)

# Set random seed
np.random.seed(42)

<a name='1'></a>
## Part 1: C4 Dataset

The [C4](https://www.tensorflow.org/datasets/catalog/c4) is a huge data set. For the purpose of this assignment you will use a few examples out of it which are present in `data.txt`. C4 is based on the [common crawl](https://commoncrawl.org/) project. Feel free to read more on their website. 

Run the cell below to see how the examples look like. 

In [3]:
!wget https://raw.githubusercontent.com/rahiakela/coursera-natural-language-processing-specialization/4-natural-language-processing-with-attention-models/week-3/data.txt

--2020-11-22 08:47:07--  https://raw.githubusercontent.com/rahiakela/coursera-natural-language-processing-specialization/4-natural-language-processing-with-attention-models/week-3/data.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5513 (5.4K) [text/plain]
Saving to: ‘data.txt’


2020-11-22 08:47:07 (76.0 MB/s) - ‘data.txt’ saved [5513/5513]



In [4]:
# load example jsons
example_jsons = list(map(ast.literal_eval, open("data.txt")))

In [5]:
# Printing the examples to see how the data looks like
for i in range(5):
  print(f"example number {i + 1}: \n\n{example_jsons[i]} \n")

example number 1: 

{'content-length': b'1970', 'content-type': b'text/plain', 'text': b'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.', 'timestamp': b'2019-04-25T12:57:54Z', 'url': b'https://klyq.com/beginners-bbq-class-taking-place-in-missoula/'} 

example number 2: 

{'content-length': b'120

Notice the `b` before each string? This means that this data comes as bytes rather than strings. Strings are actually lists of bytes so for the rest of the assignments the name `strings` will be used to describe the data. 

To check this run the following cell:

In [6]:
type(example_jsons[0].get("text"))

bytes

<a name='1.1'></a>
###  1.1 Pre-Training Objective

**Note:** The word "mask" will be used throughout this assignment in context of hiding/removing word(s)

You will be implementing the BERT loss as shown in the following image. 

<img src='https://github.com/rahiakela/img-repo/blob/master/deeplearning.ai-NLPS/loss.png?raw=1' width='800'/>

Assume you have the following text: <span style = "color:blue"> **Thank you <span style = "color:red">for inviting </span> me to your party <span style = "color:red">last</span>  week** </span> 


Now as input you will mask the words in red in the text: 

<span style = "color:blue"> **Input:**</span> Thank you  **X** me to your party **Y** week.

<span style = "color:blue">**Output:**</span> The model should predict the words(s) for **X** and **Y**. 

**Z** is used to represent the end.

<a name='1.2'></a>
### 1.2 Process C4

C4 only has the plain string `text` field, so you will tokenize and have `inputs` and `targets` out of it for supervised learning. Given your inputs, the goal is to predict the targets during training. 

You will now take the `text` and convert it to `inputs` and `targets`.

In [7]:
# Grab text field from dictionary
natural_language_texts = [example_json["text"] for example_json in example_jsons]

# First text example
natural_language_texts[4]

b'The Denver Board of Education opened the 2017-18 school year with an update on projects that include new construction, upgrades, heat mitigation and quality learning environments.\nWe are excited that Denver students will be the beneficiaries of a four year, $572 million General Obligation Bond. Since the passage of the bond, our construction team has worked to schedule the projects over the four-year term of the bond.\nDenver voters on Tuesday approved bond and mill funding measures for students in Denver Public Schools, agreeing to invest $572 million in bond funding to build and improve schools and $56.6 million in operating dollars to support proven initiatives, such as early literacy.\nDenver voters say yes to bond and mill levy funding support for DPS students and schools. Click to learn more about the details of the voter-approved bond measure.\nDenver voters on Nov. 8 approved bond and mill funding measures for DPS students and schools. Learn more about what\xe2\x80\x99s incl

<a name='1.2.1'></a>
#### 1.2.1 Decode to natural language

The following functions will help you `detokenize` and`tokenize` the text data.  

The `sentencepiece` vocabulary was used to convert from text to ids. This vocabulary file is loaded and used in this helper functions.

`natural_language_texts` has the text from the examples we gave you. 

Run the cells below to see what is going on. 

In [10]:
!wget https://github.com/rahiakela/coursera-natural-language-processing-specialization/blob/4-natural-language-processing-with-attention-models/week-3/sentencepiece.model

--2020-11-22 09:04:59--  https://github.com/rahiakela/coursera-natural-language-processing-specialization/blob/4-natural-language-processing-with-attention-models/week-3/sentencepiece.model
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘sentencepiece.model’

sentencepiece.model     [ <=>                ]  93.57K  --.-KB/s    in 0.006s  

2020-11-22 09:04:59 (16.2 MB/s) - ‘sentencepiece.model’ saved [95820]



In [12]:
!pip install sentencepiece
!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt

--2020-11-22 09:28:56--  https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 278779 (272K) [text/plain]
Saving to: ‘botchan.txt’


2020-11-22 09:28:56 (34.0 MB/s) - ‘botchan.txt’ saved [278779/278779]



In [8]:
# Special tokens
PAD, EOS, UNK = 0, 1, 2

def detokenize(np_array):
  return trax.data.detokenize(
      np_array,
      vocab_type="sentencepiece",
      vocab_file="sentencepiece.model",
      vocab_dir="."
  )

def tokenize(s):
  # The trax.data.tokenize function operates on streams, that's why we have to create 1-element stream with iter
  # and later retrieve the result with next.
  return next(trax.data.tokenize(
      iter([s]),
      vocab_type="sentencepiece",
      vocab_file="sentencepiece.model",
      vocab_dir="."
  ))

In [None]:
# printing the encoding of each word to see how subwords are tokenized
tokenized_text = [(tokenize(word).tolist(), word) for word in natural_language_texts[0].split()]
print(tokenized_text, "\n")

In [None]:
# We can see that detokenize successfully undoes the tokenization
print(f"tokenized: {tokenize('Beginners')}\ndetokenized: {detokenize(tokenize('Beginners'))}")

As you can see above, you were able to take a piece of string and tokenize it. 

Now you will create `input` and `target` pairs that will allow you to train your model. T5 uses the ids at the end of the vocab file as sentinels. For example, it will replace: 
   - `vocab_size - 1` by `<Z>`
   - `vocab_size - 2` by `<Y>`
   - and so forth. 
   
It assigns every word a `chr`.

The `pretty_decode` function below, which you will use in a bit, helps in handling the type when decoding. Take a look and try to understand what the function is doing.


Notice that:
```python
string.ascii_letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
```

**NOTE:** Targets may have more than the 52 sentinels we replace, but this is just to give you an idea of things.