# Assignment 3: Question Answering with a Language Model

**Description:** This assignment covers question answering with a language model. There are many ways to formulate the question answering task and this is one of them.  You will use the masked token with T5 to develop a sentence construct that allows the model to answer the question more than 75% of the time. You should also be able to develop an intuition for:


* Working with masked language models
* Working with prompt based models
* The depths and limits of knowledge in these large models


This notebook does NOT require a GPU to work in a timely fashion. This notebook should be run on a Google Colab even though it does not require a GPU. By default, when you open the notebook in Colab it will not configure a GPU.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2023-summer-main/blob/master/assignment/a3/QuestionAnswering_test.ipynb)


**INSTRUCTIONS:**

* Questions are always indicated as **QUESTION:**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1 and a2.





In [1]:
!pip install -q sentencepiece

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━[0m [32m0.9/1.3 MB[0m [31m29.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!pip install -q transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m107.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m70.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
from collections import Counter
import numpy as np
import tensorflow as tf
from tensorflow import keras

In [4]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

In [5]:
t5_model = TFT5ForConditionalGeneration.from_pretrained('t5-base')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

t5_model.summary()

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Model: "tft5_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (Embedding)          multiple                  24674304  
                                                                 
 encoder (TFT5MainLayer)     multiple                  109628544 
                                                                 
 decoder (TFT5MainLayer)     multiple                  137949312 
                                                                 
Total params: 222,903,552
Trainable params: 222,903,552
Non-trainable params: 0
_________________________________________________________________


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


"\<extra_id_0\>" is the special token we can use with T5 to invoke its masked word modeling ability.  This means we can construct sentences, like a fill in the blank test, that allow us to probe the knowledge embedded in the model based on its pre-training.  Here's an example that works well.  We can construct with the special token a prompt sentence that says "A poodle is a type of "\<extra_id_0\>"".  We expect the model to fill in the word 'dog' as it predicts the missing word.  Note that it may also predict 'pet' as another possibility as a poodle can be a type of pet.  Remember the
"\<extra_id_0\>" token can appear anywhere in the sentence, not just at the end.

In [6]:
PROMPT_SENTENCE = ( "A poodle is a type of <extra_id_0> .")
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                   num_beams=9,
                                   no_repeat_ngram_size=1,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=5)

print([t5_tokenizer.decode(g, skip_special_tokens=True,
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['poodle', 'dog', 'dog .']


In [7]:
PROMPT_SENTENCE = ( "A beagle is a type of <extra_id_0> .")
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                   num_beams=9,
                                   no_repeat_ngram_size=1,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=5)

print([t5_tokenizer.decode(g, skip_special_tokens=True,
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['beagle', 'hawk', 'bird']


After you've run it once, try substituting 'beagle' for 'poodle' and you'll see the model gets confused.

Notice too that we are using a beam search approach to generate multiple possibilities but only accept the top three choices rather than just the first choice. We're asking for three answer sequences to be returned and they should be between 1 and 5 subwords long.

With the growth of text generation models, developing a good prompt is an increasingly important skill.

**QUESTION:**

1.1 Let's test the actual knowledge encoded in the T5 model. Let's construct prompts that return provably true or false facts like you might see on a fill in the blank test.  Given the following ten countries (England, France, Germany, Russia, Egypt, Thailand, Japan, Canada, India, China) construct **two** different PROMPT_SENTENCEs using the special token and the values of the countries list so that in at least 7 of the 10 cases one of the top three answers is a provably correct fact.  Use the string COUNTRY to stand in for each of the elements in the list.  For example, "\<extra_id_0\> is the chief export of COUNTRY".

Note that a fact usually takes the form of a noun phrase - verb phrase - noun phrase triple where one of those noun phrases will consist of the country value.   

In [11]:
countries = ['England', 'France', 'Germany',
             'Russia', 'Egypt', 'Thailand',
             'Japan', 'Canada', 'India', 'China']

In [14]:
#Use this space to craft your first sentence.  You do NOT need to modify the hyperparameters!
for country in countries:
  PROMPT_SENTENCE = (f"<extra_id_0> is the capital of {country}.")
  t5_input_text = PROMPT_SENTENCE
  t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
  t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                    num_beams=9,
                                    no_repeat_ngram_size=2,
                                    num_return_sequences=3,
                                    min_length=1,
                                    max_length=3)

  print([t5_tokenizer.decode(g, skip_special_tokens=True,
                            clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['London', 'Manchester', 'Here']
['Paris', 'Strasbourg', 'Nice']
['Berlin', 'Munich', 'Frankfurt']
['Moscow', 'Russia', 'Kiev']
['Cairo', 'Alexandria', 'Egypt']
['Bangkok', 'Thailand', 'Phuket']
['Tokyo', 'Kyoto', 'It']
['Toronto', 'Ottawa', 'Montreal']
['Mumbai', 'Kolkata', 'Bangalore']
['Shanghai', 'Beijing', '']


In [19]:
#Use this space to craft your second sentence.  You do NOT need to modify the hyperparameters!
for country in countries:
  PROMPT_SENTENCE2 = (f"<extra_id_0> is the language spoken in {country}.")
  t5_input_text = PROMPT_SENTENCE2
  t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
  t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                    num_beams=9,
                                    no_repeat_ngram_size=2,
                                    num_return_sequences=3,
                                    min_length=1,
                                    max_length=3)

  print([t5_tokenizer.decode(g, skip_special_tokens=True,
                            clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['English', 'english', '.']
['French', 'français', 'Français']
['German', 'English', 'german']
['Russian', 'Ru', 'English']
['Arabic', 'Egyptian', '']
['Thai', 'Tha', '']
['Japanese', 'English', '']
['English', 'Canadian', 'French']
['Hindi', 'English', 'Tamil']
['Chinese', 'Mandarin', '']
