<a href="https://colab.research.google.com/github/srilamaiti/SM_MIDS_W266_HW/blob/main/QuestionAnswering_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 3: Question Answering with a Language Model

**Description:** This assignment covers question answering with a language model. There are many ways to formulate the question answering task and this is one of them.  You will use the masked token with T5 to develop a sentence construct that allows the model to answer the question more than 75% of the time. You should also be able to develop an intuition for:


* Working with masked language models 
* Working with prompt based models 
* The depths and limits of knowledge in these large models 

 
This notebook does NOT require a GPU to work in a timely fashion. This notebook should be run on a Google Colab even though it does not require a GPU. By default, when you open the notebook in Colab it will not configure a GPU. 


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2023-spring-main/blob/master/assignment/a3/QuestionAnswering_test.ipynb)


**INSTRUCTIONS:** 

* Questions are always indicated as **QUESTION:**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1 and a2.





In [23]:
!pip install -q sentencepiece

In [24]:
!pip install -q transformers

In [25]:
from collections import Counter
import numpy as np
import tensorflow as tf
from tensorflow import keras

In [26]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

In [27]:
t5_model = TFT5ForConditionalGeneration.from_pretrained('t5-base')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

t5_model.summary()

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Model: "tft5_for_conditional_generation_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (Embedding)          multiple                  24674304  
                                                                 
 encoder (TFT5MainLayer)     multiple                  109628544 
                                                                 
 decoder (TFT5MainLayer)     multiple                  137949312 
                                                                 
Total params: 222,903,552
Trainable params: 222,903,552
Non-trainable params: 0
_________________________________________________________________


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


"\<extra_id_0\>" is the special token we can use with T5 to invoke its masked word modeling ability.  This means we can construct sentences, like a fill in the blank test, that allow us to probe the knowledge embedded in the model based on its pre-training.  Here's an example that works well.  We can construct with the special token a prompt sentence that says "A poodle is a type of "\<extra_id_0\>"".  We expect the model to fill in the word 'dog' as it predicts the missing word.  Note that it may also predict 'pet' as another possibility as a poodle can be a type of pet.  Remember the 
"\<extra_id_0\>" token can appear anywhere in the sentence, not just at the end.

In [28]:
PROMPT_SENTENCE = ( "A poodle is a type of <extra_id_0> .")
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                   num_beams=9,
                                   no_repeat_ngram_size=1,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=5)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['poodle', 'dog', 'dog .']


In [29]:
t5_summary_ids

<tf.Tensor: shape=(3, 5), dtype=int32, numpy=
array([[    0, 32099,     3,   102, 14957],
       [    0, 32099,  1782, 32098,     3],
       [    0, 32099,  1782,     3,     5]], dtype=int32)>

In [30]:
#Use this space to craft your first sentence.  You do NOT need to modify the hyperparameters!
PROMPT_SENTENCE = ( "<extra_id_0> Lily is a type of Lily.")
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                   num_beams=9,
                                   no_repeat_ngram_size=2,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=3)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['Li', '.', '']


In [31]:
#Use this space to craft your first sentence.  You do NOT need to modify the hyperparameters!
PROMPT_SENTENCE = ( "<extra_id_0> is a species of hamsters.")
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                   num_beams=9,
                                   no_repeat_ngram_size=2,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=3)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['', '.', 'ham']


In [32]:
#Use this space to craft your first sentence.  You do NOT need to modify the hyperparameters!
PROMPT_SENTENCE = ( "<extra_id_0> in India is one of the seven wonders of the world.")
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                   num_beams=9,
                                   no_repeat_ngram_size=2,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=3)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['The', 'India', '.']


In [33]:
#Use this space to craft your first sentence.  You do NOT need to modify the hyperparameters!
PROMPT_SENTENCE = ( "Who is the <extra_id_0> of the USA now?")
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                   num_beams=9,
                                   no_repeat_ngram_size=2,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=3)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['President', 'president', 'leader']


After you've run it once, try substituting 'beagle' for 'poodle' and you'll see the model gets confused.

Notice too that we are using a beam search approach to generate multiple possibilities but only accept the top three choices rather than just the first choice. We're asking for three answer sequences to be returned and they should be between 1 and 5 subwords long.

With the growth of text generation models, developing a good prompt is an increasingly important skill. 

**QUESTION:**

1.1 Let's test the actual knowledge encoded in the T5 model. Let's construct prompts that return provably true or false facts like you might see on a fill in the blank test.  Given the following ten countries (England, France, Germany, Russia, Egypt, Thailand, Japan, Canada, India, China) construct **two** different PROMPT_SENTENCEs using the special token and the values of the countries list so that in at least 7 of the 10 cases one of the top three answers is a provably correct fact.  Use the string COUNTRY to stand in for each of the elements in the list.  For example, "\<extra_id_0\> is the chief export of COUNTRY".

Note that a fact usually takes the form of a noun phrase - verb phrase - noun phrase triple where one of those noun phrases will consist of the country value.   

In [43]:
country_list = ['England', 'France', 'Germany', 'Russia', 'Egypt', 'Thailand', 'Japan', 'Canada', 'India', 'China']
PROMPT_SENTENCE1 = ( "The capital of country_name is <extra_id_0>.")
PROMPT_SENTENCE2 = ( "The official language of country_name is <extra_id_0>.")

In [35]:
#Use this space to craft your second sentence.  You do NOT need to modify the hyperparameters!
for country in country_list:
    print(f"Country : {country}")

    # Replacing the COUNTRY tag with the actual country name, and then show it
    PROMPT_SENTENCE = PROMPT_SENTENCE1.replace("country_name", country)
    
    print(f"Prompt sentence : \'{PROMPT_SENTENCE}\'")
    
    # Running the model
    t5_input_text = PROMPT_SENTENCE
    t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
    t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                       num_beams=9,
                                       no_repeat_ngram_size=2,
                                       num_return_sequences=3,
                                       min_length=1,
                                       max_length=3)
    # Showing results
    print(f"Possible answers are: {[t5_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in t5_summary_ids]}")

Country : England
Prompt sentence : 'The capital of England is <extra_id_0>.'
Possible answers are: ['London', 'Manchester', 'the']
Country : France
Prompt sentence : 'The capital of France is <extra_id_0>.'
Possible answers are: ['Paris', 'Strasbourg', 'Marseille']
Country : Germany
Prompt sentence : 'The capital of Germany is <extra_id_0>.'
Possible answers are: ['Berlin', 'Frankfurt', 'Hamburg']
Country : Russia
Prompt sentence : 'The capital of Russia is <extra_id_0>.'
Possible answers are: ['Moscow', 'Kiev', 'the']
Country : Egypt
Prompt sentence : 'The capital of Egypt is <extra_id_0>.'
Possible answers are: ['Cairo', 'Alexandria', 'Abu']
Country : Thailand
Prompt sentence : 'The capital of Thailand is <extra_id_0>.'
Possible answers are: ['Bangkok', 'Phuket', 'Ph']
Country : Japan
Prompt sentence : 'The capital of Japan is <extra_id_0>.'
Possible answers are: ['Tokyo', 'Kyoto', 'Nag']
Country : Canada
Prompt sentence : 'The capital of Canada is <extra_id_0>.'
Possible answers ar

t5 model correctly specified the capital cities for all the countries present in the country list.

In [44]:
#Use this space to craft your second sentence.  You do NOT need to modify the hyperparameters!
for country in country_list:
    print(f"Country : {country}")

    # Replacing the COUNTRY tag with the actual country name, and then show it
    PROMPT_SENTENCE = PROMPT_SENTENCE2.replace("country_name", country)
    
    print(f"Prompt sentence : \'{PROMPT_SENTENCE}\'")
    
    # Running the model
    t5_input_text = PROMPT_SENTENCE
    t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
    t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                       num_beams=9,
                                       no_repeat_ngram_size=2,
                                       num_return_sequences=3,
                                       min_length=1,
                                       max_length=3)
    # Showing results
    print(f"Possible answers are: {[t5_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in t5_summary_ids]}")

Country : England
Prompt sentence : 'The official language of England is <extra_id_0>.'
Possible answers are: ['English', 'english', 'Welsh']
Country : France
Prompt sentence : 'The official language of France is <extra_id_0>.'
Possible answers are: ['French', 'English', 'Français']
Country : Germany
Prompt sentence : 'The official language of Germany is <extra_id_0>.'
Possible answers are: ['German', 'English', 'german']
Country : Russia
Prompt sentence : 'The official language of Russia is <extra_id_0>.'
Possible answers are: ['Russian', 'English', '']
Country : Egypt
Prompt sentence : 'The official language of Egypt is <extra_id_0>.'
Possible answers are: ['Arabic', 'English', 'Egyptian']
Country : Thailand
Prompt sentence : 'The official language of Thailand is <extra_id_0>.'
Possible answers are: ['Thai', 'English', '']
Country : Japan
Prompt sentence : 'The official language of Japan is <extra_id_0>.'
Possible answers are: ['English', 'Japanese', '']
Country : Canada
Prompt sente

t5 model correctly specified the official language for all the countries present in the country list.