In [1]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import pandas as pd
import random
import checklist_plus
from checklist_plus.editor import Editor
from checklist_plus.expect import Expect
from checklist_plus.pred_wrapper import PredictorWrapper
from checklist_plus.test_types import MFT
from typing import List
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Initialize random seed
# Remove this code to experiment with random samples
random.seed(123)
torch.manual_seed(456)

<torch._C.Generator at 0x7f075dfff110>

# MFTs: Introduction
In this notebook, we will create Minimum Functionality Tests (MFTs) for a generative language model. MFTs test one specific function of a language model. They are analogous to unit tests in traditional software engineering.

## Setup generative model
Before we can test anything, we need to set up our language model. We will use the HuggingFace transformers library to load a GPT2 model.

First, we create a tokenizer. The tokenizer is responsible for splitting strings into individual words, then converting those words into vectors of numbers that our model can understand.

In [3]:
# Load pretrained model tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Demonstrate what the tokenizer does
tokenizer.encode("Wherefore art thou Romeo?")

[8496, 754, 1242, 14210, 43989, 30]

Our tokenizer has turned the human-readable text into a list of numbers that the model understands. Next, let's load the GPT2 model.

In [4]:
# Load pretrained model (weights)
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)
device = 'cuda'
model.eval()
model.to(device)
"Model loaded"

'Model loaded'

Generating text with the model requires a bit of work. Let's write a function `generate_sentences` to handle the text generation.

`generate_sentences` has 1 parameter, `prompts`, which is a list of strings. A prompt is a string that the model will use as a starting point for generating new text. It gives the model context about what kind of text should be generated.

`generate_sentences` will output a list of generated text responses for each prompt.

In [5]:
def generate_sentences(prompts: List[str]) -> List[str]:
    sentences = []
    for prompt in prompts:
        token_tensor = tokenizer.encode(prompt, return_tensors='pt').to(device) # return_tensors = "pt" returns a PyTorch tensor
        out = model.generate(
            token_tensor,
            do_sample=True,
            min_length=10,
            max_length=50,
            num_beams=1,
            temperature=1.0,
            no_repeat_ngram_size=2,
            early_stopping=False,
            output_scores=True,
            return_dict_in_generate=True)
        text = tokenizer.decode(out.sequences[0], skip_special_tokens=True)
        sentences.append(text[len(prompt):])
    return sentences

In [6]:
generate_sentences(["Wherefore art thou Romeo?"])

[' if it were not, then thou art a devil, with his body made of the fire: thy body he that is not with fire makes of it. But thou, however, have done evil and have been condemned.']

Now that everything is ready, we can write our first MFT.

## MFT - Language prompt
For this MFT, we will expect the model to create a reasonable continuation of a prompt. The model will be prompted with strings like "The most commonly spoken language in {country} is " where {country} is a placeholder for a country such as Spain.

We need create a rule to determine if the model passes our test. The criteria for passing or failing the test is entirely user defined. We will consider this MFT to pass if the model's output contains any language name. This will demonstrate that the model understands the general context of the prompt. The mentioned language doesn't have to be accurate - for example, "In Spain the most commonly spoken language is Indonesian" would pass our test, because Indonesian is a language. The language may also be located anywhere in the output - for example, "In Spain the most commonly spoken language is not easy to learn. Spanish has many complicated conjugations." would also pass our test.

In a later section of this notebook, there is another version of this MFT that is stricter, requiring the correct language to be mentioned in the response.

### Handwritten MFT
First, we will write the MFT by hand. Then, we'll use Checklist's MFT class to demonstrate how Checklist helps us create the MFT much more quickly.

#### Generate prompts from template
We will use Checklist's Editor class to quickly create the prompts. For a detailed explanation of generating data, see the "1. Generating data" tutorial notebook.

In [7]:
editor = Editor()
# Note: remove the country parameter to generate prompts with random countries
prompt_strs = editor.template("The most commonly spoken language in {country} is", country = ["United States", "France", "Guatemala", "Mongolia", "Japan"])
prompt_strs.data

['The most commonly spoken language in United States is',
 'The most commonly spoken language in France is',
 'The most commonly spoken language in Guatemala is',
 'The most commonly spoken language in Mongolia is',
 'The most commonly spoken language in Japan is']

#### Language CSV
We need a list of languages to check if the model's output contains a language. To save some time, we will read language names from a CSV file. The data comes from standard ISO Language Codes https://datahub.io/core/language-codes 

In [8]:
import urllib.request
urllib.request.urlretrieve('https://datahub.io/core/language-codes/r/language-codes.csv', 'language-codes.csv')
lang_codes_csv = pd.read_csv('language-codes.csv')
lang_codes_csv

Unnamed: 0,alpha2,English
0,aa,Afar
1,ab,Abkhazian
2,ae,Avestan
3,af,Afrikaans
4,ak,Akan
...,...,...
179,yi,Yiddish
180,yo,Yoruba
181,za,Zhuang; Chuang
182,zh,Chinese


#### Run the MFT
Now we're ready to create the MFT. We will create 3 Pandas dataframes, one each for prompts, responses, and results. Then, we will loop over the prompts, send each prompt to the model, and determine if it passes or fails the test. Each prompt and its test result will be recorded in the dataframes.

In [9]:
prompts = pd.DataFrame({"id": [], "prompt": []})
responses = pd.DataFrame({"id": [], "response": []})
results = pd.DataFrame({"id": [], "p/f": []})
langs = lang_codes_csv["English"].tolist()

model_responses = generate_sentences(prompt_strs.data)

for (i, response) in enumerate(model_responses):
    pf = 'fail'
    
    # Check if any language from the CSV data is in the generated string
    for l in langs:
        if l in response:
            pf = 'pass'
            break

    prompts = prompts.append({"id": i, "prompt": prompt_strs.data[i]}, ignore_index=True)
    responses = responses.append({"id": i, "response": response}, ignore_index=True)
    results = results.append({"id": i, "p/f": pf}, ignore_index=True)

#### Show test results
Now let's look at the results of our test.

In [10]:
pd.set_option("max_colwidth", 250)

In [11]:
prompts

Unnamed: 0,id,prompt
0,0.0,The most commonly spoken language in United States is
1,1.0,The most commonly spoken language in France is
2,2.0,The most commonly spoken language in Guatemala is
3,3.0,The most commonly spoken language in Mongolia is
4,4.0,The most commonly spoken language in Japan is


In [12]:
responses

Unnamed: 0,id,response
0,0.0,English. Nearly all U.S. adult males speak English well above grade level and in the U-1 category the average verbal IQ is 98 percent. For higher grade language learners such as children and others
1,1.0,"""l'homme"", or what one French word for ""father"", means.\n\nBut in the United States, which had more than 60,000 native-born people in 2010, there was nothing"
2,2.0,"Guatemalan and has a slightly older pronunciation (14 years old).\n\nMost of these languages have more or less the same vocabulary, as they come from different parts of the world. Most common is English ("
3,3.0,"Ainu, but an interesting source for these terms is the Yuktai language spoken in the country near the border with Turkmenistan. Another official official word is Shum, and it is easy to"
4,4.0,"""Fuku wo kun-mitsu."" It means ""A man cannot run."" However, there are other ways of speaking the word—such as ""kimasu.""\n\nThe Kansai"


In [13]:
results

Unnamed: 0,id,p/f
0,0.0,pass
1,1.0,pass
2,2.0,pass
3,3.0,pass
4,4.0,fail


We can merge all the dataframes to make the results easier to read.

In [14]:
merged = pd.merge(responses, results, on="id")
merged = pd.merge(prompts, merged, on="id")
merged

Unnamed: 0,id,prompt,response,p/f
0,0.0,The most commonly spoken language in United States is,English. Nearly all U.S. adult males speak English well above grade level and in the U-1 category the average verbal IQ is 98 percent. For higher grade language learners such as children and others,pass
1,1.0,The most commonly spoken language in France is,"""l'homme"", or what one French word for ""father"", means.\n\nBut in the United States, which had more than 60,000 native-born people in 2010, there was nothing",pass
2,2.0,The most commonly spoken language in Guatemala is,"Guatemalan and has a slightly older pronunciation (14 years old).\n\nMost of these languages have more or less the same vocabulary, as they come from different parts of the world. Most common is English (",pass
3,3.0,The most commonly spoken language in Mongolia is,"Ainu, but an interesting source for these terms is the Yuktai language spoken in the country near the border with Turkmenistan. Another official official word is Shum, and it is easy to",pass
4,4.0,The most commonly spoken language in Japan is,"""Fuku wo kun-mitsu."" It means ""A man cannot run."" However, there are other ways of speaking the word—such as ""kimasu.""\n\nThe Kansai",fail


Finally, let's display the failing tests.

In [15]:
merged.loc[merged['p/f'] == 'fail']

Unnamed: 0,id,prompt,response,p/f
4,4.0,The most commonly spoken language in Japan is,"""Fuku wo kun-mitsu."" It means ""A man cannot run."" However, there are other ways of speaking the word—such as ""kimasu.""\n\nThe Kansai",fail


### Test with Checklist

Next, let's try running the MFT with Checklist. We will no longer need to keep track of results in Pandas dataframes, since Checklist will track the results for us.

#### Create the expectation function
In order to determine if an example passes or fails the test, Checklist uses an expectation function. An expectation function is a function that receives the example, then returns true if the example passes the test, or false if the example fails.

In [16]:
def response_contains_language(x, pred, conf, label=None, meta=None):
    for l in langs:
        if l in pred:
            return True
    return False

We will wrap this function with `Expect.single`, which causes the expectation function to be called for each example. In other cases, you might want to have an expectation function that checks multiple examples simulatneously. See the tutorial notebook "3. Test types, expectation functions, running tests" for detailed information about expectation functions.

In [17]:
contains_language_expect_fn = Expect.single(response_contains_language)

Now we can feed our prompts and expectation function into the MFT constructor.

In [18]:
test = MFT(**prompt_strs, name='Language in response', description='The response contains a language.', expect=contains_language_expect_fn)

In order to run the test, Checklist also needs a function that generates the model's predictions for the inputs. The function receives all inputs (prompts) as a list, and must return the results in a tuple `(model_predictions, confidences)`, where `model_predictions` is a list of all the predictions, and `confidences` is a list of the model's scores for those predictions.

We will not be using confidences in this test. Checklist provides a wrapper function `PredictorWrapper.wrap_predict()` that outputs a tuple with a confidence score of 1 for any prediction. We can use it to wrap `generate_sentences` so the predictions will have a confidence score as needed.

In [19]:
wrapped_generator = PredictorWrapper.wrap_predict(generate_sentences)
wrapped_generator(["The most commonly spoken language in Brazil is "])

(['русского.\n\nThe Portuguese word for "pursuit" is a sort of "suit," also in the form уми, a kind of shell'],
 array([1.]))

Now we're ready to run the test. The first argument to the `test.run()` function is the generator function we just created. We will also set the optional parameter `overwrite=True` so the test can be re-run without an error. If overwrite=False, then Checklist will reject subsequent test runs to prevent us from accidentally overwriting your test results.

In [20]:
test.run(wrapped_generator, overwrite=True)

Predicting 5 examples


To see the results, we can use the `summary` function.

In [21]:
def format_example(x, pred, conf, label=None, meta=None): 
    return 'Prompt:      %s\nCompletion:      %s' % (x, pred) 

In [22]:
test.summary(format_example_fn = format_example)

Test cases:      5
Fails (rate):    1 (20.0%)

Example fails:
Prompt:      The most commonly spoken language in Guatemala is
Completion:       Guatemalan.

Gabor Banda was born in Buntamas the day he was arrested. The elder Babor (35) came to Guatemala in 1892 but left after the war, and stayed
----


Test results can also be explored visually by using the `visual_summary` function.

In [23]:
test.visual_summary()

TestSummarizer(stats={'npassed': 4, 'nfailed': 1, 'nfiltered': 0}, summarizer={'name': 'Language in response',…

## MFT - Language prompt with accurate response

Let's make our test a little stricter to better understand the model's behavior. We will now require the model to respond with the correct language instead of any language in general. By using the `meta=True` argument for `editor.template()`, the country associated with the prompt will be will be stored in the `country_prompts` object.


In [24]:
country_prompts = editor.template("The most commonly spoken language in {country} is  ", country = ["United States", "France", "Guatemala", "Mongolia", "Japan"], meta=True)
correct_responses = {
    "United States": "English",
    "France": "French",
    "Guatemala": "Spanish",
    "Mongolia": "Mongolian",
    "Japan": "Japanese"
}

The country metadata can be accessed with `country_prompts.meta`.

In [25]:
country_prompts.meta

[{'country': 'United States'},
 {'country': 'France'},
 {'country': 'Guatemala'},
 {'country': 'Mongolia'},
 {'country': 'Japan'}]

### Handwritten Test

In [26]:
prompts = pd.DataFrame({"id": [], "prompt": []})
responses = pd.DataFrame({"id": [], "response": []})
test_results = pd.DataFrame({"id": [], "p/f": []})

model_responses = generate_sentences(country_prompts.data)

for (i, response) in enumerate(model_responses):
    pf = 'fail'
    country = country_prompts.meta[i]["country"]
    
    # Check if the correct language is in the response
    language = correct_responses[country]
    if language in response:
        pf = 'pass'

    prompts = prompts.append({"id": i, "prompt": country_prompts.data[i]}, ignore_index=True)
    responses = responses.append({"id": i, "response": response}, ignore_index=True)
    test_results = test_results.append({"id": i, "p/f": pf}, ignore_index=True)


#### Show test results
Let's look at our test results. The first dataframe contains the prompts given to the model.

In [27]:
prompts

Unnamed: 0,id,prompt
0,0.0,The most commonly spoken language in United States is
1,1.0,The most commonly spoken language in France is
2,2.0,The most commonly spoken language in Guatemala is
3,3.0,The most commonly spoken language in Mongolia is
4,4.0,The most commonly spoken language in Japan is


The next dataframe shows the model's response to the prompt (not including the prompt itself)

In [28]:
responses

Unnamed: 0,id,response
0,0.0,"ˈeˌs, which is pronounced by its singular, plural form. While both ī and ē can sometimes be pronounced English, ɒ is generally not. Instead, I"
1,1.0,"ˈmʒbɛt (to translate, speak, etc.). The verb ""to say"" stands for in English, which has become the first Latin language where the meaning and pronunciation is"
2,2.0,"əʒ (i.e., yam) which means ""one who sits up and speaks as a yahtzee."" Yaht is spoken by two groups of individuals: the"
3,3.0,"̈̄̇̅ (the Mongolian language), which in many other parts also means I (of). Mongolians also call the English spoken by the Greeks ""gok"", while"
4,4.0,"ō. The noun is often translated as ""unpleasant"", ""sore"", and ""disgusting"". It's an English word that makes sense if you consider it as an American noun"


The final dataframe shows the pass/fail status of the test

In [29]:
test_results

Unnamed: 0,id,p/f
0,0.0,pass
1,1.0,fail
2,2.0,fail
3,3.0,pass
4,4.0,fail


### Testing with Checklist
Now let's run the test with Checklist. All we need is a new expectation function. The rest of the process is the same as before.

In [30]:
def response_contains_correct_language(x, pred, conf, label=None, meta=None):
    country = meta['country']
    language = correct_responses[country]
    return language in pred

In [31]:
correct_language_expect_fn = Expect.single(response_contains_correct_language)

In [32]:
test = MFT(**country_prompts, name='Correct language in response', description='The response contains the correct language for the country in the prompt.', expect=correct_language_expect_fn)

In [33]:
test.run(wrapped_generator, overwrite=True)

Predicting 5 examples


In [34]:
test.summary(format_example_fn = format_example)

Test cases:      5
Fails (rate):    2 (40.0%)

Example fails:
Prompt:      The most commonly spoken language in Mongolia is  
Completion:      ike ("labor") and ki ("work").  This is a traditional term which means for any of three things:
 ikki means the work done on one's personal work 
----
Prompt:      The most commonly spoken language in Guatemala is  
Completion:      ͡°  as اجث الرجة ااتلوحهى لشاس ملا حم
----


In [35]:
test.visual_summary()

TestSummarizer(stats={'npassed': 3, 'nfailed': 2, 'nfiltered': 0}, summarizer={'name': 'Correct language in re…