In [None]:
!pip install transformers

# Easy BERT for EdTech

As a teacher trying to make bespoke tests for my students, coming up with 'alternatives' for multiple choice questions can be surprisingly time-consuming. Off-the-shelf BERT does a great job at streamlining this task.

This simple script allows users to input a sentence and an answer to create a multiple-choice question.

I use this myself when making exams and it certainly speeds up the process.



In [2]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM


In [7]:
import pandas as pd
import random


In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [37]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()



def generate_multiple_choice_questions(text, answer, top_k=2000, offset=500, choices=5):
    
    original = text
    text = text.replace(answer, "[MASK]")
    text = "[CLS] %s [SEP]"%text
    tokenized_text = tokenizer.tokenize(text)
    masked_index = tokenized_text.index("[MASK]")
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    tokens_tensor = torch.tensor([indexed_tokens])

    with torch.no_grad():
        outputs = model(tokens_tensor)
        predictions = outputs[0]

    probs = torch.nn.functional.softmax(predictions[0, masked_index], dim=-1)
    top_k_weights, top_k_indices = torch.topk(probs, top_k, sorted=True)

    alternatives = []
    for i, pred_idx in enumerate(top_k_indices):
        predicted_token = tokenizer.convert_ids_to_tokens([pred_idx])[0]
        token_weight = top_k_weights[i]
        alternatives.append(predicted_token)
        

    alternatives = [x for x in alternatives if '##' not in x]
    alternatives = random.choices(alternatives[offset:],k=choices) + [answer]

    random.shuffle(alternatives)
    question = original.replace(answer, " ___________ ")
    
    print(f'QUESTION: {question}')
    print()
    letters = ['A. ','B. ','C. ','D. ','E. ','F. ','G. ','H. ']
    for idx, x in enumerate(alternatives):
      print(f'{letters[idx]}{x}')

    print()
    print(f'CORRECT ANSWER: {answer}')

    



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Some example outputs

Notice how the alternative answers are maybe grammatically plausible, but clearly inferior choices to the correct response. This is exactly the result we're looking for and has potential for classroom use as well as ed-tech applications.

In [38]:
text = "The ship was headed northwards, and the captain instructed his crew to raise the sails."
answer = "captain"
top_k = 5000 ## number of words to choose from 
offset = 1000 ## distance from plausible answers
choices = 5 ## number of alternative answers

generate_multiple_choice_questions(text, answer, top_k=500, offset=100, choices=5)

QUESTION: The ship was headed northwards, and the  ___________  instructed his crew to raise the sails.

A. letter
B. adventurous
C. captain
D. crown
E. goldsmith
F. falcon

CORRECT ANSWER: captain


In [39]:
text = "The visit had not been announced, but did not come as a complete surprise. For at least a fortnight rumours have swirled around Kyiv that the leader of the free world might extend a planned trip to Poland to Ukraine, too."
answer = "swirled"
top_k = 5000 ## number of words to choose from 
offset = 1000 ## distance from plausible answers
choices = 5 ## number of alternative answers

generate_multiple_choice_questions(text, answer, top_k=500, offset=100, choices=5)

QUESTION: The visit had not been announced, but did not come as a complete surprise. For at least a fortnight rumours have  ___________  around Kyiv that the leader of the free world might extend a planned trip to Poland to Ukraine, too.

A. arose
B. awakened
C. hardened
D. roaming
E. swirled
F. stalked

CORRECT ANSWER: swirled


In [40]:
text = "After three years of covid-19 restrictions, this wanderlust is understandable. But alongside the obvious motives—sun, sea, sand and study—is another unstated one: spiriting money out of the country. "
answer = "wanderlust"
top_k = 35000 ## number of words to choose from 
offset = 30000 ## distance from plausible answers
choices = 5 ## number of alternative answers

generate_multiple_choice_questions(text, answer, top_k=500, offset=100, choices=5)

QUESTION: After three years of covid-19 restrictions, this  ___________  is understandable. But alongside the obvious motives—sun, sea, sand and study—is another unstated one: spiriting money out of the country. 

A. wanderlust
B. device
C. ultimately
D. lifestyle
E. end
F. provision

CORRECT ANSWER: wanderlust
