# Roberta Cryptonite
We will try to use Roberta to predict the next word of Cryptonite. The plan is that first, we will use RoBerta to predict the answer. Then we may try to use masked output, and then we may integrated some prompt engineering techniques like chain of thoughs.   
Here are the link to implementation:  
[https://huggingface.co/docs/transformers/en/model_doc/roberta](https://huggingface.co/docs/transformers/en/model_doc/roberta)

In [1]:
modified_train_fp = '../datasets/cryptonite-official-split/cryptonite-train-choice.jsonl'
modified_val_fp = "../datasets/cryptonite-official-split/cryptonite-val-choice.jsonl"
modified_test_fp = '../datasets/cryptonite-official-split/cryptonite-test-choice.jsonl'

In [2]:
# !pip install datasets
from datasets import load_dataset
# dataset = load_dataset("aviaefrat/cryptonite", "cryptonite")
dataset = load_dataset('json', data_files={'train': modified_train_fp, 'validation': modified_val_fp, 'test': modified_test_fp})

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [3]:
# !pip install transformers
# !pip install torch torchvision
dataset['train'][77]


{'publisher': 'Times',
 'date': 971913600000,
 'author': '',
 'number': '1',
 'orientation': 'across',
 'clue': 'got rid of piece of hi-fi, a particular horror of mine (8)',
 'answer': 'firedamp',
 'enumeration': '(8)',
 'quick': False,
 'sub_publisher': 'The Times',
 'choice1': 'Aktivist',
 'choice2': 'rentaler',
 'choice3': 'seemably'}

## Multiple Choice Roberta (Test)

In [8]:
from transformers import AutoTokenizer, RobertaForMultipleChoice
import torch

tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")
model = RobertaForMultipleChoice.from_pretrained("FacebookAI/roberta-base")

# take a random example from train set
sample = dataset['train'][77]

prompt = sample['clue']
choice1 = sample['answer']
choice0 = sample['choice1']
choice2 = sample['choice2']
choice3 = sample['choice3']

prompts = [prompt, prompt, prompt, prompt]
choices = [choice0, choice1, choice2, choice3]
# choice0 is correct (according to Wikipedia ;)), batch size 1
labels = torch.tensor(1).unsqueeze(0)  
# Each instance of the prompt corresponds to a different choice. so [prompt, prompt], [choice0, choice1]
encoding = tokenizer(prompts, choices, return_tensors="pt", padding=True)
with torch.no_grad():
    outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()}, labels=labels)  # batch size is 1

# the linear classifier still needs to be trained
loss = outputs.loss
logits = outputs.logits

Some weights of RobertaForMultipleChoice were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
logits

tensor([[-0.0078, -0.0090, -0.0136, -0.0077]])

Let's test the accuracy on the first 100 samples. Assuming higher logits correspond to higher probabilities assigned by the model to the corresponding class or choice. I will write some ugly code here....

In [6]:
predict_correct_count = 0
total_count = 0
for i in range(1000):
    # take a random example from train set
    sample = dataset['train'][i]
    
    prompt = sample['clue']
    choice0 = sample['answer']
    choice1 = sample['choice1']
    choice2 = sample['choice2']
    choice3 = sample['choice3']
    
    prompts = [prompt, prompt, prompt, prompt]
    choices = [choice0, choice1, choice2, choice3]
    # choice0 is correct (according to Wikipedia ;)), batch size 1
    labels = torch.tensor(0).unsqueeze(0)  
    # Each instance of the prompt corresponds to a different choice. so [prompt, prompt], [choice0, choice1]
    encoding = tokenizer(prompts, choices, return_tensors="pt", padding=True)
    with torch.no_grad():
        outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()}, labels=labels)  # batch size is 1
    
    # the linear classifier still needs to be trained
    loss = outputs.loss
    logits = outputs.logits[0]

    # see if the first choice is the highest probability
    if logits[0] > torch.max(logits[1:]):
        predict_correct_count += 1
    total_count += 1


In [7]:
print(f'correct predict {predict_correct_count} out of {total_count}, accuracy: {predict_correct_count/total_count}')

correct predict 198 out of 1000, accuracy: 0.198


So without training, the success rate is less than 25% (random guess)....... The next thing we can do is training on small dataset and see if there is improvement is loss. 

In [11]:
encoding.keys()

dict_keys(['input_ids', 'attention_mask'])

## Multiple Choice Roberta (Train)

In [40]:
import os
# preprocess the current dataset all together
def preprocess_function(sample):
    prompt = sample['clue']
    choice0 = sample['answer']
    choice1 = sample['choice1']
    choice2 = sample['choice2']
    choice3 = sample['choice3']
    
    prompts = [prompt, prompt, prompt, prompt]
    choices = [choice0, choice1, choice2, choice3]
  
    # Each instance of the prompt corresponds to a different choice. so [prompt, prompt], [choice0, choice1]
    encoding = tokenizer(prompts, choices, return_tensors="pt", padding=True)
    # choice0 is correct (according to Wikipedia ;)), batch size 1
    labels = torch.tensor(0).unsqueeze(0) 
    # add labels to encoding
    encoding['labels'] = labels
    return encoding

processed_dataset_dir = '../datasets/processed_dataset'
if not os.path.exists(processed_dataset_dir):
    dataset.set_format(columns=['clue', 'answer', 'choice1', 'choice2', 'choice3'])
    processed_dataset = dataset.map(preprocess_function) # take about 4 mins
    # save the processed dataset so that we don't need to process again
    processed_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
    processed_dataset.save_to_disk(processed_dataset_dir)
processed_dataset = load_dataset(processed_dataset_dir)


In [None]:
# model defined above: model = RobertaForMultipleChoice.from_pretrained("FacebookAI/roberta-base")
# define hyper parameters
lr = 0.001
# define optimizer and criteria
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)



## Masked Roberta

What can we do with mask? I feel like a lot of stuff, including chain of thoughs. And training, like we can train simple things like  "The word *mask* is nine letters long."

In [96]:
from transformers import AutoTokenizer, RobertaForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")
model = RobertaForMaskedLM.from_pretrained("FacebookAI/roberta-base")

inputs = tokenizer("The word '<mask>' is nine letters long.", return_tensors="pt")

with torch.no_grad():
    # the output before sigmoid/softmax: we need argmax, so we don't need soft max here...
    logits = model(**inputs).logits

In [97]:
logits.shape

torch.Size([1, 12, 50265])

From [https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/output#transformers.modeling_outputs.MaskedLMOutput](https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/output#transformers.modeling_outputs.MaskedLMOutput)
logits: (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax)  
batch_size = 1 beacause we are sending in one sentence.
sequence_length is the length of tokens (not words)
The last one is vocab size, so for each token, it is giving the probability of each words (50k here), and we just want the probability of the masked word. 

In [98]:
# retrieve index of <mask>: there might be more than one, so mask_token_index is a list.
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
# this is taking the most likely answer, we can let it return the top 10 answers
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
tokenizer.decode(predicted_token_id)

'fuck'

In [99]:
# above is taking the most likely answer, we can let it return the top 10 answers too
masked_token_logits = logits[0, mask_token_index, :]
# we could also calculate the probability of each (apply softmax on the logit - get probability of each word)
probabilities = torch.softmax(masked_token_logits, dim=-1)[0]

k = 10
# torch.topk returns top k values, so .indices returns top k values' indices
top_k_tokens = torch.topk(masked_token_logits, k, dim=1).indices[0].tolist()
for token_idx in top_k_tokens:
    predicted_word = tokenizer.decode(token_idx)
    word_prob = probabilities[token_idx]
    print(f"{predicted_word:<15} {word_prob:<15.6}")

fuck            0.0117678      
no              0.0100174      
I               0.00918356     
a               0.00591128     
mother          0.00542913     
we              0.00518939     
it              0.00500547     
love            0.00497648     
you             0.00490687     
home            0.00414038     
