# Idefics2 finetuning of FrozenLake descriptions

We load the model with LoRA and quantization

In [1]:
from datasets import load_dataset
from evaluate import load
from transformers import Idefics2ForConditionalGeneration, BitsAndBytesConfig, AutoProcessor
import torch
from tqdm import tqdm
import json
import colorama

In [2]:
vanilla_idefics2_path = "HuggingFaceM4/idefics2-8b"
finetuned_idfics2_path = "dawoz/idefics2-frozenlake"

In [3]:
def load_model(model_path):

    processor = AutoProcessor.from_pretrained(
        "HuggingFaceM4/idefics2-8b",
        do_image_splitting=False
    )
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16
    )
    model = Idefics2ForConditionalGeneration.from_pretrained(
        model_path,
        torch_dtype=torch.float16,
        quantization_config=bnb_config
    )
        
    return model, processor

In [4]:
def compute_predictions(model, processor, *, dataset, batch_size=4):
    model.eval()
    
    true_answers = []
    predicted_answers = []
    start_indexes = []

    for i in tqdm(range(0, len(dataset), batch_size)):
        examples = dataset[i: i + batch_size]
        true_answers.extend(examples["answer"])
        images = [[im] for im in examples["image"]]
        
        texts = []
        for instruction in examples["instruction"]:
            messages = [
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": instruction},
                        {"type": "image"},
                    ]
                }
            ]
            text = processor.apply_chat_template(messages, add_generation_prompt=True)
            texts.append(text.strip())
            start_indexes.append(0)
            
        inputs = processor(text=texts, images=images, return_tensors="pt", padding=True)
        generated_ids = model.generate(**inputs, max_new_tokens=64)
        generated_texts = processor.batch_decode(generated_ids[:, inputs["input_ids"].size(1):], skip_special_tokens=True)
        predicted_answers.extend(generated_texts)
    
    return {
        "true_answers": true_answers,
        "predicted_answers": predicted_answers,
        "start_indexes": start_indexes
    }    

### Average Normalized Levenshtein Similarity (useful??)

During the training, we tracked the loss on the evaluation split. It is interesting to measure the performance using the "true metric" used for DocVQA.

The metric at hand is the *Average Normalized Levenshtein Similarity* (ANLS). The Average Normalized Levenshtein Similarity (ANLS) proposed by [Biten+ ICCV'19](https://arxiv.org/abs/1905.13648) smoothly captures the OCR mistakes applying a slight penalization in case of correct intended responses, but badly recognized. It also makes use of a threshold of value 0.5 that dictates whether the output of the metric will be the ANLS if its value is equal or bigger than 0.5 or 0 otherwise. The key point of this threshold is to determine if the answer has been correctly selected but not properly recognized, or on the contrary, the output is a wrong text selected from the options and given as an answer.

We first define a few utilities to compute the ANLS.

In [5]:
def normalized_levenshtein(s1, s2):
    len_s1, len_s2 = len(s1), len(s2)
    distance = Levenshtein.distance(s1, s2)
    return distance / max(len_s1, len_s2)

def similarity_score(a_ij, o_q_i, tau=0.5):
    nl = normalized_levenshtein(a_ij, o_q_i)
    return 1 - nl if nl < tau else 0

def average_normalized_levenshtein_similarity(ground_truth, predicted_answers):
    assert len(ground_truth) == len(predicted_answers), "Length of ground_truth and predicted_answers must match."

    N = len(ground_truth)
    total_score = 0

    for i in range(N):
        a_i = ground_truth[i]
        o_q_i = predicted_answers[i]
        if o_q_i == "":
            print("Warning: Skipped an empty prediction.")
            max_score = 0
        else:
            max_score = max(similarity_score(a_ij, o_q_i) for a_ij in a_i)

        total_score += max_score

    return total_score / N

### Squad metric (exact match + F1 score)

Exact match: trivial

F1 score:
- precision: {num predicted tokens in ground truth} / {num predicted tokens}
- recall: {num predicted tokens in ground trugh} / {num ground truth tokens}
- F1 = 2 * (prec * rec) / (prec + rec)

https://huggingface.co/learn/nlp-course/chapter7/7?fw=pt#post-processing

In [6]:
def eval_frozen_knowledge(model, processor, *, dataset_path='dawoz/frozenlake_prompts_dataset', eval_batch_size=4):
    dataset = load_dataset(dataset_path, split='test')

    output = compute_predictions(model, processor, dataset=dataset, batch_size=eval_batch_size)
    true_answers = output["true_answers"]
    predicted_answers = output["predicted_answers"]
    start_indexes = output["start_indexes"]
        
    squad = load('squad')

    predictions = [{"id": str(i), "prediction_text": e} for i, e in enumerate(predicted_answers)]
    references = [{"id": str(i), "answers": {'text': [e], "answer_start": [s]}}
                  for i, (e, s) in enumerate(zip(true_answers, start_indexes))
                  ]

    res = squad.compute(predictions=predictions, references=references)
    
    # save predictions and references
    res['predictions'] = predictions
    res['references'] = references
    
    # ANLS (?)

    return res

## Start evaluation

In [7]:
model, processor = load_model(vanilla_idefics2_path)

output_vanilla = eval_frozen_knowledge(model, processor)

with open('eval_output/output_vanilla.json', 'w') as f:
    json.dump(output_vanilla, f, indent=4)
    
del model, processor
torch.cuda.empty_cache()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

100%|██████████| 25/25 [02:04<00:00,  4.97s/it]


In [8]:
model, processor = load_model(finetuned_idfics2_path)

output_finetuned = eval_frozen_knowledge(model, processor)

with open('eval_output/output_finetuned.json', 'w') as f:
    json.dump(output_finetuned, f, indent=4)
    
del model, processor
torch.cuda.empty_cache()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

100%|██████████| 25/25 [01:59<00:00,  4.77s/it]


Observe results

In [9]:
with open('eval_output/output_vanilla.json', 'r') as f:
    output_vanilla = json.load(f)
    
with open('eval_output/output_finetuned.json', 'r') as f:
    output_finetuned = json.load(f)

In [10]:
print('Idefics2 before training:')
print(f'Exact match: {output_vanilla["exact_match"]:5.2f}%')
print(f'         F1: {output_vanilla["f1"]:5.2f}%')

print('\nIdefics2 after training:')
print(f'Exact match: {output_finetuned["exact_match"]:5.2f}%')
print(f'         F1: {output_finetuned["f1"]:5.2f}%')

Idefics2 before training:
Exact match:  0.00%
         F1: 17.86%

Idefics2 after training:
Exact match: 18.00%
         F1: 73.17%


Observe single predictions

In [11]:
preds_vanilla = [p['prediction_text'] for p in output_vanilla['predictions']]
preds_finetuned = [p['prediction_text'] for p in output_finetuned['predictions']]
trues = [r['answers']['text'][0] for r in output_vanilla['references']]

for i, (pv, pf, t) in enumerate(zip(preds_vanilla, preds_finetuned, trues)):
    pv = pv.replace('\n', '\\n')[:200]
    pf = pf.replace('\n', '\\n')[:200]
    t = t.replace('\n', '\\n')[:200]    
    
    print(colorama.Fore.YELLOW + f'      Gold:   {t}' + colorama.Style.RESET_ALL)
    print(f'   Vanilla:   {pv}')
    print(colorama.Fore.GREEN + f'Fine-tuned:   {pf}' + colorama.Style.RESET_ALL)
    print()

[33m      Gold:   The picture shows an ice cell[0m
   Vanilla:   The tile in the image is a blue and white tile with a repeating pattern.
[32mFine-tuned:   The picture shows a hole[0m

[33m      Gold:   The picture shows a hole[0m
   Vanilla:   The tile in the image is a blue tile with a white outline.
[32mFine-tuned:   The picture shows a hole[0m

[33m      Gold:   The picture shows a cracked hole[0m
   Vanilla:   The tile in the image is a blue tile with a black shape in the middle.
[32mFine-tuned:   The picture shows a hole[0m

[33m      Gold:   The picture shows the player (facing north)[0m
   Vanilla:   The tile shows a small elf standing on a snowy surface.
[32mFine-tuned:   The picture shows the player (facing south)[0m

[33m      Gold:   The picture shows the player (facing south)[0m
   Vanilla:   The tile shows a small elf with a green hat and red and green outfit.
[32mFine-tuned:   The picture shows the player (facing south)[0m

[33m      Gold:   The pict