# Model Finetuning 👩🏽‍🔧

In this series of two exercises, we will fine-tune models from our previous usecases in order to improve predictions.

You may want to run this exercise on [Google colab](https://colab.research.google.com/) as finetuning LLM models requires a level of memory and computing power that might exceed your personal machine's capacity.

First we are going to nee to install the needed dependencies.

In [None]:
#pip install datasets transformers evaluate rouge_score -q

## 🚀 Sentiment Analysis on Financial Tweets Fine-Tuning 💸🐦

Using the **Twitter Financial News** dataset! We'll fine-tune a model in order to more precisely predict the financial tweets sentiment ⚡️



1. Start by importing the `"zeroshot/twitter-financial-news-sentiment"` dataset

In [1]:
from datasets import load_dataset
import tqdm as notebook_tqdm

import pandas as pd

twitter_train = load_dataset("zeroshot/twitter-financial-news-sentiment")
twitter_train

README.md:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

sent_train.csv:   0%|          | 0.00/859k [00:00<?, ?B/s]

sent_valid.csv:   0%|          | 0.00/217k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9543 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2388 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 9543
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2388
    })
})

In [2]:
twitter_train["train"][0]

{'text': '$BYND - JPMorgan reels in expectations on Beyond Meat https://t.co/bd0xbFGjkT',
 'label': 0}

2. Use the `"bert-base-uncased"` tokenizer in order to prepare the dataset for model fine-tuning.
At the end, print out the sequence length of a batch of data to make sure they're all of the same length.

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding=True)

tokenized_datasets = twitter_train.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/9543 [00:00<?, ? examples/s]

Map:   0%|          | 0/2388 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9543
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2388
    })
})

In [4]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [5]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[82, 82, 82, 82, 82, 82, 82, 82]

3. Now let's prepare our model for finetuning. Get inspiration from the lecture on finetuning for the code. You'll need to use the `Trainer` class for this. [Find documentation here](https://huggingface.co/docs/transformers/en/main_classes/trainer)

In [6]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer",report_to="none")

In [7]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
)

In [9]:
trainer.train()

Step,Training Loss


KeyboardInterrupt: 

4. Generate predictions from the model.

In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

In [None]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

5. Print out the classification report, what do you think of the results ?

In [None]:
from sklearn.metrics import classification_report
true_labels = tokenized_datasets["validation"]["label"]
print(classification_report(true_labels, preds))

6. Display the confusion matrix, what conclusions can you draw? Did the fine-tuning deliver the expected results?

In [None]:
from sklearn.metrics import confusion_matrix
import plotly.express as px
confusion = pd.DataFrame(confusion_matrix(true_labels, preds),
             index=["NEGATIVE","POSITIVE","NEUTRAL"],
             columns=["NEGATIVE","POSITIVE","NEUTRAL"])
px.imshow(confusion,
          text_auto=True)


## 📚 Summarizing Scientific Papers with Fine-Tuning 🧠✨

In this part, we'll follow up on the **long-form document summarization** challenge, using research papers from **ArXiv**. 🧪🔬 Your mission: Fine-tune a model that creates powerful, concise summaries of lengthy academic papers! 📄➡️✂️

1. Load the `"ccdv/arxiv-summarization"` dataset.

In [1]:
from datasets import load_dataset

dataset = load_dataset("ccdv/arxiv-summarization",split="train")
dataset_val = load_dataset("ccdv/arxiv-summarization",split="validation")

# Preview the data
dataset[0]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


{'article': 'additive models @xcite provide an important family of models for semiparametric regression or classification . some reasons for the success of additive models are their increased flexibility when compared to linear or generalized linear models and their increased interpretability when compared to fully nonparametric models . \n it is well - known that good estimators in additive models are in general less prone to the curse of high dimensionality than good estimators in fully nonparametric models . \n many examples of such estimators belong to the large class of regularized kernel based methods over a reproducing kernel hilbert space @xmath0 , see e.g. @xcite . in the last years \n many interesting results on learning rates of regularized kernel based models for additive models have been published when the focus is on sparsity and when the classical least squares loss function is used , see e.g. @xcite , @xcite , @xcite , @xcite , @xcite , @xcite and the references therein

2. Display the first few characters from the article, and abstract data.

In [2]:
print(dataset[0]['article'][:1000])  # First 1000 characters
print("\nAbstract:\n", dataset[0]['abstract'])

additive models @xcite provide an important family of models for semiparametric regression or classification . some reasons for the success of additive models are their increased flexibility when compared to linear or generalized linear models and their increased interpretability when compared to fully nonparametric models . 
 it is well - known that good estimators in additive models are in general less prone to the curse of high dimensionality than good estimators in fully nonparametric models . 
 many examples of such estimators belong to the large class of regularized kernel based methods over a reproducing kernel hilbert space @xmath0 , see e.g. @xcite . in the last years 
 many interesting results on learning rates of regularized kernel based models for additive models have been published when the focus is on sparsity and when the classical least squares loss function is used , see e.g. @xcite , @xcite , @xcite , @xcite , @xcite , @xcite and the references therein . of course , t

3. Extract a subset of 1000 observations form the train set, and 200 observations from the validation set.

In [3]:
dataset_small = dataset.select(range(1000))
dataset_val_small = dataset_val.select(range(200))

4. Use the `"google-t5/t5-small"` tokenizer to preprocess the data. Make sure to truncate and pad the inputs so they all share the same length.

In [4]:
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

prefix = "summarize: "


def tokenize_function(examples):
    inputs = [prefix + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)

    labels = tokenizer(text_target=examples["abstract"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs
tokenized_datasets = dataset_small.map(tokenize_function, batched=True)
tokenized_datasets_val = dataset_val_small.map(tokenize_function, batched=True)

tokenized_datasets

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Dataset({
    features: ['article', 'abstract', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1000
})

In [5]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

In [6]:
samples = tokenized_datasets[:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[512, 512, 512, 512, 512, 512, 512, 512]

5. Load the pretrained model and finetune it. You can draw inspiration from [this demo](https://huggingface.co/docs/transformers/en/tasks/summarization) for the code.

In [9]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [25]:
import numpy as np

import evaluate

metric = evaluate.load("rouge")  # or another metric

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) if isinstance(v, float) else v for k, v in result.items()}

In [26]:
training_args = Seq2SeqTrainingArguments(
    output_dir="summarization_model",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    report_to="none"
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets_val,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,3.183356,0.1354,0.0387,0.1111,0.111,20.0
2,No log,3.159598,0.1353,0.0375,0.1101,0.1101,20.0
3,No log,3.146155,0.1384,0.0402,0.1129,0.1129,20.0
4,No log,3.14279,0.1375,0.0395,0.1122,0.1121,20.0


TrainOutput(global_step=252, training_loss=3.3635418604290677, metrics={'train_runtime': 216.6853, 'train_samples_per_second': 18.46, 'train_steps_per_second': 1.163, 'total_flos': 541367205888000.0, 'train_loss': 3.3635418604290677, 'epoch': 4.0})

6. Compare a reference abstract from the validation set a model prediction, what do you think?

In [50]:
text = dataset_val[0]["article"]
text

"the interest in anchoring phenomena and phenomena in confined nematic liquid crystals has largely been driven by their potential use in liquid crystal display devices . \n the twisted nematic liquid crystal cell serves as an example . \n it consists of a nematic liquid crystal confined between two parallel walls , both providing homogeneous planar anchoring but with mutually perpendicular easy directions . in this case \n the orientation of the nematic director is tuned by the application of an external electric or magnetic field . \n a precise control of the surface alignment extending over large areas is decisive for the functioning of such devices . \n most studies have focused on nematic liquid crystals in contact with laterally uniform substrates . on the other hand substrate inhomogeneities \n arise rather naturally as a result of surface treatments such as rubbing . \n thus the nematic texture near the surface is in fact non - uniform . \n this non - uniformity , however , is s

In [51]:
inputs = tokenizer(text, return_tensors="pt").input_ids

Token indices sequence length is longer than the specified maximum sequence length for this model (11919 > 512). Running this sequence through the model will result in indexing errors


In [None]:
outputs = model.to("cpu").generate(inputs, max_new_tokens=100, do_sample=False)

In [47]:
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
summary

"the inflation reduction Act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in American history. it will ask the ultra-wealthy and corporations to pay their fair share."

In [None]:
# Compare generated summary vs reference
reference = dataset_val[0]["abstract"]
reference