# Fine-Tuning Models
> Fine-tuning using your own data

In this notebook, we'll use two references:https://huggingface.co/transformers/custom_datasets.html as a guide for our work.  We'll use the HuggingFace dataset we've already created and use it directly!

### Install required packages
Note that this is mostly required if you're on Google Colab.

In [None]:
#! pip install transformers
#! pip install datasets

### Import packages of interest

In [None]:
import numpy as np
import pandas as pd

from datasets import load_dataset, load_metric, Dataset
from transformers import pipeline
from transformers import DataCollatorWithPadding
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

from huggingface_hub import notebook_login

# 0. Log into HuggingFace CLI
Why are we doing this? Below, we'll use our own user accounts to grab datasets and upload models. If we don't do this, we'll have to pass in the auth token over. This isn't bad, but let's streamline our efforts!

In [None]:
#!git config --global credential.helper store

In [None]:
notebook_login()

VBox(children=(HTML(value='<center>\n<img src=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# 1. Load data from HuggingFace Hub or from disk

In [None]:
ds_path = 'charreaubell/demo_data'
demo_ds = load_dataset(ds_path, use_auth_token=True)

Using custom data configuration charreaubell--demo_data-e1481c242d53578c
Reusing dataset parquet (/Users/bellcs1/.cache/huggingface/datasets/parquet/charreaubell--demo_data-e1481c242d53578c/0.0.0/1638526fd0e8d960534e2155dc54fdff8dce73851f21f031d2fb9c2cf757c121)


  0%|          | 0/3 [00:00<?, ?it/s]

# 2. Pre-process inputs
What's a tokenizer and what does it do? Let's learn more using Huggingface's [instruction on tokenizers](https://huggingface.co/course/chapter2/4?fw=pt). Then, let's try it on our own!

In [None]:
#instantiate tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
tokenizer.name_or_path

'distilbert-base-cased'

In [None]:
#define tokenizing function
def tokenize_inputs(example):
    return tokenizer(example['text'], truncation = True)

In [None]:
#do the tokenizing using map function
tokenized_ds = demo_ds.map(tokenize_inputs, batched=True,
                           remove_columns = ['age', 'article_id', 'college_major',
                                             'first_name', 'last_name', 'years_of_journalism',
                                             'text'])

Loading cached processed dataset at /Users/bellcs1/.cache/huggingface/datasets/parquet/charreaubell--demo_data-e1481c242d53578c/0.0.0/1638526fd0e8d960534e2155dc54fdff8dce73851f21f031d2fb9c2cf757c121/cache-26562d7589b246b0.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

Loading cached processed dataset at /Users/bellcs1/.cache/huggingface/datasets/parquet/charreaubell--demo_data-e1481c242d53578c/0.0.0/1638526fd0e8d960534e2155dc54fdff8dce73851f21f031d2fb9c2cf757c121/cache-799da250e96ca01b.arrow


In [None]:
tokenized_ds

DatasetDict({
    valid: Dataset({
        features: ['attention_mask', 'input_ids', 'label'],
        num_rows: 4
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label'],
        num_rows: 4
    })
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label'],
        num_rows: 12
    })
})

## An aside on tokenizer functionality
We can do many things with tokenizers to help us to tokenize our data and process it. Let's check out these outputs further.

In [None]:
#check out input IDs
print(tokenized_ds['train']['input_ids'][0])

#compare against the text
print(demo_ds['train']['text'][0])

[101, 107, 16409, 18220, 1106, 1143, 1254, 1725, 146, 5380, 1204, 28015, 136, 107, 1119, 1455, 119, 107, 1398, 1103, 1639, 1202, 1105, 8582, 1518, 3370, 14897, 1111, 1833, 1177, 119, 146, 1431, 1301, 1164, 1217, 2816, 3196, 1106, 28015, 1468, 1272, 146, 1221, 1115, 146, 1274, 1204, 136, 1337, 1116, 1184, 1240, 1162, 3344, 1143, 136, 102]
"Explain to me again why I shouldnt cheat?" he asked. "All the others do and nobody ever gets punished for doing so. I should go about being happy losing to cheaters because I know that I dont? Thats what youre telling me?


In [None]:
#check out the length of the list of lists
print(len(tokenized_ds['train']['input_ids']))

#check out the length of a single element
print(len(tokenized_ds['train']['input_ids'][0]))

12
58


In [None]:
#convert input_ids to token representation
input0_tokens = tokenizer.convert_ids_to_tokens(tokenized_ds['train']['input_ids'][0])
print(input0_tokens)

['[CLS]', '"', 'Ex', '##plain', 'to', 'me', 'again', 'why', 'I', 'shouldn', '##t', 'cheat', '?', '"', 'he', 'asked', '.', '"', 'All', 'the', 'others', 'do', 'and', 'nobody', 'ever', 'gets', 'punished', 'for', 'doing', 'so', '.', 'I', 'should', 'go', 'about', 'being', 'happy', 'losing', 'to', 'cheat', '##ers', 'because', 'I', 'know', 'that', 'I', 'don', '##t', '?', 'That', '##s', 'what', 'your', '##e', 'telling', 'me', '?', '[SEP]']


In [None]:
#see what this looks like as a string
print(tokenizer.convert_tokens_to_string(input0_tokens))

#another method directly from the input ids
tokenizer.decode(tokenized_ds['train']['input_ids'][0])

[CLS] " Explain to me again why I shouldnt cheat? " he asked. " All the others do and nobody ever gets punished for doing so. I should go about being happy losing to cheaters because I know that I dont? Thats what youre telling me? [SEP]


'[CLS] " Explain to me again why I shouldnt cheat? " he asked. " All the others do and nobody ever gets punished for doing so. I should go about being happy losing to cheaters because I know that I dont? Thats what youre telling me? [SEP]'

In [None]:
#other information about tokenizer
print(tokenizer.vocab_size)

#see actual tokenizer vocab (we've abbreviated here)
#tokenizer.vocab
pd.DataFrame({'tokens': tokenizer.vocab.keys(), 'inds': tokenizer.vocab.values()}).set_index('inds').head(10)

28996


Unnamed: 0_level_0,tokens
inds,Unnamed: 1_level_1
28063,##egro
18645,Counsel
7819,Borough
25663,Marcia
13867,forewings
14891,embraced
9993,fists
28785,##』
5338,emerged
768,ἀ


## An aside on dynamically padded batch size
HF has the capacity to dynamically pad your batches such that each input is only as long as any given input in the batch. This helps with memory.You can learn more [here](https://huggingface.co/course/chapter3/2?fw=pt). For now, we'll simply instantiate a data collator and use it during training to demonstrate how we can do this.

In [None]:
#Instantiate data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# 3. Train model

In [None]:
#recall that the dataset carries information about the classes
demo_ds['train'].features['label']

ClassLabel(num_classes=5, names=['engineering', 'humanities', 'prelaw', 'premed', 'science'], names_file=None, id=None)

In [None]:
#get the number of classes and label conversions
no_classes = demo_ds['train'].features['label'].num_classes
id2label = {ind:label for ind, label in enumerate(demo_ds['train'].features['label'].names)}
label2id = {label:ind for ind, label in id2label.items()}

In [None]:
#check it out
print(id2label)
label2id

{0: 'engineering', 1: 'humanities', 2: 'prelaw', 3: 'premed', 4: 'science'}


{'engineering': 0, 'humanities': 1, 'prelaw': 2, 'premed': 3, 'science': 4}

## Define model and task architecture

In [None]:
# Choose the model type and instantiate it for the task
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased",
                                                           num_labels=no_classes,
                                                           id2label=id2label,
                                                           label2id=label2id)
model.name_or_path

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weigh

'distilbert-base-cased'

## Define settings for basic model training and train

In [None]:
#set training arguments
training_args = TrainingArguments("test-trainer",
                                 logging_strategy='epoch')

#setup training loop with arguments
trainer = Trainer(model=model,
                  args=training_args,
                  tokenizer=tokenizer,
                  data_collator=data_collator,
                  train_dataset=tokenized_ds['train'],
                  eval_dataset=tokenized_ds['valid'])

#train
trainer.train()

***** Running training *****
  Num examples = 12
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6


Step,Training Loss
2,1.6039
4,1.5358
6,1.5008




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=6, training_loss=1.546820878982544, metrics={'train_runtime': 12.0564, 'train_samples_per_second': 2.986, 'train_steps_per_second': 0.498, 'total_flos': 928356357240.0, 'train_loss': 1.546820878982544, 'epoch': 3.0})

### Reflect and Discuss
* Practically speaking, how is the model performing?

## Training with performance metrics and saving checkpoints of the model

In [None]:
#load a metric
metric = load_metric("accuracy")

#define the metric behavior
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

In [None]:
#set new training arguments
training_args = TrainingArguments("test-trainer",
                                  logging_strategy = "epoch",
                                  evaluation_strategy="epoch",
                                  save_strategy="epoch",
                                  report_to='all')

#setup training loop
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['valid'],
    compute_metrics=compute_metrics
)

#train model
trainer.train()

PyTorch: setting up devices
***** Running training *****
  Num examples = 12
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6


Epoch,Training Loss,Validation Loss,Accuracy
1,1.3961,1.622399,0.25
2,1.3409,1.647581,0.0
3,1.1837,1.661458,0.0


***** Running Evaluation *****
  Num examples = 4
  Batch size = 8
***** Running Evaluation *****
  Num examples = 4
  Batch size = 8
***** Running Evaluation *****
  Num examples = 4
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=6, training_loss=1.3068776925404866, metrics={'train_runtime': 13.3811, 'train_samples_per_second': 2.69, 'train_steps_per_second': 0.448, 'total_flos': 928356357240.0, 'train_loss': 1.3068776925404866, 'epoch': 3.0})

### Reflect and Discuss
* What new observations are present during model training?
* What comments can you make on the performance of the model now?
* What metrics are appropriate for your application?
* Consider that model training is done in-memory (the model weights are updated in memory, but not returned), and both of our `Trainer`s trained our model `model`. After basic training from Step 9 and training from Step 10, how many epochs has the model been trained?

## A brief aside on resuming training from checkpoints

In [None]:
#update the number of epochs (or steps) that you want to train for
trainer.args.num_train_epochs = 6

In [None]:
#train some more, resuming from checkpoint
trainer.train(resume_from_checkpoint=True)

## A brief aside on performance metrics
You may want to use other performance metrics than accuracy. Here are some [metrics available through Huggingface](https://huggingface.co/metrics). If you check out the metrics folder on the [Huggingface datasets](https://github.com/huggingface/datasets) repository, you'll be able to see what's necessary if you need to define another metric. Let's try a different metric!

In [None]:
from sklearn.metrics import precision_recall_fscore_support

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    
    #get predictions by using index of max logit
    predictions = np.argmax(logits, axis=-1)
    
    #calculate classification report
    perfs = precision_recall_fscore_support(labels, predictions, average='macro', zero_division=0)
    perf_dict = dict(zip(['precision', 'recall', 'fscore'], perfs[:3]))
    
    #return dictionary
    return perf_dict

In [None]:
#setup training loop
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['valid'],
    compute_metrics=compute_metrics
)

trainer.train()

***** Running training *****
  Num examples = 12
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6


Epoch,Training Loss,Validation Loss,Precision,Recall,Fscore
1,1.2231,1.548289,0.166667,0.333333,0.222222
2,1.1247,1.543248,0.166667,0.25,0.2
3,1.0441,1.558989,0.166667,0.25,0.2


***** Running Evaluation *****
  Num examples = 4
  Batch size = 8
Saving model checkpoint to test-trainer/checkpoint-2
Configuration saved in test-trainer/checkpoint-2/config.json
Model weights saved in test-trainer/checkpoint-2/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-2/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-2/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 4
  Batch size = 8
Saving model checkpoint to test-trainer/checkpoint-4
Configuration saved in test-trainer/checkpoint-4/config.json
Model weights saved in test-trainer/checkpoint-4/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-4/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-4/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 4
  Batch size = 8
Saving model checkpoint to test-trainer/checkpoint-6
Configuration saved in test-trainer/checkpoint-6/config.json
Model w

TrainOutput(global_step=6, training_loss=1.1306577920913696, metrics={'train_runtime': 15.7668, 'train_samples_per_second': 2.283, 'train_steps_per_second': 0.381, 'total_flos': 928356357240.0, 'train_loss': 1.1306577920913696, 'epoch': 3.0})

## A brief aside on model training - TRY IT YOURSELF!
One of several points of ambiguity when training models is how long should they train for? A way to approach this is to monitor the models and run them repeatedly, starting from the last checkpoint. Another way is through training a number of epochs (if you model trains quickly enough) and then always load the best model according to some metric at the end. Let's take a look at this.

We can realize this through `TrainingArguments`! In your breakout rooms, add the parameters which will enable the following:
1. Load the best model at the end
2. Set the metric for using the best model to one of our evaluation metrics
3. Examine the `greater_is_better` parameter. Do you need to change it?
4. Change the number of training epochs to something larger.
5. Decrease the training batch size.
6. Decrease the eval batch size.
7. How can you change the logging, evaluation, and save strategies to step? What else might you need to change depending on the interval of steps that you want these activities to occur?

Make sure this works, so run the cell!

In [None]:
#set new training arguments
training_args = TrainingArguments("test-trainer",
                                  overwrite_output_dir=True,
                                  logging_strategy = "epoch",
                                  evaluation_strategy="epoch",
                                  save_strategy='epoch',
                                  load_best_model_at_end = True,
                                  metric_for_best_model='fscore',
                                  greater_is_better=True,
                                  per_device_train_batch_size = 4,
                                  per_device_eval_batch_size = 4,
                                  num_train_epochs=5,
                                  report_to='all')

#setup training loop
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['valid'],
    compute_metrics=compute_metrics
)

#train model
trainer.train()

PyTorch: setting up devices
***** Running training *****
  Num examples = 12
  Num Epochs = 5
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 15


Epoch,Training Loss,Validation Loss,Precision,Recall,Fscore
1,0.9841,1.532724,0.166667,0.25,0.2
2,0.7854,1.607552,0.125,0.125,0.125
3,0.5758,1.576705,0.125,0.125,0.125
4,0.4921,1.553939,0.125,0.125,0.125
5,0.4644,1.553442,0.125,0.125,0.125


***** Running Evaluation *****
  Num examples = 4
  Batch size = 4
Saving model checkpoint to test-trainer/checkpoint-3
Configuration saved in test-trainer/checkpoint-3/config.json
Model weights saved in test-trainer/checkpoint-3/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-3/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-3/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 4
  Batch size = 4
Saving model checkpoint to test-trainer/checkpoint-6
Configuration saved in test-trainer/checkpoint-6/config.json
Model weights saved in test-trainer/checkpoint-6/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-6/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-6/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 4
  Batch size = 4
Saving model checkpoint to test-trainer/checkpoint-9
Configuration saved in test-trainer/checkpoint-9/config.json
Model w

TrainOutput(global_step=15, training_loss=0.6603476365407308, metrics={'train_runtime': 27.8783, 'train_samples_per_second': 2.152, 'train_steps_per_second': 0.538, 'total_flos': 1458254300280.0, 'train_loss': 0.6603476365407308, 'epoch': 5.0})

# 4. Using trained model with `Trainer`
## Evaluate

In [None]:
eval_ds = trainer.evaluate(tokenized_ds['train'])
eval_ds

***** Running Evaluation *****
  Num examples = 12
  Batch size = 8


{'eval_loss': 1.5361217260360718,
 'eval_precision': 0.06666666666666667,
 'eval_recall': 0.2,
 'eval_fscore': 0.1,
 'eval_support': None,
 'eval_runtime': 1.0496,
 'eval_samples_per_second': 11.433,
 'eval_steps_per_second': 1.905,
 'epoch': 3.0}

## Predict

In [None]:
preds = trainer.predict(tokenized_ds['train'])
preds

***** Running Prediction *****
  Num examples = 12
  Batch size = 8


PredictionOutput(predictions=array([[-6.0932573e-02,  2.5887752e-01,  2.0878818e-02, -1.0038625e-02,
         4.1899920e-02],
       [-7.0849620e-02,  2.6222461e-01,  3.7979402e-02, -7.9516068e-02,
         7.8266852e-02],
       [-6.0193807e-02,  2.6336738e-01,  3.2106131e-02, -7.4920192e-02,
         3.0398801e-02],
       [-6.9419034e-02,  2.2632037e-01,  5.2801825e-02, -6.1321128e-02,
         4.6445504e-02],
       [-5.5189542e-02,  2.4058674e-01,  4.0477935e-02, -3.9512221e-02,
         4.5452416e-03],
       [-5.4554872e-02,  2.2160059e-01,  1.0810974e-01, -8.2729846e-02,
         6.7920238e-04],
       [-8.0790840e-02,  2.3849653e-01,  7.1875341e-03, -7.5528726e-02,
         1.2777004e-01],
       [-6.0694143e-02,  2.8064272e-01,  2.6836798e-02, -5.2206773e-02,
         3.5131097e-02],
       [-5.4284588e-02,  2.8006762e-01,  1.9655362e-02, -3.9448284e-02,
         3.0258402e-02],
       [-5.6016058e-02,  2.2985230e-01,  1.1007115e-04, -7.0286907e-02,
         1.0518595e-01],
 

# 5. Sharing and saving your model
## Using `Trainer`
During training and using the Trainer class, you can also upload your model directly to HuggingFace Hub as it trains. Read more about this process on the [HF course documentation](https://huggingface.co/course/chapter4/3?fw=pt).

Let's check out how to do this. It's as simple as modifying our `TrainingArguments`! Don't forget to have already logged in using your authorization token or use the `use_auth_token` paramter to access your HF account. You'll need to have git-lfs installed to use this feature, so if you're on Google Colab, you can execute the line below. You can also `conda install -c conda-forge git-lfs` if you're using a conda environment.

In [None]:
#!apt-get install git-lfs

In [None]:
#set new training arguments
training_args = TrainingArguments("test-trainer",
                                  overwrite_output_dir=True,
                                  logging_strategy = "epoch",
                                  evaluation_strategy="epoch",
                                  save_strategy='epoch',
                                  push_to_hub=True,
                                  hub_model_id='charreaubell/distilbert-magazine-classifier',
                                  report_to='all')

#setup training loop
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['valid'],
    compute_metrics=compute_metrics
)

#train model
trainer.train()

#### Reflect
Visit your repository and take a look to make sure your model uploaded. Answer the following questions:
* Where did your model save locally (directory)?
* What are the contents of the saved model?
* Investigate your uploaded model.

In [None]:
#it's recommended to push the final version to HF after training completes.
trainer.push_to_hub(commit_message='end of training 3 epochs')

#### Reflect
Visit your repository once more (you'll likely need to refresh) and check out the changes.
* What is different from the uploads during training?
* What do you observe about the model cards?

## Fine-grained save/push access
You can also push the model and/or tokenizer directly using the `push_to_hub` methods in their classes. You can learn more about this [in the Huggingface docs.](https://huggingface.co/course/chapter4/3?fw=pt) An example of using trainer to save your entire model locally is shown below.

In [None]:
trainer.save_model('test-trainer')

Saving model checkpoint to test-trainer
Configuration saved in test-trainer/config.json
Model weights saved in test-trainer/pytorch_model.bin
tokenizer config file saved in test-trainer/tokenizer_config.json
Special tokens file saved in test-trainer/special_tokens_map.json


# 6. Using your fine-tuned model

In [None]:
#create pipeline from your classifier
mag_classifier = pipeline('text-classification', model='test-trainer')

#optionally, load from HF
#mag_classifier = pipeline('text-classification', model='charreaubell/distilbert-magazine-classifier', use_auth_token=True)

#get output
mag_class = mag_classifier('The cat is prettier than any cat I have ever seen.')

In [None]:
mag_class

[{'label': 'humanities', 'score': 0.2947634756565094}]

## Reflect and discuss: Breakout Rooms
You've successfully trained a model - great job!! Now, let's focus on what YOU need to do for your task. Using the [Transformer Notebooks](https://huggingface.co/docs/transformers/notebooks) and use the `Open in Colab` badge, explore what this task looks like. Note that even if your modality is different, you may be able to directly still use these notebooks with a few changes!