# Fine-Tuning Models
> Fine-tuning using your own data

In this notebook, we'll use two references:https://huggingface.co/transformers/custom_datasets.html as a guide for our work.  We'll use the HuggingFace dataset we've already created and use it directly!

### Install required packages
Note that this is mostly required if you're on Google Colab.

In [None]:
#! pip install transformers
#! pip install datasets

### Import packages of interest

In [None]:
import numpy as np
import pandas as pd

from datasets import load_dataset, load_metric, Dataset
from transformers import pipeline
from transformers import DataCollatorWithPadding
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

from huggingface_hub import notebook_login

# 0. Log into HuggingFace CLI
Why are we doing this? Below, we'll use our own user accounts to grab datasets and upload models. If we don't do this, we'll have to pass in the auth token over. This isn't bad, but let's streamline our efforts!

In [None]:
!git config credential.helper store

In [None]:
notebook_login()

# 1. Load data from HuggingFace Hub or from disk

In [None]:
ds_path = 'charreaubell/demo_data'
demo_ds = load_dataset(ds_path, use_auth_token=True)

Using custom data configuration charreaubell___demo_data-cdb143897cf94e86
Reusing dataset parquet (/Users/bellcs1/.cache/huggingface/datasets/parquet/charreaubell___demo_data-cdb143897cf94e86/0.0.0/1638526fd0e8d960534e2155dc54fdff8dce73851f21f031d2fb9c2cf757c121)


  0%|          | 0/2 [00:00<?, ?it/s]

# 2. Pre-process inputs
What's a tokenizer and what does it do? Let's learn more using Huggingface's [instruction on tokenizers](https://huggingface.co/course/chapter2/4?fw=pt). Then, let's try it on our own!

In [None]:
#instantiate tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
tokenizer.name_or_path

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

'distilbert-base-cased'

In [None]:
#define tokenizing function
def tokenize_inputs(example):
    return tokenizer(example['text'], truncation = True)

In [None]:
#do the tokenizing using map function
tokenized_ds = demo_ds.map(tokenize_inputs, batched=True,
                           remove_columns = ['age', 'article_id', 'college major',
                                             'first_name', 'last_name', 'years_of_journalism',
                                             'text'])

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
tokenized_ds

DatasetDict({
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label'],
        num_rows: 4
    })
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label'],
        num_rows: 16
    })
})

What's this `truncation` argument and this `batched` argument? Let's take a look.

## An aside on dynamically padded batch size
HF has the capacity to dynamically pad your batches such that each input is only as long as any given input in the batch. This helps with memory.You can learn more [here](https://huggingface.co/course/chapter3/2?fw=pt). For now, we'll simply instantiate a data collator and use it during training to demonstrate how we can do this.

In [None]:
#Instantiate data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# 3. Train model

In [None]:
#get the number of classes
no_classes = len(set(demo_ds['train']['label']))

## Define model and task architecture

In [None]:
# Choose the model type and instantiate it for the task
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels=no_classes)
model.name_or_path

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.b

'distilbert-base-cased'

## Define settings for basic model training and train

In [None]:
#set training arguments
training_args = TrainingArguments("test_trainer",
                                 logging_strategy='epoch')

#setup training loop with arguments
trainer = Trainer(model=model,
                  args=training_args,
                  tokenizer=tokenizer,
                  data_collator=data_collator,
                  train_dataset=tokenized_ds['train'],
                  eval_dataset=tokenized_ds['test'])

#train
trainer.train()

***** Running training *****
  Num examples = 16
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6


Step,Training Loss
2,1.6302
4,1.5681
6,1.5161




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=6, training_loss=1.5714763800303142, metrics={'train_runtime': 16.9209, 'train_samples_per_second': 2.837, 'train_steps_per_second': 0.355, 'total_flos': 1374422789760.0, 'train_loss': 1.5714763800303142, 'epoch': 3.0})

### Reflect and Discuss
* How many epochs of training did this undergo? Why do you think it stopped at this number of epochs?
* What if you wanted to train the model more? How do you think you could change the number of epochs?
* Practically speaking, how is the model performing?

## Training with performance metrics

In [None]:
#load a metric
metric = load_metric("accuracy")

#define the metric behavior
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
#set new training arguments
training_args = TrainingArguments("test-trainer",
                                  logging_strategy = "epoch",
                                  evaluation_strategy="epoch")

#setup training loop
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['test'],
    compute_metrics=compute_metrics
)

#train model
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 16
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6


Epoch,Training Loss,Validation Loss,Accuracy
1,1.4447,1.569747,0.25
2,1.404,1.565926,0.25
3,1.2864,1.582657,0.25


***** Running Evaluation *****
  Num examples = 4
  Batch size = 8
***** Running Evaluation *****
  Num examples = 4
  Batch size = 8
***** Running Evaluation *****
  Num examples = 4
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=6, training_loss=1.378400206565857, metrics={'train_runtime': 18.0177, 'train_samples_per_second': 2.664, 'train_steps_per_second': 0.333, 'total_flos': 1374422789760.0, 'train_loss': 1.378400206565857, 'epoch': 3.0})

### Reflect and Discuss
* What new observations are present during model training?
* What comments can you make on the performance of the model now?
* What metrics are appropriate for your application?
* Consider that model training is done in-memory (the model weights are updated in memory, but not returned), and both of our `Trainer`s trained our model `model`. After basic training from Step 9 and training from Step 10, how many epochs has the model been trained?
* What are some advantages and disadvantages of the in-memory training?

## A brief aside on performance metrics
You may want to use other performance metrics than accuracy. Here are some [metrics available through Huggingface](https://huggingface.co/metrics). If you check out the metrics folder on the [Huggingface datasets](https://github.com/huggingface/datasets) repository, you'll be able to see what's necessary if you need to define another metric. Let's try a different metric!

In [None]:
from sklearn.metrics import precision_recall_fscore_support

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    
    #get predictions by using index of max logit
    predictions = np.argmax(logits, axis=-1)
    
    #calculate classification report
    perfs = precision_recall_fscore_support(labels, predictions, average='macro', zero_division=0)
    perf_dict = dict(zip(['precision', 'recall', 'fscore', 'support'], perfs))
    
    #return dictionary
    return perf_dict

#setup training loop
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['test'],
    compute_metrics=compute_metrics
)

trainer.train()

***** Running training *****
  Num examples = 16
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6


Epoch,Training Loss,Validation Loss,Precision,Recall,Fscore,Support
1,1.1019,1.570975,0.083333,0.25,0.125,
2,1.0715,1.561957,0.083333,0.333333,0.133333,
3,0.9284,1.606322,0.083333,0.333333,0.133333,


***** Running Evaluation *****
  Num examples = 4
  Batch size = 8
***** Running Evaluation *****
  Num examples = 4
  Batch size = 8
***** Running Evaluation *****
  Num examples = 4
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=6, training_loss=1.0339383681615193, metrics={'train_runtime': 17.9536, 'train_samples_per_second': 2.674, 'train_steps_per_second': 0.334, 'total_flos': 1374422789760.0, 'train_loss': 1.0339383681615193, 'epoch': 3.0})

## A brief aside on model training
One of several points of ambiguity when training models is how long should they train for? A way to approach this is to monitor the models and run them repeatedly, starting from the last checkpoint. Another way is through training a number of epochs (if you model trains quickly enough) and then always load the best model according to some metric at the end. Let's take a look at this.

We can realize this through `TrainingArguments`!

In [None]:
#set new training arguments
training_args = TrainingArguments("test-trainer",
                                  overwrite_output_dir=True,
                                  logging_strategy = "epoch",
                                  evaluation_strategy="epoch",
                                  save_strategy='epoch',
                                  load_best_model_at_end = True,
                                  metric_for_best_model='fscore',
                                  greater_is_better=True)

#setup training loop
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['test'],
    compute_metrics=compute_metrics
)

#train model
trainer.train()

# 4. Using trained model with `Trainer`
## Evaluate

In [None]:
trainer.evaluate(tokenized_ds['train'])

***** Running Evaluation *****
  Num examples = 16
  Batch size = 8


{'eval_loss': 0.8418077826499939,
 'eval_precision': 0.71,
 'eval_recall': 0.8,
 'eval_fscore': 0.7492063492063492,
 'eval_support': None,
 'eval_runtime': 1.4741,
 'eval_samples_per_second': 10.854,
 'eval_steps_per_second': 1.357,
 'epoch': 3.0}

## Predict

In [None]:
trainer.predict(tokenized_ds['train'])

***** Running Prediction *****
  Num examples = 16
  Batch size = 8


PredictionOutput(predictions=array([[ 0.19887984, -0.5085467 ,  0.15752687,  0.80598766, -0.7313549 ],
       [ 0.18377188,  0.80033433, -0.20015214, -0.40164483, -0.65034115],
       [ 1.2173357 , -0.28954273,  0.02110876, -0.48217937, -0.7644893 ],
       [ 0.17263918, -0.61649   ,  1.2267463 , -0.08063987, -0.6052612 ],
       [ 0.1341124 , -0.4403972 ,  1.0039518 , -0.21165301, -0.5031114 ],
       [ 1.2421515 , -0.53989077, -0.20045453, -0.24576914, -0.6787317 ],
       [ 0.06871343, -0.5793819 ,  1.1940726 , -0.08527828, -0.78305995],
       [ 1.2404475 , -0.359777  , -0.31060284, -0.36036062, -0.8554643 ],
       [ 0.18704903,  0.7343005 , -0.21609876, -0.277732  , -0.47223613],
       [ 0.15008287, -0.14654635,  0.1811359 , -0.15184018, -0.19645688],
       [ 1.180792  , -0.57255185,  0.11821648, -0.40183204, -0.9713236 ],
       [ 1.258645  , -0.56001073, -0.06982525, -0.33898708, -0.8627219 ],
       [ 0.10763383,  0.89667153, -0.3150844 , -0.40278837, -0.6743192 ],
       [-

# 5. Sharing and saving your model
## Using `Trainer`
During training and using the Trainer class, you can also upload your model directly to HuggingFace Hub as it trains. Read more about this process on the [HF course documentation](https://huggingface.co/course/chapter4/3?fw=pt).

Let's check out how to do this. It's as simple as modifying our `TrainingArguments`! Don't forget to have already logged in using your authorization token or use the `use_auth_token` paramter to access your HF account. You'll need to have git-lfs installed to use this feature, so if you're on Google Colab, you can execute the line below. You can also `conda install -c conda-forge git-lfs` if you're using a conda environment.

In [None]:
#!apt-get install git-lfs

In [None]:
#set new training arguments
training_args = TrainingArguments("test-trainer",
                                  overwrite_output_dir=True,
                                  logging_strategy = "epoch",
                                  evaluation_strategy="epoch",
                                  save_strategy='epoch',
                                  push_to_hub=True,
                                  hub_model_id='charreaubell/distilbert-magazine-classifier')

#setup training loop
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['test'],
    compute_metrics=compute_metrics
)

#train model
trainer.train()

#### Reflect
Visit your repository and take a look to make sure your model uploaded. Answer the following questions:
* Where did your model save locally (directory)?
* What are the contents of the saved model?
* Investigate your uploaded model.

In [None]:
#it's recommended to push the final version to HF after training completes.
trainer.push_to_hub(commit_message='end of training 3 epochs')

#### Reflect
Visit your repository once more (you'll likely need to refresh) and check out the changes.
* What is different from the uploads during training?
* What do you observe about the model cards?

## Fine-grained save/push access
You can also push the model and/or tokenizer directly using the `push_to_hub` methods in their classes. You can learn more about this [in the Huggingface docs.](https://huggingface.co/course/chapter4/3?fw=pt) An example of using trainer to do this is shown below.

In [None]:
trainer.save_model('demo-distilbert')

# 6. Using your fine-tuned model

In [None]:
#create pipeline from your classifier
mag_classifier = pipeline('text-classification', model='test-trainer', use_auth_token=True)

#get output
mag_class = mag_classifier('The cat is prettier than any cat I have ever seen.')
mag_class

loading configuration file test-trainer/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-cased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.11.3",
  "vocab_size": 28996
}

loading configuration file test-trainer/config.json

[{'label': 'LABEL_0', 'score': 0.2668711543083191}]