# Fine-Tuning Models
> Fine-tuning using your own data

In this notebook, we'll use two references:https://huggingface.co/transformers/custom_datasets.html as a guide for our work.  We'll use the HuggingFace dataset we've already created and use it directly!

### Install required packages
Note that this is mostly required if you're on Google Colab.

In [None]:
#! pip install transformers
#! pip install datasets

### Import packages of interest

In [None]:
import numpy as np
import pandas as pd

from datasets import load_dataset, load_metric, Dataset
from transformers import pipeline
from transformers import DataCollatorWithPadding
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

from huggingface_hub import notebook_login

# 0. Log into HuggingFace CLI
Why are we doing this? Below, we'll use our own user accounts to grab datasets and upload models. If we don't do this, we'll have to pass in the auth token over. This isn't bad, but let's streamline our efforts!

In [None]:
#!git config --global credential.helper store

In [None]:
notebook_login()

# 1. Load data from HuggingFace Hub or from disk

In [None]:
#ds_path = 
#demo_ds = 

# 2. Pre-process inputs
What's a tokenizer and what does it do? Let's learn more using Huggingface's [instruction on tokenizers](https://huggingface.co/course/chapter2/4?fw=pt). Then, let's try it on our own!

In [None]:
#instantiate tokenizer


In [None]:
#define tokenizing function
def tokenize_inputs(example):
    

In [None]:
#do the tokenizing using map function
tokenized_ds = demo_ds.map(tokenize_inputs, batched=True,
                           remove_columns = ['age', 'article_id', 'college_major',
                                             'first_name', 'last_name', 'years_of_journalism',
                                             'text'])

## An aside on tokenizer functionality
We can do many things with tokenizers to help us to tokenize our data and process it. Let's check out these outputs further.

In [None]:
#check out input IDs


#compare against the text


In [None]:
#check out the length of the list of lists


#check out the length of a single element


In [None]:
#convert input_ids to token representation


In [None]:
#see what this looks like as a string


#another method directly from the input ids


In [None]:
#other information about tokenizer


#see actual tokenizer vocab (we've abbreviated here)


## An aside on dynamically padded batch size
HF has the capacity to dynamically pad your batches such that each input is only as long as any given input in the batch. This helps with memory.You can learn more [here](https://huggingface.co/course/chapter3/2?fw=pt). For now, we'll simply instantiate a data collator and use it during training to demonstrate how we can do this.

In [None]:
#Instantiate data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# 3. Train model

In [None]:
#recall that the dataset carries information about the classes


In [None]:
#get the number of classes


In [None]:
#get label conversions
id2label = {ind:label for ind, label in enumerate(demo_ds['train'].features['label'].names)}
label2id = {label:ind for ind, label in id2label.items()}

In [None]:
#check it out


## Define model and task architecture

In [None]:
# Choose the model type and instantiate it for the task


## Define settings for basic model training and train

In [None]:
#set training arguments


#setup training loop with arguments


#train


### Reflect and Discuss
* Practically speaking, how is the model performing?

## Training with performance metrics and saving checkpoints of the model

In [None]:
#load a metric


#define the metric behavior
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
#set new training arguments
training_args = TrainingArguments("test-trainer",
                                  logging_strategy = "epoch"
                                  #fill other other arguments here
                                 )

#setup training loop
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['valid'],
    #fill in other arguments here
)

#train model
trainer.train()

### Reflect and Discuss
* What new observations are present during model training?
* What comments can you make on the performance of the model now?
* What metrics are appropriate for your application?
* Consider that model training is done in-memory (the model weights are updated in memory, but not returned), and both of our `Trainer`s trained our model `model`. After basic training from Step 9 and training from Step 10, how many epochs has the model been trained?

## A brief aside on resuming training from checkpoints

In [None]:
#update the number of epochs (or steps) that you want to train for


In [None]:
#train some more, resuming from checkpoint


## A brief aside on performance metrics
You may want to use other performance metrics than accuracy. Here are some [metrics available through Huggingface](https://huggingface.co/metrics). If you check out the metrics folder on the [Huggingface datasets](https://github.com/huggingface/datasets) repository, you'll be able to see what's necessary if you need to define another metric. Let's try a different metric!

In [None]:
from sklearn.metrics import precision_recall_fscore_support

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    
    #get predictions by using index of max logit
    predictions = np.argmax(logits, axis=-1)
    
    #calculate classification report
    perfs = precision_recall_fscore_support(labels, predictions, average='macro', zero_division=0)
    perf_dict = dict(zip(['precision', 'recall', 'fscore'], perfs[:3]))
    
    #return dictionary
    return perf_dict

In [None]:
#setup training loop
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['valid'],
    compute_metrics=compute_metrics
)

trainer.train()

## A brief aside on model training - TRY IT YOURSELF!
One of several points of ambiguity when training models is how long should they train for? A way to approach this is to monitor the models and run them repeatedly, starting from the last checkpoint. Another way is through training a number of epochs (if you model trains quickly enough) and then always load the best model according to some metric at the end. Let's take a look at this.

We can realize this through `TrainingArguments`! In your breakout rooms, add the parameters which will enable the following:
1. Load the best model at the end
2. Set the metric for using the best model to one of our evaluation metrics
3. Examine the `greater_is_better` parameter. Do you need to change it?
4. Change the number of training epochs to something larger.
5. Decrease the training batch size.
6. Decrease the eval batch size.
7. How can you change the logging, evaluation, and save strategies to step? What else might you need to change depending on the interval of steps that you want these activities to occur?

Make sure this works, so run the cell!

In [None]:
#set new training arguments
training_args = TrainingArguments("test-trainer",
                                  logging_strategy = "epoch",
                                  evaluation_strategy="epoch",
                                  save_strategy='epoch',
                                  #add other parameters here!
                                  #more parameters
                                  #use as many lines as you need!
                                  )

#setup training loop
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['valid'],
    compute_metrics=compute_metrics
)

#train model
trainer.train()

# 4. Using trained model with `Trainer`
## Evaluate

## Predict

# 5. Sharing and saving your model
## Using `Trainer`
During training and using the Trainer class, you can also upload your model directly to HuggingFace Hub as it trains. Read more about this process on the [HF course documentation](https://huggingface.co/course/chapter4/3?fw=pt).

Let's check out how to do this. It's as simple as modifying our `TrainingArguments`! Don't forget to have already logged in using your authorization token or use the `use_auth_token` paramter to access your HF account. You'll need to have git-lfs installed to use this feature, so if you're on Google Colab, you can execute the line below. You can also `conda install -c conda-forge git-lfs` if you're using a conda environment.

In [None]:
#!apt-get install git-lfs

In [None]:
#set new training arguments
training_args = TrainingArguments("test-trainer",
                                  overwrite_output_dir=True,
                                  logging_strategy = "epoch",
                                  evaluation_strategy="epoch",
                                  save_strategy='epoch',
                                  #new arguments
                                  report_to='all')

#setup training loop
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['valid'],
    compute_metrics=compute_metrics
)

#train model
trainer.train()

#### Reflect
Visit your repository and take a look to make sure your model uploaded. Answer the following questions:
* Where did your model save locally (directory)?
* What are the contents of the saved model?
* Investigate your uploaded model.

In [None]:
#it's recommended to push the final version to HF after training completes.


#### Reflect
Visit your repository once more (you'll likely need to refresh) and check out the changes.
* What is different from the uploads during training?
* What do you observe about the model cards?

## Fine-grained save/push access
You can also push the model and/or tokenizer directly using the `push_to_hub` methods in their classes. You can learn more about this [in the Huggingface docs.](https://huggingface.co/course/chapter4/3?fw=pt) An example of using trainer to save your entire model locally is shown below.

In [None]:
trainer.save_model('test-trainer')

# 6. Using your fine-tuned model

In [None]:
#create pipeline from your classifier


#optionally, load from HF
#mag_classifier = pipeline('text-classification', model='charreaubell/distilbert-magazine-classifier', use_auth_token=True)

#get output


In [None]:
#do inference using trained model


## Reflect and discuss: Breakout Rooms
You've successfully trained a model - great job!! Now, let's focus on what YOU need to do for your task. Using the [Transformer Notebooks](https://huggingface.co/docs/transformers/notebooks) and use the `Open in Colab` badge, explore what this task looks like. Note that even if your modality is different, you may be able to directly still use these notebooks with a few changes!