<a href="https://colab.research.google.com/github/vanderbilt-data-science/repo-template-huggingface/blob/main/30-model-training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 30-model-training
> Training models

In this notebook, we train models based on the feature set of interest to predict the outcomes.  This code is strictly a template - make sure you add all appropriate markdown and change parameters appropriately.

In [None]:
%%capture
!pip install transformers datasets

#### Common helpful packages

In [None]:
#Data analysis and processing
import pandas as pd
import numpy as np

#machine learning
from sklearn.model_selection import train_test_split

#transformers base
from datasets import load_dataset, load_metric, Dataset
from huggingface_hub import notebook_login
from transformers import AutoTokenizer, TrainingArguments, Trainer, DataCollatorWithPadding

#transformers task
from transformers import AutoModelForSequenceClassification

#### Notebook constants
The following cell contains most of the variables that the user may choose to modify for their particular dataset or to choose a different model.

In [None]:
#datasets
dataset_id = 'tweet_eval'
dataset_config = 'emotion' #can also be None (the data type, not the string)

#model and tokenizers
model_name = "distilbert-base-cased"
tokenizer_name = "distilbert-base-cased"

#model outputs
model_directory_name = 'trained_model'

#whether to push to hub
push_to_hub = False

## Huggingface Hub login
You don't need to log in unless you plan to pull from private repos on the Hub or push to the Hub.

In [None]:
!git config --global credential.helper store
notebook_login()

# Load data
The following code assumes that you're loading data from the Huggingface Hub. However, you can use local data (on Colab)

In [None]:
#Load tweet_eval dataset, emotion configuration
ds = load_dataset(dataset_id, dataset_config)
ds

In [None]:
#Make data labels for classification
num_classes = ds['train'].features['label'].num_classes
id2label = {ind:val for ind, val in enumerate(ds['train'].features['label'].names)}
label2id = {val:key for key, val in id2label.items()}

# Tokenization

In [None]:
#load tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

In [None]:
#define tokenizing function
def tokenize_inputs(example):
    return tokenizer(example['text'], truncation = True)

In [None]:
#do tokenization
tokenized_ds = ds.map(tokenize_inputs, batched=True, remove_columns=['text'])

In [None]:
#Instantiate data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create and train model
Although the code below is provided as a basis for Trainer and TrainingArguments, _you should certainly change the TrainingArguments for your particular case._

In [None]:
#define target task using relevant class
model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                           num_labels=num_classes,
                                                           id2label=id2label,
                                                           label2id=label2id)

## Define metric

In [None]:
#we'll use accuracy from HF hub as an example
metric = load_metric("accuracy")

#function to utilize the metric we've loaded
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

## Train model
Note that you will likely need to train a great deal longer than 3 epochs depending on the size of your data. If your data is notably large, you should change "epochs" to a number of steps instead to be able to monitor your training adequately.

For example, during experimentation, you should likely set `push_to_hub=False` until you're ready for a full training.

In [None]:
#set parameters around training
training_args = TrainingArguments(model_directory_name,
                                  logging_strategy = "epoch",
                                  evaluation_strategy="epoch",
                                  save_strategy='epoch',
                                  load_best_model_at_end = True,
                                  metric_for_best_model='accuracy',
                                  greater_is_better=True,
                                  per_device_train_batch_size = 4,
                                  per_device_eval_batch_size = 4,
                                  num_train_epochs=3,
                                  push_to_hub=push_to_hub,
                                  hub_strategy='checkpoint',
                                  report_to='all')

In [None]:
#setup trainer and actually train
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['validation'],
    compute_metrics=compute_metrics
)

#actually train model
trainer.train()

## Final push
Confident with your model? Do a final push (remembering that your best model is loaded if you've used `load_best_model_at_end=True`). Don't forget that you'll need to be logged into Huggingface using `notebook_login`.

In [None]:
#push to hub
trainer.push_to_hub(commit_message='end of training 3 epochs')

# Prediction and evaluation
Note that the following code looks at the validation set; however, during training it is often useful to make sure your model can learn on the data by first inspecting the performance on the training dataset.


## Evaluate

In [None]:
#Perform evaluation over entire dataset
eval_ds = trainer.evaluate(tokenized_ds['validation'])
eval_ds

## Predict

In [None]:
#use trainer to predict
preds = trainer.predict(tokenized_ds['validation'])
preds

# Model saving
Note that after we've saved the model below, we'll be able to use the pipeline function to load this model and use it for inference.

In [None]:
trainer.save_model('bert-tuned-model')