# **LAB 2: FINE TUNING**

Fine-tuning refers to the process of taking a pre-trained language model and 
training it further on a specific task or domain to improve its performance on that task.  
<br />
It is an important technique used to adapt LLMs to specific tasks and domains.  
<br />
In this lab we will explore basic ways to fine tune large language models using
open soure tools. First we look at an example of doing this by hand with the open source 🤗 Transformers
Python library. Familiarity with the 🤗 Transformers package is helpful once we
introduce additional tools with more flexibility, such as H2O LLM Studio  
<br />
In this notebook, we will explore how do fine-tune a foundational large language
model such that it can generate LinkedIn posts in the style of known influencers
on the platform. 

Use the prepared dataset from the prior lab: /kaggle/input/influencers-data-prepared-csv
- You will need to click on `Add Data`, 
- Select `Your Datasets`,  
- Grab the `requirements`, and 
- Grab the `influencers_data_prepared.csv` datasets

# Using Hugging Face 

## Understanding the `transformers` and `datasets` libraries

- Load the WNLI data set from the General Language Understanding Evaluation (GLUE)
benchmark. (https://gluebenchmark.com/)

From the paper, `The Winograd Schema Challenge (Levesque et al., 2011)`, this is a reading comprehension task
in which a system must read a sentence with a pronoun and select the referent of that pronoun from
a list of choices.`

**If you are on Kaggle, set the Accelerator to "GPU-T4"**

**References**:
- Datasets library: https://pypi.org/project/datasets/

In [1]:
import warnings
warnings.filterwarnings('ignore')

# set flag for training environment
TRAINING = True

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "wnli")
checkpoint = "bert-base-uncased"


Downloading builder script:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/wnli (download: 28.32 KiB, generated: 154.03 KiB, post-processed: Unknown size, total: 182.35 KiB) to /root/.cache/huggingface/datasets/glue/wnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/29.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/635 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/71 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/146 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/wnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
glue_dataset = load_dataset("glue", "wnli")
glue_dataset

  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 635
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 71
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 146
    })
})

In [5]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 635
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 71
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 146
    })
})

In [22]:
raw_datasets['test'].__dir__()

['_info',
 '_split',
 '_indexes',
 '_data',
 '_indices',
 '_format_type',
 '_format_kwargs',
 '_format_columns',
 '_output_all_columns',
 '_fingerprint',
 '__module__',
 '__doc__',
 '__init__',
 'from_file',
 'from_buffer',
 'from_pandas',
 'from_dict',
 'from_csv',
 'from_json',
 'from_parquet',
 'from_text',
 '__del__',
 '__enter__',
 '__exit__',
 'save_to_disk',
 '_build_local_temp_path',
 'load_from_disk',
 'data',
 'cache_files',
 'num_columns',
 'num_rows',
 'column_names',
 'shape',
 'unique',
 'class_encode_column',
 'flatten',
 'cast',
 'cast_column',
 'remove_columns',
 'rename_column',
 'rename_columns',
 '__len__',
 '_iter',
 '__iter__',
 '__repr__',
 'format',
 'formatted_as',
 'set_format',
 'reset_format',
 'set_transform',
 'with_format',
 'with_transform',
 'prepare_for_task',
 '_getitem',
 '__getitem__',
 'cleanup_cache_files',
 '_get_cache_file_path',
 'map',
 '_map_single',
 'filter',
 'flatten_indices',
 '_new_dataset_with_indices',
 'select',
 'sort',
 'shuffle',


In [39]:
from pprint import pprint
pprint(raw_datasets['test']._info)

DatasetInfo(description='GLUE, the General Language Understanding Evaluation '
                        'benchmark\n'
                        '(https://gluebenchmark.com/) is a collection of '
                        'resources for training,\n'
                        'evaluating, and analyzing natural language '
                        'understanding systems.\n'
                        '\n',
            citation='@inproceedings{levesque2012winograd,\n'
                     '  title={The winograd schema challenge},\n'
                     '  author={Levesque, Hector and Davis, Ernest and '
                     'Morgenstern, Leora},\n'
                     '  booktitle={Thirteenth International Conference on the '
                     'Principles of Knowledge Representation and Reasoning},\n'
                     '  year={2012}\n'
                     '}\n'
                     '@inproceedings{wang2019glue,\n'
                     '  title={{GLUE}: A Multi-Task Benchmark and Analysis '
 

In [44]:
raw_datasets['train'][2:4]

{'sentence1': ['The police arrested all of the gang members. They were trying to stop the drug trade in the neighborhood.',
  "Steve follows Fred's example in everything. He influences him hugely."],
 'sentence2': ['The police were trying to stop the drug trade in the neighborhood.',
  'Steve influences him hugely.'],
 'label': [1, 0],
 'idx': [2, 3]}

# Tokenizer

We can automatically load the correct tokenizer used from the pretrained model
via `AutoTokenizer`.

In [2]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = 'this will be fun!'

tokenizer.tokenize(sequence)


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['this', 'will', 'be', 'fun', '!']

# Tokenizer Output

Let's take a look at the integers (input_ids) assigned to each token in the sequence
as well as other information such as optional masks for any tokens that need to be
masked from the attention mechanism - special tokens for truncating sequences for example

In [3]:
tokenizer(sequence)

{'input_ids': [101, 2023, 2097, 2022, 4569, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [4]:
# function to create
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [6]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

# Load pretrained model weights

In [7]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Create a Trainer object to begin fine tuning

In [8]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [9]:
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=120, training_loss=0.7066105524698894, metrics={'train_runtime': 103.3614, 'train_samples_per_second': 18.43, 'train_steps_per_second': 1.161, 'total_flos': 72598609616940.0, 'train_loss': 0.7066105524698894, 'epoch': 3.0})

# Generate predictions on the validation data

In [10]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(71, 2) (71,)


# Model Output

As we can see, the transformer model outputs logits directly

In [11]:
print(predictions.predictions[:10, :10])

[[ 0.06643731  0.23543179]
 [ 0.14411297  0.22523996]
 [ 0.11989371  0.24879391]
 [ 0.2413762   0.22958633]
 [ 0.16321541  0.18696262]
 [ 0.11199773  0.2538352 ]
 [ 0.12315273  0.23112205]
 [ 0.19100055  0.18159585]
 [-0.14954926  0.10492721]
 [-0.07249562  0.05665697]]


# Turn into label predictions

In [12]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)
preds

array([1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,
       1, 1, 0, 1, 1])

---