# Experiment notes on fine-tuning

In [1]:
import torch

if torch.cuda.is_available():
    print("GPU is available")
else:
    print("GPU is not available")

GPU is available


In [None]:
# !pip install datasets
# !pip install evaluate
# !pip install accelerate

In [4]:
# See python-version

from datasets import load_dataset
from transformers import (
  GPT2Tokenizer,
  GPT2ForSequenceClassification,
  TrainingArguments,
  Trainer
)
import evaluate
import pandas as pd
import numpy as np

## Get dataset from HuggingFace

`load_datasets()` is the standard way to download from HuggingFace repos.

Hugging Face's datasets contain a training, validation, and testing section.

In [5]:
ds = load_dataset("mteb/tweet_sentiment_extraction")

Downloading readme:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.63M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/465k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/27481 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3534 [00:00<?, ? examples/s]

In [None]:
ds = load_dataset("ic-fspml/stock_news_sentiment")
ds

In [None]:
# def enum_label(x) -> int:
#   if x == "neutral":
#     return 0
#   elif x == "strongly bearish":
#     return -2
#   elif x == "mildly bearish":
#     return -1
#   elif x == "mildly bullish":
#     return 1
#   elif x == "strongly bullish":
#     return 2

# def new_col(x):
#   x["sentiment"] = x["label"]
#   x["label"] = enum_label(x["label"])
#   return x


In [None]:
# # Enumerate labels since Transformers can't use str as labels
# new_ds = ds.map(new_col)

# df["label_enum"] = df["label"].apply(enum_label)

Find the number of unique labels.

In [6]:
# View test
pd.DataFrame(ds["test"])

Unnamed: 0,id,text,label,label_text
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,1,neutral
1,96d74cb729,Shanghai is also really exciting (precisely -...,2,positive
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",0,negative
3,01082688c6,happy bday!,2,positive
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,2,positive
...,...,...,...,...
3529,e5f0e6ef4b,"its at 3 am, im very tired but i can`t sleep ...",0,negative
3530,416863ce47,All alone in this old house again. Thanks for...,2,positive
3531,6332da480c,I know what you mean. My little dog is sinkin...,0,negative
3532,df1baec676,_sutra what is your next youtube video gonna b...,2,positive


In [7]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

## Tokenizer

Tokenizer is a core function for machine learning algorithms to translate data
from human-readible (string) to computer-readible (numbers).

A tokenizer takes a string and breaks up the pieces into tokens which can be
used by an algorithm. Tokenizer, much like the algorithms themselves, vary on
their characteristics and behaviors.

### Efficiency

This makes the parsing of values to be more efficient since numbers are faster
to process and store than strings.

### Examples

Some examples of a tokenizer include Byte-level (BPE) for ChatGPT-2.
Another is Hugging Face's Tokenizer.

### Process

Steps for a tokenizer include:

1. Normalization: Removes whitespace, converts to lowercase, and removes
   accented characters.

   `"Héllò hôw are yoü?"` -> `"hello, how are you?"`

1. Pro-tokenization: Split the string into smaller chunks such as words. In the
   following example, the offsets are kept track.

   `"hello, how are you?"` -> `[('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]`

1. Modeling: Using a BERT tokenizer, will tokenize the sentence like this:

   `["hello"; ","; "how"; "are"; "you"; "?"]`

1. Post-processing: Adds commands for processing the text.

   `["CLS"; "hello"; ","; "how"; "are"; "you"; "?"; "SEP"]`

   The CLS stands for classification token and SEP stands for end of sentence.

https://medium.com/@awaldeep/hugging-face-understanding-tokenizers-1b7e4afdb154


In [8]:
tokenizer.pad_token = tokenizer.eos_token
def tokenize(examples):
    """Returns tokenized data for each row.

    Args:
        examples: Row example.
        field: Name of field to consider as tokenize.
    """
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [9]:
# new_ds.map(tokenize, batched=True)
tokenized_ds = ds.map(tokenize, batched=True)

Map:   0%|          | 0/27481 [00:00<?, ? examples/s]

Map:   0%|          | 0/3534 [00:00<?, ? examples/s]

In [18]:
# Trying to get input ids
ds["train"].features.get("input_ids")

The tokenizer will add `input_ids` and `attention_mask` fields.

In [19]:
# Training set
small_train_dataset = tokenized_ds["train"].shuffle(seed=42).select(range(1000))
# Testing set
small_test_dataset = tokenized_ds["test"].shuffle(seed=42).select(range(1000))

## Use the pre-trained model

This practice is for using a fine-tuning model. Since it's hard to create a
pre-trained model which contains general knowledge, we'll use an existing model.

In [20]:
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)  # Labels are `labels` column

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Testing the algorithm

In [21]:
metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

## Run training

Using HuggingFace's
[Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer).

The training library which runs on PyTorch supports distributed training for
NVIDIA GPUs. 


In [22]:
def compute_metrics(eval):
    logits, labels = eval
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [23]:
training_args = TrainingArguments(
    output_dir="test_trainer",
    # evaluation_strategy="epoch",
    per_device_train_batch_size=1,  # Reduce batch size here
    per_device_eval_batch_size=1,  # Optionally, reduce for evaluation as well
    gradient_accumulation_steps=4,
)


trainer = Trainer(
    # Model for training
    model=model,
    # Takes instance of `TrainingArguments`
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_test_dataset,
    compute_metrics=compute_metrics,
)

In [24]:
trainer.train()

Step,Training Loss
500,0.8525


TrainOutput(global_step=750, training_loss=0.7261580098470052, metrics={'train_runtime': 463.919, 'train_samples_per_second': 6.467, 'train_steps_per_second': 1.617, 'total_flos': 1567794659328000.0, 'train_loss': 0.7261580098470052, 'epoch': 3.0})

In [25]:
trainer.evaluate()

{'eval_loss': 1.0517683029174805,
 'eval_accuracy': 0.734,
 'eval_runtime': 47.7095,
 'eval_samples_per_second': 20.96,
 'eval_steps_per_second': 20.96,
 'epoch': 3.0}