# Introduction

* Datasets:
    * https://huggingface.co/datasets/tatsu-lab/alpaca?row=1
* Models:
    * https://huggingface.co/distilbert/distilgpt2
 
***Note:*** *Here we will manually preprocess the input before feeding it to the model. We use `formatting_func` in the SFT API.*

In [1]:
!pip install -U accelerate peft bitsandbytes transformers trl datasets



In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    TrainingArguments,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer

## Configuration

In [3]:
batch_size = 4
num_workers = os.cpu_count()
max_steps = 13000
bf16 = True
fp16 = False
gradient_accumulation_steps = 8
context_length = 1024
logging_steps = 500
save_steps = 500
learning_rate = 0.0001
model_name = 'distilbert/distilgpt2'
out_dir = 'outputs/distilgpt2_alpaca_preprocess_fn'

## Load Dataset

In [4]:
dataset = load_dataset('tatsu-lab/alpaca')

In [5]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 52002
    })
})


In [6]:
print(dataset['train']['text'][0])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.


In [7]:
print(dataset['train'][0])

{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}


In [8]:
full_dataset = dataset['train'].train_test_split(test_size=0.05, shuffle=True)
dataset_train = full_dataset['train']
dataset_valid = full_dataset['test']
 
print(dataset_train)
print(dataset_valid)

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 49401
})
Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 2601
})


In [9]:
for i in range(10):
    print(dataset_train[i])
    print('****************')
    
    text = dataset_train[i]
    instruction = '### Instruction:\n' + text['instruction']
    inputs = '\n\n### Input:\n' + text['input']
    response = '\n\n### Response:\n' + text['output']
    
    final_text = instruction + inputs + response
    print(final_text)
    print('#'*50)

{'instruction': 'Given the following text, list 5 facts about dolphins:\n\nDolphins are one of the most intelligent animals on Earth and have fascinated humans since ancient times.', 'input': '', 'output': '1. Dolphins are highly social animals and live in groups called pods. \n2. They communicate using a wide range of vocalizations. \n3. Dolphins can recognize themselves in a mirror. \n4. They can learn to follow directions and even perform tricks for entertainment. \n5. Dolphins can use tools to help them find food or protect themselves.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGiven the following text, list 5 facts about dolphins:\n\nDolphins are one of the most intelligent animals on Earth and have fascinated humans since ancient times.\n\n### Response:\n1. Dolphins are highly social animals and live in groups called pods. \n2. They communicate using a wide range of vocalizations. \n3. 

In [10]:
# def preprocess_function(example):
#     """
#     Formatting function returning a list of samples (kind of necessary for SFT API).
#     """
#     output_texts = []
#     for i in range(len(example['instruction'])):
#         instruction = '### Instruction:\n' + example['instruction'][i]
#         inputs = '\n\n### Input:\n' + example['input'][i]
#         response = '\n\n### Response:\n' + example['output'][i]
        
#         final_text = instruction + inputs + response
#         output_texts.append(final_text)
#     return output_texts

def preprocess_function(example):
    """
    Formatting function returning a list of samples (kind of necessary for SFT API).
    """
    text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    return text

## Model

In [11]:
if bf16:
    model = AutoModelForCausalLM.from_pretrained(model_name).to(dtype=torch.bfloat16)
else:
    model = AutoModelForCausalLM.from_pretrained(model_name)

In [12]:
print(model)
# Total parameters and trainable parameters.
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
81,912,576 total parameters.
81,912,576 training parameters.


## Tokenizer

In [13]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True,
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

## Training

In [14]:
training_args = TrainingArguments(
    output_dir=f"{out_dir}/logs",
    evaluation_strategy='steps',
    weight_decay=0.01,
    load_best_model_at_end=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_strategy='steps',
    save_strategy='steps',
    logging_steps=logging_steps,
    save_steps=save_steps,
    save_total_limit=2,
    bf16=bf16,
    fp16=fp16,
    report_to='tensorboard',
    max_steps=max_steps,
    dataloader_num_workers=num_workers,
    gradient_accumulation_steps=gradient_accumulation_steps,
    learning_rate=learning_rate,
    lr_scheduler_type='constant',
)

In [15]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_train,
    eval_dataset=dataset_valid,
    max_seq_length=context_length,
    tokenizer=tokenizer,
    args=training_args,
    formatting_func=preprocess_function,
    packing=True
)

Generating train split: 0 examples [00:00, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1495 > 1024). Running this sequence through the model will result in indexing errors


Generating train split: 0 examples [00:00, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [16]:
dataloader = trainer.get_train_dataloader()
for i, sample in enumerate(dataloader):
    print(tokenizer.decode(sample['input_ids'][0]))
    print('#'*50)
    if i == 5:
        break

 with a hint of abstraction, allowing the viewer's imagination to create stories of their own.<|endoftext|>### Instruction:
Calculate the length of an arc with the following parameters.

### Input:
Angle of arc = 90°
Radius = 5m

### Response:
The length of an arc with an angle of 90° and a radius of 5m is 8.66m. This can be calculated using the formula L = rθ, where L is the arc length, r is the radius, and θ is the angle in radians. In this case, θ = pi/2, so the arc length is 8.66m.<|endoftext|>### Instruction:
Suggest a location to visit in the United States.

### Input:


### Response:
Grand Canyon National Park, Arizona<|endoftext|>### Instruction:
Compare and contrast socialism and communism.

### Input:


### Response:
Socialism and communism are both political and economic ideologies that promote the collective and equal ownership of resources and the means of production. However, socialism emphasizes economic democracy and a gradual transition from capitalism, while communism

In [17]:
history = trainer.train()

Step,Training Loss,Validation Loss
500,2.3553,2.158659
1000,2.2219,2.117892
1500,2.1815,2.097776
2000,2.1571,2.088081
2500,2.143,2.080245
3000,2.1321,2.075031
3500,2.1252,2.07196
4000,2.1193,2.068825
4500,2.1144,2.067222
5000,2.1114,2.065731


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


In [18]:
model.save_pretrained(f"{out_dir}/best_model")
tokenizer.save_pretrained(f"{out_dir}/best_model")

('outputs/distilgpt2_alpaca_preprocess_fn/best_model/tokenizer_config.json',
 'outputs/distilgpt2_alpaca_preprocess_fn/best_model/special_tokens_map.json',
 'outputs/distilgpt2_alpaca_preprocess_fn/best_model/vocab.json',
 'outputs/distilgpt2_alpaca_preprocess_fn/best_model/merges.txt',
 'outputs/distilgpt2_alpaca_preprocess_fn/best_model/added_tokens.json')

## Inference

In [4]:
from transformers import (
    AutoModelForCausalLM, 
    logging, 
    pipeline,
    AutoTokenizer
)

In [5]:
model = AutoModelForCausalLM.from_pretrained('outputs/distilgpt2_alpaca_preprocess_fn/best_model/')
tokenizer = AutoTokenizer.from_pretrained('outputs/distilgpt2_alpaca_preprocess_fn/best_model/')

tokenizer.pad_token = tokenizer.eos_token

In [6]:
# logging.set_verbosity(logging.CRITICAL)

In [7]:
pipe = pipeline(task='text-generation', model=model, tokenizer=tokenizer, max_length=256)

In [8]:
prompt = """### Instruction:
Write a resignation email to my boss.

### Input:


### Response:
"""

In [9]:
print(prompt)

### Instruction:
Write a resignation email to my boss.

### Input:


### Response:



In [12]:
result = pipe(
    prompt
)
print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


### Instruction:
Write a resignation email to my boss.

### Input:


### Response:
Dear [General Manager],

I would like to apologize to my colleagues at [Management Organization] tomorrow. I am particularly disappointed to hear that [Management Organization] is taking legal action and is taking an unusual approach. We are confident we can continue to serve as a leading partner in this area and I deeply apologize for any inconvenience or potential inconvenience that may arise from my involvement.

Sincerely, [Management Organization]
