# Introduction

* Datasets:
    * https://huggingface.co/datasets/tatsu-lab/alpaca?row=1
* Models:
    * https://huggingface.co/distilbert/distilgpt2
 
***Note:*** *Here we will manually preprocess the input before feeding it to the model. We use `formatting_func` in the SFT API.*

In [1]:
!pip install -U accelerate peft bitsandbytes transformers trl datasets



In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    TrainingArguments,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer

## Configuration

In [3]:
batch_size = 16
num_workers = os.cpu_count()
max_steps = 3000
bf16 = True
fp16 = False
gradient_accumulation_steps = 2
context_length = 256
logging_steps = 500
save_steps = 500
learning_rate = 0.00005
model_name = 'openai-community/gpt2'
out_dir = 'outputs/gpt2_alpaca_preprocssed_fn'

## Load Dataset

In [4]:
dataset = load_dataset('tatsu-lab/alpaca')

In [5]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 52002
    })
})


In [6]:
print(dataset['train']['text'][0])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.


In [7]:
print(dataset['train'][0])

{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}


In [8]:
full_dataset = dataset['train'].train_test_split(test_size=0.05, shuffle=True)
dataset_train = full_dataset['train']
dataset_valid = full_dataset['test']
 
print(dataset_train)
print(dataset_valid)

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 49401
})
Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 2601
})


In [9]:
for i in range(10):
    print(dataset_train[i])
    print('****************')
    
    text = dataset_train[i]
    instruction = '### Instruction:\n' + text['instruction']
    inputs = '\n\n### Input:\n' + text['input']
    response = '\n\n### Response:\n' + text['output']
    
    final_text = instruction + inputs + response
    print(final_text)
    print('#'*50)

{'instruction': 'Reword the following sentence in another way: "This person wore a face mask."', 'input': '', 'output': 'This individual had a face covering on.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nReword the following sentence in another way: "This person wore a face mask."\n\n### Response:\nThis individual had a face covering on.'}
****************
### Instruction:
Reword the following sentence in another way: "This person wore a face mask."

### Input:


### Response:
This individual had a face covering on.
##################################################
{'instruction': 'Generate 8 unique alphanumeric characters', 'input': '', 'output': '5gA1zV7m', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGenerate 8 unique alphanumeric characters\n\n### Response:\n5gA1zV7m'}
****************
### Instruct

In [10]:
def preprocess_function(example):
    """
    Formatting function returning a list of samples (kind of necessary for SFT API).
    """
    output_texts = []
    for i in range(len(example['instruction'])):
        instruction = '### Instruction:\n' + example['instruction'][i]
        inputs = '\n\n### Input:\n' + example['input'][i]
        response = '\n\n### Response:\n' + example['output'][i]
        
        final_text = instruction + inputs + response
        output_texts.append(final_text)
    return output_texts

## Model

In [11]:
if bf16:
    model = AutoModelForCausalLM.from_pretrained(model_name).to(dtype=torch.bfloat16)
else:
    model = AutoModelForCausalLM.from_pretrained(model_name)

In [12]:
print(model)
# Total parameters and trainable parameters.
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
124,439,808 total parameters.
124,439,808 training parameters.


## Tokenizer

In [13]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True,
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

## Training

In [14]:
# training_args = TrainingArguments(
#     output_dir=f"{out_dir}/logs",
#     evaluation_strategy='steps',
#     weight_decay=0.01,
#     load_best_model_at_end=True,
#     per_device_train_batch_size=batch_size,
#     per_device_eval_batch_size=batch_size,
#     logging_strategy='steps',
#     save_strategy='steps',
#     logging_steps=logging_steps,
#     save_steps=save_steps,
#     save_total_limit=2,
#     bf16=bf16,
#     fp16=fp16,
#     report_to='tensorboard',
#     max_steps=max_steps,
#     dataloader_num_workers=num_workers,
#     gradient_accumulation_steps=gradient_accumulation_steps,
#     learning_rate=learning_rate,
#     lr_scheduler_type='constant',
# )

training_args = TrainingArguments(
    output_dir=f"{out_dir}/logs",
    evaluation_strategy='epoch',
    weight_decay=0.01,
    load_best_model_at_end=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_strategy='epoch',
    save_strategy='epoch',
    save_total_limit=2,
    bf16=bf16,
    fp16=fp16,
    report_to='tensorboard',
    dataloader_num_workers=num_workers,
    gradient_accumulation_steps=gradient_accumulation_steps,
    learning_rate=learning_rate,
    lr_scheduler_type='constant',
    num_train_epochs=1
)

In [15]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_train,
    eval_dataset=dataset_valid,
    # peft_config=peft_params,
    # dataset_text_field='text',
    max_seq_length=context_length,
    tokenizer=tokenizer,
    args=training_args,
    # packing=False,
    formatting_func=preprocess_function
)

Map:   0%|          | 0/49401 [00:00<?, ? examples/s]

Map:   0%|          | 0/2601 [00:00<?, ? examples/s]

In [16]:
history = trainer.train()

Epoch,Training Loss,Validation Loss
1,2.189,1.983096


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


In [17]:
model.save_pretrained(f"{out_dir}/best_model")
tokenizer.save_pretrained(f"{out_dir}/best_model")

('outputs/gpt2_alpaca_preprocssed_fn/best_model/tokenizer_config.json',
 'outputs/gpt2_alpaca_preprocssed_fn/best_model/special_tokens_map.json',
 'outputs/gpt2_alpaca_preprocssed_fn/best_model/vocab.json',
 'outputs/gpt2_alpaca_preprocssed_fn/best_model/merges.txt',
 'outputs/gpt2_alpaca_preprocssed_fn/best_model/added_tokens.json')

In [18]:
# from tensorboard import notebook
# log_dir = "distilgpt_openassistant_guanaco/logs"
# notebook.start("--logdir {} --port 4000".format(log_dir))

## Inference

In [19]:
from transformers import (
    AutoModelForCausalLM, 
    logging, 
    pipeline,
    AutoTokenizer
)

In [20]:
model = AutoModelForCausalLM.from_pretrained('outputs/gpt2_alpaca_preprocssed_fn/best_model/')
tokenizer = AutoTokenizer.from_pretrained('outputs/gpt2_alpaca_preprocssed_fn/best_model/')

tokenizer.pad_token = tokenizer.eos_token

In [21]:
# logging.set_verbosity(logging.CRITICAL)

In [22]:
pipe = pipeline(task='text-generation', model=model, tokenizer=tokenizer)

In [23]:
prompt = """### Instruction:
Give three points for staying healthy.

### Input:


### Response:
"""

In [24]:
print(prompt)

### Instruction:
Give three points for staying healthy.

### Input:


### Response:



In [25]:
result = pipe(
    prompt, 
    max_new_tokens=128,
    # early_stopping=True,
    # num_beams=5
)
print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


### Instruction:
Give three points for staying healthy.

### Input:


### Response:
1. Stay hydrated. You can get an amount of fluids out of a bottle, but keep your temperature at 35 degrees Fahrenheit. 2. Eat whole fruit. 3. Drink coffee only when you are thirsty and drink cold water.

### Response:
4. Keep your calories low, especially if you have chronic diseases, such as diabetes, cancer, and osteoporosis. 5. Exercise and take breaks. 6. Exercise often helps to maintain the hydration of your body.

### Response:
7. Eat a variety of fruits and vegetables. 8. Try your favorite snacks and drink lots of water.
