# Introduction

* Datasets:
    * https://huggingface.co/datasets/tatsu-lab/alpaca?row=1
* Models:
    * https://huggingface.co/distilbert/distilgpt2
 
***Note:*** *Here we will manually preprocess the input before feeding it to the model. We use `formatting_func` in the SFT API.*

In [1]:
!pip install -U accelerate peft bitsandbytes transformers trl datasets



In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    TrainingArguments,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer

## Configuration

In [3]:
batch_size = 8
num_workers = os.cpu_count()
max_steps = 30000
bf16 = True
fp16 = False
gradient_accumulation_steps = 1
context_length = 512
logging_steps = 1000
save_steps = 1000
learning_rate = 0.0001
model_name = 'distilbert/distilgpt2'
out_dir = 'outputs/distilgpt2_alpaca_preprocssed_fn'

## Load Dataset

In [4]:
dataset = load_dataset('tatsu-lab/alpaca')

In [5]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 52002
    })
})


In [6]:
print(dataset['train']['text'][0])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.


In [7]:
print(dataset['train'][0])

{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}


In [8]:
full_dataset = dataset['train'].train_test_split(test_size=0.05, shuffle=True)
dataset_train = full_dataset['train']
dataset_valid = full_dataset['test']
 
print(dataset_train)
print(dataset_valid)

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 49401
})
Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 2601
})


In [9]:
for i in range(10):
    print(dataset_train[i])
    print('****************')
    
    text = dataset_train[i]
    instruction = '### Instruction:\n' + text['instruction']
    inputs = '\n\n### Input:\n' + text['input']
    response = '\n\n### Response:\n' + text['output']
    
    final_text = instruction + inputs + response
    print(final_text)
    print('#'*50)

{'instruction': 'How does artificial intelligence affect human employment?', 'input': '', 'output': 'Artificial intelligence has had a profound impact on human employment. It has enabled the automation of many tasks and enabled businesses to utilize data to make more informed decisions. Additionally, AI has made it possible for employees to work remotely and has enabled companies to hire a more diverse workforce. While AI has opened up new job opportunities, it also has the potential to replace human labor in certain sectors, leading to job losses and displacing certain roles.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nHow does artificial intelligence affect human employment?\n\n### Response:\nArtificial intelligence has had a profound impact on human employment. It has enabled the automation of many tasks and enabled businesses to utilize data to make more informed decisions. Additionally, A

In [10]:
# def preprocess_function(example):
#     """
#     Formatting function returning a list of samples (kind of necessary for SFT API).
#     """
#     output_texts = []
#     for i in range(len(example['instruction'])):
#         instruction = '### Instruction:\n' + example['instruction'][i]
#         inputs = '\n\n### Input:\n' + example['input'][i]
#         response = '\n\n### Response:\n' + example['output'][i]
        
#         final_text = instruction + inputs + response
#         output_texts.append(final_text)
#     return output_texts

def preprocess_function(example):
    """
    Formatting function returning a list of samples (kind of necessary for SFT API).
    """
    text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    return text

## Model

In [11]:
if bf16:
    model = AutoModelForCausalLM.from_pretrained(model_name).to(dtype=torch.bfloat16)
else:
    model = AutoModelForCausalLM.from_pretrained(model_name)

In [12]:
print(model)
# Total parameters and trainable parameters.
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
81,912,576 total parameters.
81,912,576 training parameters.


## Tokenizer

In [13]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True,
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

## Training

In [14]:
training_args = TrainingArguments(
    output_dir=f"{out_dir}/logs",
    evaluation_strategy='steps',
    weight_decay=0.01,
    load_best_model_at_end=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_strategy='steps',
    save_strategy='steps',
    logging_steps=logging_steps,
    save_steps=save_steps,
    save_total_limit=2,
    bf16=bf16,
    fp16=fp16,
    report_to='tensorboard',
    max_steps=max_steps,
    dataloader_num_workers=num_workers,
    gradient_accumulation_steps=gradient_accumulation_steps,
    learning_rate=learning_rate,
    lr_scheduler_type='constant',
)

In [15]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_train,
    eval_dataset=dataset_valid,
    max_seq_length=context_length,
    tokenizer=tokenizer,
    args=training_args,
    formatting_func=preprocess_function,
    packing=True
)

Generating train split: 0 examples [00:00, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1495 > 1024). Running this sequence through the model will result in indexing errors


Generating train split: 0 examples [00:00, ? examples/s]



In [16]:
dataloader = trainer.get_train_dataloader()
for i, sample in enumerate(dataloader):
    print(tokenizer.decode(sample['input_ids'][0]))
    print('#'*50)
    if i == 5:
        break

 is composed of molecules made of two hydrogen atoms and one oxygen atom and has a lower density than oil. Conversely, oil is composed of molecules of carbon and hydrogen and has a higher density than water. Additionally, water is an excellent conductor of electricity, whereas oil is an insulator of electricity. Water also has a low viscosity, whereas oil is highly viscous.<|endoftext|>### Instruction:
Write fifty words in French introducing yourself.

### Input:


### Response:
Je m'appelle Julie. J'ai 24 ans et je suis étudiante en commerce. J'adore voyager et j'ai visité plusieurs pays. J'aime lire et passer du temps avec mes amis. J'apprends le français depuis deux ans et je progresse petit à petit. J'aime écouter de la musique, pratiquer le yoga et aller à la plage. Je m'efforce de vivre ma vie à fond et de profiter chaque jour.<|endoftext|>### Instruction:
Provide specific examples of the nitrogen cycle.

### Input:


### Response:
The nitrogen cycle is the process of nitrogen be

In [17]:
history = trainer.train()

Step,Training Loss,Validation Loss
1000,2.4101,2.217997
2000,2.2885,2.176106
3000,2.2518,2.158459
4000,2.2289,2.146298
5000,2.2172,2.140784
6000,2.2037,2.134933
7000,2.1984,2.131148
8000,2.1932,2.127623
9000,2.1872,2.12482
10000,2.1845,2.122992


Checkpoint destination directory outputs/distilgpt2_alpaca_preprocssed_fn/logs/checkpoint-6000 already exists and is non-empty. Saving will proceed but saved results may be invalid.
There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


In [26]:
model.save_pretrained(f"{out_dir}/best_model")
tokenizer.save_pretrained(f"{out_dir}/best_model")

Non-default generation parameters: {'max_length': 50, 'do_sample': True}


('outputs/distilgpt2_alpaca_preprocssed_fn/best_model/tokenizer_config.json',
 'outputs/distilgpt2_alpaca_preprocssed_fn/best_model/special_tokens_map.json',
 'outputs/distilgpt2_alpaca_preprocssed_fn/best_model/vocab.json',
 'outputs/distilgpt2_alpaca_preprocssed_fn/best_model/merges.txt',
 'outputs/distilgpt2_alpaca_preprocssed_fn/best_model/added_tokens.json',
 'outputs/distilgpt2_alpaca_preprocssed_fn/best_model/tokenizer.json')

## Inference

In [27]:
from transformers import (
    AutoModelForCausalLM, 
    logging, 
    pipeline,
    AutoTokenizer
)

In [28]:
model = AutoModelForCausalLM.from_pretrained('outputs/distilgpt2_alpaca_preprocssed_fn/best_model/')
tokenizer = AutoTokenizer.from_pretrained('outputs/distilgpt2_alpaca_preprocssed_fn/best_model/')

tokenizer.pad_token = tokenizer.eos_token

In [29]:
# logging.set_verbosity(logging.CRITICAL)

In [30]:
pipe = pipeline(task='text-generation', model=model, tokenizer=tokenizer, max_length=256)

In [35]:
prompt = """### Instruction:
Give three tips to stay healthy.
### Input:


### Response:
"""

In [36]:
print(prompt)

### Instruction:
Give three tips to stay healthy.
### Input:


### Response:



In [37]:
result = pipe(
    prompt
)
print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


### Instruction:
Give three tips to stay healthy.
### Input:


### Response:
1. Stay organized. 
2. Keep the body healthy and avoid taking much time to eat fruits or vegetables.
3. Practice eating regular and moderate fruits.
4. Stick to a strict diet consisting of low-fat foods, fruits and vegetables.
