# Introduction

* Datasets:
    * https://huggingface.co/datasets/tatsu-lab/alpaca?row=1
* Models:
    * https://huggingface.co/distilbert/distilgpt2
 
***Note:*** *Here we will manually preprocess the input before feeding it to the model. We use `formatting_func` in the SFT API.*

In [1]:
!pip install -U accelerate peft bitsandbytes transformers trl datasets

Collecting peft
  Downloading peft-0.9.0-py3-none-any.whl.metadata (13 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl.metadata (9.9 kB)
Collecting transformers
  Downloading transformers-4.38.2-py3-none-any.whl.metadata (130 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.7/130.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trl
  Downloading trl-0.7.11-py3-none-any.whl.metadata (10 kB)
Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting tyro>=0.5.11 (from trl)
  Downloading tyro-0.7.3-py3-none-any.whl.metadata (7.7 kB)
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting shtab>=1.5.6 (from tyro>=0.5.11->trl)
  Downloading shtab-1.7.0-py3-none-any.whl.metadata (7

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    TrainingArguments,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer

2024-03-07 07:13:44.330644: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-07 07:13:44.330749: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-07 07:13:44.473883: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Configuration

In [3]:
batch_size = 32
num_workers = os.cpu_count()
max_steps = 1000
bf16 = False
fp16 = True
gradient_accumulation_steps = 32
context_length = 256
logging_steps = 100
save_steps = 100
learning_rate = 0.00001
model_name = 'distilbert/distilgpt2'
out_dir = 'outputs/distilgpt2_alpaca_preprocssed_fn'

## Load Dataset

In [4]:
dataset = load_dataset('tatsu-lab/alpaca')

Downloading readme:   0%|          | 0.00/7.47k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 24.2M/24.2M [00:00<00:00, 48.9MB/s]


Generating train split: 0 examples [00:00, ? examples/s]

In [5]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 52002
    })
})


In [6]:
print(dataset['train']['text'][0])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.


In [7]:
print(dataset['train'][0])

{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}


In [8]:
full_dataset = dataset['train'].train_test_split(test_size=0.05, shuffle=True)
dataset_train = full_dataset['train']
dataset_valid = full_dataset['test']
 
print(dataset_train)
print(dataset_valid)

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 49401
})
Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 2601
})


In [9]:
for i in range(10):
    print(dataset_train[i])
    print('****************')
    
    text = dataset_train[i]
    instruction = '### Instruction:\n' + text['instruction']
    if text['input'] != 'No input.' and text['input'] != '':
        inputs = '\n### Input:\n' + text['input']
    else:
        inputs = ''
    response = '\n### Response:\n' + text['output']
    
    final_text = instruction + inputs + response
    print(final_text)
    print('#'*50)

{'instruction': 'Construct a sentence using similes. Output the sentence.', 'input': 'No input', 'output': 'His plans were as fragile as a house of cards.', 'text': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nConstruct a sentence using similes. Output the sentence.\n\n### Input:\nNo input\n\n### Response:\nHis plans were as fragile as a house of cards.'}
****************
### Instruction:
Construct a sentence using similes. Output the sentence.
### Input:
No input
### Response:
His plans were as fragile as a house of cards.
##################################################
{'instruction': 'Find the average age of the participants in this survey.', 'input': '25, 27, 32, 29, 21', 'output': 'The average age of the participants in this survey is 26.4.', 'text': 'Below is an instruction that describes a task, paired with an input that provides further conte

In [10]:
def preprocess_function(example):
    """
    Formatting function returning a list of samples (kind of necessary for SFT API).
    """
    output_texts = []
    for i in range(len(example['instruction'])):
        instruction = '### Instruction:\n' + example['instruction'][i]
        if example['input'][i] != 'No input.' and example['input'][i] != '':
            inputs = '\n### Input:\n' + example['input'][i]
        else:
            inputs = ''
        response = '\n### Response:\n' + example['output'][i]
        
        final_text = instruction + inputs + response
        output_texts.append(final_text)
    return output_texts

## Model

In [11]:
if bf16:
    model = AutoModelForCausalLM.from_pretrained(model_name).to(dtype=torch.bfloat16)
else:
    model = AutoModelForCausalLM.from_pretrained(model_name)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [12]:
print(model)
# Total parameters and trainable parameters.
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
81,912,576 total parameters.
81,912,576 training parameters.


## Tokenizer

In [13]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True,
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

## Training

In [14]:
training_args = TrainingArguments(
    output_dir=f"{out_dir}/logs",
    evaluation_strategy='steps',
    weight_decay=0.01,
    load_best_model_at_end=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_strategy='steps',
    save_strategy='steps',
    logging_steps=logging_steps,
    save_steps=save_steps,
    save_total_limit=2,
    bf16=bf16,
    fp16=fp16,
    report_to='tensorboard',
    max_steps=max_steps,
    dataloader_num_workers=num_workers,
    gradient_accumulation_steps=gradient_accumulation_steps,
    learning_rate=learning_rate,
    lr_scheduler_type='constant',
)

In [15]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_train,
    eval_dataset=dataset_valid,
    # peft_config=peft_params,
    # dataset_text_field='text',
    max_seq_length=context_length,
    tokenizer=tokenizer,
    args=training_args,
    # packing=False,
    formatting_func=preprocess_function
)

Map:   0%|          | 0/49401 [00:00<?, ? examples/s]

Map:   0%|          | 0/2601 [00:00<?, ? examples/s]

In [16]:
history = trainer.train()

Step,Training Loss,Validation Loss
100,2.8248,2.439374
200,2.4775,2.302728
300,2.3993,2.263635
400,2.3571,2.237714
500,2.3265,2.217748
600,2.2995,2.201749
700,2.2781,2.188146
800,2.259,2.175823
900,2.24,2.166024
1000,2.2239,2.156904


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


In [17]:
model.save_pretrained(f"{out_dir}/best_model")
tokenizer.save_pretrained(f"{out_dir}/best_model")

('outputs/distilgpt2_alpaca_preprocssed_fn/best_model/tokenizer_config.json',
 'outputs/distilgpt2_alpaca_preprocssed_fn/best_model/special_tokens_map.json',
 'outputs/distilgpt2_alpaca_preprocssed_fn/best_model/vocab.json',
 'outputs/distilgpt2_alpaca_preprocssed_fn/best_model/merges.txt',
 'outputs/distilgpt2_alpaca_preprocssed_fn/best_model/added_tokens.json')

In [18]:
# from tensorboard import notebook
# log_dir = "distilgpt_openassistant_guanaco/logs"
# notebook.start("--logdir {} --port 4000".format(log_dir))

## Inference

In [1]:
from transformers import (
    AutoModelForCausalLM, 
    logging, 
    pipeline,
    AutoTokenizer
)

In [2]:
model = AutoModelForCausalLM.from_pretrained('outputs/distilgpt2_alpaca_preprocssed_fn/best_model/')
tokenizer = AutoTokenizer.from_pretrained('outputs/distilgpt2_alpaca_preprocssed_fn/best_model/')

tokenizer.pad_token = tokenizer.eos_token

In [3]:
# logging.set_verbosity(logging.CRITICAL)

In [4]:
pipe = pipeline(task='text-generation', model=model, tokenizer=tokenizer)

In [13]:
prompt = """### Instruction:
Write a short essay on rain water harvesting.
### Response:
"""

In [14]:
print(prompt)

### Instruction:
Write a short essay on rain water harvesting.
### Response:



In [21]:
result = pipe(
    prompt, 
    max_new_tokens=128,
    # early_stopping=True,
    num_beams=1
)
print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


### Instruction:
Write a short essay on rain water harvesting.
### Response:
This essay is about rain water harvesting in an interesting and relevant way. It outlines the process of harvesting rainwater on various stages in order to conserve and protect the environment.

The process of harvesting rainwater is often repeated using large amounts of water. In order to successfully remove large amounts of rain from a large area, the rainwater must first drain the soil and then turn it, and then the soil must replenish the water during the process. During this process, it is possible to use water into irrigation and agricultural irrigation, as well as other methods such as irrigation and irrigation to grow and restore water. As the rainwater is


In [16]:
# !zip -r /kaggle/working/outputs /kaggle/working/outputs