# Introduction

Typical language modeling approach where the model sees the Q&A samples and predicts the next word. In the `group_texts` function we just extract the samples one after the other and train the model to predit the next word.

## Setup

In [1]:
!pip install -U transformers datasets bitsandbytes evaluate accelerate trl



## Imports

In [2]:
from datasets import load_dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    AutoModelForCausalLM,
    BitsAndBytesConfig
)
from trl import SFTTrainer

import os
import torch
from peft import LoraConfig

## Configuration

In [3]:
batch_size = 4
num_workers = os.cpu_count()
max_steps = 50000
bf16 = False
fp16 = False
gradient_accumulation_steps = 1
context_length = 512
logging_steps = 1000
save_steps = 1000
learning_rate = 0.00001
model_name = 'distilbert/distilgpt2'
out_dir = 'distilgpt2_squad_language_modeling_assistant/best_model_run_2'
logs = 'distilgpt2_squad_language_modeling_assistant/logs_run_2'

## Dataset Preparation

In [4]:
train_raw = load_dataset('squad', split='train')
valid_raw = load_dataset('squad', split='validation')
print(train_raw)
print(valid_raw)

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})
Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10570
})


In [5]:
print(train_raw[0])
print('-' * 50)
print(train_raw[1])
print('-' * 50)
print('-' * 50)
print(train_raw[2])
print('-' * 50)
print('-' * 50)
print(train_raw[3])
print('-' * 50)
print('-' * 50)
print(train_raw[4])

{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}
--------------------------------------------------
{'i

## Tokenization

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [7]:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [8]:
text = '### context: ' + train_raw[0]['context'] + '\n' + \
'### question: ' + train_raw[0]['question'] + '\n' + \
'### answer: ' + train_raw[0]['answers']['text'][0]

In [9]:
print(train_raw[0]['answers']['text'][0])

Saint Bernadette Soubirous


In [10]:
print(text)

### context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
### question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
### answer: Saint Bernadette Soubirous


In [11]:
def preprocess_function(example):
    """
    Formatting function returning a list of samples (kind of necessary for SFFT API).
    """
    output_texts = []
    for i in range(len(example['context'])):
        question_context = f"### context: {example['context'][i]}\n### question: {example['question'][i]}\n"
        answers = f"### answer: {example['answers'][i]['text'][0]}"
        final_tokens = question_context + answers
        output_texts.append(final_tokens)
    return output_texts

## Model

In [12]:
compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

In [13]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

In [14]:
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Linear4bit(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear4bit(in_features=768, out_features=768, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Linear4bit(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear4bit(in_features=3072, out_features=768, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=

In [15]:
# Total parameters and trainable parameters.
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")

60,678,912 total parameters.
39,403,776 training parameters.


## Training

In [16]:
peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

In [17]:
training_args = TrainingArguments(
    output_dir=logs,
    evaluation_strategy='steps',
    weight_decay=0.01,
    load_best_model_at_end=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_strategy='steps',
    save_strategy='steps',
    logging_steps=logging_steps,
    save_steps=save_steps,
    save_total_limit=2,
    bf16=bf16,
    fp16=fp16,
    report_to='tensorboard',
    max_steps=max_steps,
    dataloader_num_workers=num_workers,
    gradient_accumulation_steps=gradient_accumulation_steps,
    learning_rate=learning_rate,
    optim="paged_adamw_32bit",
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    group_by_length=True
)


# training_args = TrainingArguments(
#     output_dir=logs,
#     num_train_epochs=1,
#     per_device_train_batch_size=batch_size,
#     gradient_accumulation_steps=gradient_accumulation_steps,
#     optim="paged_adamw_32bit",
#     save_steps=save_steps,
#     logging_steps=logging_steps,
#     learning_rate=learning_rate,
#     weight_decay=0.001,
#     fp16=fp16,
#     bf16=bf16,
#     max_grad_norm=0.3,
#     max_steps=-1,
#     warmup_ratio=0.03,
#     group_by_length=True,
#     lr_scheduler_type="constant",
#     report_to="tensorboard",
#     save_total_limit=2
# )

In [18]:
trainer = SFTTrainer(
    model=model,
    train_dataset=train_raw,
    eval_dataset=valid_raw,
    peft_config=peft_params,
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_args,
    formatting_func=preprocess_function,
    packing=False
)



Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [19]:
history = trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Step,Training Loss,Validation Loss
1000,3.9317,3.696845
2000,3.7474,3.494168
3000,3.6019,3.439988
4000,3.5715,3.425056
5000,3.5627,3.415097
6000,3.5471,3.40799
7000,3.5407,3.401382
8000,3.5332,3.395782
9000,3.5289,3.391074
10000,3.5306,3.387321


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [21]:
trainer.model.save_pretrained(out_dir)
trainer.tokenizer.save_pretrained(out_dir)

('distilgpt2_squad_language_modeling_assistant/best_model_run_2/tokenizer_config.json',
 'distilgpt2_squad_language_modeling_assistant/best_model_run_2/special_tokens_map.json',
 'distilgpt2_squad_language_modeling_assistant/best_model_run_2/vocab.json',
 'distilgpt2_squad_language_modeling_assistant/best_model_run_2/merges.txt',
 'distilgpt2_squad_language_modeling_assistant/best_model_run_2/added_tokens.json',
 'distilgpt2_squad_language_modeling_assistant/best_model_run_2/tokenizer.json')

## Inference

In [22]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

import torch

In [23]:
model = AutoModelForCausalLM.from_pretrained(
    f"{out_dir}",
    device_map='cuda'
)
tokenizer = AutoTokenizer.from_pretrained(f"{out_dir}")



In [26]:
prompt = """### context: Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters (or kernel) optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections.[1][2] For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution (or cross-correlation) kernels,[3][4] only 25 neurons are required to process 5x5-sized tiles.[5][6] Higher-layer features are extracted from wider context windows, compared to lower-layer features.

They have applications in:

image and video recognition,[7]
recommender systems,[8]
image classification,
image segmentation,
medical image analysis,
natural language processing,[9]
brain–computer interfaces,[10] and
financial time series.[11]
CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networks (SIANN), based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation-equivariant responses known as feature maps.[12][13] Counter-intuitively, most convolutional neural networks are not invariant to translation, due to the downsampling operation they apply to the input.[14]

Feed-forward neural networks are usually fully connected networks, that is, each neuron in one layer is connected to all neurons in the next layer. The "full connectivity" of these networks make them prone to overfitting data. Typical ways of regularization, or preventing overfitting, include: penalizing parameters during training (such as weight decay) or trimming connectivity (skipped connections, dropout, etc.) Robust datasets also increases the probability that CNNs will learn the generalized principles that characterize a given dataset rather than the biases of a poorly-populated set.[15]

Convolutional networks were inspired by biological processes[16][17][18][19] in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.

CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns to optimize the filters (or kernels) through automated learning, whereas in traditional algorithms these filters are hand-engineered. This independence from prior knowledge and human intervention in feature extraction is a major advantage. \n
### question: What are the applications of CNNs?"""

In [27]:
print(prompt)

### context: Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters (or kernel) optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections.[1][2] For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution (or cross-correlation) kernels,[3][4] only 25 neurons are required to process 5x5-sized tiles.[5][6] Higher-layer features are extracted from wider context windows, compared to lower-layer features.

They have applications in:

image and video recognition,[7]
recommender systems,[8]
image classification,
image segmentation,
medical image analysis,
natural language processing,[9]
brain–computer interfaces,[10] and
financial time series.[11]
CNNs are also known as 

In [28]:
input_tokens = tokenizer(
    prompt, 
    return_tensors='pt'
)

In [29]:
# print(input_tokens)

In [30]:
generated_ids = model.generate(
    input_tokens.input_ids.to('cuda'),
    max_new_tokens=256
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [31]:
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(input_tokens.input_ids, generated_ids)
]

In [32]:
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

In [33]:
print(response)


### answer: algorithms that use pre-processing compared to other image classification algorithms. This independence from prior knowledge and human intervention in feature extraction is a major advantage. ### answer: algorithms that use pre-processing compared to other image classification algorithms. This independence from prior knowledge and human intervention in feature extraction is a major advantage. ### answer: algorithms that use pre-processing compared to other image classification algorithms. This independence from prior knowledge and human intervention in feature extraction is a major advantage. ### answer: algorithms that use pre-processing compared to other image classification algorithms. This independence from prior knowledge and human intervention in feature extraction is a major advantage. ### answer: algorithms that use pre-processing compared to other image classification algorithms. This independence from prior knowledge and human intervention in feature extraction i

### Pipeline Generated

In [34]:
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=1024)
result = pipe(f"{prompt}")
print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


### context: Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters (or kernel) optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections.[1][2] For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution (or cross-correlation) kernels,[3][4] only 25 neurons are required to process 5x5-sized tiles.[5][6] Higher-layer features are extracted from wider context windows, compared to lower-layer features.

They have applications in:

image and video recognition,[7]
recommender systems,[8]
image classification,
image segmentation,
medical image analysis,
natural language processing,[9]
brain–computer interfaces,[10] and
financial time series.[11]
CNNs are also known as 