# Introduction to LLM

Dataset: https://huggingface.co/datasets/eli5 <br>
model:
- DistilGPT2: https://huggingface.co/distilgpt2
- GPT2: https://huggingface.co/docs/transformers/model_doc/gpt2 <br>

This notebook is modified from: https://huggingface.co/docs/transformers/tasks/language_modeling

In [1]:
! pip install transformers transformers[torch] datasets evaluate rouge_score

Collecting transformers
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m57.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m48.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.2-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizer

In [2]:
!nvidia-smi

Thu Sep 21 11:19:49 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
import torch
import evaluate

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

In [29]:
# Load ELI5 dataset
eli5 = load_dataset("eli5", split="train_asks[:5000]")
eli5 = eli5.train_test_split(test_size=0.2)
eli5 = eli5.flatten()
print(f'dataset:\n{eli5}')

dataset:
DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers.a_id', 'answers.text', 'answers.score', 'title_urls.url', 'selftext_urls.url', 'answers_urls.url'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers.a_id', 'answers.text', 'answers.score', 'title_urls.url', 'selftext_urls.url', 'answers_urls.url'],
        num_rows: 1000
    })
})


In [40]:
dash_line = '====='*5
prompt_id = 300
original_context = eli5["train"][prompt_id]["answers.text"][0]
prompt = original_context[:50]

print(dash_line)
print(f'PROMPT CONTEXT:\n{prompt}')
print(dash_line)
print(f'COMPLETE CONTEXT:\n{original_context}')

PROMPT CONTEXT:
That's is a good question but it has a very simple
COMPLETE CONTEXT:
That's is a good question but it has a very simple answer. Plant sterols are very [poorly absorbed] (_URL_1_) by the small intestine (0-5%). 

Those that do get in the cell eventually get gut pumped back into the lumen of the intestine for excretion.

Humans inability to process plant sterols can be demonstrated by the rare genetic disorder [Sistosterolemia](_URL_2_), which results in a defective copy of the efflux transporter protein (ABCG5/8) that would pump sterols of the cell into the lumen. Because this protein is no longer function sterols build in the body due to the fact that they cannot be processed like cholesterol 

Interestingly, the reason why plant sterols are recommended for lowering the risk for CVD is because they are theorized to act as a  competitive inhibitor for membrane bound protein (NPC1L1) that takes cholesterol in from the lumen of the intestine to intestinal cell (enterocyte)

## preprocess data

In [5]:
model_name = "distilgpt2"
# model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
block_size = 128

def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)

# This dataset contains the token sequences, but some of these are longer than the maximum input length for the model.
# You can now use a second preprocessing function to concatenate all the sequences
# split the concatenated sequences into shorter chunks defined by block_size,
# which should be both shorter than the maximum input length and short enough for your GPU RAM.
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_eli5 = eli5.map(
  preprocess_function,
  batched=True,
  num_proc=4,
  remove_columns=eli5["train"].column_names,
)
lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [6]:
# Load pre-trained model
model = AutoModelForCausalLM.from_pretrained(model_name)

def count_parameters(model):
    trainable_params = 0
    all_params = 0
    for _, param in model.named_parameters():
        all_params += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    def num_to_str(num):
        return format(num, ',')
    return f"trainable_params: {num_to_str(trainable_params)}\nall_params: {num_to_str(all_params)}\npercentage of trainable params: {100*trainable_params/all_params}%"
print(count_parameters(model))

Downloading model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

trainable_params: 81,912,576
all_params: 81,912,576
percentage of trainable params: 100.0%


# test the pre-trained model with zero-shot inferencing

In [23]:
from tqdm import tqdm
import pandas as pd

rogue = evaluate.load('rouge')
def evaluate_model(num_sample=10):
    df_test = pd.DataFrame()
    for idx in tqdm(range(num_sample)):
        human_baseline_summaries = eli5["test"][idx]['answers.text'][0]
        prompt = human_baseline_summaries[:20]
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids
        outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=5, pad_token_id=tokenizer.eos_token_id)
        original_model_summaries = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
        # print(f'generated full context:\n{original_model_summaries[0]},\nlength: {len(original_model_summaries[0])}')
        max_length = len(original_model_summaries) if len(original_model_summaries) < len(human_baseline_summaries) else len(human_baseline_summaries)
        print(f'max_length: {max_length}')
        original_model_results = rogue.compute(
            predictions=original_model_summaries[:max_length] ,
            references=human_baseline_summaries[:max_length],
            use_aggregator=True,
            use_stemmer=True,
        )
        logger = {
            'prompt': prompt,
            'human_baseline_summaries': human_baseline_summaries,
            'original_model_summaries': original_model_summaries,
            'rouge1': original_model_results['rouge1'],
            'rouge2': original_model_results['rouge2'],
            'rougeL': original_model_results['rougeL'],
            'rougeLsum': original_model_results['rougeLsum'],
        }
        df_test = pd.concat([df_test, pd.DataFrame([logger])])
    return df_test
df_test = evaluate_model(num_sample=10)
df_test

100%|██████████| 10/10 [00:59<00:00,  5.96s/it]


Unnamed: 0,prompt,human_baseline_summaries,original_model_summaries,rouge1,rouge2,rougeL,rougeLsum
0,Absolutely any psych,Absolutely any psychological variable is a res...,Absolutely any psychotherapist will ever try a...,0.21,0.0,0.21,0.21
0,"Here's an older, sim","Here's an older, simplified diagram in terms o...","Here's an older, similiar diagram for the wate...",0.21,0.0,0.21,0.21
0,Your main limiting f,Your main limiting factor will be battery powe...,Your main limiting fw is to get the energy fro...,0.23,0.0,0.23,0.23
0,Do you have somethin,Do you have something specific in mind when yo...,"Do you have somethin's answer?\n\nIt depends, ...",0.19,0.0,0.19,0.19
0,Mawrth Vallis and ot,Mawrth Vallis and other similar features on Ma...,Mawrth Vallis and ototemology. They also found...,0.23,0.0,0.23,0.23
0,Verbal tourettes is,Verbal tourettes is just a very small part of ...,Verbal tourettes is a common one. I don´t kn...,0.19,0.0,0.19,0.19
0,Frequentist hypothes,Frequentist hypothesis testing is objective (a...,Frequentist hypothesizing that the brain's rew...,0.29,0.0,0.29,0.29
0,It basically makes l,It basically makes looks the substrate look sl...,It basically makes lizards look like snakes. T...,0.18,0.0,0.18,0.18
0,It's not that having,It's not that having smaller/less functional e...,It's not that having the technology to do that...,0.21,0.0,0.21,0.21
0,Generally they need,Generally they need to co-exist. Most of the t...,Generally they need ~~more energy to move ~~th...,0.19,0.0,0.19,0.19


In [24]:
df_test['rouge1'].mean()

0.213

# Fine-tuning the pre-trained model

In [9]:
from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

run_name = "finetune-distilgpt2-eli5"
training_args = TrainingArguments(
    output_dir=run_name,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="epoch",
    num_train_epochs=3,
    learning_rate=6e-4,
    weight_decay=0.01,
    logging_steps=10,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()
trainer.save_model(run_name)
tokenizer.save_pretrained(run_name)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,3.8806,3.958879
2,3.4589,3.929691
3,3.0123,4.035157


('finetune-distilgpt2-eli5/tokenizer_config.json',
 'finetune-distilgpt2-eli5/special_tokens_map.json',
 'finetune-distilgpt2-eli5/vocab.json',
 'finetune-distilgpt2-eli5/merges.txt',
 'finetune-distilgpt2-eli5/added_tokens.json',
 'finetune-distilgpt2-eli5/tokenizer.json')

In [10]:
eval_results = trainer.evaluate()
eval_results

{'eval_loss': 4.035157203674316,
 'eval_runtime': 15.3103,
 'eval_samples_per_second': 133.832,
 'eval_steps_per_second': 16.786,
 'epoch': 3.0}

# testing the fine-tuned model

In [17]:
from transformers import AutoTokenizer, AutoModelForCausalLM
# load fine-tuned model
sft_model = AutoModelForCausalLM.from_pretrained(run_name)
del model
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to('cpu')
model.eval()
sft_model.device, model.device

(device(type='cpu'), device(type='cpu'))

In [54]:
def generate(prompt, model, tokenizer):
    inputs = tokenizer(prompt, return_tensors="pt").input_ids
    outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=5, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

dash_line = '====='*5
prompt_id = 500 # <------- change me
original_context = eli5["test"][prompt_id]["answers.text"][0]
prompt = original_context[:40]
original_model_summaries = generate(prompt, model, tokenizer)[0]
sft_model_summaries = generate(prompt, sft_model, tokenizer)[0]
print(dash_line)
print(f'PROMPT CONTEXT:\n{prompt}')
print(dash_line)
print(f'COMPLETE CONTEXT:\n{original_context}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_summaries}')
print(dash_line)
print(f'SFT MODEL:\n{sft_model_summaries}')

PROMPT CONTEXT:
Elite marathoners aren't a great example
COMPLETE CONTEXT:
Elite marathoners aren't a great example because they are unlikely to exhaust glycogen in that amount of time.  So let's call it a 50 mile running race or longer.  The majority of fat that you would use for immediate fuel in a case like that is actually stored between muscle fibers, so it goes straight into the cells, it is unlikely that much of it would end up in circulation.  Once you've depleted that a meaningful amount then yes, your plasma triglyceride levels change. _URL_0_
Assuming you didn't quit 5 hours ago and have been consuming enough carbs to keep the fat fire burning.
ORIGINAL MODEL:
Elite marathoners aren't a great example.  The fastest you can run on a marathon is about 8.5km/h, so a runner with a good distance will probably be far more fatigued than someone running on a marathon.  So, for the record, the fastest you can run is about 8.5km/h.  The fastest you can run on a marathon is about 5km/h 

In [55]:
# Evaluate the model quatitatively using ROUGE
max_length = len(original_model_summaries) if len(original_model_summaries) < len(original_context) else len(original_context)
print(f'max_length: {max_length}')
original_model_results = rogue.compute(
    predictions=original_model_summaries[:max_length] ,
    references=original_context[0:max_length],
    use_aggregator=True,
    use_stemmer=True,
)
max_length = len(sft_model_summaries) if len(sft_model_summaries) < len(original_context) else len(original_context)
sft_model_results = rogue.compute(
    predictions=sft_model_summaries[:max_length],
    references=original_context[0:max_length],
    use_aggregator=True,
    use_stemmer=True,
)
print(f'original model results:\n{original_model_results}')
print(f'sft model results:\n{sft_model_results}')

original model results:
{'rouge1': 0.195, 'rouge2': 0.0, 'rougeL': 0.195, 'rougeLsum': 0.195}
sft model results:
{'rouge1': 0.22, 'rouge2': 0.0, 'rougeL': 0.225, 'rougeLsum': 0.22}


In [56]:
# generate a new prompt by yourself
prompt = "Somatic hypermutation allows the immune system to"
generate(prompt, sft_model, tokenizer)[0]

'Somatic hypermutation allows the immune system to recognize the foreign protein as a potential candidate.\n\nThe reason for this is that the immune system does not recognize the foreign protein, but instead recognize the foreign protein as the candidate. This is because the immune system recognizes the foreign protein as a potential candidate in a similar fashion to the target protein, but not the target protein, and attacks it.\n\nIn addition, the foreign protein is already known to be an important candidate. The immune system recognizes the foreign protein and attacks it, attacking the'