# First We Install Some Libraries
The basic library is transformrers.  We also need accelerate for some reason. This takes 5 seconds.

In [1]:
!pip install --upgrade pip
!pip install transformers --quiet
!pip install accelerate --quiet
!pip install "numexpr==2.7.3" --quiet


Defaulting to user installation because normal site-packages is not writeable


In [2]:
!pip install --upgrade fsspec --quiet
!pip install "datasets==2.9.0" --quiet
!pip install streamlit --quiet
!pip install protobuf=="3.20.*" --quiet

### Downloading the model from HuggingFace
This loads the Flan-UL2 model and its tokenizer.  This will download the model from Huggngface so will take 15 minutes as the model is 40GB. The next time we run this, it will be cached. The second time we run it will be faster, about 25 seconds.

In [3]:
from transformers import T5ForConditionalGeneration, AutoTokenizer
import torch
model = T5ForConditionalGeneration.from_pretrained("google/flan-ul2", torch_dtype=torch.bfloat16, device_map="auto")     
tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2")
tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True


Downloading (…)lve/main/config.json:   0%|          | 0.00/784 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/67.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00008.bin:   0%|          | 0.00/4.69G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00008.bin:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00008.bin:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Downloading (…)l-00004-of-00008.bin:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Downloading (…)l-00005-of-00008.bin:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Downloading (…)l-00006-of-00008.bin:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

Downloading (…)l-00007-of-00008.bin:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Downloading (…)l-00008-of-00008.bin:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.43M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

## Our generate function
This is a function that, given a query, generates a response from the LLM. This is for testing. Later we will turn this into a Streamlit app.   

In [5]:
def f(str):
    inputs = tokenizer(str, return_tensors="pt").input_ids.to("cuda")
    outputs = model.generate(inputs, max_length=500,
                             num_beams=2,
                             repetition_penalty=2.5,
                             length_penalty=1.0,
                             early_stopping=True,
                             no_repeat_ngram_size=2,
                             use_cache=True,
                             do_sample = True,
                             temperature = 1.5,
                             top_k = 50,
                             top_p = 0.95)
    
    print(tokenizer.decode(outputs[0][1:-1]))

## Testing the original Flan-UL2
We now test if the model works.  This is tuned to answer questions with short responses, so it gives just a few words as the answer.  We use a pretty high temperature so the output often changes.  

Sometimes it says *record sales* or *singer songwriter* or *he had long and bizarre hair* which I suppose are arguably correct. These answers are really short, and we are going to fine tune things to make it give long, web-page like answers.

In [6]:
f("How did David Bowie become famous?")

singer


## Install Peft and other libraries for fine tunung

Now we install peft, the package that allows parameter efficient fine tuning.  We will also need datasets (to load a dataset). This takes 12 seconds or so.

In [14]:

!pip install "peft==0.2.0" --quiet
!pip install "transformers==4.27.2"  "accelerate==0.17.1" "evaluate==0.4.0" loralib  --quiet


## Where we keep things

We set a bunch of variables so we have names of where we keep some files.

In [4]:
!cat /home/ubuntu/five-dollar-test/what.json /home/ubuntu/five-dollar-test/how.json >  /home/ubuntu/five-dollar-test/train.json
training_data = "/home/ubuntu/five-dollar-test/train.json" # The training data is taken from this file.
shuffled_data = "/home/ubuntu/data/train.json" # We shuffle the data and put it here.
tokenized_data = "/home/ubuntu/data-all-test/train"  # Where we put the tokenized data

output_dir="/home/ubuntu/lora-flan-ul2-fact"  # the place where we save the interim results

peft_model_id="/home/ubuntu/results-ul2-1k"  # The place where we save the peft model

## Tokenize the dataset
We load the data in, shuffle it, and write it out as tokens (so numbers). This takes about 50 seconds, so we save the data so we can reload it quickly.

In [5]:
import json
jjout = []
with open(training_data ) as file:
    while True:
        line = file.readline()
        if not line:
            break
        i = json.loads(line)
        if not i.get("input") or not i.get("output"):
            continue
        jjout.append({"input":  i["input"], "output": i["output"]})
        
import random
random.shuffle(jjout)  # randomly shuffle the data.

import os
try:
    os.mkdir("/home/ubuntu/data/")
except:
    print("Directory /home/ubuntu/data/ exists")

with open(shuffled_data, "w") as file:
    file.write(json.dumps(jjout))

from datasets import load_dataset
data_files = {"train": "train.json"}
dataset = load_dataset("/home/ubuntu/data/", data_files=data_files)

max_length = 512 #tokenizer.model_max_length  Flan-ul2 has this set to 2048 which makes things a little slow.  If we used flash attention this would be quicker.

def preprocess_function(sample, padding="max_length"):
    inputs = [f"{{input}}\n".format(input=item) for item in sample['input']]
    model_inputs = tokenizer(inputs, max_length=max_length, padding=padding, truncation=True)
    labels = tokenizer(text_target=sample['output'], max_length=max_length, padding=padding, truncation=True)
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=list(dataset["train"].features))
tokenized_dataset["train"].save_to_disk(tokenized_data )


Using custom data configuration data-aa940360b9385bdc


Downloading and preparing dataset json/data to /home/ubuntu/.cache/huggingface/datasets/json/data-aa940360b9385bdc/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /home/ubuntu/.cache/huggingface/datasets/json/data-aa940360b9385bdc/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/9 [00:00<?, ?ba/s]

Saving the dataset (0/1 shards):   0%|          | 0/8466 [00:00<?, ? examples/s]

## Load back in the datset

We load our dataset from disk.  This is mostly so we can rerun things without regenerating the data.

In [6]:
from datasets import load_from_disk
tokenized_dataset = load_from_disk(tokenized_data)

## Making the LoRA model

Now we make the lora model which is a wrapper around the base model. This takes 25 seconds.  Be patient.

In [9]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
 r=8,  # the rank.  This is the smaller dimension of the two d*r and r*d matrices that multiply to make a d*d matrix.
 lora_alpha=32,  # this is the mixing factor.  32 is quite high.
 target_modules=["q", "v"],  # We just replace the value and query matrices that are used in attention. The choices are  ["q", "k", "v", "o", "wi", "wo"].
 lora_dropout=0.05,  # This is a regularization thing to make the model train better.  We ignore 5% of the parameters each time to encourage the others.
 bias="none", # This can be none, lora_only, or all.  none seems the safe choice.
 task_type=TaskType.SEQ_2_SEQ_LM
)

model = get_peft_model(model, lora_config)

import torch.nn as nn
class CastOutputToFloat(nn.Sequential):
    def forward(self, x): return super().forward(x).to(torch.float32)

model.gradient_checkpointing_enable()  # This makes things faster
model.enable_input_require_grads() # This computes the gradients for the input embeddings so we can fine tune.
model.lm_head = CastOutputToFloat(model.lm_head)  # we loaded the base model in bf16 to we need to cast the top to float so we can peft it.
model.config.use_cache = False  # You can't use the cache with gradient checkpointing.  

print("The Trainable Parameters!!")
model.print_trainable_parameters()

The Trainable Parameters!!
trainable params: 12582912 || all params: 19472196608 || trainable%: 0.06461988985274732


# The Training Parameters

These are what controls the trainer. The data collator helps load the data in.

In [10]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=8, # We train 8 examples at a time.
    gradient_accumulation_steps=1,  # We don't accumulate gradients
    #auto_find_batch_size=True,  # This sometimes works, and can replace the previous two lines, but gave me out of mempoery errors once.
    learning_rate=1e-3, # higher learning rate,  Learning rates are usually 10 or 100 times smaller than this, but this seems to work.
    num_train_epochs=1, # We train for 1 epoch, but actually we stop after max_steps so this will not be ineffect unless you have less than 1000 examples.
    logging_dir=f"{output_dir}/logs", # Where we log things
    logging_strategy="steps", # We log by steps.  epoch would loh once every epoch and no turns off logging.
    logging_steps=50,  # we log every 50 steps to see how things are going.
    max_steps=250, # take 250 steps of 8 examples, so 1000 examples in total.
    bf16=True,  # We train in bf16 as fp16 can overflow in T5 models.  bf16 has more range but less precision.
    save_strategy="no", #
    report_to="tensorboard",  # This makes our little progress bar
    optim='adamw_torch',  # this is the AdamW optimizer as the default one is deprecated.
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)
model.config.use_cache = False 

# Train the model
This takes 10 minutes or so.  If you make max_steps bigger, it will take longer.  Quality will change, but 1000 samples seems to work well.
Remember we are trying to teach it a style, not to teach it facts. 

In [11]:
trainer.train()

Step,Training Loss
50,2.2845
100,2.2514
150,2.2399
200,2.2157
250,2.226


TrainOutput(global_step=250, training_loss=2.2434981384277344, metrics={'train_runtime': 536.1627, 'train_samples_per_second': 3.73, 'train_steps_per_second': 0.466, 'total_flos': 1.18828642074624e+17, 'train_loss': 2.2434981384277344, 'epoch': 0.24})

## Save the Peft Model
This saves the peft model in results-ul2-1k or whatever you set *peft_model_id* to. 

In [12]:

trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)


('/home/ubuntu/results-ul2-1k/tokenizer_config.json',
 '/home/ubuntu/results-ul2-1k/special_tokens_map.json',
 '/home/ubuntu/results-ul2-1k/tokenizer.json')

## Merge the Peft model with the original model
We merge the models together so that inference will be faster. This takes about two minutes.   Once this is done, we restart the kernel.  This will prevent us running out of memory.

In [20]:

import peft 
key_list = [key for key, _ in model.base_model.model.named_modules() if "lora" not in key]
for key in key_list:
    parent, target, target_name = model.base_model._get_submodules(key)
    if isinstance(target, peft.tuners.lora.Linear):
        bias = target.bias is not None
        new_module = torch.nn.Linear(target.in_features, target.out_features, bias=bias)
        model.base_model._replace_module(parent, target_name, new_module, target)

In [23]:
import torch

from peft import PeftModel, PeftConfig
#from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import T5ForConditionalGeneration, AutoTokenizer

print("load model")
model = T5ForConditionalGeneration.from_pretrained("google/flan-ul2", torch_dtype=torch.bfloat16, device_map="auto")     

#model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-ul2", torch_dtype=torch.bfloat16,  device_map={"":0})
print("model")
tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2" )
print("tokenizer")
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
print("lora")
model.eval()
print("merge")

import peft 
key_list = [key for key, _ in model.base_model.model.named_modules() if "lora" not in key]
for key in key_list:
    parent, target, target_name = model.base_model._get_submodules(key)
    if isinstance(target, peft.tuners.lora.Linear):
        bias = target.bias is not None
        new_module = torch.nn.Linear(target.in_features, target.out_features, bias=bias)
        model.base_model._replace_module(parent, target_name, new_module, target)

model.lm_head = CastOutputToFloat(model.lm_head)

print("reset")
model = model.base_model.model
print("saving")
model.save_pretrained("/home/ubuntu/merged-ul2-model-v4")
print("done")

load model


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

model
tokenizer
lora
merge
reset
saving
done


## Start our Streamlit App

First, we restart the kernel, choosing "Restart Kernel" from the Kernel menu above.  This is to save GPU memory, which we will need for the Streamlit app.  Then we start the streamlit app with the next cell.

In [None]:

!streamlit run app.py -- --merged_model /home/ubuntu/merged-ul2-model-v4


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.29.150.237:8501[0m
[34m  External URL: [0m[1mhttp://209.20.158.247:8501[0m
[0m
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:23<00:00,  5.82s/it]
2023-06-04 19:40:32.949626: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-04 19:40:33.153403: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environm