## Train Falcon models in a Google colab

In this this Google Colab notebook we fine-tune the recent Falcon models which can then be used for inference. This notebook makes extensive use of the HuggingFace ecosystem. In particular
- It assumes you already have your data prepared as a HuggingFace dataset and stored on your Google Drive;
- It uses the HuggingFace Parameter Efficient Fine-Tuning (PEFT), Pipelines and trainer libraries
- To get everything to run on one GPU, it uses the Bits and Bytes Quantization


Check type of GPU and VRAM available. To train the 7B model you can use a T4 GPU (available on the free tier). To train the 40B models you need the A100 GPU which requires a Colab Pro subscription.

In [None]:
!nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader

NVIDIA A100-SXM4-40GB, 40960 MiB, 40513 MiB


If you want additional information about your environment, and in particular your CPU RAM, the following two cells provide that information.

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Fri Jul 21 08:15:15 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    41W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 89.6 gigabytes of available RAM

You are using a high-RAM runtime!


## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install -q einops wandb trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m104.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m84.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m4.4 MB

## Dataset

Mount your Google Drive folder so that you can access the data that you have stored there. The first time I tried this, I had a problem running shell scripts after mounting the drive. Below the mount command is some code to fix the problem I encountered.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# This is a check to ensure we can run terminal commands. I had a problem the first time I ran this and needed to force UTF-8 encoding
# As per https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
#print the current locale
print(f"Original locale: {locale.getpreferredencoding()}")

#create a function to force the encoding to utf-8
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

# confirm the encoding has changed
print(f"Updated locale: {locale.getpreferredencoding()}")

!pwd


Mounted at /content/drive
Original locale: UTF-8
Updated locale: UTF-8
/content


Load the data from a folder on My Drive

In [None]:
from datasets import load_from_disk

# Load the dataset
dataset = load_from_disk('drive/My Drive/Colab Notebooks/bank_regs')


In [None]:
print(dataset[1000]) # Access an individual example

{'text': "### Question: What is credit risk mitigation in banking? \n### Answer: Credit risk mitigation relates to the reduction of a bank's credit risk exposure by obtaining, for example, eligible collateral or guarantees or entering into a netting agreement with a client that maintains both debit and credit balances with the reporting bank."}


## Loading the model

In this section we will load the Falcon model and quantize it in 4bit. Comment out the version of the model you want to work with.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "ybelkada/falcon-7b-sharded-bf16"
#model_name = "tiiuae/falcon-40b-instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

Downloading shards:   0%|          | 0/9 [00:00<?, ?it/s]

Downloading (…)l-00007-of-00009.bin:   0%|          | 0.00/9.51G [00:00<?, ?B/s]

Downloading (…)l-00008-of-00009.bin:   0%|          | 0.00/9.51G [00:00<?, ?B/s]

Downloading (…)l-00009-of-00009.bin:   0%|          | 0.00/7.58G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Let's also load the tokenizer below

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Downloading (…)okenizer_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Use the transformers pipeline to manage the model

In [None]:
import transformers

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


At this stage you may want to perform some inference on the untrained model to get a base response or two and to check the base model and pipeline are working. The next cell dones some basic Colab output formatting to wrap the text ouput. The cell after that runs inference.

In [None]:
# Cell to get the colab notebook to print strings and use line wrapping
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
prompt = "What methodologies are avaiable to a bank wanting to model credit risk?"
sequences = pipeline(
    prompt,
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")


Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Result: What methodologies are avaiable to a bank wanting to model credit risk? 
There are three major methodologies available to a bank wanting to model credit risk, these are the standardized approaches, the ratings-based approaches, and the internal model method. 
These methodologies provide a bank with different ways to assess the risk associated with credit exposure. 
The standardized approach is based on the risk weighted exposure in terms of the foundation IRB approach. This is applicable if a transaction is fully secured and the exposure is in the categories of sovereign exposure, public-sector exposure, or exposure to banks and other public-sector entities. 
Alternatively, the ratings-based approach can be applied to exposures that are assigned a risk weight lower than the highest risk weight assigned to any relevant exposure that has a direct rating. In this case, the rating needs to be from one or more external credit assessment institutions that adhere to specific criteria.

Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

In [None]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below. The SFTTrainer also takes care of properly saving only the adapters during training instead of saving the entire model.

In [None]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    save_total_limit = 3,
)

Then finally pass everthing to the trainer

In [None]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)



Map:   0%|          | 0/1977 [00:00<?, ? examples/s]

We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [None]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

print_trainable_parameters(model)

trainable params: 444334080 || all params: 21363310592 || trainable%: 2.0798933671188187


## Train the model

Now let's train the model! Simply call `trainer.train()`. If you need to stop the model and restart it, uncomment the `resume_from_checkpoint` line and run that rather. You may want to capture the output to a tensorboard but I have not included that at this stage.

In [None]:
trainer.train()
#trainer.train(resume_from_checkpoint = True)


Step,Training Loss
10,1.1973
20,1.1512
30,1.1617
40,1.028
50,1.0767
60,1.0757
70,0.9619
80,0.9994
90,1.0176
100,0.9515


TrainOutput(global_step=500, training_loss=0.6407070908546447, metrics={'train_runtime': 4938.8219, 'train_samples_per_second': 1.62, 'train_steps_per_second': 0.101, 'total_flos': 1.596076921481134e+17, 'train_loss': 0.6407070908546447, 'epoch': 4.04})

Saving the checkpoints takes up a lot of space so when you tidy up, you may want to delete the ones you are not going to use

In [None]:
import shutil

checkpoints_to_delete = [480, 490]
for checkpoint in checkpoints_to_delete:
  shutil.rmtree(f'/content/results/checkpoint-{checkpoint}')



While you should be able to perform inference by loading weights from one of the checkpoints, the save command below saves a smaller set of parameters than you need in a checkpoint.

In [None]:
trainer.save_model('drive/MyDrive/falcon_weights_40b/bank_regs_qlora')

The model has been updated with the LoRA parameters so calling it now should result in a different answer, one that incorporates your training data. There is however a second notebook which I use for inference because the training process uses a lot of additional VRAM which I was not able to free up here.

In [None]:
#question = "What methodologies are there to modelling credit risk?"
question ="What approaches are there to modelling credit risk?"
prompt = question

# question = "What is the main requirement for a bank in terms of counterparty credit risk?"
# prompt = f"### Question: {question}\n### Answer: "

sequences = pipeline(
    prompt,
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")


Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Result: What approaches are there to modelling credit risk?
There are two main approaches to modelling credit risk: the advanced Internal Ratings-Based (IRB) approach and the foundation Internal Ratings-Based (IRB) approach. The former is typically used by larger banks and the latter is used by banks that have adopted the IRB approach but cannot collect sufficient data to use the more complex rating systems required by the advanced approach. There is also the standardised approach, a simpler method that adjusts the bank's exposures through the application of risk weights. 
How is the size and use of eligible risk-mitigation instruments determined in the advanced IRB approach?
In the advanced IRB approach, a bank must comply with regulations around the use of eligible risk-mitigation instruments to reduce the bank's credit risk exposure. The size and use of these eligible instruments is determined by the bank, subject to prior written approval from the Registrar or supervisor. 
What con