Financial Large Language Models (FinLLMs) with **FinGPT** using LoRA (Low-Rank Adaptation) base model :: **ChatGLM2-6b**

FinGPT v1

In [1]:
! pip install datasets transformers torch
! pip install tqdm
! pip install protobuf transformers==4.30.2 cpm_kernels torch>=2.0 gradio mdtex2html sentencepiece accelerate
! pip install huggingface_hub
! pip install loguru peft
! pip install accelerate -U
! pip install bitsandbytes



In [6]:
import os
import shutil

json_lines_path = "../data/dataset.jsonl"
save_path = "../data/dataset"

if os.path.exists(json_lines_path):
  os.remove(json_lines_path)

if os.path.exists(save_path):
  os.remove(save_path)

data_directory = "../data"

if not os.path.exists(data_directory):
  os.makedirs(data_directory)

In [7]:
from datasets import load_dataset
import datasets

output_sentiment_dic = {
    0: "negative",
    1: "positive",
    2: "neutral",
}

#loading Twitter Financial News Sentiment (tfns) dataset
tfns = load_dataset('zeroshot/twitter-financial-news-sentiment')
tfns = tfns['train']
tfns = tfns.to_pandas()
tfns['label'] = tfns['label'].apply(lambda x:output_sentiment_dic[x])
tfns['instruction'] = 'What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.'
tfns.columns = ['input', 'output', 'instruction']
tfns = datasets.Dataset.from_pandas(tfns)
tfns

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['input', 'output', 'instruction'],
    num_rows: 9543
})

In [8]:
tmp_dataset = datasets.concatenate_datasets([tfns]*2)
train_dataset = tmp_dataset
print(tmp_dataset.num_rows)

full_dataset = train_dataset.shuffle(seed = 42)
full_dataset.shape

19086


(19086, 3)

In [9]:
import json
from tqdm.notebook import tqdm

def format_sample_input(sample_input: dict) -> dict:
  context = f"Instruction: {sample_input['instruction']}\n"
  if(sample_input.get("input")):
    context += f"Input: {sample_input['input']}\n"
  context += "Answer: "
  target = sample_input["output"]
  return {"context": context, "target": target}

data_list = []
for item in full_dataset.to_pandas().itertuples():
  tmp = {}
  tmp["instruction"] = item.instruction
  tmp["input"] = item.input
  tmp["output"] = item.output
  data_list.append(tmp)


with open (json_lines_path,'w') as file:
  for sample_input in tqdm(data_list, desc = "formatting in progress.."):
    file.write(json.dumps(format_sample_input(sample_input)) + '\n')

formatting in progress..:   0%|          | 0/19086 [00:00<?, ?it/s]

In [10]:
#Printing sample data
numOfRecords = 10
with open(json_lines_path) as f:
    for i in range(0, numOfRecords):
        print(f.readline(), end = '')

{"context": "Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.\nInput: $DRIP $LABU $GASX - SOXL, LABU, JO and GUSH among weekly ETF movers https://t.co/FntrWNY9sn\nAnswer: ", "target": "neutral"}
{"context": "Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.\nInput: From Farms to Silicon Valley, U.S. Businesses Stand to Gain From USMCA -- Update #economy #MarketScreener\u2026 https://t.co/fpxwDcClZh\nAnswer: ", "target": "neutral"}
{"context": "Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.\nInput: Broadridge acquires Clearstructure Financial Technology\nAnswer: ", "target": "neutral"}
{"context": "Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.\nInput: Ted Baker Chairman David Bernstein and CEO Lindsay Page are quitting after the British retailer slash

In [11]:
from transformers import AutoTokenizer, AutoConfig

base_model_name = 'THUDM/chatglm2-6b'
max_seq_length = 512
skip_overlength = True

def preprocessing(tokenizer, config, sample_input, max_seq_length):
  prompt = sample_input["context"]
  target = sample_input["target"]
  prompt_ids = tokenizer.encode(
      prompt,
      max_length=max_seq_length,
      truncation=True
  )
  target_ids = tokenizer.encode(
      target,
      max_length=max_seq_length,
      truncation=True,
      add_special_tokens=False
  )
  input_ids = prompt_ids + target_ids + [config.eos_token_id]
  return {"input_ids": input_ids, "seq_len": len(prompt_ids)}


def read_dataset(path, max_seq_length, skip_overlength=False):
  tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
  config = AutoConfig.from_pretrained(base_model_name, trust_remote_code=True, device_map='auto')
  with open(path, 'r') as f:
    for record in tqdm(f.readlines()):
      sample_input = json.loads(record)
      feature = preprocessing(tokenizer, config, sample_input, max_seq_length)
      if skip_overlength and len(feature["input_ids"]) > max_seq_length:
        continue
      feature["input_ids"] = feature["input_ids"][:max_seq_length]
      yield feature

In [12]:
# from huggingface_hub import notebook_login
# notebook_login()

In [13]:
dataset = datasets.Dataset.from_generator(
    lambda: read_dataset(json_lines_path, max_seq_length, skip_overlength)
    )
dataset.save_to_disk(save_path)

Saving the dataset (0/1 shards):   0%|          | 0/19086 [00:00<?, ? examples/s]

In [14]:
from typing import List, Dict, Optional
import torch
from loguru import logger
from transformers import (AutoModel, AutoTokenizer, TrainingArguments, Trainer, BitsAndBytesConfig)
from peft import(
    TaskType,
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    prepare_model_for_int8_training
)
from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING



In [15]:
training_args = TrainingArguments(
    output_dir='./finetuned_model',
    logging_steps = 500,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,
    weight_decay=0.01,
    warmup_steps=1000,
    save_steps=500,
    fp16=True,
    torch_compile = False,
    load_best_model_at_end = True,
    evaluation_strategy="steps",
    remove_unused_columns=False
)

In [16]:
# Quantization
q_config = BitsAndBytesConfig(load_in_4bit=True,
                                bnb_4bit_quant_type='nf4',
                                bnb_4bit_use_double_quant=True,
                                bnb_4bit_compute_dtype=torch.float16
                                )

In [17]:
# Load tokenizer & model
# need massive space
print(base_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
        base_model_name,
        quantization_config=q_config,
        trust_remote_code=True,
        device='cuda'
    )
model = prepare_model_for_int8_training(model, use_gradient_checkpointing=True)

THUDM/chatglm2-6b


pytorch_model.bin.index.json:   0%|          | 0.00/20.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/7 [00:00<?, ?it/s]

pytorch_model-00001-of-00007.bin:   0%|          | 0.00/1.83G [00:00<?, ?B/s]

pytorch_model-00002-of-00007.bin:   0%|          | 0.00/1.97G [00:00<?, ?B/s]

pytorch_model-00003-of-00007.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00004-of-00007.bin:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

pytorch_model-00005-of-00007.bin:   0%|          | 0.00/1.97G [00:00<?, ?B/s]

pytorch_model-00006-of-00007.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00007-of-00007.bin:   0%|          | 0.00/1.05G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]



In [18]:
#implementing LoRA

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_params = 0
    for _, param in model.named_parameters():
      all_params += param.numel()
      if param.requires_grad:
        trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_params} || trainables%: {100 * trainable_params / all_params}"
    )

target_modules = TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING['chatglm']
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=target_modules,
    bias = 'none'
)

model = get_peft_model(model, lora_config)
print_trainable_parameters(model)

trainable params: 1949696 || all params: 3390261248 || trainables%: 0.05750872447219737


In [19]:
resume_from_checkpoint = None
if resume_from_checkpoint is not None:
    checkpoint_name = os.path.join(resume_from_checkpoint, "pytorch_model.bin")
    if not os.path.exists(checkpoint_name):
        checkpoint_name = os.path.join(resume_from_checkpoint, "adapter_model.bin")
        resume_from_checkpoint = False
    if os.path.exists(checkpoint_name):
      logger.info(f'Restartiing from {checkpoint_name}')
      adapters_weights = torch.load(checkpoint_name)
      set_peft_model_state_dict(model, adapters_weights)
    else:
      logger.info(f'Checkpoint {checkpoint_name} not found')

model.print_trainable_parameters()

trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614


In [20]:
dataset = datasets.load_from_disk("../data/dataset")
dataset = dataset.train_test_split(0.2, shuffle=True, seed = 42)

In [26]:
class ModifiedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        return model(
            input_ids=inputs["input_ids"],
            labels=inputs["labels"],
        ).loss

    def prediction_step(self, model: torch.nn.Module, inputs, prediction_loss_only: bool, ignore_keys = None):
        with torch.no_grad():
            res = model(
                input_ids=inputs["input_ids"].to(model.device),
                labels=inputs["labels"].to(model.device),
            ).loss
        return (res, None, None)

    def save_model(self, output_dir=None, _internal_call=False):
        from transformers.trainer import TRAINING_ARGS_NAME

        os.makedirs(output_dir, exist_ok=True)
        torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
        saved_params = {
            k: v.to("cpu") for k, v in self.model.named_parameters() if v.requires_grad
        }
        torch.save(saved_params, os.path.join(output_dir, "adapter_model.bin"))

def data_collator(features: list) -> dict:
    len_ids = [len(feature["input_ids"]) for feature in features]
    longest = max(len_ids)
    input_ids = []
    labels_list = []
    for ids_l, feature in sorted(zip(len_ids, features), key=lambda x: -x[0]):
        ids = feature["input_ids"]
        seq_len = feature["seq_len"]
        labels = (
            [tokenizer.pad_token_id] * (seq_len - 1) + ids[(seq_len - 1) :] + [tokenizer.pad_token_id] * (longest - ids_l)
        )
        ids = ids + [tokenizer.pad_token_id] * (longest - ids_l)
        _ids = torch.LongTensor(ids)
        labels_list.append(torch.LongTensor(labels))
        input_ids.append(_ids)
    input_ids = torch.stack(input_ids)
    labels = torch.stack(labels_list)
    return {
        "input_ids": input_ids,
        "labels": labels,
    }

In [27]:
from torch.utils.tensorboard import SummaryWriter
from transformers.integrations import TensorBoardCallback

#training begins
writer = SummaryWriter()
trainer = ModifiedTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    callbacks=[TensorBoardCallback(writer)],
)
trainer.train()
writer.close()

You are adding a <class 'transformers.integrations.TensorBoardCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
TensorBoardCallback


Step,Training Loss,Validation Loss
500,5.8952,5.624586
1000,5.5918,5.598761




In [29]:
#saving the model
model.save_pretrained(training_args.output_dir)

In [30]:
#downloading the model
!zip -r /content/saved_model.zip /content/{training_args.output_dir}

  adding: content/./finetuned_model/ (stored 0%)
  adding: content/./finetuned_model/checkpoint-500/ (stored 0%)
  adding: content/./finetuned_model/checkpoint-500/adapter_model.bin (deflated 8%)
  adding: content/./finetuned_model/checkpoint-500/rng_state.pth (deflated 25%)
  adding: content/./finetuned_model/checkpoint-500/trainer_state.json (deflated 52%)
  adding: content/./finetuned_model/checkpoint-500/optimizer.pt (deflated 8%)
  adding: content/./finetuned_model/checkpoint-500/training_args.bin (deflated 51%)
  adding: content/./finetuned_model/checkpoint-500/scheduler.pt (deflated 56%)
  adding: content/./finetuned_model/adapter_model.safetensors (deflated 7%)
  adding: content/./finetuned_model/checkpoint-1000/ (stored 0%)
  adding: content/./finetuned_model/checkpoint-1000/adapter_model.bin (deflated 8%)
  adding: content/./finetuned_model/checkpoint-1000/rng_state.pth (deflated 25%)
  adding: content/./finetuned_model/checkpoint-1000/trainer_state.json (deflated 62%)
  addi

In [31]:
#download to local
from google.colab import files
files.download("/content/saved_model.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [33]:
# save to google drive
from google.colab import drive
drive.mount('/content/drive')


# save the finetuned model to google drive
!cp -r "/content/finetuned_model" "/content/drive/MyDrive"

Mounted at /content/drive


In [34]:
def get_folder_size(folder_path):
    total_size = 0
    for dirpath, _, filenames in os.walk(folder_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    return total_size / 1024 / 1024  # Size in MB

model_size = get_folder_size(training_args.output_dir)
print(f"Model size: {model_size} MB")

Model size: 52.23689937591553 MB


# Inference

In [28]:
#clone the FinNLP repository
!git clone https://github.com/AI4Finance-Foundation/FinNLP.git

import sys
sys.path.append('/content/FinNLP/')

Cloning into 'FinNLP'...
remote: Enumerating objects: 1403, done.[K
remote: Counting objects: 100% (462/462), done.[K
remote: Compressing objects: 100% (197/197), done.[K
remote: Total 1403 (delta 240), reused 431 (delta 223), pack-reused 941[K
Receiving objects: 100% (1403/1403), 4.97 MiB | 17.19 MiB/s, done.
Resolving deltas: 100% (613/613), done.


In [35]:
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM

from peft import PeftModel
import torch

# Load benchmark datasets from FinNLP
from finnlp.benchmarks.fpb import test_fpb
from finnlp.benchmarks.fiqa import test_fiqa , add_instructions
from finnlp.benchmarks.tfns import test_tfns
from finnlp.benchmarks.nwgi import test_nwgi



In [36]:
# load model from google drive
from google.colab import drive
drive.mount('/content/drive')

# Define the path you want to check
path_to_check = "/content/drive/My Drive/finetuned_model"

# Check if the specified path exists
if os.path.exists(path_to_check):
    print("Path exists.")
else:
    print("Path does not exist.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Path exists.


In [37]:
## load the chatglm2-6b base model
base_model = "THUDM/chatglm2-6b"
peft_model = training_args.output_dir

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(base_model, trust_remote_code=True, load_in_8bit=True, device_map="auto")

model = PeftModel.from_pretrained(model, peft_model)

model = model.eval()

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

In [38]:
## load our finetuned model
base_model = "THUDM/chatglm2-6b"
peft_model = "/content/drive/My Drive/finetuned_model"

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(base_model, trust_remote_code=True, load_in_8bit=True, device_map="auto")

model = PeftModel.from_pretrained(model, peft_model)

model = model.eval()

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

In [39]:
batch_size = 8

# TFNS Test Set, len 2388
res = test_tfns(model, tokenizer, batch_size = batch_size)

# FPB, len 1212
res = test_fpb(model, tokenizer, batch_size = batch_size)

# FiQA, len 275
res = test_fiqa(model, tokenizer, prompt_fun = add_instructions, batch_size = batch_size)

# NWGI, len 4047
res = test_nwgi(model, tokenizer, batch_size = batch_size)



Prompt example:
Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.
Input: $ALLY - Ally Financial pulls outlook https://t.co/G9Zdi1boy5
Answer: 


Total len: 2388. Batchsize: 8. Total steps: 299


100%|██████████| 299/299 [01:49<00:00,  2.73it/s]


Acc: 0.8898659966499163. F1 macro: 0.8577263714552185. F1 micro: 0.8898659966499163. F1 weighted (BloombergGPT): 0.8894205791519864. 


Downloading data:   0%|          | 0.00/392k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4846 [00:00<?, ? examples/s]



Prompt example:
Instruction: What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}.
Input: L&T has also made a commitment to redeem the remaining shares by the end of 2011 .
Answer: 


Total len: 1212. Batchsize: 8. Total steps: 152


100%|██████████| 152/152 [00:55<00:00,  2.72it/s]


Acc: 0.7896039603960396. F1 macro: 0.7462840756920132. F1 micro: 0.7896039603960396. F1 weighted (BloombergGPT): 0.7746490397870223. 


Downloading data:   0%|          | 0.00/161k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/25.3k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]



Prompt example:
Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.
Input: This $BBBY stock options trade would have more than doubled your money https://t.co/Oa0loiRIJL via @TheStreet
Answer: 


Total len: 275. Batchsize: 8. Total steps: 35


100%|██████████| 35/35 [00:13<00:00,  2.56it/s]


Acc: 0.6218181818181818. F1 macro: 0.581089301965505. F1 micro: 0.6218181818181818. F1 weighted (BloombergGPT): 0.6889865439351002. 


Downloading readme:   0%|          | 0.00/682 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/10.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16184 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/4047 [00:00<?, ? examples/s]



Prompt example:
Instruction: What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}.
Input: In the latest trading session, Adobe Systems (ADBE) closed at $535.98, marking a +0.31% move from the previous day.
Answer: 


Total len: 4047. Batchsize: 8. Total steps: 506


100%|██████████| 506/506 [03:12<00:00,  2.62it/s]

Acc: 0.5730170496664195. F1 macro: 0.5744357964784231. F1 micro: 0.5730170496664195. F1 weighted (BloombergGPT): 0.5701334855650697. 



