Financial Large Language Models (FinLLMs) with **FinGPT** using LoRA (Low-Rank Adaptation) base model :: **LLaMA-7B**

FinGPT v2

# Importing required packages


In [1]:
! pip install datasets torch
! pip install tqdm
! pip install protobuf transformers==4.27.1 cpm_kernels torch>=2.0 gradio mdtex2html sentencepiece accelerate
! pip install huggingface_hub
! pip install loguru peft
! pip install accelerate -U
! pip install bitsandbytes



In [2]:
from typing import List, Dict, Optional

import datasets
import torch
from loguru import logger
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig
)
from peft import (
    TaskType,
    LoraConfig,
    get_peft_model,
    set_peft_model_state_dict,
    prepare_model_for_kbit_training,
)
from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING

# Hugging Face login
(required to access the model)


In [3]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Loading Data

In [None]:
import os
import shutil

json_lines_path = "../data/dataset.jsonl"
save_path = "../data/dataset"

if os.path.exists(json_lines_path):
  os.remove(json_lines_path)

if os.path.exists(save_path):
  os.remove(save_path)

data_directory = "../data"

if not os.path.exists(data_directory):
  os.makedirs(data_directory)

In [5]:
from datasets import load_dataset
import datasets

output_sentiment_dic = {
    0: "negative",
    1: "positive",
    2: "neutral",
}

#loading Twitter Financial News Sentiment (tfns) dataset
tfns = load_dataset('zeroshot/twitter-financial-news-sentiment')
tfns = tfns['train']
tfns = tfns.to_pandas()
tfns['label'] = tfns['label'].apply(lambda x:output_sentiment_dic[x])
tfns['instruction'] = 'What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.'
tfns.columns = ['input', 'output', 'instruction']
tfns = datasets.Dataset.from_pandas(tfns)
tfns

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/859k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/217k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['input', 'output', 'instruction'],
    num_rows: 9543
})

In [6]:
tmp_dataset = datasets.concatenate_datasets([tfns]*2)
train_dataset = tmp_dataset
print(tmp_dataset.num_rows)

full_dataset = train_dataset.shuffle(seed = 42)
full_dataset.shape

19086


(19086, 3)

# Making Dataset


In [7]:
import json
from tqdm.notebook import tqdm

In [8]:
def format_sample_input(sample_input: dict) -> dict:
  context = f"Instruction: {sample_input['instruction']}\n"
  if(sample_input.get("input")):
    context += f"Input: {sample_input['input']}\n"
  context += "Answer: "
  target = sample_input["output"]
  return {"context": context, "target": target}

In [9]:
data_list = []
for item in full_dataset.to_pandas().itertuples():
  tmp = {}
  tmp["instruction"] = item.instruction
  tmp["input"] = item.input
  tmp["output"] = item.output
  data_list.append(tmp)

In [10]:
with open (json_lines_path,'w') as file:
  for sample_input in tqdm(data_list, desc = "formatting in progress.."):
    file.write(json.dumps(format_sample_input(sample_input)) + '\n')

formatting in progress..:   0%|          | 0/19086 [00:00<?, ?it/s]

In [11]:
#Printing sample data
numOfRecords = 10
with open(json_lines_path) as f:
    for i in range(0, numOfRecords):
        print(f.readline(), end = '')

{"context": "Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.\nInput: $DRIP $LABU $GASX - SOXL, LABU, JO and GUSH among weekly ETF movers https://t.co/FntrWNY9sn\nAnswer: ", "target": "neutral"}
{"context": "Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.\nInput: From Farms to Silicon Valley, U.S. Businesses Stand to Gain From USMCA -- Update #economy #MarketScreener\u2026 https://t.co/fpxwDcClZh\nAnswer: ", "target": "neutral"}
{"context": "Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.\nInput: Broadridge acquires Clearstructure Financial Technology\nAnswer: ", "target": "neutral"}
{"context": "Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.\nInput: Ted Baker Chairman David Bernstein and CEO Lindsay Page are quitting after the British retailer slash

# Tokenizing

In [12]:
from transformers import AutoTokenizer, AutoConfig

base_model_name = 'meta-llama/Llama-2-7b-chat-hf'
max_seq_length = 512
skip_overlength = True

def preprocessing(tokenizer, config, sample_input, max_seq_length):
  prompt = sample_input["context"]
  target = sample_input["target"]
  prompt_ids = tokenizer.encode(
      prompt,
      max_length=max_seq_length,
      truncation=True
  )
  target_ids = tokenizer.encode(
      target,
      max_length=max_seq_length,
      truncation=True,
      add_special_tokens=False
  )
  input_ids = prompt_ids + target_ids + [config.eos_token_id]
  return {"input_ids": input_ids, "seq_len": len(prompt_ids)}


def read_dataset(path, max_seq_length, skip_overlength=False):
  tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
  config = AutoConfig.from_pretrained(base_model_name, trust_remote_code=True, device_map='auto')
  with open(path, 'r') as f:
    for record in tqdm(f.readlines()):
      sample_input = json.loads(record)
      feature = preprocessing(tokenizer, config, sample_input, max_seq_length)
      if skip_overlength and len(feature["input_ids"]) > max_seq_length:
        continue
      feature["input_ids"] = feature["input_ids"][:max_seq_length]
      yield feature

In [13]:
dataset = datasets.Dataset.from_generator(
    lambda: read_dataset(json_lines_path, max_seq_length, skip_overlength)
    )
dataset.save_to_disk(save_path)

Generating train split: 0 examples [00:00, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

  0%|          | 0/19086 [00:00<?, ?it/s]

Saving the dataset (0/1 shards):   0%|          | 0/19086 [00:00<?, ? examples/s]

# Training

In [14]:
training_args = TrainingArguments(
    output_dir='./finetuned_model',
    logging_steps = 500,
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,
    weight_decay=0.01,
    warmup_steps=1000,
    save_steps=500,
    fp16=True,
    torch_compile = False,
    load_best_model_at_end = True,
    evaluation_strategy="steps",
    remove_unused_columns=False
)

In [15]:
# Load tokenizer & model
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
model =  AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_8bit = True,
        trust_remote_code=True,
        device_map='auto',
    )
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [16]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [17]:
# LoRA
target_modules = TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING['llama']
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=target_modules,
    bias='none',
)
model = get_peft_model(model, lora_config)
print_trainable_parameters(model)

trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199


In [18]:
resume_from_checkpoint = None
if resume_from_checkpoint is not None:
    checkpoint_name = os.path.join(resume_from_checkpoint, 'pytorch_model.bin')
    if not os.path.exists(checkpoint_name):
        checkpoint_name = os.path.join(
            resume_from_checkpoint, 'adapter_model.bin'
        )
        resume_from_checkpoint = False
    if os.path.exists(checkpoint_name):
        logger.info(f'Restarting from {checkpoint_name}')
        adapters_weights = torch.load(checkpoint_name)
        set_peft_model_state_dict(model, adapters_weights)
    else:
        logger.info(f'Checkpoint {checkpoint_name} not found')

In [19]:
model.print_trainable_parameters()

trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199


In [20]:
# load data
dataset = datasets.load_from_disk("../data/dataset")
dataset = dataset.train_test_split(0.2, shuffle=True, seed = 42)

In [21]:
class ModifiedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        return model(
            input_ids=inputs["input_ids"],
            labels=inputs["labels"],
        ).loss

    def prediction_step(self, model: torch.nn.Module, inputs, prediction_loss_only: bool, ignore_keys = None):
        with torch.no_grad():
            res = model(
                input_ids=inputs["input_ids"].to(model.device),
                labels=inputs["labels"].to(model.device),
            ).loss
        return (res, None, None)

    def save_model(self, output_dir=None, _internal_call=False):
        from transformers.trainer import TRAINING_ARGS_NAME

        os.makedirs(output_dir, exist_ok=True)
        torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
        saved_params = {
            k: v.to("cpu") for k, v in self.model.named_parameters() if v.requires_grad
        }
        torch.save(saved_params, os.path.join(output_dir, "adapter_model.bin"))

def data_collator(features: list) -> dict:
    len_ids = [len(feature["input_ids"]) for feature in features]
    longest = max(len_ids)
    input_ids = []
    labels_list = []
    for ids_l, feature in sorted(zip(len_ids, features), key=lambda x: -x[0]):
        ids = feature["input_ids"]
        seq_len = feature["seq_len"]
        labels = (
            [tokenizer.pad_token_id] * (seq_len - 1) + ids[(seq_len - 1) :] + [tokenizer.pad_token_id] * (longest - ids_l)
        )
        ids = ids + [tokenizer.pad_token_id] * (longest - ids_l)
        _ids = torch.LongTensor(ids)
        labels_list.append(torch.LongTensor(labels))
        input_ids.append(_ids)
    input_ids = torch.stack(input_ids)
    labels = torch.stack(labels_list)
    return {
        "input_ids": input_ids,
        "labels": labels,
    }

In [22]:
from torch.utils.tensorboard import SummaryWriter
from transformers.integrations import TensorBoardCallback


In [23]:
# Train
writer = SummaryWriter()
trainer = ModifiedTrainer(
    model=model,
    args=training_args,             # Trainer args
    train_dataset=dataset["train"], # Training set
    eval_dataset=dataset["test"],   # Testing set
    data_collator=data_collator,    # Data Collator
    callbacks=[TensorBoardCallback(writer)],
)
trainer.train()
writer.close()
# save model
model.save_pretrained(training_args.output_dir)

You are adding a <class 'transformers.integrations.TensorBoardCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
TensorBoardCallback
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
500,1.9925,0.004259




In [24]:
#downloading the model
!zip -r /content/saved_model.zip /content/{training_args.output_dir}

  adding: content/./finetuned_model/ (stored 0%)
  adding: content/./finetuned_model/README.md (deflated 66%)
  adding: content/./finetuned_model/runs/ (stored 0%)
  adding: content/./finetuned_model/runs/Mar22_20-47-32_d30db01e519a/ (stored 0%)
  adding: content/./finetuned_model/runs/Mar22_20-47-32_d30db01e519a/events.out.tfevents.1711140559.d30db01e519a.1143.1 (deflated 57%)
  adding: content/./finetuned_model/checkpoint-500/ (stored 0%)
  adding: content/./finetuned_model/checkpoint-500/optimizer.pt (deflated 8%)
  adding: content/./finetuned_model/checkpoint-500/trainer_state.json (deflated 53%)
  adding: content/./finetuned_model/checkpoint-500/adapter_model.bin (deflated 8%)
  adding: content/./finetuned_model/checkpoint-500/scheduler.pt (deflated 56%)
  adding: content/./finetuned_model/checkpoint-500/training_args.bin (deflated 51%)
  adding: content/./finetuned_model/checkpoint-500/rng_state.pth (deflated 25%)
  adding: content/./finetuned_model/adapter_config.json (deflated 

In [25]:
#download to local
from google.colab import files
files.download("/content/saved_model.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [32]:
# save to google drive
from google.colab import drive
drive.mount('/content/drive')


# save the finetuned model to google drive
!cp -r "/content/finetuned_model" "/content/drive/MyDrive"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [28]:
def get_folder_size(folder_path):
    total_size = 0
    for dirpath, _, filenames in os.walk(folder_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    return total_size / 1024 / 1024  # Size in MB

model_size = get_folder_size(training_args.output_dir)
print(f"Model size: {model_size} MB")

Model size: 64.16181945800781 MB


# Inference

In [18]:
#clone the FinNLP repository
!git clone https://github.com/AI4Finance-Foundation/FinNLP.git

import sys
sys.path.append('/content/FinNLP/')

fatal: destination path 'FinNLP' already exists and is not an empty directory.


In [19]:
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM

from peft import PeftModel
import torch

# Load benchmark datasets from FinNLP
from finnlp.benchmarks.fpb import test_fpb
from finnlp.benchmarks.fiqa import test_fiqa , add_instructions
from finnlp.benchmarks.tfns import test_tfns
from finnlp.benchmarks.nwgi import test_nwgi



In [4]:
# load model from google drive
from google.colab import drive
drive.mount('/content/drive')

# Define the path you want to check
path_to_check = "/content/drive/My Drive/finetuned_model"

# Check if the specified path exists
if os.path.exists(path_to_check):
    print("Path exists.")
else:
    print("Path does not exist.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Path exists.


In [1]:
!pip install git+https://github.com/huggingface/transformers
from transformers import LlamaForCausalLM

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-6v22sbt2
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-6v22sbt2
  Resolved https://github.com/huggingface/transformers to commit c5f0288bc7d76f65996586f79f69fba8867a0e67
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [10]:
from peft import (
    TaskType,
    LoraConfig,
    get_peft_model,
    set_peft_model_state_dict,
    prepare_model_for_kbit_training,
)
from peft import PeftModel


In [14]:
import torch

In [15]:
## load our finetuned model
base_model = "meta-llama/Llama-2-7b-chat-hf"
peft_model = "/content/drive/My Drive/finetuned_model"

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
model = LlamaForCausalLM.from_pretrained(base_model, trust_remote_code=True, device_map = "auto", load_in_8bit = True,)
model = PeftModel.from_pretrained(model, peft_model)
model = torch.compile(model)  # Please comment this line if your platform does not support torch.compile
model = model.eval()

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [21]:
# Rerun the code
batch_size = 8

# FPB
res = test_fpb(model, tokenizer, batch_size = batch_size)

# TFNS Test Set, len 2388
res = test_tfns(model, tokenizer, batch_size=batch_size)

# FiQA, len 275
res = test_fiqa(model, tokenizer, prompt_fun=add_instructions, batch_size=batch_size)

# NWGI, len 4047
res = test_nwgi(model, tokenizer, batch_size=batch_size)



Prompt example:
Instruction: What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}.
Input: L&T has also made a commitment to redeem the remaining shares by the end of 2011 .
Answer: 


Total len: 1212. Batchsize: 8. Total steps: 152


100%|██████████| 152/152 [01:29<00:00,  1.69it/s]


Acc: 0.6435643564356436. F1 macro: 0.41017286629293465. F1 micro: 0.6435643564356436. F1 weighted (BloombergGPT): 0.5456706604834383. 


Prompt example:
Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.
Input: $ALLY - Ally Financial pulls outlook https://t.co/G9Zdi1boy5
Answer: 


Total len: 2388. Batchsize: 8. Total steps: 299


100%|██████████| 299/299 [02:55<00:00,  1.70it/s]


Acc: 0.6842546063651591. F1 macro: 0.39793310156518685. F1 micro: 0.6842546063651591. F1 weighted (BloombergGPT): 0.5939952516222438. 


Downloading data:   0%|          | 0.00/161k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/25.3k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]



Prompt example:
Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.
Input: This $BBBY stock options trade would have more than doubled your money https://t.co/Oa0loiRIJL via @TheStreet
Answer: 


Total len: 275. Batchsize: 8. Total steps: 35


100%|██████████| 35/35 [00:21<00:00,  1.64it/s]


Acc: 0.17454545454545456. F1 macro: 0.18200937125422248. F1 micro: 0.17454545454545456. F1 weighted (BloombergGPT): 0.19358602038693573. 


Downloading readme:   0%|          | 0.00/682 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/10.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16184 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/4047 [00:00<?, ? examples/s]



Prompt example:
Instruction: What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}.
Input: In the latest trading session, Adobe Systems (ADBE) closed at $535.98, marking a +0.31% move from the previous day.
Answer: 


Total len: 4047. Batchsize: 8. Total steps: 506


100%|██████████| 506/506 [05:03<00:00,  1.67it/s]

Acc: 0.44180874722016306. F1 macro: 0.3035030501637554. F1 micro: 0.44180874722016306. F1 weighted (BloombergGPT): 0.3354436329182804. 



