# Train llama3 with LoRA
[Source of this notebook](https://www.philschmid.de/fsdp-qlora-llama3#3-fine-tune-the-llm-with-pytorch-fsdp-q-lora-and-sdpa)

This notebook is designed for Kaggle notebook with 2 Nvidia T4 GPUs

### Enviornment setup
- Set your `HF_TOKEN` at `Add-ons -> Secrets`

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
HF_TOKEN = user_secrets.get_secret("HF_TOKEN")
# HF_TOKEN_WRITE = user_secrets.get_secret("HF_TOKEN_WRITE")
# HF_TOKEN_WRITE = user_secrets.get_secret("HF_TOKEN_WRITE")

!huggingface-cli login --token $HF_TOKEN

In [3]:
# Install Pytorch for FSDP and FA/SDPA
%pip install "torch==2.2.2" tensorboard
 
# Install Hugging Face libraries
%pip install  --upgrade "transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0"

Collecting torch==2.2.2
  Downloading torch-2.2.2-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.2.2)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.2.2)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.2.2)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.2.2)
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.2.2)
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.2.2)
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylin

### Load and prepare the dataset

In [4]:
from datasets import load_dataset, DatasetDict
 
# Convert dataset to OAI messages
system_message = """You are a seasoned stock market analyst. What is the summary of this financial text"""

def create_conversation(sample):
#     return {
#         "messages": [
#             {"role": "system", "content": system_message},
#             {"role": "user", "content": sample["document"]},
#             {"role": "assistant", "content": sample["summary"]}
#         ]
#     }
    return {
        "messages": [
            {"role": "system", "content": sample["Instruction"]},
            {"role": "user", "content": sample["Input"]},
            {"role": "assistant", "content": sample["Output"]}
        ]
    }
 
# Load dataset from the hub
dataset = load_dataset("ECS289L/Stocksense-Prediction-Current-Week-6stock-llama3", split="train")
# print(dataset)
# dataset = dataset.select(range(0, 50))
 
# Convert dataset to OAI messages
dataset = dataset.map(create_conversation, remove_columns=dataset.features, batched=False)
# split dataset into 10,000 training samples and 2,500 test samples
dataset = dataset.train_test_split(test_size=100, seed=42)

print(dataset["train"][123]["messages"])

# save datasets to disk
dataset["train"].to_json("train_dataset.json", orient="records", force_ascii=False)
dataset["test"].to_json("test_dataset.json", orient="records", force_ascii=False)

Downloading data: 100%|██████████| 1.97M/1.97M [00:00<00:00, 9.05MB/s]


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2083 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

105366

### Load base model and setup training parameters

In [5]:
from datasets import load_dataset

# Load jsonl data from disk
dataset = load_dataset("json", data_files="train_dataset.json", split="train")

Generating train split: 0 examples [00:00, ? examples/s]

In [6]:
%%writefile llama_3_8b_fsdp_qlora.yaml
# script parameters
model_id: "ECS289L/Stocksense-Plus-Full" # Hugging Face model id
dataset_path: "."                      # path to dataset
max_seq_len:  4096 # 2048              # max sequence length for model and packing of the dataset
# training parameters
output_dir: "/home/jupyter/llama-3-8b-FinGPT" # Temporary output directory for model checkpoints
report_to: "tensorboard"               # report metrics to tensorboard
learning_rate: 0.0003                  # learning rate 2e-4
lr_scheduler_type: "constant"          # learning rate scheduler
num_train_epochs: 2                    # number of training epochs
per_device_train_batch_size: 1         # batch size per device during training
per_device_eval_batch_size: 1          # batch size for evaluation
gradient_accumulation_steps: 2         # number of steps before performing a backward/update pass
optim: adamw_torch                     # use torch adamw optimizer
logging_steps: 10                      # log every 10 steps
save_strategy: epoch                   # save checkpoint every epoch
evaluation_strategy: epoch             # evaluate every epoch
max_grad_norm: 0.3                     # max gradient norm
warmup_ratio: 0.03                     # warmup ratio
bf16: false                             # use bfloat16 precision
tf32: false                             # use tf32 precision
gradient_checkpointing: true           # use gradient checkpointing to save memory
hub_private_repo: true
# FSDP parameters: https://huggingface.co/docs/transformers/main/en/fsdp
fsdp: "full_shard auto_wrap" # remove offload if enough GPU memory
fsdp_config:
  backward_prefetch: "backward_pre"
  forward_prefetch: "false"
  use_orig_params: "false"

Writing llama_3_8b_fsdp_qlora.yaml


In [7]:
%%writefile run_fsdp_qlora.py
import logging
from dataclasses import dataclass, field
import os
import random
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, TrainingArguments
from trl.commands.cli_utils import  TrlParser
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
        set_seed,

)
from trl import setup_chat_format
from peft import LoraConfig


from trl import (
   SFTTrainer)

# Comment in if you want to use the Llama 3 instruct template but make sure to add modules_to_save
# LLAMA_3_CHAT_TEMPLATE="{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"

# Anthropic/Vicuna like template without the need for special tokens
LLAMA_3_CHAT_TEMPLATE = (
    "{% for message in messages %}"
        "{% if message['role'] == 'system' %}"
            "{{ message['content'] }}"
        "{% elif message['role'] == 'user' %}"
            "{{ '\n\nHuman: ' + message['content'] +  eos_token }}"
        "{% elif message['role'] == 'assistant' %}"
            "{{ '\n\nAssistant: '  + message['content'] +  eos_token  }}"
        "{% endif %}"
    "{% endfor %}"
    "{% if add_generation_prompt %}"
    "{{ '\n\nAssistant: ' }}"
    "{% endif %}"
)

# LLAMA_3_CHAT_TEMPLATE = (
# "{% set ns = namespace(found=false) %}"
#   "{% for message in messages %}"
#       "{% if message['role'] == 'system' %}"
#           "{% set ns.found = true %}"
#       "{% endif %}"
#   "{% endfor %}"
#   "{% if not ns.found %}"
#       "{{ '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n' + 'You are a helpful assistant' + '<|eot_id|>' }}"
#   "{% endif %}"
#   "{% for message in messages %}"
#       "{% if message['role'] == 'system' %}"
#           "{{ '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n' + message['content'] | trim + '<|eot_id|>' }}"
#       "{% else %}"
#           "{% if message['role'] == 'user' %}"
#               "{{ '<|start_header_id|>user<|end_header_id|>\n\n' + message['content'] | trim + '<|eot_id|>'}}"
#           "{% else %}"
#               "{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' + message['content'] + '<|eot_id|>' }}"
#           "{% endif %}"
#       "{% endif %}"
#   "{% endfor %}"
#   "{% if add_generation_prompt %}"
#       "{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}"
#   "{% endif %}"
# )

# ACCELERATE_USE_FSDP=1 FSDP_CPU_RAM_EFFICIENT_LOADING=1 torchrun --nproc_per_node=4 ./scripts/run_fsdp_qlora.py --config llama_3_70b_fsdp_qlora.yaml

@dataclass
class ScriptArguments:
    dataset_path: str = field(
        default=None,
        metadata={
            "help": "Path to the dataset"
        },
    )
    model_id: str = field(
        default=None, metadata={"help": "Model ID to use for SFT training"}
    )
    max_seq_length: int = field(
        default=512, metadata={"help": "The maximum sequence length for SFT Trainer"}
    )


def training_function(script_args, training_args):
    ################
    # Dataset
    ################
    
    train_dataset = load_dataset(
        "json",
        data_files=os.path.join(script_args.dataset_path, "train_dataset.json"),
        split="train",
    )
    test_dataset = load_dataset(
        "json",
        data_files=os.path.join(script_args.dataset_path, "test_dataset.json"),
        split="train",
    )

    ################
    # Model & Tokenizer
    ################

    # Tokenizer        
    tokenizer = AutoTokenizer.from_pretrained(script_args.model_id, use_fast=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.chat_template = LLAMA_3_CHAT_TEMPLATE
    
    # template dataset
    def template_dataset(examples):
        return{"text":  tokenizer.apply_chat_template(examples["messages"], tokenize=False)}
    
    train_dataset = train_dataset.map(template_dataset, remove_columns=["messages"])
    test_dataset = test_dataset.map(template_dataset, remove_columns=["messages"])
    
    # print random sample
    with training_args.main_process_first(
        desc="Log a few random samples from the processed training set"
    ):
        for index in random.sample(range(len(train_dataset)), 2):
            print(train_dataset[index]["text"])

    # Model    
    torch_dtype = torch.bfloat16
    quant_storage_dtype = torch.bfloat16

    quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch_dtype,
            bnb_4bit_quant_storage=quant_storage_dtype,
        )

    model = AutoModelForCausalLM.from_pretrained(
        script_args.model_id,
        quantization_config=quantization_config,
        attn_implementation="sdpa", # use sdpa, alternatively use "flash_attention_2"
        torch_dtype=quant_storage_dtype,
        use_cache=False if training_args.gradient_checkpointing else True,  # this is needed for gradient checkpointing
    )
    
    if training_args.gradient_checkpointing:
        model.gradient_checkpointing_enable()

    ################
    # PEFT
    ################

    # LoRA config based on QLoRA paper & Sebastian Raschka experiment
    peft_config = LoraConfig(
        lora_alpha=8,
        lora_dropout=0.05,
        r=16,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM",
#         modules_to_save = ["lm_head", "embed_tokens"] # add if you want to use the Llama 3 instruct template
    )

    ################
    # Training
    ################
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        dataset_text_field="text",
        eval_dataset=test_dataset,
        peft_config=peft_config,
        max_seq_length=script_args.max_seq_length,
        tokenizer=tokenizer,
        packing=True,
        dataset_kwargs={
            "add_special_tokens": False,  # We template with special tokens
            "append_concat_token": False,  # No need to add additional separator token
        },
    )
    if trainer.accelerator.is_main_process:
        trainer.model.print_trainable_parameters()

    ##########################
    # Train model
    ##########################
    checkpoint = None
    if training_args.resume_from_checkpoint is not None:
        checkpoint = training_args.resume_from_checkpoint
    trainer.train(resume_from_checkpoint=checkpoint)

    ##########################
    # SAVE MODEL FOR SAGEMAKER
    ##########################
    if trainer.is_fsdp_enabled:
        trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
    trainer.save_model()
    
if __name__ == "__main__":
    parser = TrlParser((ScriptArguments, TrainingArguments))
    script_args, training_args = parser.parse_args_and_config()    
    
    # set use reentrant to False
    if training_args.gradient_checkpointing:
        training_args.gradient_checkpointing_kwargs = {"use_reentrant": True}
    # set seed
    set_seed(training_args.seed)
  
    # launch training
    training_function(script_args, training_args)


Writing run_fsdp_qlora.py


### Train Model

##### Release unreferenced memory in Python

In [8]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

##### Start training with torchrun

In [9]:
!ACCELERATE_USE_FSDP=1 FSDP_CPU_RAM_EFFICIENT_LOADING=1 torchrun --nproc_per_node=2 ./run_fsdp_qlora.py --config llama_3_8b_fsdp_qlora.yaml

2024-06-02 23:54:19.161143: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-02 23:54:19.161146: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-02 23:54:19.161209: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-02 23:54:19.161269: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-02 23:54:19.252121: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory

### Inference

In [10]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
 
peft_model_id = "/home/jupyter/llama-3-8b-FinGPT"
 
# Load Model with PEFT adapter
model = AutoPeftModelForCausalLM.from_pretrained(
  peft_model_id,
  torch_dtype=torch.float16,
  quantization_config= {"load_in_4bit": True},
  device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [27]:
from datasets import load_dataset
from random import randint
 

# Load our test dataset
eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
rand_idx = 4
messages = eval_dataset[rand_idx]["messages"][:2]

# Test on sample
input_ids = tokenizer.apply_chat_template(messages,add_generation_prompt=True,return_tensors="pt").to(model.device)
outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id= tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]

print(f"**System Prompt:**\n{eval_dataset[rand_idx]['messages'][0]['content']}\n")
print(f"**Query:**\n{eval_dataset[rand_idx]['messages'][1]['content']}\n")
print(f"**Original Answer:**\n{eval_dataset[rand_idx]['messages'][2]['content']}\n")
print(f"**Generated Answer:**\n{tokenizer.decode(response,skip_special_tokens=True)}")

# **Query:**
# How long was the Revolutionary War?
# **Original Answer:**
# The American Revolutionary War lasted just over seven years. The war started on April 19, 1775, and ended on September 3, 1783.
# **Generated Answer:**
# The Revolutionary War, also known as the American Revolution, was an 18th-century war fought between the Kingdom of Great Britain and the Thirteen Colonies. The war lasted from 1775 to 1783.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


KeyboardInterrupt: 

### Save the model

##### Zip the lora file
Download manually from the output section in the sidebar

In [12]:
# !ls
# !zip -0 -r llama-3-8b-FinGPT.zip /home/jupyter/llama-3-8b-FinGPT

##### Merge PEFT and base model

In [13]:
# !ls /home/jupyter

In [14]:
#### COMMENT IN TO MERGE PEFT AND BASE MODEL ####
from peft import AutoPeftModelForCausalLM
import torch
from transformers import AutoTokenizer
 
# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    "/home/jupyter/llama-3-8b-FinGPT",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)
# Merge LoRA and base model and save
merged_model = model.merge_and_unload()

# Save locally
merged_model.save_pretrained("/home/jupyter/llama-3-8b-FinGPT-Merged",safe_serialization=True, max_shard_size="2GB")
tokenizer = AutoTokenizer.from_pretrained("/home/jupyter/llama-3-8b-FinGPT")
tokenizer.save_pretrained("/home/jupyter/llama-3-8b-FinGPT-Merged",safe_serialization=True)
# !zip -0 -r llama-3-8b-FinGPT-Merged.zip /home/jupyter/llama-3-8b-FinGPT-Merged

# # Publish to Huggingface
# merged_model.push_to_hub("ECS289L/Stocksense-Plus-Prediction", safe_serialization=True)
# peft_model_id = "/home/jupyter/llama-3-8b-FinGPT"
# tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
# tokenizer.push_to_hub("ECS289L/Stocksense-Plus-Prediction")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


('/home/jupyter/llama-3-8b-FinGPT-Merged/tokenizer_config.json',
 '/home/jupyter/llama-3-8b-FinGPT-Merged/special_tokens_map.json',
 '/home/jupyter/llama-3-8b-FinGPT-Merged/tokenizer.json')

### Covert to GGUF

In [15]:
# !pip install --upgrade huggingface_hub
# from huggingface_hub import HfApi, list_models, snapshot_download
# snapshot_download("ECS289L/Stocksense-Plus-Prediction", local_dir_use_symlinks=False, local_dir="/home/jupyter/llama-3-8b-FinGPT")

In [16]:
!git clone https://github.com/ggerganov/llama.cpp.git
!cd llama.cpp && make -j 16

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Cloning into 'llama.cpp'...
remote: Enumerating objects: 26191, done.[K
remote: Counting objects: 100% (6112/6112), done.[K
remote: Compressing objects: 100% (262/262), done.[K
remote: Total 26191 (delta 5973), reused 5870 (delta 5850), pack-reused 20079[K
Receiving objects: 100% (26191/26191), 49.94 MiB | 27.99 MiB/s, done.
Resolving deltas: 100% (18707/18707), done.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE 
I NVCCFLAGS: -std=c++11 -O3 
I LDFLAGS:    
I CC:        cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
I CXX:       c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D

In [17]:
#!cd llama.cpp && python3 -m pip install -r requirements.txt
!cd llama.cpp && python3 convert-hf-to-gguf.py /home/jupyter/llama-3-8b-FinGPT-Merged

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Writing: 100%|███████████████████████████| 16.1G/16.1G [01:28<00:00, 182Mbyte/s]


In [18]:
!ls /home/jupyter/llama-3-8b-FinGPT-Merged

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


config.json			  model-00006-of-00009.safetensors
generation_config.json		  model-00007-of-00009.safetensors
ggml-model-f16.gguf		  model-00008-of-00009.safetensors
model-00001-of-00009.safetensors  model-00009-of-00009.safetensors
model-00002-of-00009.safetensors  model.safetensors.index.json
model-00003-of-00009.safetensors  special_tokens_map.json
model-00004-of-00009.safetensors  tokenizer.json
model-00005-of-00009.safetensors  tokenizer_config.json


In [19]:
!cd llama.cpp && ./quantize /home/jupyter/llama-3-8b-FinGPT-Merged/ggml-model-f16.gguf /home/jupyter/llama-3-8b-FinGPT-Merged/Stocksense-Plus-Prediction-Q4_K_M.gguf Q4_K_M

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


main: build = 3070 (3413ae21)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: quantizing '/home/jupyter/llama-3-8b-FinGPT-Merged/ggml-model-f16.gguf' to '/home/jupyter/llama-3-8b-FinGPT-Merged/Stocksense-Plus-Prediction-Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /home/jupyter/llama-3-8b-FinGPT-Merged/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama-3-8b-FinGPT-Merged
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                  

In [20]:
from huggingface_hub import HfApi
api = HfApi()

model_id = "ECS289L/Stocksense-Plus-Prediction-GGUF"
api.create_repo(model_id, exist_ok=True, repo_type="model")
api.upload_file(
    path_or_fileobj="/home/jupyter/llama-3-8b-FinGPT-Merged/Stocksense-Plus-Prediction-Q4_K_M.gguf",
    path_in_repo="Stocksense-Plus-Prediction-Q4_K_M.gguf",
    repo_id=model_id,
)

Stocksense-Plus-Prediction-Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ECS289L/Stocksense-Plus-Prediction-GGUF/commit/2f5eec45db80d8709468208bd1a48ed9742a5d66', commit_message='Upload Stocksense-Plus-Prediction-Q4_K_M.gguf with huggingface_hub', commit_description='', oid='2f5eec45db80d8709468208bd1a48ed9742a5d66', pr_url=None, pr_revision=None, pr_num=None)

### Accuracy Test

In [21]:
!apt install systemctl -y
!curl -fsSL https://ollama.com/install.sh | sh
!systemctl start ollama
!pip install ollama

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  adwaita-icon-theme gir1.2-glib-2.0 gir1.2-packagekitglib-1.0
  glib-networking-common glib-networking-services gtk-update-icon-cache
  hicolor-icon-theme humanity-icon-theme iso-codes libargon2-1 libcap2
  libcap2-bin libcolord2 libcryptsetup12 libdconf1 libdevmapper1.02.1
  libepoxy0 libgirepository-1.0-1 libglib2.0-bin libgstreamer1.0-0 libip4tc2
  libjson-c4 libjson-glib-1.0-0 libjson-glib-1.0-common liblmdb0
  libpackagekit-glib2-18 libpolkit-agent-1-0 libpolkit-gobject-1-0 libproxy1v5
  libstemmer0d libxdamage1 libyaml-0-2 python-apt-common python3-apt
  python3-certifi python3-chardet python3-dbus python3-gi python3-idna
  python3-requests python3-requests-unixsocket python3-six
  python3-software-properties python3-urllib3 ubuntu-mono
Use 'apt autoremove' to remove them.
The following additional packages

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


>>> Downloading ollama...
######################################################################## 100.0%#=#=#                                                                          
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> NVIDIA GPU installed.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting ollama
  Downloading ollama-0.2.0-py3-none-any.whl.metadata (4.1 kB)
Downloading ollama-0.2.0-py3-none-any.whl (9.5 kB)
Installing collected packages: ollama
Successfully installed ollama-0.2.0


In [30]:
%%writefile accuracy.py
import ollama
from datasets import load_dataset
import tqdm


modelfile='''
FROM /home/jupyter/llama-3-8b-FinGPT-Merged/Stocksense-Plus-Prediction-Q4_K_M.gguf
SYSTEM You are a seasoned stock market analyst. Your task is to predict the companies' stock price movement for this week based on this week's positive headlines and negative headlines. Give me answer in the format of {increased/decreased/flat} in {X}%
TEMPLATE "{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"
PARAMETER num_keep 24
PARAMETER stop <|start_header_id|>
PARAMETER stop <|end_header_id|>
PARAMETER stop <|eot_id|>
'''

# modelfile='''
# FROM llama3:latest
# SYSTEM What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}.
# TEMPLATE "{{ if .System }}<|start_header_id|>system<|end_header_id|>

# {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

# {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

# {{ .Response }}<|eot_id|>"
# PARAMETER num_keep 24
# PARAMETER stop <|start_header_id|>
# PARAMETER stop <|end_header_id|>
# PARAMETER stop <|eot_id|>
# '''

ollama.create(model='stocksense-plus-test', modelfile=modelfile)

# Load jsonl data from disk
dataset = load_dataset("json", data_files="test_dataset.json", split="train")
userPrompts = []
answers = []
generatedAnswers = []
correctResponse = 0

for d in dataset:
    userPrompts.append([{'role': 'user', 'content': d['messages'][1]['content']+ "Just me answer in the format of {increased/decreased/flat} in {X}%. Don't say other things."}])
    # userPrompts.append([{'role': 'user', 'content': d['messages'][1]['content']}])
    groundTruth = d['messages'][2]['content'].split(' ')
    answers.append((groundTruth[0], groundTruth[2]))


for _ in range(3):
    for idx, prompts in enumerate(userPrompts):
        response = ollama.chat(model='stocksense-plus-test', messages=prompts)
        print(response['message']['content'], answers[idx])
        upDown = response['message']['content'].split(' ')[0].lower()
        percentage = response['message']['content'].split(' ')[2]
        # print(upDown.lower(), percentage)
        if answers[idx][0] == upDown:
            correctResponse += 1

# print(answers)
print(correctResponse/(len(answers) * 3) * 100, '%')

Overwriting accuracy.py


In [32]:
!python accuracy.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


increased in 3.05% ('decreased', '5.35%')
increased in 1.34% ('decreased', '5.3%')
increased in 1.39% ('decreased', '1.58%')
increased in 0.46% ('increased', '2.73%')
increased in 0.6% ('decreased', '0.52%')
decreased in 0.37% ('increased', '1.69%')
increased in 5.45% ('increased', '1.23%')
increased in 6.02% ('decreased', '1.28%')
increased in 0.25% ('increased', '0.75%')
increased in 0.45% ('decreased', '2.49%')
increased in 0.55% ('increased', '3.17%')
decreased in 2.01% ('decreased', '1.28%')
decreased in 1.0% ('decreased', '5.84%')
decreased in 0.43% ('decreased', '1.07%')
decreased in 6.33% ('increased', '11.33%')
increased in 0.95% ('decreased', '1.46%')
increased in 1.6% ('increased', '2.2%')
increased in 2.25% ('increased', '1.83%')
decreased in 1.01% ('decreased', '1.49%')
decreased in 2.09% ('increased', '2.21%')
increased in 2.31% ('increased', '3.84%')
increased in 2.4% ('decreased', '3.32%')
decreased in 5.42% ('increased', '0.33%')
increased in 1.22% ('decreased', '0.11%

## Useful sources
- https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms
- https://www.philschmid.de/fsdp-qlora-llama3

- https://www.philschmid.de/fsdp-qlora-llama3#3-fine-tune-the-llm-with-pytorch-fsdp-q-lora-and-sdpa
- https://www.philschmid.de/fine-tune-llms-in-2024-with-trl#3-create-and-prepare-the-dataset