# LLaMa - 3 Financial Analyst

### Imports

In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes
!pip install sec_api
!pip install -U langchain
!pip install -U langchain-community
!pip install -U sentence-transformers
!pip install -U faiss-gpu
!pip install rouge_score

HuggingFace Token Found: https://huggingface.co/settings/tokens  
Free SEC API Key Here: https://sec-api.io/

In [4]:
from google.colab import userdata

In [2]:
# HuggingFace token, required for accessing gated models (like LLaMa 3 8B Instruct)
hf_token = userdata.get('HF_TOKEN')
# SEC-API Key
sec_api_key = userdata.get('SEC_API_KEY')

In [3]:
# Fine Tuning Related Packages
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# Pipeline & RAG Related Packages
from sec_api import ExtractorApi, QueryApi
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

import pandas as pd
from rouge_score import rouge_scorer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.6.0+cu124)
    Python  3.9.23 (you have 3.11.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!


### Load Data
Load data and split to test and train set.

In [4]:
# Load the model and tokenizer from the pre-trained FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    # Specify the pre-trained model to use
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    # Specifies the maximum number of tokens (words or subwords) that the model can process in a single forward pass
    max_seq_length = 2048,
    # Data type for the model. None means auto-detection based on hardware, Float16 for specific hardware like Tesla T4
    dtype = None,
    # Enable 4-bit quantization, By quantizing the weights of the model to 4 bits instead of the usual 16 or 32 bits, the memory required to store these weights is significantly reduced. This allows larger models to be run on hardware with limited memory resources.
    load_in_4bit = True,
    # Access token for gated models, required for authentication to use models like Meta-Llama-2-7b-hf
    token = hf_token,
)


==((====))==  Unsloth 2025.8.6: Fast Llama patching. Transformers: 4.55.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

In [5]:
# Defining the expected prompt
ft_prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Below is a user question, paired with retrieved context. Write a response that appropriately answers the question,
include specific details in your response. <|eot_id|>

<|start_header_id|>user<|end_header_id|>

### Question:
{}

### Context:
{}

<|eot_id|>

### Response: <|start_header_id|>assistant<|end_header_id|>
{}"""

# Grabbing end of sentence special token
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

# Function for formatting above prompt with information from Financial QA dataset
def formatting_prompts_func(examples,intent='train'):

    # print (len(examples["question"]))
    questions = examples["question"]
    contexts       = examples["context"]
    responses      = examples["answer"] if intent == 'train' else [""]*(len(examples["question"]))
    texts = []
    for question, context, response in zip(questions, contexts, responses):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        if intent == 'train':
          text = ft_prompt.format(question, context, response) + EOS_TOKEN
        else:
          text = ft_prompt.format(question, context, response)
        texts.append(text)
    return { "text" : texts, }
pass


In [6]:
#Load the data
dataset_train = load_dataset("virattt/llama-3-8b-financialQA", split = "train")

# Split it into 2 parts (e.g., 80% train and 20% test)
split_datasets = dataset_train.train_test_split(test_size=0.1)

dataset = split_datasets['train']
dataset_test = split_datasets['test']

dataset = dataset.map(formatting_prompts_func, batched = True,)
dataset_test = dataset_test.map(formatting_prompts_func, batched = True,fn_kwargs={"intent": "test"})

README.md:   0%|          | 0.00/419 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7000 [00:00<?, ? examples/s]

Map:   0%|          | 0/6300 [00:00<?, ? examples/s]

Map:   0%|          | 0/700 [00:00<?, ? examples/s]

### Pre Training Inference

In [None]:
FastLanguageModel.for_inference(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128255)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSN

In [25]:
# Main Inference Function, handles generating and decoding tokens
def inference(prompt,max_tokens=64):
  inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

  # Generate tokens for the input prompt using the model, with a maximum of 64 new tokens.
  # The `use_cache` parameter enables faster generation by reusing previously computed values.
  # The `pad_token_id` is set to the EOS token to handle padding properly.
  outputs = model.generate(**inputs, max_new_tokens = max_tokens, use_cache = True, pad_token_id=tokenizer.eos_token_id)
  response = tokenizer.batch_decode(outputs) # Decoding tokens into english words
  return response

In [26]:
# Function for extracting just the language model generation from the full response
def extract_response(text):
    text = text[0]
    start_token = "### Response: <|start_header_id|>assistant<|end_header_id|>"
    end_token = "<|eot_id|>"

    start_index = text.find(start_token) + len(start_token)
    # end_index = text.find(end_token, start_index)

    if start_index == -1:
        return None

    return text[start_index:].strip()

In [None]:
# print(dataset_test[0]['text'])

In [None]:
# resp = inference(dataset_test[0]['text'])
# parsed_response = extract_response(resp)
# print(parsed_response)

In [27]:
from tqdm.notebook import tqdm

In [28]:
def run_evaluation(dataset, model, max_length=1024):

    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    records = []

    for i,data in tqdm(enumerate(dataset),total=len(dataset)):
        input_passage = data['text']
        true_analysis = data['answer']

        resp = inference(input_passage,max_length)
        model_output = extract_response(resp)

        scores = scorer.score(target=true_analysis, prediction=model_output)

        records.append({
            "input_passage": input_passage,
            "true_analysis": true_analysis,
            "generated_analysis": model_output,
            "rouge1_precision": scores["rouge1"].precision,
            "rouge1_recall": scores["rouge1"].recall,
            "rouge1_fmeasure": scores["rouge1"].fmeasure,
            "rouge2_precision": scores["rouge2"].precision,
            "rouge2_recall": scores["rouge2"].recall,
            "rouge2_fmeasure": scores["rouge2"].fmeasure,
            "rougeL_precision": scores["rougeL"].precision,
            "rougeL_recall": scores["rougeL"].recall,
            "rougeL_fmeasure": scores["rougeL"].fmeasure
        })

    return pd.DataFrame(records)


In [None]:
res = run_evaluation(dataset_test, model, max_length=1024)

  0%|          | 0/700 [00:00<?, ?it/s]

In [None]:
res.describe()

Unnamed: 0,rouge1_precision,rouge1_recall,rouge1_fmeasure,rouge2_precision,rouge2_recall,rouge2_fmeasure,rougeL_precision,rougeL_recall,rougeL_fmeasure
count,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0
mean,0.216155,0.909578,0.312323,0.15833,0.684123,0.226585,0.191705,0.826274,0.277457
std,0.191919,0.1215,0.214037,0.170581,0.281886,0.196147,0.176175,0.178979,0.196047
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.083333,0.870169,0.151372,0.046512,0.5,0.085562,0.074172,0.708333,0.134295
50%,0.142857,0.949359,0.245002,0.095238,0.722222,0.162162,0.126633,0.868993,0.222222
75%,0.283285,1.0,0.422069,0.195255,1.0,0.300203,0.237802,1.0,0.356289
max,0.909091,1.0,0.928571,0.857143,1.0,0.923077,0.894737,1.0,0.928571


### Fine Tuning

In [None]:
# Load the model and tokenizer from the pre-trained FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    # Specify the pre-trained model to use
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    # Specifies the maximum number of tokens (words or subwords) that the model can process in a single forward pass
    max_seq_length = 2048,
    # Data type for the model. None means auto-detection based on hardware, Float16 for specific hardware like Tesla T4
    dtype = None,
    # Enable 4-bit quantization, By quantizing the weights of the model to 4 bits instead of the usual 16 or 32 bits, the memory required to store these weights is significantly reduced. This allows larger models to be run on hardware with limited memory resources.
    load_in_4bit = True,
    # Access token for gated models, required for authentication to use models like Meta-Llama-2-7b-hf
    token = hf_token,
    device_map="auto"
)


==((====))==  Unsloth 2025.8.5: Fast Llama patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [7]:
# Apply LoRA (Low-Rank Adaptation) adapters to the model for parameter-efficient fine-tuning
model = FastLanguageModel.get_peft_model(
    model,
    # Rank of the adaptation matrix. Higher values can capture more complex patterns. Suggested values: 8, 16, 32, 64, 128
    r = 16,
    # Specify the model layers to which LoRA adapters should be applied
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    # Scaling factor for LoRA. Controls the weight of the adaptation. Typically a small positive integer
    lora_alpha = 16,
    # Dropout rate for LoRA. A value of 0 means no dropout, which is optimized for performance
    lora_dropout = 0,
    # Bias handling in LoRA. Setting to "none" is optimized for performance, but other options can be used
    bias = "none",
    # Enables gradient checkpointing to save memory during training. "unsloth" is optimized for very long contexts
    use_gradient_checkpointing = "unsloth",
    # Seed for random number generation to ensure reproducibility of results
    random_state = 3407,
)

Unsloth 2025.8.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.



1. **r**: The rank of the low-rank adaptation matrix. This determines the capacity of the adapter to capture additional information. Higher ranks allow capturing more complex patterns but also increase computational overhead.

2. **target_modules**: List of module names within the model to which the LoRA adapters should be applied. These modules typically include the projections within transformer layers (e.g., query, key, value projections) and other key transformation points.
  - **q_proj**: Projects input features to query vectors for attention mechanisms.
  - **k_proj**: Projects input features to key vectors for attention mechanisms.
  - **v_proj**: Projects input features to value vectors for attention mechanisms.
  - **o_proj**: Projects the output of the attention mechanism to the next layer.
  - **gate_proj**: Applies gating mechanisms to modulate the flow of information.
  - **up_proj**: Projects features to a higher dimensional space, often used in feed-forward networks.
  - **down_proj**: Projects features to a lower dimensional space, often used in feed-forward networks.

These layers are typically involved in transformer-based models, facilitating various projections and transformations necessary for the attention mechanism and feed-forward processes.

3. **lora_alpha**: A scaling factor for the LoRA adapter. It controls the impact of the adapter on the model's outputs. Typically set to a small positive integer.

4. **lora_dropout**: Dropout rate applied to the LoRA adapters. Dropout helps in regularizing the model, but setting it to 0 means no dropout, which is often optimal for performance.

5. **bias**: This specifies how biases should be handled in the LoRA adapters. Setting it to "none" indicates no bias is used, which is optimized for performance, although other options are available depending on the use case.

6. **use_gradient_checkpointing**: Enables gradient checkpointing, which helps to save memory during training by not storing all intermediate activations. "unsloth" is a setting optimized for very long contexts, but it can also be set to True.

7. **random_state**: A seed for the random number generator to ensure reproducibility. This makes sure that the results are consistent across different runs of the code.

### **Preparing the Fine Tuning Dataset**

We will be using a HF dataset of Financial Q&A over form 10ks, provided by user [Virat Singh](https://github.com/virattt) here https://huggingface.co/datasets/virattt/llama-3-8b-financialQA

The following code below formats the entries into the prompt defined first for training, being careful to add in special tokens. In this case our End of Sentence token is <|eot_id|>. More LLaMa 3 special tokens [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/)

In [8]:
# len(dataset)

### **Defining the Trainer Arguments**

We will be setting up and using HuggingFace Transformer Reinforcement Learning (TRL)'s [Supervised Fine-Tuning Trainer](https://huggingface.co/docs/trl/sft_trainer)

**Supervised fine-tuning** is a process in machine learning where a pre-trained model is further trained on a specific dataset with labeled examples. During this process, the model learns to make predictions or classifications based on the labeled data, improving its performance on the specific task at hand. This technique leverages the general knowledge the model has already acquired during its initial training phase and adapts it to perform well on a more targeted set of examples. Supervised fine-tuning is commonly used to customize models for specific applications, such as sentiment analysis, object recognition, or language translation, by using task-specific annotated data.

In [9]:
trainer = SFTTrainer(
    # The model to be fine-tuned
    model = model,
    # The tokenizer associated with the model
    tokenizer = tokenizer,
    # The dataset used for training
    train_dataset = dataset,
    # The field in the dataset containing the text data
    dataset_text_field = "text",
    # Maximum sequence length for the training data
    max_seq_length = 2048,
    # Number of processes to use for data loading
    dataset_num_proc = 2,
    # Whether to use sequence packing, which can speed up training for short sequences
    packing = False,
    args = TrainingArguments(
        # Batch size per device during training
        per_device_train_batch_size = 2,
        # Number of gradient accumulation steps to perform before updating the model parameters
        gradient_accumulation_steps = 4,
        # Number of warmup steps for learning rate scheduler
        warmup_steps = 5,
        # Total number of training steps
        max_steps = 60,
        # Number of training epochs, can use this instead of max_steps, for this notebook its ~900 steps given the dataset
        # num_train_epochs = 1,
        # Learning rate for the optimizer
        learning_rate = 2e-4,
        # Use 16-bit floating point precision for training if bfloat16 is not supported
        fp16 = not is_bfloat16_supported(),
        # Use bfloat16 precision for training if supported
        bf16 = is_bfloat16_supported(),
        # Number of steps between logging events
        logging_steps = 1,
        # Optimizer to use (in this case, AdamW with 8-bit precision)
        optim = "adamw_8bit",
        # Weight decay to apply to the model parameters
        weight_decay = 0.01,
        # Type of learning rate scheduler to use
        lr_scheduler_type = "linear",
        # Seed for random number generation to ensure reproducibility
        seed = 3407,
        # Directory to save the output models and logs
        output_dir = "outputs",
        no_cuda = False,

    ),
)


Unsloth: Tokenizing ["text"]:   0%|          | 0/6300 [00:00<?, ? examples/s]

1. **model**: The model to be fine-tuned. This is the pre-trained model that will be adapted to the specific training data.

2. **tokenizer**: The tokenizer associated with the model. It converts text data into tokens that the model can process.

3. **train_dataset**: The dataset used for training. This is the collection of labeled examples that the model will learn from during the fine-tuning process.

4. **dataset_text_field**: The field in the dataset containing the text data. This specifies which part of the dataset contains the text that the model will be trained on.

5. **max_seq_length**: The maximum sequence length for the training data. This limits the number of tokens per input sequence to ensure they fit within the model's processing capacity.

6. **dataset_num_proc**: The number of processes to use for data loading. This can speed up data loading by parallelizing it across multiple processes.

7. **packing**: A boolean indicating whether to use sequence packing. Sequence packing can speed up training by combining multiple short sequences into a single batch.

8. **args**: A set of training arguments that configure the training process. These include various hyperparameters and settings:

    - **per_device_train_batch_size**: The batch size per device during training. This determines how many examples are processed together in one forward/backward pass.
    
    - **gradient_accumulation_steps**: The number of gradient accumulation steps to perform before updating the model parameters. This allows for effectively larger batch sizes without requiring more memory.
    
    - **warmup_steps**: The number of warmup steps for the learning rate scheduler. During these steps, the learning rate increases gradually to the initial value.
    
    - **max_steps**: The total number of training steps. This defines how many batches of training data the model will be trained on.
    
    - **num_train_epochs**: The number of training epochs (uncommented in the example). This defines how many times the entire training dataset will be passed through the model.
    
    - **learning_rate**: The learning rate for the optimizer. This controls how much to adjust the model's weights with respect to the gradient during training.
    
    - **fp16**: A boolean indicating whether to use 16-bit floating point precision for training if bfloat16 is not supported. This can speed up training and reduce memory usage.
    
    - **bf16**: A boolean indicating whether to use bfloat16 precision for training if supported. This can also speed up training and reduce memory usage while maintaining higher precision than fp16.
    
    - **logging_steps**: The number of steps between logging events. This controls how frequently training progress and metrics are logged.
    
    - **optim**: The optimizer to use. In this case, AdamW with 8-bit precision, which can improve efficiency for large models.
    
    - **weight_decay**: The weight decay to apply to the model parameters. This is a regularization technique to prevent overfitting by penalizing large weights.
    
    - **lr_scheduler_type**: The type of learning rate scheduler to use. This controls how the learning rate changes over time during training.
    
    - **seed**: The seed for random number generation. This ensures reproducibility of results by controlling the randomness in training.
    
    - **output_dir**: The directory to save the output models and logs. This specifies where the trained model and training logs will be stored.

# **Now We're Ready to Train!** 🎉

In [10]:
# Add this in your training loop or right after trainer.train()
print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"GPU memory cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

GPU memory allocated: 5.89 GB
GPU memory cached: 5.90 GB


In [11]:
print(f"Model device: {next(model.parameters()).device}")
print(f"Trainer device: {trainer.args.device}")
print(f"Model parameters device: {next(trainer.model.parameters()).device}")

Model device: cuda:0
Trainer device: cuda:0
Model parameters device: cuda:0


In [12]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 6,300 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mrisheshg[0m ([33mrisheshg-jadavpur-[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,4.6778
2,4.3665
3,4.2334
4,3.6174
5,2.7318
6,2.6989
7,2.3514
8,2.0588
9,1.8196
10,1.4473


---
### **Saving Your Fine-Tuned Model Locally**

First click on files, then mount drive to connect google drive! Create a folder and replace with path below

In [None]:
kaggle = False
huggingface = True
colab = True
### If using Kaggle


In [20]:
if kaggle:
  from kaggle_secrets import UserSecretsClient

  user_secrets = UserSecretsClient()
  kaggle_username = 'risheshg'#user_secrets.get_secret("KAGGLE_USERNAME")
  kaggle_key = '9bdfcf1c81accc1cc7f35ea87b7431f6'#user_secrets.get_secret("KAGGLE_KEY")

  Retrieve the Kaggle username and key from the stored secrets
  os.environ["KAGGLE_USERNAME"] = kaggle_username
  os.environ["KAGGLE_KEY"] = kaggle_key

  # # Save the model on a local directory
  preset_dir = '/kaggle/working//llama_finagent"
  gemma_lm.save_to_preset(preset_dir)

  # Upload the model directly to Kaggle Hub
  kaggle_uri = f"kaggle://{kaggle_username}/llm_finance_agent/trained_step60"
  keras_nlp.upload_preset(kaggle_uri, preset_dir)

In [16]:
if colab:
  model.save_pretrained("/content/drive/MyDrive/l3_finagent/l3_finagent_step60") # Local saving
  tokenizer.save_pretrained("/content/drive/MyDrive/l3_finagent/l3_finagent_step60")

('/content/drive/MyDrive/l3_finagent/l3_finagent_step60/tokenizer_config.json',
 '/content/drive/MyDrive/l3_finagent/l3_finagent_step60/special_tokens_map.json',
 '/content/drive/MyDrive/l3_finagent/l3_finagent_step60/chat_template.jinja',
 '/content/drive/MyDrive/l3_finagent/l3_finagent_step60/tokenizer.json')

In [23]:
if huggingface:
  model.push_to_hub("rgarg/l3_finagent_step60")
  tokenizer.push_to_hub("rgarg/l3_finagent_step60")

README.md:   0%|          | 0.00/590 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/rgarg/l3_finagent_step60


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

### **Function for Loading New Model**
Can load from the saved local drive or from huggingface

In [24]:
model_name = 'rgarg/l3_finagent_step60'

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/content/drive/MyDrive/l3_finagent/l3_finagent_step60", # Path into where you saved your model
    max_seq_length = 2048, # Existing arguments from when we loaded earlier
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

==((====))==  Unsloth 2025.8.6: Fast Llama patching. Transformers: 4.55.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096, padding_idx=128255)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

### **Setting Up Functions for Running Inference**

**Inference** refers to the process of using a trained machine learning model to make predictions or generate outputs based on new, unseen input data. It involves feeding the input data into the model and obtaining the model's predictions, classifications, or generated text, depending on the task the model is designed for. Inference is the application phase of the model, as opposed to the training phase.

In [43]:
inference(dataset_test[1]['text'],64)

['<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\nBelow is a user question, paired with retrieved context. Write a response that appropriately answers the question,\ninclude specific details in your response. <|eot_id|>\n\n<|start_header_id|>user<|end_header_id|>\n\n### Question:\nWhere does the section on Financial Statements and Supplementary Data begin in the Form 10-K?\n\n### Context:\nIn the Form 10-K, the section on Financial Statements and Supplementary Data begins on page F-1.\n\n<|eot_id|>\n\n### Response: <|start_header_id|>assistant<|end_header_id|>\nThe section on Financial Statements and Supplementary Data begins on page F-1 in the Form 10-K.<|eot_id|>']

In [35]:
dataset_test[1]['answer']

'Page F-1'

In [40]:
# Function for extracting just the language model generation from the full response
def extract_response(text):
    text = text[0]
    start_token = "### Response: <|start_header_id|>assistant<|end_header_id|>"
    end_token = "<|eot_id|>"

    start_index = text.find(start_token) + len(start_token)
    end_index = text.find(end_token, start_index)

    if start_index == -1 or end_index == -1:
        return None

    return text[start_index:end_index].strip()

In [41]:
extract_response(['<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\nBelow is a user question, paired with retrieved context. Write a response that appropriately answers the question,\ninclude specific details in your response. <|eot_id|>\n\n<|start_header_id|>user<|end_header_id|>\n\n### Question:\nWhere does the section on Financial Statements and Supplementary Data begin in the Form 10-K?\n\n### Context:\nIn the Form 10-K, the section on Financial Statements and Supplementary Data begins on page F-1.\n\n<|eot_id|>\n\n### Response: <|start_header_id|>assistant<|end_header_id|>\nThe section on Financial Statements and Supplementary Data begins on page F-1.<|eot_id|>'])

'The section on Financial Statements and Supplementary Data begins on page F-1.'

In [44]:
res2 = run_evaluation(dataset_test, model, max_length=1024)

  0%|          | 0/700 [00:00<?, ?it/s]

In [45]:
res2

Unnamed: 0,input_passage,true_analysis,generated_analysis,rouge1_precision,rouge1_recall,rouge1_fmeasure,rouge2_precision,rouge2_recall,rouge2_fmeasure,rougeL_precision,rougeL_recall,rougeL_fmeasure
0,<|begin_of_text|><|start_header_id|>system<|en...,"$8,516 million","$8,516 million",1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
1,<|begin_of_text|><|start_header_id|>system<|en...,Page F-1,The section on Financial Statements and Supple...,0.166667,1.000000,0.285714,0.117647,1.000000,0.210526,0.166667,1.000000,0.285714
2,<|begin_of_text|><|start_header_id|>system<|en...,"HP's long-term debt decreased from $10,796 mil...","HP's long-term debt decreased from $10,796 mil...",0.666667,0.888889,0.761905,0.565217,0.764706,0.650000,0.666667,0.888889,0.761905
3,<|begin_of_text|><|start_header_id|>system<|en...,The FDA label update for Yescarta includes ove...,The FDA label update for Yescarta includes ove...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
4,<|begin_of_text|><|start_header_id|>system<|en...,20%,20%,1.000000,1.000000,1.000000,0.000000,0.000000,0.000000,1.000000,1.000000,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
695,<|begin_of_text|><|start_header_id|>system<|en...,Employee groups at AT&T help to reflect and fo...,AT&T has employee groups that reflect its dive...,0.400000,0.583333,0.474576,0.117647,0.173913,0.140351,0.314286,0.458333,0.372881
696,<|begin_of_text|><|start_header_id|>system<|en...,The increase in operating income for the Compa...,The absence of $5.8 billion of opioid litigati...,0.530303,0.686275,0.598291,0.461538,0.600000,0.521739,0.348485,0.450980,0.393162
697,<|begin_of_text|><|start_header_id|>system<|en...,The federal government has used the FCA to pro...,The federal government has used the FCA to pro...,0.695652,0.820513,0.752941,0.555556,0.657895,0.602410,0.695652,0.820513,0.752941
698,<|begin_of_text|><|start_header_id|>system<|en...,If foreign earnings declared as indefinitely r...,If foreign earnings are repatriated to the U.S...,0.857143,0.827586,0.842105,0.703704,0.678571,0.690909,0.821429,0.793103,0.807018


In [46]:
res2.describe()

Unnamed: 0,rouge1_precision,rouge1_recall,rouge1_fmeasure,rouge2_precision,rouge2_recall,rouge2_fmeasure,rougeL_precision,rougeL_recall,rougeL_fmeasure
count,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0
mean,0.735684,0.781256,0.700917,0.590991,0.63127,0.559708,0.691789,0.73471,0.657425
std,0.265505,0.25627,0.247689,0.321286,0.32265,0.299134,0.276373,0.273098,0.257527
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.549405,0.664894,0.555556,0.342105,0.4,0.329765,0.482759,0.548387,0.470391
50%,0.802439,0.875,0.75,0.6,0.676945,0.58435,0.728501,0.806285,0.678269
75%,1.0,1.0,0.901547,0.9,0.970694,0.8,1.0,1.0,0.875762
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


---
# **Part 2: Setting Up SEC 10-K Data Pipeline & Retrieval Functionality**

Set up the RAG pipeline to inject the relevant context into each generation.

The flow will follow as such:
*User Question* -> *Context Retrieval from 10-K* -> *LLM Answers User Question Using Context*

To do this we will need to be able to:
1. Gather specific from 10-K's
2. Parse and chunk the text in them
3. Vectorize and embed the chunks into a vector Database
4. Set up a retriever to semantically search the user's questions over the database to return relevant context


In [49]:
# Extract Filings Function
def get_filings(ticker):
    global sec_api_key

    # Finding Recent Filings with QueryAPI
    queryApi = QueryApi(api_key=sec_api_key)
    query = {
      "query": f"ticker:{ticker} AND formType:\"10-K\"",
      "from": "0",
      "size": "1",
      "sort": [{ "filedAt": { "order": "desc" } }]
    }
    filings = queryApi.get_filings(query)

    # Getting 10-K URL
    filing_url = filings["filings"][0]["linkToFilingDetails"]

    # Extracting Text with ExtractorAPI
    extractorApi = ExtractorApi(api_key=sec_api_key)
    onea_text = extractorApi.get_section(filing_url, "1A", "text") # Section 1A - Risk Factors
    seven_text = extractorApi.get_section(filing_url, "7", "text") # Section 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations

    # Joining Texts
    combined_text = onea_text + "\n\n" + seven_text

    return combined_text

### **Setting Up Embeddings Locally**

In [50]:
# HF Model Path
modelPath = "BAAI/bge-large-en-v1.5"
# Create a dictionary with model configuration options, specifying to use the cuda for GPU optimization
model_kwargs = {'device':'cuda'}
encode_kwargs = {'normalize_embeddings': True}

# Initialize an instance of LangChain's HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     # Provide the pre-trained model's path
    model_kwargs=model_kwargs, # Pass the model configuration options
    encode_kwargs=encode_kwargs # Pass the encoding options
)

  embeddings = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

### **Processing & Defining the Vector Database**

In this flow we get the data from the above defined SEC API functions, and then go through Three steps:
1. Text Splitting
2. Vectorizing
3. Retrieval Function Setup

In [57]:
!pip install faiss-cpu
# !conda install -c pytorch faiss-gpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m50.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


In [58]:
# Prompt the user to input the stock ticker they want to analyze
ticker = input("What Ticker Would you Like to Analyze? ex. AAPL: ")

print("-----")
print("Getting Filing Data")
# Retrieve the filing data for the specified ticker
filing_data = get_filings(ticker)

print("-----")
print("Initializing Vector Database")
# Initialize a text splitter to divide the filing data into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,         # Maximum size of each chunk
    chunk_overlap = 500,       # Number of characters to overlap between chunks
    length_function = len,     # Function to determine the length of the chunks
    is_separator_regex = False # Whether the separator is a regex pattern
)
# Split the filing data into smaller, manageable chunks
split_data = text_splitter.create_documents([filing_data])

# Create a FAISS vector database from the split data using embeddings
db = FAISS.from_documents(split_data, embeddings)

# Create a retriever object to search within the vector database
retriever = db.as_retriever()

print("-----")
print("Filing Initialized")


What Ticker Would you Like to Analyze? ex. AAPL: AAPL
-----
Getting Filing Data
-----
Initializing Vector Database
-----
Filing Initialized


In [59]:
# Retrieval Function
def retrieve_context(query):
    global retriever
    retrieved_docs = retriever.invoke(query) # Invoke the retriever with the query to get relevant documents
    context = []
    for doc in retrieved_docs:
        context.append(doc.page_content) # Collect the content of each retrieved document
    return context

In [60]:
context = retrieve_context("How have currency fluctuations impacted the company's net sales and gross margins?")
print(context)

['The weakening of foreign currencies relative to the U.S. dollar adversely affects the U.S. dollar value of the Company&#8217;s foreign currency&#8211;denominated sales and earnings, and generally leads the Company to raise international pricing, potentially reducing demand for the Company&#8217;s products. In some circumstances, for competitive or other reasons, the Company may decide not to raise international pricing to offset the U.S. dollar&#8217;s strengthening, which would adversely affect the U.S. dollar value of the gross margins the Company earns on foreign currency&#8211;denominated sales.', 'The Company&#8217;s profit margins vary across its products, services, geographic segments and distribution channels. For example, the gross margins on the Company&#8217;s products and services vary significantly and can change over time. The Company&#8217;s gross margins are subject to volatility and downward pressure due to a variety of factors, including: continued industry-wide glo

In [63]:
len(context), context[1]

(4,
 'The Company&#8217;s profit margins vary across its products, services, geographic segments and distribution channels. For example, the gross margins on the Company&#8217;s products and services vary significantly and can change over time. The Company&#8217;s gross margins are subject to volatility and downward pressure due to a variety of factors, including: continued industry-wide global product pricing pressures and product pricing actions that the Company may take in response to such pressures; increased competition; the Company&#8217;s ability to effectively stimulate demand for certain of its products and services; compressed product life cycles; supply shortages; potential increases in the cost of components, outside manufacturing services, and developing, acquiring and delivering content for the Company&#8217;s services; the Company&#8217;s ability to manage product quality and warranty costs effectively; shifts in the mix of products and services, or in the geographic,')

---
# **Main Script: Putting it All Together!**

Now, we'll string everything together into a very simple while loop that will take the user's question, retrieve context from the Vector DB populated with the specific company Form 10-K, then run inference through our fine tuned model to generate a response! Give it a shot

In [70]:
while True:
  question = input(f"What would you like to know about {ticker}'s form 10-K? ")
  if question == "x":
    break
  else:
    context = retrieve_context(question) # Context Retrieval

    prompt = ft_prompt.format(
          question,
          context,
          "", # output - leave this blank for generation!
      )

    resp = inference(prompt,1024) # Running Inference
    # print (resp)
    parsed_response = extract_response(resp) # Parsing Response
    print(f"L3 Agent: {parsed_response}")
    print("-----\n")


What would you like to know about AAPL's form 10-K? How have currency fluctuations impacted the company's net sales and gross margins?
L3 Agent: The weakening of foreign currencies relative to the U.S. dollar has a negative impact on the company's net sales and gross margins, as it results in the U.S. dollar value of foreign currency-denominated sales and earnings decreasing. Conversely, a strengthening of foreign currencies relative to the U.S. dollar can cause the company to reduce international pricing or incur losses on foreign currency derivative instruments, thereby limiting the benefit.
-----

What would you like to know about AAPL's form 10-K? What region contributes most to international sales?
L3 Agent: Europe
-----

What would you like to know about AAPL's form 10-K? Where is outsourcing located currently? 
L3 Agent: The Company outsources manufacturing and logistical services to partners primarily located in Asia, with a significant concentration of manufacturing in China m

What region contributes most to international sales?  
Where is outsourcing located currently?  
Does the US dollar weakening help or hurt the company?  
What are significant announcements of products during fiscal year 2023?  
iPhone Net Sales?
