## Install Required Libraries
Install any necessary libraries that are not pre-installed in the environment.

In [1]:
!pip install -U transformers datasets
!pip install -U trl accelerate peft transformers bitsandbytes



## Load the Model and Tokenizer
Many LLM models are available in the Hugging Face model hub. We can load the model needed and its corresponding tokenizer using the AutoModelForCausalLM and AutoTokenizer classes.

In [2]:
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

## this is too big
# base_model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
base_model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

# Load a pre-trained model
model = AutoModelForCausalLM.from_pretrained(base_model_name)

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

model.to("cuda")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 1536)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=1536, out_features=1536, bias=True)
          (k_proj): Linear(in_features=1536, out_features=256, bias=True)
          (v_proj): Linear(in_features=1536, out_features=256, bias=True)
          (o_proj): Linear(in_features=1536, out_features=1536, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=1536, out_features=8960, bias=False)
          (up_proj): Linear(in_features=1536, out_features=8960, bias=False)
          (down_proj): Linear(in_features=8960, out_features=1536, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((1536,), eps=1e-06)
    (rotary_emb): Qw

## Prepare the Input
To generate text autoregressively, wou need to provide an initial prompt or input sequence.

In [3]:
# Define a prompt
def apply_prompt(example):
    prompt = f"""
    ### Question: {example['question']}
    ### Context: {example['context']}
    ### Answer: {example['answer']}
    """
    return prompt.strip()  # Remove leading/trailing whitespace

In [4]:
question = "What are the financial consequences of exchange rate fluctuations?"

# Tokenize the input
inputs = tokenizer(apply_prompt({"question":question, "context":"", "answer":""}), return_tensors="pt").to("cuda")
# return_tensors="pt" for Returns PyTorch tensors, use `return_tensors="tf"` for TensorFlow

## Generate Text
Use the generate method to produce text. You can control the behavior of text generation using various parameters:

In [5]:
# Generate text
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,  # Maximum length of the generated answer
    use_cache=True,
    # num_beams=5,          # Beam search for better quality
    # temperature=0.7,      # Controls randomness (lower = more deterministic)
    # early_stopping=True,  # Stop generation when the model outputs the end-of-sequence token
)

# Decode the generated tokens to text
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = answer.split("### Answer:")[-1].strip()
print(answer)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Exchange rate fluctuations can have various financial consequences. For example, if a country's currency degrades, the value of its exports increases, leading to higher profits. However, if the same country's currency appreciates, the value of its exports decreases, leading to lower profits. Additionally, if a country's currency degrades, the cost of imports increases, leading to higher costs for consumers. If a country's currency appreciates, the cost of imports decreases, leading to lower costs for consumers. Furthermore, if a country's currency degrades, the value of its foreign savings increases, leading to higher returns on foreign savings. However, if the same country's currency appreciates, the value of its foreign savings decreases, leading to lower returns on foreign savings. Additionally, if a country's currency degrades, the value of its foreign debt increases, leading to higher costs for the country. However, if the same country's currency appreciates, the value of its fore

## Load the Dataset
Load your dataset. You can use the datasets library from Hugging Face for easy loading and preprocessing.

In [6]:
from datasets import load_dataset
dataset = load_dataset("virattt/financial-qa-10K",split = "train")

dataset = dataset.map(lambda example: {"prompt": apply_prompt(example)})
print(dataset[0]["prompt"])

README.md:   0%|          | 0.00/419 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]

### Question: What area did NVIDIA initially focus on before expanding to other computationally intensive fields?
    ### Context: Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.
    ### Answer: NVIDIA initially focused on PC graphics.


## Preprocessing the Dataset
Tokenize the dataset using the tokenizer:

In [11]:
# Tokenize the dataset
def tokenize_function(examples):
    output = tokenizer(examples["prompt"], truncation=True, padding="max_length", max_length=128)
    output['labels'] = output['input_ids'].copy()
    return output

# dataset = dataset.map(tokenize_function, batched=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]

## Train Model

In [14]:
# prompt: use peft

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
from trl import SFTTrainer

# Configure LoRA
lora_config = LoraConfig(
    r=16,  # LoRA attention dimension
    lora_alpha=16, # Alpha parameter for LoRA scaling
    lora_dropout=0.05, # Dropout probability for LoRA layers
    bias="none", # Bias type for LoRA. Can be 'none', 'all' or 'lora_only'
    task_type="CAUSAL_LM", # Task type
)

# Prepare model for k-bit training if using 4-bit quantization (BitsAndBytes)
# model = prepare_model_for_kbit_training(model)

# Get PEFT model
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()

# Define training arguments
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",  # Output directory
    num_train_epochs=1,  # Number of training epochs
    per_device_train_batch_size=2,  # Batch size per device during training
    gradient_accumulation_steps=1,  # Number of updates steps to accumulate before performing a backward/update pass
    save_steps=1000,  # Save checkpoint every X updates steps
    logging_steps=25,  # Log every X updates steps
    learning_rate=2e-4,  # Learning rate
    weight_decay=0.001,  # Weight decay
    fp16=True,  # Use mixed precision training
    max_grad_norm=0.3, # Max gradient norm
    max_steps=-1, # Use num_train_epochs instead
    warmup_ratio=0.03, # Linear warmup ratio
    group_by_length=True, # Group samples by length
    lr_scheduler_type="cosine", # Learning rate scheduler
    report_to="none", # Reporting tool
)

# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_dataset,
    peft_config=lora_config,
    dataset_text_field="prompt", # Field in dataset containing the text
    max_seq_length=512, # Maximum sequence length
    tokenizer=tokenizer,
    args=training_args,
    packing=False, # Pack sequences into a single input tensor
)

# Start training
trainer.train()

KeyboardInterrupt: 

In [12]:
from transformers import TrainingArguments, Trainer

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    # evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none"
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Fine-tune the model
trainer.train()

OutOfMemoryError: CUDA out of memory. Tried to allocate 892.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 534.12 MiB is free. Process 5822 has 14.22 GiB memory in use. Of the allocated memory 13.66 GiB is allocated by PyTorch, and 429.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Model inference after fine-tuning

In [None]:
question = "What are the financial consequences of exchange rate fluctuations?"

# Tokenize the input
inputs = tokenizer(apply_prompt({"question":question, "context":"", "answer":""}), return_tensors="pt").to("cuda")

# Generate text
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)

# Decode the generated tokens to text
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = answer.split("### Answer:")[-1].strip()
print(answer)

## Save model locally

In [None]:
new_model_name = "DeepSeek-R1-Financial"
model.save_pretrained(new_model_name) # Local saving
tokenizer.save_pretrained(new_model_name)
model.save_pretrained_merged(new_model_name, tokenizer, save_method = "merged_16bit")

## Push model to HuggingFace hub

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HUGGINGFACE_TOKEN")


from huggingface_hub import login
login(hf_token)

In [None]:
hf_model_name = f"wangbn/{new_model_name}"

model.push_to_hub(hf_model_name) # Online saving
tokenizer.push_to_hub(hf_model_name) # Online saving
model.push_to_hub_merged(hf_model_name, tokenizer, save_method = "merged_16bit")