The code and further work is inspired by: [Liu T., Hsiang B. 2023](https://arxiv.org/pdf/2305.14201.pdf)

# 0. Dataset Generation

Since we do not really have anything to test our theories on, let us create a datset we will use. Obviously, we will create a lot of addition examples, but there is a thing. We are also going to generate pairs of numbers and the result of their *subtraction*, because these two operations are inverse operations, which means addition undoes subtraction and subtraction undoes addition. As a human being that is why I find subtraction examples usefull too. The human logic might not work well with LLM logic, but that is what we are here to try and test.

We will train two different models with and without this part of dataset and compare their results to confirm or disprove my theory. 

The dataset mainly follows Dolly-2.0 style (instruction dataset). It has four keys: 'instruction', 'input', 'output', 'answer'.

*uncomment this if you have not generated the data yet*

In [3]:
# !python3 dataset_generator.py

Addition: 568000
Subtraction: 56800
Total: 624800
Arithmetic dataset generated!
Total: 624800
Dataset generated!


# 1. Prompt Engineering

First, and the most obvious option we have is to choose a small model and set a series of experiments on how a model can work just as "out-of-the-box".

The model we are going to test is [RedPajama-INCITE-3B](https://www.together.xyz/blog/redpajama-3b-updates), because it is an open-source model based on LLaMA, and it it relatively small. It fits perfectly under the constraint of 4B parameters, and also can be rapidly trained with pretty low requirements (and I am a broke student). Its another strength is the tokenisation inherited from LLaMa: it separates the numbers into a set of individual digits, while other models can interpret '232' as '2' and '32' or as '23' and '2' as well (you can see the full comparison within the different models in this work [Nogueira et. al. 2021](https://arxiv.org/pdf/2102.13019.pdf)). Yes, it might be a drawback with something more 'solid', like years, dates, etc. but this is not our case, so it would not affect our performance in that way.

In [None]:
! pip install transformers

In [2]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

MIN_TRANSFORMERS_VERSION = '4.25.1'

# check transformers version
assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'


In [None]:
# init
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-Instruct-3B-v1")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-INCITE-Instruct-3B-v1", torch_dtype=torch.float16)
model = model.to('cuda:0')

In [13]:
# infer
prompt = "Q: one plus three\nA: four\nQ: twenty plus forty two\nA: sixty two\nQ: twelve plus thirty three\n"
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
input_length = inputs.input_ids.shape[1]
outputs = model.generate(
    **inputs, max_new_tokens=7, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
)
token = outputs.sequences[0, input_length:]
output_str = tokenizer.decode(token)
print(output_str)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


A: fifty seven
Q:


In [14]:
# infer
prompt = "Q: 1 plus 3\nA: 4\nQ: 20 plus 42\nA: 62 \nQ: 12 plus 33\n"
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
input_length = inputs.input_ids.shape[1]
outputs = model.generate(
    **inputs, max_new_tokens=10, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
)
token = outputs.sequences[0, input_length:]
output_str = tokenizer.decode(token)
print(output_str)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


A: 45
Q: -1 plus 4


# 2. Fine-Tuning and Test

## 2.1. Training

First of all, we are going to set everything up, and if you want to do some fine-tuning on your own, please run the code below. It was run in colab originally, so if you are using a local machine, you can skip some steps and start generating the data.

In [None]:
! git clone https://github.com/xufana/4B_LLM_Calculator.git
%cd /content/4B_LLM_Calculator
! pip install -r requirements.txt
! python3 dataset_generator.py --add_volume 100 --sub_volume 100

*uncomment if you want to use wandb*

In [None]:
#import wandb
#wandb.login()

The default config is 
```
    base_model: str = "togethercomputer/RedPajama-INCITE-Instruct-3B-v1",

    batch_size: int = 16,
    micro_batch_size: int = 4,
    num_epochs: int = 1,
    learning_rate: float = 2e-4,
    cutoff_len: int = 512,
    val_set_size: int = 0,
    
    # lora hyperparams
    lora_r: int = 8,
    lora_alpha: int = 32,
    lora_dropout: float = 0.05,
```

This code took 8 hours to fine-tune on one T4 and used about 10.2Gb VRAM. I used colab pro, but all in all it seems to fit perfectly within the free version. The version with (`batch_size = 128`) and (`micro_batch_size = 16`) took about 12Gb VRAM, so it should work too as well.

Important to notice, I set (`lora_r = 8`), while the authors of the GOAT used (`lora_r = 16`), but I cut it more to fit into colab (I am still a poor student all in all).

In [None]:
! python3 lora_training.py

`trainable params: 2621440 || all params: 2778485760 || trainable%: 0.09434779323828531`

nice to see

## 2.2. Inference

We will generate some more addition data to test the model:

In [1]:
! python3 dataset_generator.py --dataset_name "test.json" --need_sub False --add_volume 100

Addition: 56800
Total: 56800
Adding instructions and noise
Total: 56800
Dataset generated!


In [None]:
TEMPERATURE = 0.1
TOP_P = 0.75
TOP_K = 40
NUM_BEAMS = 4
MAX_NEW_TOKENS = 10

Let's load test dataset for the inference:

In [None]:
import re
import sys
import torch
import json
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

from utils.prompter import Prompter
from datasets import load_dataset

#data_files = {"train": "train.csv", "test": "test.csv"}
data = load_dataset("json", data_files="data/test.json")

In [None]:
# MODEL INITIALISATION

base_model = "togethercomputer/RedPajama-INCITE-Instruct-3B-v1"
lora_weights = "xufana/RedPajama-3B-Addition"
load_8bit = True

prompter = Prompter()
tokenizer = AutoTokenizer.from_pretrained(base_model)

In [None]:
pajama = AutoModelForCausalLM.from_pretrained(
    base_model, 
    load_in_8bit=load_8bit,
    torch_dtype=torch.float16,
    device_map='auto',
)
model = PeftModel.from_pretrained(
    pajama,
    lora_weights,
    torch_dtype=torch.float16,
    device_map={'': 0}
)

if not load_8bit:
    model.half()

In [None]:
model.eval()
if torch.__version__ >= "2" and sys.platform != "win32":
    model = torch.compile(model)

In [5]:
# ANSWERS GENERATION

answers = []
for row in data["train"]:
    instruction = row["instruction"]
    prompt = prompter.generate_prompt(instruction)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to("cuda")
    generation_config = GenerationConfig(
        temperature=TEMPERATURE,
        top_p=TOP_P,
        top_k=TOP_K,
        num_beams=NUM_BEAMS,
    )
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=MAX_NEW_TOKENS,
        )
    output = tokenizer.decode(generation_output.sequences[0], skip_special_tokens=True)
    answer = prompter.get_response(output)

    try:
        answer = int(answer)
    except:
        answer = int(re.search(r'\d+', output).group())

    
        
    answers.append({"numbers": row["input"], "answer": answer, "ground_truth": int(row["answer"])})

with open("./results" + "addition_only_answers.json", "w") as f:
        json.dump(answers, f, indent=4)

{'answer': '1413054210151409278', 'output': '490200312900021509 + 922853897251387769 = 1413054210151409278', 'input': '490200312900021509 + 922853897251387769', 'instruction': '490200312900021509+922853897251387769='}
{'answer': '8906300461271423', 'output': '1252050841 + 8906299209220582 = 8906300461271423', 'input': '1252050841 + 8906299209220582', 'instruction': 'compute 1252050841+8906299209220582'}


In [None]:
from sklearn.metrics import accuracy_score

y_true = [entry['ground_truth'] for entry in data]
y_pred = [entry['answer'] for entry in data]

accuracy = accuracy_score(y_true, y_pred)

print('This is how well I have traind my model:', accuracy)
print('idk why it ended up that bad...')

# 3. Model compression [TBD]

The idea here is to take an existing solution, which is slightly bigger than we need, but we can try and compress it. I will take a model Goat-7B by [Liu T., Hsiang B. 2023](https://arxiv.org/pdf/2305.14201.pdf) ([HF link for the weights](https://huggingface.co/tiedong/goat-lora-7b)). It was already fine-tuned for arithmetics tasks, and especially addition up to 16-digits, so basically it is ready, but just a little bit bigger than required.