# Fine tune LLM on database

<!-- TABS -->
## Connect to superduper

:::note
Note that this is only relevant if you are running superduper in development mode.
Otherwise refer to "Configuring your production system".
:::

In [1]:
APPLY = True
TABLE_NAME = 'sample_llm_finetuning'

In [2]:
from superduper import superduper

db = superduper('mongomock://test_db', force_apply=True)

[32m2025-Jan-13 13:15:39.10[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.misc.plugins[0m:[36m13  [0m | [1mLoading plugin: mongodb[0m
[32m2025-Jan-13 13:15:39.19[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m64  [0m | [1mBuilding Data Layer[0m
[32m2025-Jan-13 13:15:39.19[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m79  [0m | [1mData Layer built[0m
[32m2025-Jan-13 13:15:39.19[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.backends.base.cluster[0m:[36m99  [0m | [1mCluster initialized in 0.00 seconds.[0m
[32m2025-Jan-13 13:15:39.19[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.build[0m:[36m184 [0m | [1mConfiguration: 
 +---------------+---------------------+
| Configuration |        Value        |
+---------------+---------------------+
|  Data Backend | mongomock://test_db |
+-------

<!-- TABS -->
## Get LLM Finetuning Data

The following are examples of training data in different formats.

In [3]:
from datasets import load_dataset
from superduper.base.document import Document

def getter():

    dataset_name = "timdettmers/openassistant-guanaco"
    dataset = load_dataset(dataset_name)
    
    train_dataset = dataset["train"]
    eval_dataset = dataset["test"]
    
    train_documents = [{**example, "_fold": "train"} for example in train_dataset][:10]
    eval_documents = [{**example, "_fold": "valid"} for example in eval_dataset][:5]
    
    data = train_documents + eval_documents

    return data

In [4]:
if APPLY:
    data = getter()

Repo card metadata block was not found. Setting CardData to empty.


We can define different training parameters to handle this type of data.

In [5]:
transform = None
key = ('text')
training_kwargs=dict(dataset_text_field="text")

Example input_text and output_text

In [6]:
if APPLY:
    d = data[0]
    input_text, output_text = d["text"].rsplit("### Assistant: ", maxsplit=1)
    input_text += "### Assistant: "
    output_text = output_text.rsplit("### Human:")[0]
    print("Input: --------------")
    print(input_text)
    print("Response: --------------")
    print(output_text)

Input: --------------
### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: 
Response: --------------
"Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.

Recent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited

<!-- TABS -->
## Insert simple data

After turning on auto_schema, we can directly insert data, and superduper will automatically analyze the data type, and match the construction of the table and datatype.

In [7]:
table_or_collection = db[TABLE_NAME]

if APPLY:
    table_or_collection.insert(data).execute()

select = table_or_collection.select()

[32m2025-Jan-13 13:15:41.75[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m593 [0m | [1mComponent (table, sample_llm_finetuning) not found in cache, loading from db[0m
[32m2025-Jan-13 13:15:41.75[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m599 [0m | [1mLoad (('table', 'sample_llm_finetuning')) from metadata...[0m
[32m2025-Jan-13 13:15:41.75[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m331 [0m | [1mTable sample_llm_finetuning does not exist, auto creating...[0m
[32m2025-Jan-13 13:15:42.40[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m337 [0m | [1mCreating table sample_llm_finetuning with schema {('text', 'str'), ('_fold', 'str')}[0m
[32m2025-Jan-13 13:15:42.40[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m593 [0m | [1mCo

## Select a Model

In [8]:
model_name = "Qwen/Qwen2.5-0.5B"
model_kwargs = dict()
tokenizer_kwargs = dict()

<!-- TABS -->
## Build A Trainable LLM

**Create an LLM Trainer for training**

The parameters of this LLM Trainer are basically the same as `transformers.TrainingArguments`, but some additional parameters have been added for easier training setup.

In [9]:
from superduper_transformers import LLM, LLMTrainer

trainer = LLMTrainer(
    identifier="llm-finetune-trainer",
    output_dir="output/finetune",
    overwrite_output_dir=True,
    num_train_epochs=2,
    save_total_limit=10,
    logging_steps=1,
    evaluation_strategy="steps",
    save_steps=100,
    eval_steps=100,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    key=key,
    select=select,
    transform=transform,
    training_kwargs=training_kwargs,
    use_lora=True,
)

Create a trainable LLM model and add it to the database, then the training task will run automatically.

In [10]:
llm = LLM(
    identifier="llm",
    model_name_or_path=model_name,
    trainer=trainer,
    model_kwargs=model_kwargs,
    tokenizer_kwargs=tokenizer_kwargs,
)

In [11]:
if APPLY:
    db.apply(llm, force=True)

[32m2025-Jan-13 13:15:43.70[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m593 [0m | [1mComponent (trainer, llm-finetune-trainer) not found in cache, loading from db[0m
[32m2025-Jan-13 13:15:43.70[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m599 [0m | [1mLoad (('trainer', 'llm-finetune-trainer')) from metadata...[0m
[32m2025-Jan-13 13:15:43.70[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.apply[0m:[36m359 [0m | [1mFound new trainer:llm-finetune-trainer:43d4b6d421394594[0m
[32m2025-Jan-13 13:15:43.70[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m593 [0m | [1mComponent (model, llm) not found in cache, loading from db[0m
[32m2025-Jan-13 13:15:43.70[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m599 [0m | [1mLoad (('model', 'llm')) from metada

Device set to use mps:0


[32m2025-Jan-13 13:15:45.01[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper_transformers.training[0m:[36m284 [0m | [1mStart training, experiment_id: f38a3a5dcec325a3f02d9b31e476e9fb1f77c6f5[0m
[32m2025-Jan-13 13:15:45.02[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper_transformers.training[0m:[36m487 [0m | [1mStart training LLM model[0m
[32m2025-Jan-13 13:15:45.02[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper_transformers.training[0m:[36m488 [0m | [1mtraining_args: LLMTrainer(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
bits=N



[32m2025-Jan-13 13:15:46.09[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper_transformers.training[0m:[36m519 [0m | [1mtokenizer_kwargs: %s {'pretrained_model_name_or_path': 'Qwen/Qwen2.5-0.5B'}[0m
[32m2025-Jan-13 13:15:46.41[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper_transformers.training[0m:[36m534 [0m | [1mPreparing LoRA training[0m
trainable params: 4,399,104 || all params: 498,431,872 || trainable%: 0.8826



Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

[32m2025-Jan-13 13:15:46.68[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper_transformers.training[0m:[36m550 [0m | [1mAdd callback <superduper_transformers.training.LLMCallback object at 0x2aa392380>[0m


Step,Training Loss,Validation Loss


[32m2025-Jan-13 13:15:57.31[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m562 [0m | [1mComponent 42d4b7ba8f204ee6 not found in cache, loading from db with uuid[0m
[32m2025-Jan-13 13:15:57.31[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m562 [0m | [1mComponent 43d4b6d421394594 not found in cache, loading from db with uuid[0m
[32m2025-Jan-13 13:15:57.33[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m584 [0m | [1mAdding trainer:llm-finetune-trainer:43d4b6d421394594 to cache[0m
[32m2025-Jan-13 13:15:57.33[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m562 [0m | [1mComponent c7450d30113d4fd0 not found in cache, loading from db with uuid[0m
[32m2025-Jan-13 13:15:57.33[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.base.datalayer[0m:[36m584 [0m | [1mAdd

Device set to use mps:0
The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DiffLlamaForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'Jamb

[32m2025-Jan-13 13:15:59.15[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36msuperduper.backends.local.queue[0m:[36m120 [0m | [1mConsumed all events[0m


In [None]:
from superduper import Template, Table, Schema, Application
from superduper.components.dataset import RemoteData

llm.trainer.use_lora = "<var:use_lora>"
llm.trainer.num_train_epochs = "<var:num_train_epochs>"

app = Application(identifier="llm", components=[llm])

t = Template(
    'llm-finetune',
    template=app,
    substitutions={
        TABLE_NAME: 'table_name',
        model_name: 'model_name',
    },
    default_table=Table(
        'sample_llm_finetuning',
        schema=Schema(
            'sample_llm_finetuning/schema',
            fields={'x': 'str', 'y': 'int'},
        ),
        data=RemoteData(
            'llm_finetuning',
            getter=getter,
        ),
    ),
    template_variables=['table_name', 'model_name', 'use_lora', 'num_train_epochs'],
    types={
        'collection': {
            'type': 'str',
            'default': 'dataset',
        },
        'model_name': {
            'type': 'str',
            'default': 'Qwen/Qwen2.5-0.5B',
        },
        'use_lora': {
            'type': 'bool',
            'default': True,
        },
        'num_train_epochs': {
            'type': 'int',
            'default': 3
        },
        'table_name': {
            'type': 'str',
            'default': 'sample_llm_finetuning',
        }
    }
)

t.export('.')

## Load the trained model
There are two methods to load a trained model:

- **Load the model directly**: This will load the model with the best metrics (if the transformers' best model save strategy is set) or the last version of the model.
- **Use a specified checkpoint**: This method downloads the specified checkpoint, then initializes the base model, and finally merges the checkpoint with the base model. This approach supports custom operations such as resetting flash_attentions, model quantization, etc., during initialization.

In [None]:
if APPLY:
    llm = db.load("model", "llm")

In [None]:
if APPLY:
    llm.predict(input_text, max_new_tokens=200)