# Fine-tune NuExtract-1.5-tiny for Resume Extraction

Fine-tunes `numind/NuExtract-1.5-tiny` (Qwen2.5-0.5B based) using Unsloth + QLoRA on a custom resume extraction dataset.

Run on a **free** Tesla T4 Google Colab instance.

In [1]:
%%capture
!pip install -q unsloth
!pip install -q --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo

### Load NuExtract-1.5-tiny

NuExtract-1.5-tiny is a fine-tuning of Qwen2.5-0.5B by NuMind for structured information extraction. It supports long documents and multiple languages.

In [2]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 4096
dtype = None  # Auto-detect
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="numind/NuExtract-1.5-tiny",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2026.1.4: Fast Qwen2 patching. Transformers: 4.57.6.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

numind/NuExtract-1.5-tiny does not have a padding token! Will use pad_token = <|vision_pad|>.


### Add LoRA adapters

Only update 1-10% of parameters via QLoRA.

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

Unsloth 2026.1.4 patched 24 layers with 24 QKV layers, 24 O layers and 24 MLP layers.


### Prepare Training Data

Load the resume extraction dataset and format it using NuExtract's native prompt format:
```
<|input|>
### Template:
{template_json}
### Text:
{resume_text}
<|output|>
{filled_json}<|end-output|>
```

In [8]:
import json
from datasets import load_dataset

# Template matching your app/core/templates.py
RESUME_TEMPLATE = {
    "personal_information": {
        "name": "",
        "email": "",
        "phone": "",
        "location": "",
    },
    "skills": [],
    "work_experience": [
        {
            "employer": "",
            "job_title": "",
            "start_date": "",
            "end_date": "",
            "location": "",
        }
    ],
    "education": [
        {
            "degree": "",
            "institution": "",
            "graduation_year": "",
        }
    ],
}

TEMPLATE_STR = json.dumps(RESUME_TEMPLATE, indent=4)
EOS_TOKEN = "<|end-output|>"


def formatting_prompts_func(examples):
    texts = []
    for resume_text, output_json in zip(examples["text"], examples["output"]):
        # Format output JSON with indentation for readability
        formatted_output = json.dumps(json.loads(output_json), indent=4)

        prompt = (
            f"<|input|>\n"
            f"### Template:\n{TEMPLATE_STR}\n"
            f"### Text:\n{resume_text}\n\n"
            f"<|output|>\n"
            f"{formatted_output}{EOS_TOKEN}"
        )
        texts.append(prompt)
    return {"text": texts}


# Load the training data
# Upload data/train.jsonl to Colab or change path accordingly
dataset = load_dataset("json", data_files="data/train.jsonl", split="train")
dataset = dataset.map(formatting_prompts_func, batched=True)

print(f"Training examples: {len(dataset)}")
print(f"\n--- Sample prompt ---\n{dataset[0]['text'][:1000]}...")

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/70 [00:00<?, ? examples/s]

Training examples: 70

--- Sample prompt ---
<|input|>
### Template:
{
    "personal_information": {
        "name": "",
        "email": "",
        "phone": "",
        "location": ""
    },
    "skills": [],
    "work_experience": [
        {
            "employer": "",
            "job_title": "",
            "start_date": "",
            "end_date": "",
            "location": ""
        }
    ],
    "education": [
        {
            "degree": "",
            "institution": "",
            "graduation_year": ""
        }
    ]
}
### Text:
OLIVIA DUBOIS
olivia.dubois@orange.fr | +33 6 12 34 56 78 | Paris, France

EXPERIENCE PROFESSIONNELLE

Blockchain Developer â€” Ledger
Jan 2021 â€“ Present | Paris, France
- Developed smart contracts for DeFi protocols using Solidity.
- Audited internal codebases for security vulnerabilities.
- Integrated hardware wallet support into Web3 dApps.

Full Stack Developer â€” Dassault SystÃ¨mes
Sep 2018 â€“ Dec 2020 | VÃ©lizy-Villacoublay, France
-

### Train

Using SFTTrainer with multiple epochs since the dataset is small (~10 examples).
With 30+ examples, 3 epochs is usually enough. With ~10, use more epochs.

In [20]:
from trl import SFTTrainer
from transformers import TrainingArguments, EarlyStoppingCallback
from unsloth import is_bfloat16_supported

split_dataset = dataset.train_test_split(test_size=0.15, seed=3407)
train_ds = split_dataset["train"]
eval_ds  = split_dataset["test"]

print(f"â†’ Training on {len(train_ds)} examples | Validating on {len(eval_ds)} examples")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,

    args=TrainingArguments(

        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=12,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,

        logging_steps=5,
        output_dir="outputs",
        report_to="none",

        eval_strategy="epoch",
        save_strategy="epoch",
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,

    ),

    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=3,
            early_stopping_threshold=0.0005
        )
    ],
)

trainer_stats = trainer.train()

# Quick summary
print(f"Training finished.")
print(f"Best eval loss: {trainer.state.best_metric:.4f}")
print(f"Total epochs run: {trainer.state.epoch:.1f}")

â†’ Training on 59 examples | Validating on 11 examples


Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/59 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/11 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 59 | Num Epochs = 12 | Total steps = 96
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 8,798,208 of 502,830,976 (1.75% trained)


Epoch,Training Loss,Validation Loss
1,0.2943,0.326408
2,0.2372,0.333588
3,0.2128,0.348774
4,0.1496,0.385741


Unsloth: Not an error, but Qwen2ForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


Training finished.
Best eval loss: 0.3264
Total epochs run: 4.0


In [21]:
trainer_stats = trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 59 | Num Epochs = 12 | Total steps = 96
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 8,798,208 of 502,830,976 (1.75% trained)


Epoch,Training Loss,Validation Loss
1,0.2497,0.320913
2,0.214,0.345624
3,0.1912,0.363174
4,0.1213,0.406136


In [11]:
print(f"{trainer_stats.metrics['train_runtime']:.1f} seconds used for training.")
print(f"{trainer_stats.metrics['train_runtime']/60:.1f} minutes used for training.")
print(f"Final loss: {trainer_stats.metrics['train_loss']:.4f}")

156.7 seconds used for training.
2.6 minutes used for training.
Final loss: 0.4586


### Test Inference

Test with a sample resume to verify extraction quality.

In [25]:
FastLanguageModel.for_inference(model)

test_resume = """Sheldon Cooper
Dr. Sheldon L. Cooper

Pasadena CA
Email: shelton.cooper@caltech.edu
Phone 626-555-0112

Summary:
Theoretical physicist specializing in string theory. Holds multiple
advanced degrees and an IQ well above average (187).

Skills:
String theory, math, physics, quantum mechanics, lectures,
whiteboard usage, research

Experience
Caltech - Theoretical Physicist (2003-present)
Research in string theory & related areas
Teaching grad students occasionally
Publications available upon request

Education:
PhD Physics - Caltech (2003)
BS Physics / Math - University of Texas at Austin
"""

test_prompt = (
    f"<|input|>\n"
    f"### Template:\n{TEMPLATE_STR}\n"
    f"### Text:\n{test_resume}\n\n"
    f"<|output|>\n"
)

inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
result = tokenizer.decode(outputs[0], skip_special_tokens=False)

# Extract just the output part
output_text = result.split("<|output|>")[1].split("<|end-output|>")[0].strip()
print(output_text)

# Validate JSON
try:
    parsed = json.loads(output_text)
    print("\n--- Parsed successfully ---")
    print(json.dumps(parsed, indent=2))
except json.JSONDecodeError as e:
    print(f"\n--- JSON parse error: {e} ---")

{
    "personal_information": {
        "name": "Sheldon Cooper",
        "email": "sheldon.cooper@caltech.edu",
        "phone": "626-555-0112",
        "location": "Pasadena, CA"
    },
    "skills": [
        "String theory",
        "math",
        "physics",
        "quantum mechanics",
        "lectures",
        "whiteboard usage",
        "research"
    ],
    "work_experience": [
        {
            "employer": "Caltech",
            "job_title": "Theoretical Physicist",
            "start_date": "2003",
            "end_date": "present",
            "location": "Pasadena, CA"
        }
    ],
    "education": [
        {
            "degree": "PhD Physics",
            "institution": "Caltech",
            "graduation_year": "2003"
        },
        {
            "degree": "BS Physics / Math",
            "institution": "University of Texas at Austin",
            "graduation_year": "1999"
        }
    ]
}

--- Parsed successfully ---
{
  "personal_information": {
    "na

### Save LoRA Adapters

In [26]:
model.save_pretrained("nuextract-resume-lora")
tokenizer.save_pretrained("nuextract-resume-lora")

('nuextract-resume-lora/tokenizer_config.json',
 'nuextract-resume-lora/special_tokens_map.json',
 'nuextract-resume-lora/chat_template.jinja',
 'nuextract-resume-lora/vocab.json',
 'nuextract-resume-lora/merges.txt',
 'nuextract-resume-lora/added_tokens.json',
 'nuextract-resume-lora/tokenizer.json')

### Export to GGUF

Export the fine-tuned model to GGUF format for use with llama-cpp-python in the ResuMap API.

In [27]:
# Export to GGUF q8_0 (smaller, ~500MB)
model.save_pretrained_gguf("nuextract-resume-gguf-q8", tokenizer, quantization_method="q8_0")

Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `nuextract-resume-gguf-f16`: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:04<00:00,  4.20s/it]


Successfully copied all 1 files from cache to `nuextract-resume-gguf-f16`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00, 8160.12it/s]
Unsloth: Merging weights into 16bit: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:04<00:00,  4.54s/it]


Unsloth: Merge process complete. Saved to `/content/nuextract-resume-gguf-f16`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages
Unsloth: Successfully installed llama.cpp!
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['NuExtract-1.5-tiny.F16.gguf'

Unsloth: Copying 1 files from cache to `nuextract-resume-gguf-q8`: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:03<00:00,  3.38s/it]


Successfully copied all 1 files from cache to `nuextract-resume-gguf-q8`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00, 8981.38it/s]
Unsloth: Merging weights into 16bit: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:04<00:00,  4.64s/it]


Unsloth: Merge process complete. Saved to `/content/nuextract-resume-gguf-q8`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: llama.cpp found in the system. Skipping installation.
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['NuExtract-1.5-tiny.F16.gguf']
Unsloth: [2] Converting GGUF f16 into q8_0. This might take 10 minutes...
Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['NuExtract-1.5-tiny.Q8_0.gguf']
Unsloth: No Ollama template mapping found for model 'numind/NuExtract-1.5-tiny

{'save_directory': 'nuextract-resume-gguf-q8',
 'gguf_files': ['NuExtract-1.5-tiny.Q8_0.gguf'],
 'modelfile_location': None,
 'want_full_precision': False,
 'is_vlm': False,
 'fix_bos_token': False}

In [37]:
# Export to GGUF f16 (best quality, ~1GB)
model.save_pretrained_gguf("nuextract-resume-gguf-f16", tokenizer, quantization_method="f16")

Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `nuextract-resume-gguf-f16`: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:17<00:00, 17.49s/it]


Successfully copied all 1 files from cache to `nuextract-resume-gguf-f16`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00, 5223.29it/s]
Unsloth: Merging weights into 16bit: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:20<00:00, 20.65s/it]


Unsloth: Merge process complete. Saved to `/content/nuextract-resume-gguf-f16`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: llama.cpp found in the system. Skipping installation.
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['NuExtract-1.5-tiny.F16.gguf']
Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['NuExtract-1.5-tiny.F16.gguf']
Unsloth: No Ollama template mapping found for model 'numind/NuExtract-1.5-tiny'. Skipping Ollama Modelfile
Unsloth: example usage for text only LLMs: lla

{'save_directory': 'nuextract-resume-gguf-f16',
 'gguf_files': ['NuExtract-1.5-tiny.F16.gguf'],
 'modelfile_location': None,
 'want_full_precision': True,
 'is_vlm': False,
 'fix_bos_token': False}

### Deploy to ResuMap

After exporting, download the Q8_0 GGUF file from Colab and add it to your ResuMap project:

```bash
# In Colab, the exported file is at:
# nuextract-resume-gguf-q8/NuExtract-1.5-tiny.Q8_0.gguf

# Download it to your local machine, then:
cp /path/to/NuExtract-1.5-tiny.Q8_0.gguf models/NuExtract-1.5-tiny.Q8_0.gguf

# Add to Git LFS and commit
git add models/*.gguf
git commit -m "Add fine-tuned NuExtract Q8_0 GGUF model via LFS"
git push
```

The model filename in `app/config.py` is already configured to use `NuExtract-1.5-tiny.Q8_0.gguf`.

Then restart your API server â€” no code changes needed!

In [35]:
from huggingface_hub import HfApi

api = HfApi()
api.create_repo(
    repo_id="",
    repo_type="model",
    exist_ok=True,
)


RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id='')