Skip to content

redspottedbittern/formIE

Repository files navigation

VLM fine-tuning for Document IE

VLLM UNSLOTH Hydra

⭐ Fine-tuning VLMs on synthetic data for Visual Form Document Information Extraction. ⭐

This repository was created for a project with the goal to evaluate the performance gain of fine-tuning with mostly synthetic data. Now, the repo can be used in order to conduct your own fine-tuning experiments. Here is how:

1. ⚙️ Setup

The code uses unsloth and vllm, which can have contradicting dependencies. It is recommended to use conda and uv and install the exact packages like this: Use setup_symlinks.sh to create symlinks for the models/, outputs/, and .cache/ directories from a different directory.

# 1. Setup
conda env create -f conda_vlm.yml
uv pip install -r requirements.lock

# 2. (Optional) configure storage
bash setup_symlinks.sh /path/to/storage

# 3. Run full pipeline
python main.py \
    train.model_for_training='unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit'

2. 🚀 Pipeline

The main entry point for the pipeline is main.py. It automatically runs the whole pipeline of training, inference, and evaluation. Start it with the obligatory argument train.model_for_training:

python main.py \
    train.model_for_training='unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit'

The configuration is managed with hydra. In order to change settings for your run, overwrite them on the command line or in config. Each pipeline stage has its own config namespace (train, predict, evaluate). Overrides must be prefixed accordingly.

python main.py \
    +train.dataset=bigdataset \
    +predict.dataset=smalldataset

Each pipeline step can be run alone by calling the corresponding file in src/.

2.1 📦 Outputs

models/     # trained models
outputs/    # logs, predictions, and evaluation results
.cache/     # intermediate data for unsloth

2.2 🏋️ Train

python src/train.py \
    model_for_training='models/<model_dir>' \
    dataset=dataset_name

For training, select a model and a dataset. Take care to select a unsloth hyperparameter set that fits your experiment and model.

2.3 🔮 Predict

python src/predict.py \
    model_for_inference='models/<model_dir>' \
    dataset=dataset_name

For inference, simply select a local model or a Hugging Face model and a dataset. Take care to select a vllm config that fits the model. The optional argument stop=foo stops the dataset after foo samples.

2.4 🧹 Postprocess

python src/postprocess.py \
    predictions_to_postprocess='outputs/<model_dir>/predictions.json'

There is an option to postprocess the predictions. At the moment the module Postprocessor is empty. Implement a postprocessor to use it.

2.5 📊 Eval

python src/evaluate.py \
    predictions_to_evaluate='outputs/<model_dir>/predictions.json' \
    skip_chatgpt=False

The evaluation step calculates metrics for every type of field. For freetext fields a Likert score with ChatGPT as an LLM-as-a-judge is calculated. In order to work there needs to be a .chatgpt-key file in root. Use the option skip_chatgpt=True to skip the Likert score.

2.6 📈 Comparing different runs

The script compare_runs.py is not part of the pipeline. It creates plots for more than one model. Use with a list of the results_per_sample.csv file created during the evaluation:

python compare_runs.py \
    results_to_compare='[outputs/<run_dir>/results/results_per_sample.csv, outputs/<run_dir>/results/results_per_sample.csv]'

3. 🗂️ Dataset requirements

This project was originally built with a proprietary dataset. In order to run the whole pipeline, the dataset must confirm to certain criteria.

The pipeline expects a Hugging Face dataset with these columns:

Column Type Description
image array PIL image of the PDF page
file string the name of the file
field_id string a unique id for the current field
groundtruth string the value of the field
field_type string which type the field is

Additionally, the fields need to be of one of the three types: Freetext, Combfield, Checkbox. The types are evaluated separately during evaluation.

You can add a JSON file resources/name_mapping.json that maps the field_id value in the dataset to a better description of the field to improve the extraction capabilities of the model. To identify files as written with machine writing, add a pattern in config/file_names/base.yaml. This will be used during evaluation.

About

Codebase for Visual Document Information Extraction with VLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors