InferLength is an installable Python package for static output-length prediction on local LLM inference workloads. It leverages on-the-fly activations and token entropy for highly accurate prediction with negligible cost, as introduced in Predicting LLM Output Length via Entropy-Guided Representations
The package supports:
- training on the ForeLen benchmark, custom prompt data, or merged datasets
- length prediction from a saved model artifact on disk, without retraining
- a generic decorator that can attach predictions and local runtime logging to functions
The paper's Progressive Length Prediction method remains out of scope in this repository. The implementation here is static EGTP only.
- Python 3.9+
- PyTorch 2.4+
Install directly from the public repo:
pip install git+https://github.com/vastava/InferLength.gitFor local development:
git clone https://github.com/vastava/InferLength.git
cd InferLength
pip install -e .Fetch and normalize a ForeLen config:
egtp dataset fetch-forelen \
--config-name qwen2.5-0.5b-longseq \
--cache-dir .cache/forelen \
--output outputs/forelen_qwen_longseq.jsonlNormalize proprietary data:
egtp dataset normalize \
--custom-file my_data.csv \
--prompt-column prompt \
--response-column response \
--model-id /models/my-local-llm \
--output outputs/my_data.jsonlTrain a predictor from ForeLen:
egtp train \
--data-mode forelen \
--forelen-config qwen2.5-0.5b-longseq \
--cache-dir .cache/forelen \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--artifact-dir outputs/qwen-longseq-artifactTrain a predictor from proprietary data:
egtp train \
--data-mode custom \
--model-id /models/my-local-llm \
--normalized-data outputs/my_data.jsonl \
--artifact-dir outputs/my-artifactPredict from a saved artifact:
egtp predict \
--artifact-dir outputs/my-artifact \
--prompt "Summarize the attached report in five bullets."from egtp import EGTPDataset, EGTPTrainer, TrainConfig, TrainedPredictor, with_length_prediction
dataset = EGTPDataset.load_custom(
"my_data.csv",
column_map={"prompt_text": "prompt", "response_text": "response"},
model_id_or_path="/models/my-local-llm",
)
trainer = EGTPTrainer(device="auto")
predictor = trainer.fit(
dataset=dataset,
model_id_or_path="/models/my-local-llm",
train_config=TrainConfig(epochs=20, patience=3),
)
predictor.save("artifacts/my-local-llm")
loaded = TrainedPredictor.load("artifacts/my-local-llm")
@with_length_prediction(
predictor=loaded,
prompt_extractor=lambda args, kwargs: kwargs["prompt"],
response_extractor=lambda result: result["text"],
log_path="outputs/runtime_log.jsonl",
)
def run_inference(*, prompt: str):
return {"text": "hello world"}- ForeLen is downloaded on demand from Hugging Face and is not bundled into the wheel.
- Custom rows must provide prompt text and either response text or numeric target length.
- Target length is defined as output token count under the training model tokenizer.
- Quickstart Custom data, train, save, load, predict.
- Qwen Evaluation Main realistic evaluation path with ForeLen plus MAE/RMSE.
- Runtime Integration Decorator usage, callbacks, and JSONL logging.
- Llama Evaluation Optional gated-model path for users with Hugging Face Llama access.
@inproceedings{xie2026predicting,
title={Predicting LLM Output Length via Entropy-Guided Representations},
author={Xie, Huanyi and Chen, Yubin and Wang, Liangyu and Hu, Lijie and Wang, Di},
booktitle={International Conference on Learning Representations},
year={2026}
}