m-LoRA: An Efficient "Factory" to Build Multiple LoRA Adapters

mLoRA (a.k.a Multi-LoRA Fine-Tune) is an open-source framework designed for efficient fine-tuning of multiple Large Language Models (LLMs) using LoRA and its variants. Key features of mLoRA include:

Concurrent fine-tuning of multiple LoRA adapters.
Shared base model among multiple LoRA adapters.
Support for multiple LoRA variant algorithms and various base models.
Exclusive Mo-LoRA (Mixture of LoRAs) optimization for MixLoRA and its variants.

You can try m-LoRA with Google Colab before local installation.

Note from the maintainer of this repository

This is an actively developing fork of the official m-LoRA repository, focusing on PEFT algorithm and its related improvements, maintained by the authors of m-LoRA. Please note that the functions, interfaces, and performance of this fork are slightly different from the original m-LoRA. We cannot guarantee compatibility. For production use, please prefer the original m-LoRA.

Supported Platform

OS	Backend	Model Precision	Quantization	Flash Attention
Linux	CUDA	FP32, FP16, TF32, BF16	8bit and 4bit	✓
Windows	CUDA	FP32, FP16, TF32, BF16	8bit and 4bit	-
macOS	MPS	FP32, FP16, BF16	✗	✗
All	CPU	FP32, FP16, BF16	✗	✗

You can use the MLORA_BACKEND_TYPE environment variable to force m-LoRA to use a specific backend. For example, if you want m-LoRA to run only on CPU, you can set MLORA_BACKEND_TYPE=CPU before importing mlora.

Supported Pre-trained Models

	Model	# Parameters
✓	LLaMA 1/2/3	7B/8B/13B/70B
✓	TinyLLaMA	1.1B
✓	Qwen 1.5/2	1.5B/4B/7B/57B/72B
✓	Gemma	2B/7B
✓	Mistral	7B
✓	Phi 2	2.7B
✓	ChatGLM 1/2/3/4	6B

Supported LoRA Variants

	LoRA Variants	Arguments*
✓	QLoRA	See Quantize Methods
✓	LoRA+	`loraplus_lr_ratio: 20.0`
✓	DoRA	`use_dora: true`
✓	rsLoRA	`use_rslora: true`
✓	MixLoRA	See MixLoRA

*: Arguments of configuration file

Supported Attention Methods

	Attention Methods	Name	Arguments*
✓	Scaled Dot Product	`"eager"`	`--attn_impl eager`
✓	Flash Attention 2	`"flash_attn"`	`--attn_impl flash_attn`

*: Arguments of mlora.py

m-LoRA only supports scaled-dot product attention (eager) by default. Additional requirements are necessary for flash attention.

For flash attention, manual installation of the following dependencies is required:

pip3 install ninja
pip3 install flash-attn==2.5.8 --no-build-isolation

If any attention method is not specified, flash attention is used if available.

Supported Quantize Methods

	Quantize Methods	Arguments*
✓	Full Precision (FP32)	by default
✓	Tensor Float 32	`--tf32`
✓	Half Precision (FP16)	`--fp16`
✓	Brain Float 16	`--bf16`
✓	8bit Quantize	`--load_8bit`
✓	4bit Quantize	`--load_4bit`

*: Arguments of mlora.py

m-LoRA offers support for various model accuracy and quantization methods. By default, m-LoRA utilizes full precision (Float32), but users can opt for half precision (Float16) using --fp16 or BrainFloat16 using --bf16. Enabling half precision reduces the model size by half, and for further reduction, quantization methods can be employed.

Quantization can be activated using --load_4bit for 4-bit quantization or --load_8bit for 8-bit quantization. However, when only quantization is enabled, m-LoRA utilizes Float32 for calculations. To achieve memory savings during training, users can combine quantization and half-precision modes.

To enable quantization support, please manually install bitsandbytes:

pip3 install bitsandbytes==0.43.1

It's crucial to note that regardless of the settings, LoRA weights are always calculated and stored at full precision. For maintaining calculation accuracy, m-LoRA framework mandates the use of full precision for calculations when accuracy is imperative.

For users with NVIDIA Ampere or newer GPU architectures, the --tf32 option can be utilized to enable full-precision calculation acceleration.

Known issues

Quantization with Qwen2 have no effect (same with transformers).
Applying quantization with DoRA will result in higher memory and computation cost (same with PEFT).

Installation

Please refer to m-LoRA Install Guide.

Quickstart

You can conveniently utilize m-LoRA via launch.py. The following example demonstrates a streamlined approach to training a dummy model with m-LoRA.

# Generating configuration
python launch.py gen --template lora --tasks ./data/dummy_data.json
# Running the training task
python launch.py run --base_model TinyLlama/TinyLlama_v1.1
# Try with gradio web ui
python inference.py \
  --base_model TinyLlama/TinyLlama_v1.1 \
  --template ./template/alpaca.json \
  --lora_weights ./casual_0

For further detailed usage information, please refer to the help command:

python launch.py help

m-LoRA

The mlora.py code is a starting point for finetuning on various datasets. Basic command for finetuning a baseline model on the Alpaca Cleaned dataset:

python mlora.py \
  --base_model meta-llama/Llama-2-7b-hf \
  --config ./config/alpaca.json \
  --bf16

You can check the template finetune configuration in template folder.

For further detailed usage information, please use --help option:

python mlora.py --help

Use Docker

Firstly, ensure that you have installed Docker Engine and NVIDIA Container Toolkit correctly.

After that, you can launch the container using the following typical command:

docker run --gpus all -it --rm mikecovlee/mlora

You can check all available tags from: mikecovlee/mlora/tags

Please note that this container only provides a proper environment to run m-LoRA. The codes of m-LoRA are not included.

Copyright

This project is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 407 Commits
.github/workflows		.github/workflows
.launcher		.launcher
config		config
data		data
misc		misc
mlora		mlora
template		template
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Install.md		Install.md
LICENSE		LICENSE
README.md		README.md
evaluator.py		evaluator.py
generate.py		generate.py
inference.py		inference.py
launch.py		launch.py
mlora.ipynb		mlora.ipynb
mlora.py		mlora.py
mmlu_calscore.py		mmlu_calscore.py
mmlu_evaluate.py		mmlu_evaluate.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

m-LoRA: An Efficient "Factory" to Build Multiple LoRA Adapters

Note from the maintainer of this repository

Supported Platform

Supported Pre-trained Models

Supported LoRA Variants

Supported Attention Methods

Supported Quantize Methods

Known issues

Installation

Quickstart

m-LoRA

Use Docker

Copyright

About

Languages

License

scukdde-llm/mLoRA

Folders and files

Latest commit

History

Repository files navigation

m-LoRA: An Efficient "Factory" to Build Multiple LoRA Adapters

Note from the maintainer of this repository

Supported Platform

Supported Pre-trained Models

Supported LoRA Variants

Supported Attention Methods

Supported Quantize Methods

Known issues

Installation

Quickstart

m-LoRA

Use Docker

Copyright

About

Resources

License

Stars

Watchers

Forks

Languages