Skip to content
This repository has been archived by the owner on Jul 13, 2024. It is now read-only.

scukdde-llm/mLoRA

 
 

Repository files navigation

m-LoRA: An Efficient "Factory" to Build Multiple LoRA Adapters

mLoRA (a.k.a Multi-LoRA Fine-Tune) is an open-source framework designed for efficient fine-tuning of multiple Large Language Models (LLMs) using LoRA and its variants. Key features of mLoRA include:

  • Concurrent fine-tuning of multiple LoRA adapters.

  • Shared base model among multiple LoRA adapters.

  • Support for multiple LoRA variant algorithms and various base models.

  • Exclusive Mo-LoRA (Mixture of LoRAs) optimization for MixLoRA and its variants.

You can try m-LoRA with Google Colab before local installation.

Note from the maintainer of this repository

This is an actively developing fork of the official m-LoRA repository, focusing on PEFT algorithm and its related improvements, maintained by the authors of m-LoRA. Please note that the functions, interfaces, and performance of this fork are slightly different from the original m-LoRA. We cannot guarantee compatibility. For production use, please prefer the original m-LoRA.

Supported Platform

OS Backend Model Precision Quantization Flash Attention
Linux CUDA FP32, FP16, TF32, BF16 8bit and 4bit
Windows CUDA FP32, FP16, TF32, BF16 8bit and 4bit -
macOS MPS FP32, FP16, BF16
All CPU FP32, FP16, BF16

You can use the MLORA_BACKEND_TYPE environment variable to force m-LoRA to use a specific backend. For example, if you want m-LoRA to run only on CPU, you can set MLORA_BACKEND_TYPE=CPU before importing mlora.

Supported Pre-trained Models

Model # Parameters
LLaMA 1/2/3 7B/8B/13B/70B
TinyLLaMA 1.1B
Qwen 1.5/2 1.5B/4B/7B/57B/72B
Gemma 2B/7B
Mistral 7B
Phi 2 2.7B
ChatGLM 1/2/3/4 6B

Supported LoRA Variants

LoRA Variants Arguments*
QLoRA See Quantize Methods
LoRA+ loraplus_lr_ratio: 20.0
DoRA use_dora: true
rsLoRA use_rslora: true
MixLoRA See MixLoRA

*: Arguments of configuration file

Supported Attention Methods

Attention Methods Name Arguments*
Scaled Dot Product "eager" --attn_impl eager
Flash Attention 2 "flash_attn" --attn_impl flash_attn

*: Arguments of mlora.py

m-LoRA only supports scaled-dot product attention (eager) by default. Additional requirements are necessary for flash attention.

For flash attention, manual installation of the following dependencies is required:

pip3 install ninja
pip3 install flash-attn==2.5.8 --no-build-isolation

If any attention method is not specified, flash attention is used if available.

Supported Quantize Methods

Quantize Methods Arguments*
Full Precision (FP32) by default
Tensor Float 32 --tf32
Half Precision (FP16) --fp16
Brain Float 16 --bf16
8bit Quantize --load_8bit
4bit Quantize --load_4bit

*: Arguments of mlora.py

m-LoRA offers support for various model accuracy and quantization methods. By default, m-LoRA utilizes full precision (Float32), but users can opt for half precision (Float16) using --fp16 or BrainFloat16 using --bf16. Enabling half precision reduces the model size by half, and for further reduction, quantization methods can be employed.

Quantization can be activated using --load_4bit for 4-bit quantization or --load_8bit for 8-bit quantization. However, when only quantization is enabled, m-LoRA utilizes Float32 for calculations. To achieve memory savings during training, users can combine quantization and half-precision modes.

To enable quantization support, please manually install bitsandbytes:

pip3 install bitsandbytes==0.43.1

It's crucial to note that regardless of the settings, LoRA weights are always calculated and stored at full precision. For maintaining calculation accuracy, m-LoRA framework mandates the use of full precision for calculations when accuracy is imperative.

For users with NVIDIA Ampere or newer GPU architectures, the --tf32 option can be utilized to enable full-precision calculation acceleration.

Known issues

  • Quantization with Qwen2 have no effect (same with transformers).
  • Applying quantization with DoRA will result in higher memory and computation cost (same with PEFT).

Installation

Please refer to m-LoRA Install Guide.

Quickstart

You can conveniently utilize m-LoRA via launch.py. The following example demonstrates a streamlined approach to training a dummy model with m-LoRA.

# Generating configuration
python launch.py gen --template lora --tasks ./data/dummy_data.json
# Running the training task
python launch.py run --base_model TinyLlama/TinyLlama_v1.1
# Try with gradio web ui
python inference.py \
  --base_model TinyLlama/TinyLlama_v1.1 \
  --template ./template/alpaca.json \
  --lora_weights ./casual_0

For further detailed usage information, please refer to the help command:

python launch.py help

m-LoRA

The mlora.py code is a starting point for finetuning on various datasets. Basic command for finetuning a baseline model on the Alpaca Cleaned dataset:

python mlora.py \
  --base_model meta-llama/Llama-2-7b-hf \
  --config ./config/alpaca.json \
  --bf16

You can check the template finetune configuration in template folder.

For further detailed usage information, please use --help option:

python mlora.py --help

Use Docker

Firstly, ensure that you have installed Docker Engine and NVIDIA Container Toolkit correctly.

After that, you can launch the container using the following typical command:

docker run --gpus all -it --rm mikecovlee/mlora

You can check all available tags from: mikecovlee/mlora/tags

Please note that this container only provides a proper environment to run m-LoRA. The codes of m-LoRA are not included.

Copyright

Copyright © 2023-2024 KDDE Lab, Sichuan University

This project is licensed under the Apache 2.0 License.

Languages

  • Python 97.5%
  • Jupyter Notebook 1.8%
  • Other 0.7%