Efficient and Scalable Fine-Tune of Language Models for Genome Understanding

Parameter-Efficient Fine-Tuning (PEFT) has become the de facto approach to fine-tune PFMs while decreasing the computational costs. The current status of PEFT includes:

Prefix Tuning methods, e.g., Prefix-Tuning: Optimizing Continuous Prompts for Generation, P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
Prompt Tuning methods, e.g., The Power of Scale for Parameter-Efficient Prompt Tuning
Low-rank adaptation method, e.g., LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS and AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Among these methods, we opt for adaptive rank sampling to deal with the data heterogeneous issue and LINGO: Language prefix fINe-tuning for GenOmes to leverage the in-context learning ability of LLMs. The framework is as follows:

The repository is organized as follows:

dataset/: the directory of data sets. We applied our adaptive rank sampling for a comprehensive set of genome understanding tasks on various LLMs, i.e., promoter detection, epigenetic marks prediction in yeast, and in multiple human cell types. the link is here
finetune/: fine-tuning LLMs and pre-trained DNA foundation models for single label task and multiple label tasks using DSP with BBPE tokenized embeddings and one-hot embeddings.
peftnew/: Coupling RS with AdaLoRA method
scripts/: SLURM batch script to run the .py files.
demos/: Some minimal demos to run AdaLoRA + RS with DSP on OPT and 4-bit quantized Llama. See llama_dna_sequential_finetune_QLoRA.ipynb
Besides, this link contains 2 fine-tuned checkpoints. See link. Replace "/path/to/your/local/model" with the actual file path to your saved model on your local system.

model_name_or_path: Optional[str] = field(default="/path/to/your/local/model")

Setting up environment

Typically, the setup process on a standard PC requires several tens of minutes to complete.

conda env create -f dna_llm.yml

For fine-tune

sbatch run_llm_lora.sh data_path

Models support matrix

Find models that are supported out of the box below.

Model	LoRA	AdaLoRA	Adaptive rank sampling	LINGO + one-hot	LINGO + BBPE
1000G-500M	✅	✅	✅
DNABERT-2	✅	✅	✅
OPT	✅	✅	✅	✅	✅
LLaMA	✅	✅	✅

Figures

Cite

@inproceedings{zhan2023parameter,
  title={Parameter-Efficient Fine-Tune on Open Pre-trained Transformers for Genomic Sequence},
  author={Zhan, Huixin and Zhang, Zijun Frank},
  booktitle={NeurIPS 2023 Generative AI and Biology (GenBio) Workshop},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
demos		demos
figures		figures
finetune		finetune
peftnew		peftnew
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dna_llm.yml		dna_llm.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

demos

demos

figures

figures

finetune

finetune

peftnew

peftnew

scripts

scripts

.DS_Store

.DS_Store

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

dna_llm.yml

dna_llm.yml

Repository files navigation

Efficient and Scalable Fine-Tune of Language Models for Genome Understanding

Setting up environment

For fine-tune

Models support matrix

Figures

Cite

About

Releases

Packages

Contributors 2

Languages

License

zhanglab-aim/LINGO

Folders and files

Latest commit

History

Repository files navigation

Efficient and Scalable Fine-Tune of Language Models for Genome Understanding

Setting up environment

For fine-tune

Models support matrix

Figures

Cite

About

Resources

License

Stars

Watchers

Forks

Languages