Dr.E (Dual-Level Residual with Embedding) is a novel framework that bridges Graph Neural Networks (GNNs) with Large Language Models (LLMs) for text-attributed graph learning. The key innovation lies in the dual-level residual quantization mechanism that enables effective alignment between continuous graph representations and discrete token embeddings.
- Multi-View Graph Encoding: Captures structural information at different hop levels (1-hop, 2-hop, 3-hop neighborhoods)
- Dual-Level Residual Quantization:
- Intra-Layer Residual: Generates K codes per view using residual quantization
- Inter-Layer Residual: Propagates quantized embeddings across GNN layers
- Token-Level Alignment: Uses frozen LLM token embeddings as codebook for seamless integration
- Parameter-Efficient Fine-Tuning: Leverages LoRA for efficient LLM adaptation
The framework consists of three main components:
- GNN Encoder: A 3-layer SAGEConv network with Inter-Layer Residual connections
- Vector Quantization Module: Maps continuous GNN embeddings to discrete tokens using cosine similarity
- LLM Decoder: Llama-2-7B with LoRA fine-tuning for node classification
- Python >= 3.9
- CUDA >= 11.7
- PyTorch >= 2.0.0
Download Llama-2-7B-hf from Hugging Face and update the llm_path in configs/training.py.
Download the pre-computed embeddings file x_emb.pt from Google Drive:
First, generate the codebook from your datasets:
python scripts/create_codebook.py --llm_path /path/to/Llama-2-7b-hf# Train on Cora
python train.py --llm_path /path/to/Llama-2-7b-hf --dataset cora_dataset
# Train on PubMed
python train.py --llm_path /path/to/Llama-2-7b-hf --dataset pubmed_dataset
# Train on ogbn-arxiv
python train.py --llm_path /path/to/Llama-2-7b-hf --dataset ogbn_arxiv_dataset --batch_size 2| Argument | Default | Description |
|---|---|---|
--llm_path |
- | Path to Llama-2-7B-hf model |
--dataset |
cora_dataset |
Dataset name |
--batch_size |
4 |
Training batch size |
--num_epochs |
20 |
Number of training epochs |
--llm_lr |
1e-4 |
Learning rate for LLM (LoRA) |
--gnn_lr |
8e-4 |
Learning rate for GNN |
--device |
cuda:0 |
Training device |
--quantization |
8bit |
LLM quantization (4bit/8bit) |
| Dataset | Test Accuracy |
|---|---|
| Cora | 91.33% |
| PubMed | 96.70% |
| ogbn-arxiv | 76.45% |
.
├── README.md
├── requirements.txt
├── .gitignore
├── train.py # Main training script
├── configs/
│ ├── training.py # Training configuration
│ ├── peft.py # LoRA configuration
│ ├── datasets.py # Dataset paths
│ └── quantization.py # Quantization config
├── models/
│ ├── model.py # Dr.E model architecture
│ └── vq.py # Vector Quantization module
├── datasets/
│ ├── cora_dataset.py # Cora dataset loader
│ └── pubmed_dataset.py # PubMed dataset loader
├── utils/
│ ├── train_utils.py # Training utilities
│ ├── dataset_utils.py # Dataset preprocessing
│ ├── config_utils.py # Configuration utilities
│ └── memory_utils.py # Memory tracking
├── scripts/
│ └── create_codebook.py # Codebook generation
└── imgs/ # Paper figures
If you find this work useful, please cite our paper:
@inproceedings{liu2025multi,
title={Multi-view empowered structural graph wordification for language models},
author={Liu, Zipeng and Wu, Likang and He, Ming and Guan, Zhong and Zhao, Hongke and Feng, Nan},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={39},
number={23},
pages={24714--24722},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
