SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

SpikeGPT is a lightweight generative language model with pure binary, event-driven spiking activation units. The arxiv paper of SpikeGPT can be found here.

If you are interested in SpikeGPT, feel free to join our Discord using this link!

This repo is inspired by the RWKV-LM.

If you find yourself struggling with environment configuration, consider using the Docker image for SpikeGPT available on Github.

Training on Enwik8

Download the enwik8 dataset by visiting the following link: enwik8 dataset.
Modify the train set, validate set, and test set paths in the train.py script to match the directory where you've extracted the files. For example, if you've extracted the files to a directory named enwik8_data, your train.py script should be updated as follows:
```
# Set the paths for the datasets
datafile_train = "path/to/enwik8_data/train"
datafile_valid = "path/to/enwik8_data/validate"
datafile_test = "path/to/enwik8_data/test"
```

Pre-training on large corpus

Pre-Training on a Large Corpus:
- To begin, pre-tokenize your corpus data.
- For custom data, use the jsonl2binidx tool to convert your data.
- If you prefer pre-tokenized data, consider using pre-tokenized The Pile, which is equipped with a 20B tokenizer and is used in GPT-NeoX and Pythia.
- If resources are limited, you may use just one file from the dataset instead of the entire collection.
Configuring the Training Script:
- In train.py, uncomment line 82 to enable MMapIndexedDataset as the dataset class.
- Change datafile_train to the filename of your binidx file.
- Important: Do not include the .bin or .idx file extensions.
Starting Multi-GPU Training:
- Utilize Hugging Face's Accelerate to begin training on multiple GPUs.

Fine-Tuning on WikiText-103

Downloading Pre-Tokenized WikiText-103:
- You can obtain the pre-tokenized WikiText-103 dataset binidx file from this Hugging Face dataset link.
Fine-Tuning the Model:
- Use the same approach as in pre-training for fine-tuning your model with this dataset.
- Important: Set a smaller learning rate than during the pre-training stage to avoid catastrophic forgetting. A recommended learning rate is around 3e-6.
- For the batch size, it's advisable to adjust according to your specific requirements to find an optimal setting for your case.

Inference with Prompt

You can choose to run inference with either your own customized model or with our pre-trained model. Our pre-trained model is available here. This model trained 5B tokens on OpenWebText.

download our pre-trained model, and put it in the root directory of this repo.
Modify the 'context' variable in run.py to your custom prompt
Run run.py

Fine-Tune with NLU tasks

run the file in 'NLU' folders
change the path in line 17 to the model path

Citation

If you find SpikeGPT useful in your work, please cite the following source:

@article{zhu2023spikegpt,
        title = {SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks},
        author = {Zhu, Rui-Jie and Zhao, Qihang and Li, Guoqi and Eshraghian, Jason K.},
        journal = {arXiv preprint arXiv:2302.13939},
        year    = {2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
NLU		NLU
cuda		cuda
src		src
static		static
20B_tokenizer.json		20B_tokenizer.json
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt
run.py		run.py
train.py		train.py
vocab_book.json		vocab_book.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLU

NLU

cuda

cuda

src

src

static

static

20B_tokenizer.json

20B_tokenizer.json

LICENSE

LICENSE

readme.md

readme.md

requirements.txt

requirements.txt

run.py

run.py

train.py

train.py

vocab_book.json

vocab_book.json

Repository files navigation

SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

Training on Enwik8

Pre-training on large corpus

Fine-Tuning on WikiText-103

Inference with Prompt

Fine-Tune with NLU tasks

Citation

About

Contributors 5

Languages

License

ridgerchu/SpikeGPT

Folders and files

Latest commit

History

Repository files navigation

SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

Training on Enwik8

Pre-training on large corpus

Fine-Tuning on WikiText-103

Inference with Prompt

Fine-Tune with NLU tasks

Citation

About

Resources

License

Stars

Watchers

Forks

Languages