Skip to content

thecodingmage/Diffusion-SLM

Repository files navigation

Download Dataset

Dataset: Fineweb

download Script:

from huggingface_hub import snapshot_download
folder = snapshot_download(
                "HuggingFaceFW/fineweb", 
                repo_type="dataset",
                local_dir="./fineweb/",
                # replace "data/CC-MAIN-2023-50/*" with "sample/100BT/*" to use the 100BT sample
                allow_patterns="sample/10BT/*")

References

  1. Llada-from-Scratch

utils

  1. Avoid fragmentation: export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Working Guid

  1. Create a venv: python -m venv venv , venv\Scripts\activate
  2. cd setup and ./setup.sh
  3. Put the training data and python prepareData.py
  4. Train tokenizer: python tokenizer.py
  5. Train LLADA model: CUDA_VISIBLE_DEVICES=1 python train2.py
  6. Eval_1: python sample.py
  7. Eval_2: python eval.py
  8. Launch App: python app.py

About

A Diffusion-based Small Language Model (SLM) built using the LLaDA framework. Currently featuring a 1B-parameter architecture trained on FineWeb 10BT with a focus on bidirectional context and denoising efficiency.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors