download Script:
from huggingface_hub import snapshot_download
folder = snapshot_download(
"HuggingFaceFW/fineweb",
repo_type="dataset",
local_dir="./fineweb/",
# replace "data/CC-MAIN-2023-50/*" with "sample/100BT/*" to use the 100BT sample
allow_patterns="sample/10BT/*")- Avoid fragmentation:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- Create a venv:
python -m venv venv,venv\Scripts\activate cd setupand./setup.sh- Put the training data and
python prepareData.py - Train tokenizer:
python tokenizer.py - Train LLADA model:
CUDA_VISIBLE_DEVICES=1 python train2.py - Eval_1:
python sample.py - Eval_2:
python eval.py - Launch App:
python app.py