tutorial: LLM basics from scratch provide step by step explanation.
cd to data folder
cd dataInitialize Git LFS for Large Files
git lfs installClone the dataset:
git clone https://huggingface.co/datasets/Skylion007/openwebtextUnzip dataset:
bash unzip.shBack to the root folder, run the following command:
python convert_data.pyIt converts all the .xz files in data/openwebtext/subsets and put the converted .txt files in folder data/extracted.
We are using neetbox for monitoring, open localhost:20202 (neetbox's default port) in your browser and you can check the progresses. If you are working on a remote server, you can use ssh -L 20202:localhost:20202 user@remotehost to forward the port to your local machine, or you can directly access the server's IP address with the port number, and you will see all the processes:
Optionally, the script will ask you if you'd like to delete the original .xz files to save disk space. If you want to keep them, type n and press Enter.
python train.py --config config/gptv1_s.tomlSince we are using neetbox for monitoring, open localhost:20202 (neetbox's default port) in your browser and you can check the progresses:
python inference.py --config config/gptv1_s.tomlOpen localhost:20202 (neetbox's default port) in your browser and feed text to your model via action button.
more information see also LLM basics from scratch


