- Review the
lora.yml
config
- Update your Huggingface repository + remember to log in via CLI (
huggingface-cli login
) - Update your Weights&Biases project + remember to log in via CLI (
wandb login
) - Update dataset path + output path
- Generate dataset with
prepare_data.jl
. It will createleaderboard_code_gen_data_trainset11.jsonl
.
You require a GPU from here onward
- Finetune with Axolotl. Follow the 3 steps in
axolotl_script.sh
- All steps are executed from the
axolotl
folder (the main folder of the docker / VS Code) - Preprocess datasets
- Finetune the model based on
lora.yml
config - Merge the adapter into the base model
You could stop here.
- The adapter will be automatically uploaded to HuggingFace hub (it has c. 85MB.)
- The training run will be recorded by Weights&Biases, so you investigate your loss curves (and pick the right checkpoint.)
- Convert your model into GGUF and quantize it - script
gguf_conversion.sh
- It sets up a new Python environment and changed current directory to
llama.cpp/
- Downloads and compiles llama.cpp
- Installs its Python dependencies
- Converts the Lora MERGED model into FP16 bin file
- Quantizes the model into selected format (eg, Q5_K_M)
- Upload to HuggingFace Hub. Remember to change the HuggingFace repo name!
[OPTIONAL] Benchmarking the results
-
Install Julia if it's running remotely - script
julia_install.sh
-
Install
vLLM
and launch the server - scriptvllm_server.sh
(executed from main project directoryaxolotl/
) -
Benchmark the base model (easiest with vLLM Server) - see
benchmark_finetune.jl
-
Benchmark the base model + LORA adapter - see
benchmark_finetune.jl
You can shut down vLLM server now.
-
Start
llama.cpp
server - seellama_server.sh
(executed from directoryllama.cpp/
) -
Benchmark the GGUF quantized model - see
benchmark_finetune.jl
Tips
- If you get GPU / CUDA memory errors, reduce the
micro_batch_size
- If your loss quickly starts increasing, your
learning_rate
is too high. Halve it. - Watch your GPU utilization and GPU memory stats with
watch -n0.1 nvidia-smi
. If it's not close to 100%, you're wasting time :) - Set LORA hyperparameters based on https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2
Training dataset
- File:
leaderboard_code_gen_data_trainset11.jsonl
- It contains 11/14 test cases, the top 50 samples that scored full points (100 points).
- The held-out test cases are
q_and_a_extractor
,extract_julia_code
, andpig_latinify
.
Overall Results
Model | Elapsed | Score | Score Std Deviation | Count Zero Score | Count Full Score | Cost Cents |
---|---|---|---|---|---|---|
cheater-7b.Q5_K_M | 2.1 | 86.7 | 26.3 | 5 | 265 | 0.0 |
cheater-7b | 6.1 | 85.0 | 27.4 | 11 | 242 | 0.0 |
gpt-4-1106-preview | 22.5 | 68.9 | 35.0 | 50 | 149 | 1.21 |
dolphin-2.6-mistral | 8.2 | 57.9 | 30.4 | 32 | 66 | 0.0 |
By test case
name | cheater-7b | cheater-7b.Q5_K_M | dolphin-2.6-mistral | gpt-4-1106-preview |
---|---|---|---|---|
timezone_bumper | 97.0 | 100.0 | 84.4 | 91.1 |
FloatWithUnits | 100.0 | 100.0 | 82.0 | 67.9 |
clean_column | 95.8 | 100.0 | 70.8 | 80.8 |
count_model_rows | 96.0 | 100.0 | 51.5 | 87.9 |
weather_data_analyzer | 98.0 | 92.0 | 57.8 | 86.2 |
keep_only_names | 100.0 | 100.0 | 48.5 | 81.2 |
wrap_string | 89.5 | 85.0 | 67.4 | 87.4 |
event_scheduler | 97.8 | 100.0 | 56.6 | 63.0 |
ispersonal | 100.0 | 100.0 | 60.0 | 50.0 |
add_yearmonth | 95.0 | 100.0 | 45.2 | 65.0 |
audi_filter | 93.8 | 100.0 | 55.5 | 51.8 |
q_and_a_extractor | 55.0 | 73.3 | 42.3 | 47.6 |
extract_julia_code | 48.4 | 38.5 | 55.6 | 50.0 |
pig_latinify | 23.5 | 25.0 | 32.8 | 54.8 |