ValueError: test_size=2000 should be either positive and smaller than the number of samples 78 or a float in the (0, 1) range #470

happy-zhangbo · 2023-05-22T08:37:47Z

(llm) root@autodl-container-91b911ad00-2f77ab00:~/autodl-tmp/alpaca-lora# python finetune.py --base_model /root/autodl-tmp/models/decapoda-research/llama-7b-hf --data_path /root/autodl-tmp/alpaca-lora/data.json --output_dir './lora-alpaca-musk' --micro_batch_size 64 --batch_size 1024

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /root/miniconda3/envs/llm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so
/root/miniconda3/envs/llm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: /root/miniconda3/envs/llm did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/root/miniconda3/envs/llm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('Asia/Shanghai')}
warn(msg)
/root/miniconda3/envs/llm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//127.0.0.1'), PosixPath('7890'), PosixPath('http')}
warn(msg)
/root/miniconda3/envs/llm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/vscode-ipc-03416a64-f716-47fe-a0a5-8e9cdca081d7.sock')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/root/miniconda3/envs/llm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 118
/root/miniconda3/envs/llm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /root/miniconda3/envs/llm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so...
Training Alpaca-LoRA model with params:
base_model: /root/autodl-tmp/models/decapoda-research/llama-7b-hf
data_path: /root/autodl-tmp/alpaca-lora/data.json
output_dir: ./lora-alpaca-musk
batch_size: 1024
micro_batch_size: 64
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
add_eos_token: False
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:22<00:00, 1.45it/s]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-c39e4147dc9b25ec/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 299.61it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Traceback (most recent call last):
File "/root/autodl-tmp/alpaca-lora/finetune.py", line 283, in
fire.Fire(train)
File "/root/miniconda3/envs/llm/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/envs/llm/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/envs/llm/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/root/autodl-tmp/alpaca-lora/finetune.py", line 214, in train
train_val = data["train"].train_test_split(
File "/root/miniconda3/envs/llm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 543, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/root/miniconda3/envs/llm/lib/python3.9/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/root/miniconda3/envs/llm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 4365, in train_test_split
raise ValueError(
ValueError: test_size=2000 should be either positive and smaller than the number of samples 78 or a float in the (0, 1) range

Why did such an error occur~I don't understand~~

The text was updated successfully, but these errors were encountered:

tulunlxj2017 · 2023-05-29T10:28:37Z

@happy-zhangbo I got same problem, how did you fixed it.

happy-zhangbo · 2023-05-29T10:47:07Z

@happy-zhangbo I got same problem, how did you fixed it.

The problem of too few datasets,

happy-zhangbo closed this as completed May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: test_size=2000 should be either positive and smaller than the number of samples 78 or a float in the (0, 1) range #470

ValueError: test_size=2000 should be either positive and smaller than the number of samples 78 or a float in the (0, 1) range #470

happy-zhangbo commented May 22, 2023

tulunlxj2017 commented May 29, 2023

happy-zhangbo commented May 29, 2023

ValueError: test_size=2000 should be either positive and smaller than the number of samples 78 or a float in the (0, 1) range #470

ValueError: test_size=2000 should be either positive and smaller than the number of samples 78 or a float in the (0, 1) range #470

Comments

happy-zhangbo commented May 22, 2023

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

tulunlxj2017 commented May 29, 2023

happy-zhangbo commented May 29, 2023