Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I can not reproduce 7b 6.09 Wiki2 PPL. #78

Closed
USBhost opened this issue Mar 24, 2023 · 14 comments
Closed

I can not reproduce 7b 6.09 Wiki2 PPL. #78

USBhost opened this issue Mar 24, 2023 · 14 comments

Comments

@USBhost
Copy link
Contributor

USBhost commented Mar 24, 2023

I can not seem to get that. It's ether smaller or a little bigger. Can you guys provide the command you used to get this?

In my tests I ended up getting Wiki2 6.29,6.25 and 5.9 trying different settings. Also what's the correct why to do these tests? What is the correct way to check? benchmark check or normal saving at the Evaluating stage?

python -u ../repositories/GPTQ-for-LLaMa/llama.py llama-7b wikitext2 --new-eval --wbits 4 --act-order --true-sequential --save_safetensors llama-7b-4bit.safetensors
c4-new: 7.843033313751221
ptb-new: 10.846735000610352
wikitext2: 5.92544412612915

python -u ../repositories/GPTQ-for-LLaMa/llama.py llama-7b wikitext2 --new-eval --wbits 4 --act-order --true-sequential --load llama-7b-4bit.safetensors --benchmark 2048 --check
Median: 0.0950855016708374
PPL: 6.688839912414551
max memory(MiB): 1712.3349609375

@USBhost USBhost changed the title I can not preproduce 7b 6.09 Wiki2 PPL. I can not reproduce 7b 6.09 Wiki2 PPL. Mar 24, 2023
@qwopqwop200
Copy link
Owner

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)

@USBhost
Copy link
Contributor Author

USBhost commented Mar 25, 2023

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)

Yes I get that but what command do you use to properly get that number from? benchmark or normal at the Evaluating stage? I am confused on what I should be using to properly gauge what I convert.

@Qubitium
Copy link
Contributor

Qubitium commented Mar 25, 2023

@USBhost Which GPU and Nvidia driver version are you using? Maybe we can track this gpu/driver diff when other users report more scores as well.

Also, your score is better: lower is better, not higher. That's a good thing right?

@USBhost
Copy link
Contributor Author

USBhost commented Mar 25, 2023

@USBhost Which GPU and Nvidia driver version are you using? Maybe we can track this gpu/driver diff when other users report more scores as well.

Also, your score is better: lower is better, not higher. That's a good thing right?

GPU: RTX A6000
Driver: 530.30.02

If you notice In the op I ran with wikitext2 vs the default? C4. So If I ran with C4 I got wikitext2 6.29, when I ran with groupsize 128 + --true-sequential I got wikitext2 6.25.

@Qubitium
Copy link
Contributor

@USBhost Looking at the code and reading the arvix paper, here are my thoughts on variance.

  1. Default sampling of 128 is chosen. You can increase this but vram usage during quantizing will explode.
  2. This means if you selected c4, it will select randomly 128 rows from that dataset to establish a calibration point/baseline for comparison and also to assist quantization.

The code by design will produce different score every single time since the calibration point is randomized.

@USBhost
Copy link
Contributor Author

USBhost commented Mar 30, 2023

@USBhost Looking at the code and reading the arvix paper, here are my thoughts on variance.

  1. Default sampling of 128 is chosen. You can increase this but vram usage during quantizing will explode.
  2. This means if you selected c4, it will select randomly 128 rows from that dataset to establish a calibration point/baseline for comparison and also to assist quantization.

The code by design will produce different score every single time since the calibration point is randomized.

But I am able to reproduce my results though. That random doesn't seem to be doing what we think it does.

@Qubitium
Copy link
Contributor

But I am able to reproduce my results though. That random doesn't seem to be doing what we think it does.

Interesting. I never tried to run the same config more than once myself since it takes forever. Are you getting the exact values down to the significant digits on repeat quantizing with same config?

@USBhost
Copy link
Contributor Author

USBhost commented Mar 30, 2023

But I am able to reproduce my results though. That random doesn't seem to be doing what we think it does.

Interesting. I never tried to run the same config more than once myself since it takes forever. Are you getting the exact values down to the significant digits on repeat quantizing with same config?

To the exact last digit.

@Xiuyu-Li
Copy link

Xiuyu-Li commented Apr 1, 2023

I have the same issue using an A6000 GPU.

@qwopqwop200
Copy link
Owner

This value is obtained from GPTQ. Please ask GPTQ for details.

@Xiuyu-Li
Copy link

Xiuyu-Li commented Apr 2, 2023

Nvm. I found that the reported results could be reproduced by using --eval --new-eval instead of --benchmark 2048 --check

@USBhost
Copy link
Contributor Author

USBhost commented Apr 2, 2023

For 7b really? @Xiuyu-Li ?

new-eval does not affect Wiki2 .

@Xiuyu-Li
Copy link

Xiuyu-Li commented Apr 2, 2023

For 7b really? @Xiuyu-Li ?

Yes. Got 5.78 when calibrated with wiki2 and 5.83 when calibrated with c4, both using a group size of 128. Here you need to call llama_eval with --eval to be consistent with what GPTQ did. ---benchmark got the model evaluated on a different set of data.

@USBhost
Copy link
Contributor Author

USBhost commented Apr 2, 2023

For 7b really? @Xiuyu-Li ?

Yes. Got 5.78 when calibrated with wiki2 and 5.83 when calibrated with c4, both using a group size of 128. Here you need to call llama_eval with --eval to be consistent with what GPTQ did. ---benchmark got the model evaluated on a different set of data.

Oh a group size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants