Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem when reproduce experiment #5

Closed
cat538 opened this issue Mar 27, 2024 · 1 comment
Closed

problem when reproduce experiment #5

cat538 opened this issue Mar 27, 2024 · 1 comment

Comments

@cat538
Copy link

cat538 commented Mar 27, 2024

Thanks for the great work!
I'm having a little problem reproducing the PPL results in the paper. I used the code snippet from the gptq repo for measuring ppl and was able to reproduce the fp16 baseline for the llama family in the paper, but I was unable to reproduce the fp16 baseline for mistral-7b using the same test code:

testdata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
testenc = tokenizer("\n\n".join(testdata['text']), return_tensors='pt')["input_ids"]

nsamples = testenc.numel() // input_len
nlls = []

loss_fct = nn.CrossEntropyLoss()
for i in tqdm(range(nsamples)):
    batch = testenc[:, (i * input_len) : ((i + 1) * input_len)].to(model.device)
    outputs = model.model(batch)
    hidden_states = outputs[0]
    logits = model.lm_head(hidden_states)
    shift_logits = logits[:, :-1, :]
    shift_labels = batch[:, 1:].to(model.lm_head.weight.device)
    loss = loss_fct(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
    )
    neg_log_likelihood = loss.float() * input_len
    nlls.append(neg_log_likelihood)

ppl = torch.exp(torch.stack(nlls).sum() / (nsamples * input_len)).item()

Specifically, I use mistral-7b-v0.1, tried seqlen=8000 as well as seqlen=8192, both slightly lower than the results in the paper, which gave us a bit of trouble.

I would like to ask will you release the code of measuring ppl?

@chooper1
Copy link
Collaborator

chooper1 commented Apr 5, 2024

Thank you for the interest in our work!

I was able to reproduce this. The gap in scores depends on what attention implementation is used in Transformers. I measured our PPL numbers with seqlen=8192 using "_attn_implementation": "eager" in the config.json file. If you use newer versions of transformers, by default _attn_implementation": "sdpa" is used instead. When using "sdpa" I get 4.73 instead of 4.76. Let me know if this doesn't fix the issue.

@chooper1 chooper1 closed this as completed Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants