Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix misalignment between token offsets returned from the API and samples in the UI #821

Merged
merged 2 commits into from
Dec 15, 2023

Conversation

jonathangomesselman
Copy link
Contributor

Issue Description

Right now, for Decoder-Only models, there is a small mis-alignment bug between what is processed by the DQ client and what we see in the UI. Essentially, there is a small disconnect (for many situations) between what the model "sees" (as input and returns alignment / token data on) and string that we display in the UI.

Current Flow

  • User logs string column - this is what will be displayed in the UI and what we expect to compute aligned token info for. However, we cannot directly compute alignment / token info since decoder only models process the full + through what we call the formatted_prompt
  • User logs formatted_prompt
    • Locate response template
    • Slice off all characters / tokens following the response template
    • These tokens (and their direct string representation) is what the model is seeing and what we use to compute token alignment and token likelihoods.

From this flow, we need the logged text and the sliced response text (from formatted_prompt) to EXACTLY match ---- But currently they don't always, for example in the case below:

response_template = "### response:"
target = "This is a test"
formatted_prompt = f"Input ### response: {target}"
...
sliced_response_text = " This is a test"

Where we see their is an added space in the slice_response_text. While subtle and honestly hard to see often in the UI, for certain cases this off by one case can really screw things up.

*** Solution ***
The proposed solution is to get rid of having the user log labels / targets and instead directly infer them from as the sliced_response_text. This way, there will be no discrepancy between what the model sees and what is in the UI!

This change works quite well with the current system design because for computing token alignmnet we already decode the response tokens in the function align_response_tokens_to_character_spans.

Other benefits are:

  • We get out of the box the ability to show the EOS token and other special tokens, which we have wanted to make the default anyways! NOTE this is just for Decoder-Only modles.
  • The user no longer needs to log labels! Just Inputs and Formatted Prompt

Demonstration

Example run with the error: run

If you look at the token DEPs you can see that the highlighted tokens don't make sense / are clearly off by one. The tokenizer would never encoder {"m as one token. It really means to encode {" but the space is missing!

Example run with the fix: run

In the first sample you can see that {" is properly tokenized! Also looking at the other samples you see that the space appears before tokens NOT after. This was another sign that in retrospect I should have seen with other runs but did not notice!

@codecov-commenter
Copy link

codecov-commenter commented Dec 15, 2023

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (657e1cb) 86.59% compared to head (6d2745f) 86.57%.

Files Patch % Lines
...aquality/loggers/data_logger/seq2seq/formatters.py 20.00% 4 Missing ⚠️
...uality/loggers/data_logger/seq2seq/seq2seq_base.py 66.66% 2 Missing ⚠️
dataquality/utils/seq2seq/offsets.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #821      +/-   ##
==========================================
- Coverage   86.59%   86.57%   -0.03%     
==========================================
  Files         196      196              
  Lines       15773    15779       +6     
==========================================
+ Hits        13658    13660       +2     
- Misses       2115     2119       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jonathangomesselman
Copy link
Contributor Author

Copy link

This pull request has been linked to Shortcut Story #10287: [DQ] Fix alignment issue for Decoder-Only models.

@elboy3 elboy3 merged commit 908af1d into main Dec 15, 2023
3 checks passed
@elboy3 elboy3 deleted the fix/decoder-alignment branch December 15, 2023 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants