Fix misalignment between token offsets returned from the API and samples in the UI #821

jonathangomesselman · 2023-12-15T21:23:22Z

Issue Description

Right now, for Decoder-Only models, there is a small mis-alignment bug between what is processed by the DQ client and what we see in the UI. Essentially, there is a small disconnect (for many situations) between what the model "sees" (as input and returns alignment / token data on) and string that we display in the UI.

Current Flow

User logs string column - this is what will be displayed in the UI and what we expect to compute aligned token info for. However, we cannot directly compute alignment / token info since decoder only models process the full + through what we call the formatted_prompt
User logs formatted_prompt
- Locate response template
- Slice off all characters / tokens following the response template
- These tokens (and their direct string representation) is what the model is seeing and what we use to compute token alignment and token likelihoods.

From this flow, we need the logged text and the sliced response text (from formatted_prompt) to EXACTLY match ---- But currently they don't always, for example in the case below:

response_template = "### response:"
target = "This is a test"
formatted_prompt = f"Input ### response: {target}"
...
sliced_response_text = " This is a test"

Where we see their is an added space in the slice_response_text. While subtle and honestly hard to see often in the UI, for certain cases this off by one case can really screw things up.

*** Solution ***
The proposed solution is to get rid of having the user log labels / targets and instead directly infer them from as the sliced_response_text. This way, there will be no discrepancy between what the model sees and what is in the UI!

This change works quite well with the current system design because for computing token alignmnet we already decode the response tokens in the function align_response_tokens_to_character_spans.

Other benefits are:

We get out of the box the ability to show the EOS token and other special tokens, which we have wanted to make the default anyways! NOTE this is just for Decoder-Only modles.
The user no longer needs to log labels! Just Inputs and Formatted Prompt

Demonstration

Example run with the error: run

If you look at the token DEPs you can see that the highlighted tokens don't make sense / are clearly off by one. The tokenizer would never encoder {"m as one token. It really means to encode {" but the space is missing!

Example run with the fix: run

In the first sample you can see that {" is properly tokenized! Also looking at the other samples you see that the space appears before tokens NOT after. This was another sign that in retrospect I should have seen with other runs but did not notice!

codecov-commenter · 2023-12-15T21:31:31Z

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (657e1cb) 86.59% compared to head (6d2745f) 86.57%.

Files	Patch %	Lines
...aquality/loggers/data_logger/seq2seq/formatters.py	20.00%	4 Missing ⚠️
...uality/loggers/data_logger/seq2seq/seq2seq_base.py	66.66%	2 Missing ⚠️
dataquality/utils/seq2seq/offsets.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #821      +/-   ##
==========================================
- Coverage   86.59%   86.57%   -0.03%     
==========================================
  Files         196      196              
  Lines       15773    15779       +6     
==========================================
+ Hits        13658    13660       +2     
- Misses       2115     2119       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jonathangomesselman · 2023-12-15T21:31:32Z

https://app.shortcut.com/galileo/story/10287

shortcut-integration · 2023-12-15T21:31:35Z

This pull request has been linked to Shortcut Story #10287: [DQ] Fix alignment issue for Decoder-Only models.

Update logic for aligning decoder only targets

5856764

jonathangomesselman requested review from dcaustin33 and a team as code owners December 15, 2023 21:23

return targets

6d2745f

elboy3 approved these changes Dec 15, 2023

View reviewed changes

elboy3 merged commit 908af1d into main Dec 15, 2023
3 checks passed

elboy3 deleted the fix/decoder-alignment branch December 15, 2023 23:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix misalignment between token offsets returned from the API and samples in the UI #821

Fix misalignment between token offsets returned from the API and samples in the UI #821

jonathangomesselman commented Dec 15, 2023

codecov-commenter commented Dec 15, 2023 •

edited

Loading

jonathangomesselman commented Dec 15, 2023

shortcut-integration bot commented Dec 15, 2023

Fix misalignment between token offsets returned from the API and samples in the UI #821

Fix misalignment between token offsets returned from the API and samples in the UI #821

Conversation

jonathangomesselman commented Dec 15, 2023

codecov-commenter commented Dec 15, 2023 • edited Loading

Codecov Report

jonathangomesselman commented Dec 15, 2023

shortcut-integration bot commented Dec 15, 2023

codecov-commenter commented Dec 15, 2023 •

edited

Loading