Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

learn_bpe.py code question #116

Closed
lzp-man opened this issue Jul 16, 2022 · 1 comment
Closed

learn_bpe.py code question #116

lzp-man opened this issue Jul 16, 2022 · 1 comment

Comments

@lzp-man
Copy link

lzp-man commented Jul 16, 2022

In learn_bpe.py, the function prune_stats has code as follow:
for item,freq in list(stats.items()):
if freq < threshold:
del stats[item]
if freq < 0:
big_stats[item] += freq
else:
big_stats[item] = freq
I want to ask why the freq can bellow zero? This conditional judgment is for what?

@rsennrich
Copy link
Owner

For efficiency reasons, we keep two dictionaries:

  • big_stats, which is the full collection of symbol pairs
  • stats, which is a pruned version of big_stats that initially only contains the most frequent symbol pairs, but is regularly synced with big_stats. Most operations (updating symbol pair frequencies; finding the most frequent one) are done on stats.

if a symbol pair is not among the the most frequent pairs, it may be pruned from stats (and the default frequency is 0), but the frequency of a symbol pair will actually decrease as neighbouring symbols are merged (if you have "a b c", and "b c" are merged, then the frequency of "a b" decreases). That's how frequencies in stats can become negative, and why the code you're quoting makes sure to update big_stats correspondingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants