learn_bpe.py code question #116

lzp-man · 2022-07-16T01:31:48Z

In learn_bpe.py, the function prune_stats has code as follow:
for item,freq in list(stats.items()):
if freq < threshold:
del stats[item]
if freq < 0:
big_stats[item] += freq
else:
big_stats[item] = freq
I want to ask why the freq can bellow zero? This conditional judgment is for what?

rsennrich · 2022-07-19T08:03:26Z

For efficiency reasons, we keep two dictionaries:

big_stats, which is the full collection of symbol pairs
stats, which is a pruned version of big_stats that initially only contains the most frequent symbol pairs, but is regularly synced with big_stats. Most operations (updating symbol pair frequencies; finding the most frequent one) are done on stats.

if a symbol pair is not among the the most frequent pairs, it may be pruned from stats (and the default frequency is 0), but the frequency of a symbol pair will actually decrease as neighbouring symbols are merged (if you have "a b c", and "b c" are merged, then the frequency of "a b" decreases). That's how frequencies in stats can become negative, and why the code you're quoting makes sure to update big_stats correspondingly.

rsennrich closed this as completed Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

learn_bpe.py code question #116

learn_bpe.py code question #116

lzp-man commented Jul 16, 2022

rsennrich commented Jul 19, 2022

learn_bpe.py code question #116

learn_bpe.py code question #116

Comments

lzp-man commented Jul 16, 2022

rsennrich commented Jul 19, 2022