Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trailing \x02 byte on restore_key result #10

Closed
mjwillson opened this issue Feb 6, 2014 · 9 comments
Closed

Trailing \x02 byte on restore_key result #10

mjwillson opened this issue Feb 6, 2014 · 9 comments

Comments

@mjwillson
Copy link

Like so:

In [46]: marisa_trie.Trie([u'foo', u'bar']).restore_key(0)
Out[46]: u'bar\x02'

This doesn't happen if I first get the key_id for that key:

In [48]: t = marisa_trie.Trie([u'foo', u'bar'])

In [49]: t.key_id(u'bar')
Out[49]: 0

In [50]: t.restore_key(0)
Out[50]: u'bar'

If it's part of the contract that key_id is needed before restore_key then it should probably be documented, ideally raise some kind of exception if the contract is violated rather than silently return an incorrect result.

@kmike
Copy link
Member

kmike commented Feb 6, 2014

No, this is not a part of contract AFAIK, this looks like a bug.

@mjwillson
Copy link
Author

Cheers
Actually the thing with doing a key_id beforehand may be a red herring -- the bug seems to disappear with the following innocuous change too:

In [81]: marisa_trie.Trie([u'foo', u'bar']).restore_key(0)
Out[81]: u'bar\x02'

In [82]: t = marisa_trie.Trie([u'foo', u'bar']); t.restore_key(0)
Out[82]: u'bar'

I'm guessing perhaps the Trie is getting garbage-collected in the first instance, but is returning a string whose memory is backed by that freed up space?

@mjwillson
Copy link
Author

It seems a slightly weird intermittent (or at least hard to pin down what triggered it) bug anyway.

@mjwillson
Copy link
Author

Not sure it's gc-related either as still happens if I gc.disable().

Sometimes it happens on all runs after the first run:

In [3]: t = marisa_trie.Trie([u'foo', u'bar']); t.restore_key(0)
Out[3]: u'bar'

In [4]: t = marisa_trie.Trie([u'foo', u'bar']); t.restore_key(0)
Out[4]: u'bar\x02'

Sometimes I'm getting this error too:

In [2]: t = marisa_trie.Trie([u'foo', u'bar']); t.restore_key(0)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-663a54335fc8> in <module>()
----> 1 t = marisa_trie.Trie([u'foo', u'bar']); t.restore_key(0)

/usr/local/lib/python2.7/dist-packages/marisa_trie.so in marisa_trie.Trie.restore_key (src/marisa_trie.cpp:4794)()

/usr/local/lib/python2.7/dist-packages/marisa_trie.so in marisa_trie.Trie.restore_key (src/marisa_trie.cpp:4728)()

UnicodeDecodeError: 'utf8' codec can't decode byte 0x85 in position 3: invalid start byte

@kmike
Copy link
Member

kmike commented Feb 7, 2014

This is reproducable - a weird bug! I'll try to get to it this weekend.

@sisukapalli
Copy link

Hi, I too had the same problem just now ended up at this page. I have a somewhat large trie (2G), and found that running under ipython was not working but running on command-line was fine:

The following two work fine:
(1) echo 0 | marisa-reverse-lookup -r TRIEFILE.marisa
(2) python -e "from marisa_trie import Trie; print Trie().load('TRIEFILE.marisa').restore_key(0)"

The third one (in a running instance of IPython) fails:
(3) marisa_trie.Trie().load('TRIEFILE.marisa').restore_key(0)

however, it works in a new IPython instance.

Nothing very informative, but one more data point.

@kmike
Copy link
Member

kmike commented Feb 8, 2014

Thanks for the extra info.

It is interesting that this issue can be reproduced in an IPython shell, but doesn't manifest itself in a regular Python shell. A test case for it also doesn't fail.

I tried different IPython versions; @mjwillson's example works fine in IPython 0.10 but fails in 0.11+.

@kmike
Copy link
Member

kmike commented Feb 8, 2014

Also, it works fine in IPython 1.1 under Python 3.3.

@kmike kmike closed this as completed in 97f41d8 Feb 8, 2014
@kmike
Copy link
Member

kmike commented Feb 8, 2014

@mjwillson @sisukapalli thanks for the info! This bug should be fixed in 0.5.2. It turned out IPython vs python was a red herring: restore_key method was building the result incorrectly.

Maybe when code is executed in IPython memory layout is different and there are more non-zero bytes in memory - that could be a reason why the problem pops up only in IPython shell. When a byte after the string end is zero, restore_key method returned a proper result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants