Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding edges() and iteredges() Functions for DAWGs #1

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

EliFinkelshteyn
Copy link

As discussed at pytries/marisa-trie#20, this is support for adding the edges() and iteredges() methods for CompletionDAWG. If this looks good, I'll add similar support for RecordDAWGs and ByteDAWGs. The code isn't as optimized as it could be, but it works, it's clean (IMO), and it's fast enough for me.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.82%) to 91.02% when pulling fa6cd76 on EliFinkelshteyn:master into 3a56922 on kmike:master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-2.82%) to 91.02% when pulling fa6cd76 on EliFinkelshteyn:master into 3a56922 on kmike:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.82%) to 91.02% when pulling fa6cd76 on EliFinkelshteyn:master into 3a56922 on kmike:master.

@EliFinkelshteyn
Copy link
Author

These latest additions add edges() and iteredges() functionality for all applicable DAWGs and clean up the code since the original pull request. They complete all the work I planned to implement that we originally discussed. Would love to hear your thoughts @kmike.

if prefix:
index = self.dct.follow_bytes(prefix, index)
if not index:
return res
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is backwards incompatible - .items should return an empty list, not None here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Will fix (and add a test for future).

@coveralls
Copy link

Coverage Status

Coverage decreased (-4.37%) to 89.46% when pulling 30bf53b on EliFinkelshteyn:master into 3a56922 on kmike:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-4.37%) to 89.46% when pulling 30bf53b on EliFinkelshteyn:master into 3a56922 on kmike:master.

@@ -95,6 +95,77 @@ def size(self):
return len(self._units)


class EdgeFollower(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for separating Completer and EdgeFollower

@coveralls
Copy link

Coverage Status

Coverage decreased (-4.2%) to 89.64% when pulling 15355be on EliFinkelshteyn:master into 3a56922 on kmike:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-4.2%) to 89.64% when pulling 15355be on EliFinkelshteyn:master into 3a56922 on kmike:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-4.03%) to 89.81% when pulling dee560c on EliFinkelshteyn:master into 3a56922 on kmike:master.

yield item

def edges(self, prefix=""):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that .edges method should return the same data regardless of DAWG class. It it returns a list of strings in a base class it should return a list of strings in all subclasses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For BytesDAWG it could make sense to filter out edges leading to the values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's similar data for all. It never returns a list of strings. It always returns a list of 2-tuples. For dawgs with no data, the tuples are (str, True) for terminal edges and (str, False) for non-terminals.

For dawgs with data, they're (str, data) for terminal edges, and (str, False) for non-terminals. Since data evaluates to true in a boolean situation, this seems most logical to me. If you want the data in an edge, you have it. If you want to just use the edges and know whether they're terminals or not, you can do that the same way across dawgs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really want them to be the same, we could make them return (str, True) for terminal edges always, and just add an extra edges_with_data() method for dawgs that provide any kind of data storage. That actually seems most consistent to me. If you agree, I'll make that addition.

@coveralls
Copy link

Coverage Status

Coverage decreased (-3.85%) to 89.98% when pulling 2a93173 on EliFinkelshteyn:master into 3a56922 on kmike:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-3.85%) to 89.98% when pulling 2a93173 on EliFinkelshteyn:master into 3a56922 on kmike:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-4.96%) to 88.87% when pulling 8cb08f3 on EliFinkelshteyn:master into 3a56922 on kmike:master.

@EliFinkelshteyn
Copy link
Author

@kmike latest change makes edges() and iter_edges() always return (str, boolean) to denote what the edge key is and whether or not it's a terminal. It also adds edges_data() and edgesiter_data() for dawgs that store data which always return (str, data or None).

Everything looks done as far as I can tell. I'd like to start using this in prod soon. I can use my fork, but if you plan to merge soon, that would be even better.

@coveralls
Copy link

Coverage Status

Coverage decreased (-4.93%) to 88.91% when pulling f3baac8 on EliFinkelshteyn:master into 3a56922 on kmike:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-3.93%) to 89.91% when pulling ae7472a on EliFinkelshteyn:master into 3a56922 on kmike:master.

@kmike
Copy link
Member

kmike commented Apr 20, 2015

Hi @EliFinkelshteyn,

Everything looks done as far as I can tell. I'd like to start using this in prod soon. I can use my fork, but if you plan to merge soon, that would be even better.

It seems the main complexity is that some characters are represented by multiple transitions, right? You solved it by trying to decode data until is succeeds, which is reasonable for UTF-8.

Regarding the API - so .edges is like .keys, but it only traverses graph to depth of 1 unicode character, and also returns if the result is terminal or not? I think it is reasonable.

One question is whether it should return full keys or partial keys, without the prefix. You've implemented it the same way as Completer, which looks fine.

Could you please add more tests? For example, based on https://coveralls.io/builds/2376072/source?filename=dawg_python%2Fwrapper.py, the code which handles UnicodeDecodeErrors is untested; some conditions are also missing in dawgs.py (see https://coveralls.io/builds/2376072/source?filename=dawg_python%2Fdawgs.py).

Thanks for your PR! It is wel-written 👍 But I need a bit more time to review it. I'm not sure I'll be able to finish the review during this work week; weekend is more likely.

@EliFinkelshteyn
Copy link
Author

So, there's actually an issue here. When unicode chars share the same first bytes, this will only return one of the chars. I am working on fixing that now. I realized you can tell exactly how many bytes are in a unicode char by how many leading ones the first byte has, so I can use this to speed up the whole thing a bit as well.

@kmike
Copy link
Member

kmike commented Apr 27, 2015

A good catch. For some reason I thought that UTF8 synchronization is enough to make repeated decoding work, but it is not.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.12%) to 91.72% when pulling 5462916 on EliFinkelshteyn:master into 3a56922 on kmike:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-2.12%) to 91.72% when pulling 5462916 on EliFinkelshteyn:master into 3a56922 on kmike:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.82%) to 92.02% when pulling 0b81a9f on EliFinkelshteyn:master into 3a56922 on kmike:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.82%) to 92.02% when pulling 2cbd340 on EliFinkelshteyn:master into 3a56922 on kmike:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-1.82%) to 92.02% when pulling 2cbd340 on EliFinkelshteyn:master into 3a56922 on kmike:master.

@EliFinkelshteyn
Copy link
Author

That was my bad. I didn't know python3.2 has an issue with 'u'. I'm also just tired, so this took way longer than it should have. Should all be fixed and working now though.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.82%) to 92.02% when pulling f56e2b9 on EliFinkelshteyn:master into 3a56922 on kmike:master.

@EliFinkelshteyn
Copy link
Author

Ping here. Anything else you want done for this to be merged in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants