Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique Method #9

Closed
RyPeck opened this issue Jun 6, 2016 · 3 comments
Closed

Unique Method #9

RyPeck opened this issue Jun 6, 2016 · 3 comments
Milestone

Comments

@RyPeck
Copy link

RyPeck commented Jun 6, 2016

Hey Shawn - one of the problems you were speaking about at PyCon 2016 was looking to guarantee that all integers in a list were unique, in an efficient way for large sets of data?

@shawnbrown
Copy link
Owner

shawnbrown commented Jun 6, 2016

Yes! Thanks for filing this issue--I should have added it soon after the discussion at the conference but was distracted with talks and events.

I hesitated to implement an "assert unique" method because I didn't want to add a feature that would break with larger-than-memory sets of data but I didn't want to just persist it all to a temp-file either (for a method that would assert uniqueness for any given data type, not just integers, though that was the use case discussed).

I did some thinking about it and I'm wondering if I could use Bloom filter to take care of most of the work and then make a second pass over the data to eliminate any false positives. But maybe there's an even better approach that I just can't see right now. If you have an idea, yourself, I'd be glad to hear it. I think the Bloom filter approach is promising though.

@shawnbrown
Copy link
Owner

There's a bit of refactoring that I want to do first (some magic removal) but I'll look at implementing this "assert unique" behavior soon after.

@shawnbrown
Copy link
Owner

shawnbrown commented Jun 22, 2016

The initial work for this is done: 055f438. The new assertDataUnique() assertSubjectUnique() method can be used for asserting that column values, or combinations of column values, are unique. Duplicate values will be raised as "Extra" differences.

Currently, the implementation is unoptimized--it cannot run on data larger than available RAM. I've opened issue #13 for the planned, future optimization.

If anyone needs this method before the next release, you can install the development release with:

pip install --upgrade https://github.com/shawnbrown/datatest/archive/master.zip

@shawnbrown shawnbrown added this to the 0.7.0 milestone Jul 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants