Unique Method #9

RyPeck · 2016-06-06T01:51:02Z

Hey Shawn - one of the problems you were speaking about at PyCon 2016 was looking to guarantee that all integers in a list were unique, in an efficient way for large sets of data?

shawnbrown · 2016-06-06T02:39:43Z

Yes! Thanks for filing this issue--I should have added it soon after the discussion at the conference but was distracted with talks and events.

I hesitated to implement an "assert unique" method because I didn't want to add a feature that would break with larger-than-memory sets of data but I didn't want to just persist it all to a temp-file either (for a method that would assert uniqueness for any given data type, not just integers, though that was the use case discussed).

I did some thinking about it and I'm wondering if I could use Bloom filter to take care of most of the work and then make a second pass over the data to eliminate any false positives. But maybe there's an even better approach that I just can't see right now. If you have an idea, yourself, I'd be glad to hear it. I think the Bloom filter approach is promising though.

shawnbrown · 2016-06-07T01:51:15Z

There's a bit of refactoring that I want to do first (some magic removal) but I'll look at implementing this "assert unique" behavior soon after.

shawnbrown · 2016-06-22T15:05:54Z

The initial work for this is done: 055f438. The new ~~assertDataUnique()~~ assertSubjectUnique() method can be used for asserting that column values, or combinations of column values, are unique. Duplicate values will be raised as "Extra" differences.

Currently, the implementation is unoptimized--it cannot run on data larger than available RAM. I've opened issue #13 for the planned, future optimization.

If anyone needs this method before the next release, you can install the development release with:

pip install --upgrade https://github.com/shawnbrown/datatest/archive/master.zip

shawnbrown added the enhancement label Jun 6, 2016

shawnbrown mentioned this issue Jun 22, 2016

Optimize assertSubjectUnique() method. #13

Closed

shawnbrown closed this as completed Jun 22, 2016

shawnbrown added this to the 0.7.0 milestone Jul 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unique Method #9

Unique Method #9

RyPeck commented Jun 6, 2016 •

edited

shawnbrown commented Jun 6, 2016 •

edited

shawnbrown commented Jun 7, 2016

shawnbrown commented Jun 22, 2016 •

edited

Unique Method #9

Unique Method #9

Comments

RyPeck commented Jun 6, 2016 • edited

shawnbrown commented Jun 6, 2016 • edited

shawnbrown commented Jun 7, 2016

shawnbrown commented Jun 22, 2016 • edited

RyPeck commented Jun 6, 2016 •

edited

shawnbrown commented Jun 6, 2016 •

edited

shawnbrown commented Jun 22, 2016 •

edited