Scala Hadoop Shred: deduplicate event_ids with different event_fingerprints (synthetic duplicates) #24

alexanderdean · 2012-08-15T15:11:31Z

When you observe two events in your event stream with the same event_id, one of three things could be happening:

category	cause	payload matches?	probable time between duplicates	fix
ID collision	huge event volumes / algorithm flaws	no	far apart	give one event a new ID
synthetic copy	browser pre-cachers, anti-virus software, adult content screeners, web scrapers	partially (most or all client-sent fields)	close by logical time	either a) delete synthetic copy or b) give it a new ID & preserve relationship to "parent" event
natural copy	at least once processing	yes	close by ETL time	delete all but one event

Thinking about this further, a simple de-duplication algorithm would be:

If the payload matches exactly, then delete all but one copy
If the payload differs in any way, then give one event a new ID and preserve its relationship to "parent" event

With this approach, distinguishing between ID collisions and synthetic copies can still be done (if needed) at analysis time.

Could this be done using bloom filters? (http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf) Not directly:

unless event_id **definitely not in set** of bloom[event_ids]:
    if hash(event) **definitely not in set** of bloom[event hashes]:
        assign new event_id and store old event_id in original_event_id field
    else:
        delete event # Deletes some false positives!

Rather than delete some false positives (causing data loss), it would be safer to err on the side of caution:

unless event_id **definitely not in set** of bloom[event_ids]:
    assign new event_id and store old event_id in original_event_id field

But this approach will still cause inflated counts with at least once processing systems. So a hybrid model, using a KV cache of N days of events and another KV cache of N days of event hashes:

if event_id in n_days_cache[event_ids]:
    if hash(event) in n_days_cache[event hashes]:
        delete event
    else:
         assign new event_id and store old event_id in original_event_id field
else:
    unless event_id **definitely not in set** of bloom[event_ids]:
        assign new event_id and store old event_id in original_event_id field

This prevents natural copies within an N day window and safely renames ID collisions and synthetic copies all the way back through time.

Title was: Occasionally CloudFront registers an event twice

Content was:

Presumably because it hits two separate nodes at the same time.

Update the ETL to dedupe this - algorithm would be to hash the querystring and check for duplicates within a X minute timeframe. (Hashing the full raw querystring would implicitly include txn_id, the random JavaScript-side identifier, in the uniqueness check.)

The text was updated successfully, but these errors were encountered:

alexanderdean · 2012-12-27T11:07:45Z

Note that achieving this isn't currently possible with our Hive serde-based row-level deserialisation. We would need either:

A stream-based ETL process which cached querystring hashes for X minutes to check for uniqueness, or:
An aggregation step following our current ETL process which dropped the duplicates

/cc @yalisassoon

yalisassoon · 2012-12-27T11:51:43Z

I was imagining the following alternative, which I was hoping would work with Cascading / Scalding (so not strictly speaking stream-based, rather batch based), but is in essence a streaming based approach:

Maintaining a list of the last X hours worth of processed lines of data in a lookup table. (The value of X would depend on how regularly the ETL batch process works, but I think we need to move from a daily mode to an hourly mode at the very least.)
Cross referencing every new line of data with the lines stored in the lookup table and apply some de-duping logic
At an X hours delay, the processed data would be written from the lookup table to S3

We will need to have a number of lookup tables referenced as part of a Scalding ETL, especially as we build out an OLAP friendly version of the events table. I imagined it would make sense to use Amazon SimpleDB or RDS for the tables for simplicity, but probably worth discussing on a dedicated thread...

alexanderdean · 2012-12-27T12:20:39Z

Yep that could well work... Hopefully once we have built the caches required by the OLAP cube, the approach to deduplication will become clearer.

One thing we should probably do at some stage is move from storing the txn_id for a row to storing the generated event_hash (which will include the txn_id as well as everything else on the querystring).

alexanderdean · 2013-05-02T16:15:01Z

To handle dupes cleanly, we need to store the hashes for all messages from the last batch, which could be duplicated in our time period. We can store these in a table in HBase:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hbase.html

alexanderdean · 2014-11-08T09:42:07Z

One of our oldest open tickets now.

In a stream processing world, you have to make an assumption that all dupes will come from the same shard key (IP address?), and then keep all event_ids in your shard's local storage for N minutes. (Otherwise you would have to re-partition based on incoming event_id, pretty painful.)

There could be an interesting abstraction in Scala Common Enrich where as well as an EnrichmentRegistry, it is supplied with read/write access to a KV store - the specifics of the KV store are abstracted (Samza's KV store impl? Storehaus?) but Common Enrich can use it to manage dupes, sessionization etc.

/cc @yalisassoon @fblundun

alexanderdean · 2015-05-28T20:24:51Z

Renaming from Scala Common Enrich: to Unnamed new module:, because in a stream processing world, any stage can introduce natural copies. So we have to embed this module in each storage sink that cares about duplicate event_ids.

alexanderdean · 2015-08-18T11:48:51Z

We have done some further brainstorming on this today. First we drew out some different event scenarios:

alexanderdean · 2015-08-18T11:50:52Z

Then we came up with this taxonomy:

event id	ur event	possible explanations	strategy
same	same	* At least once processing * 2 trackers, 1 event Robots	Delete all but first
same	different	* User reusing event ID * Event ID algo collision * Robot	Assign new event ID to all but first, preserving original event ID as 'parent event ID'
different	same	* 2 trackers, 1 event * Behavioral collision	Do nothing

alexanderdean · 2015-08-18T11:52:33Z

We quickly realized that it is essential to define the ur event (ur meaning most primitive) correctly. @yalisassoon to add his whiteboard photo...

yalisassoon · 2015-08-18T21:32:39Z

See below:

alexanderdean · 2016-11-09T13:42:32Z

Further reading:

alexanderdean · 2016-11-09T13:53:10Z

For existing code: 8c09ade

…prints (synthetic duplicates) (close #24)

…prints (synthetic duplicates) (snowplow/snowplow#24)

snowplow#24) * Add language to Markdown blocks in Try Snowplow and Tutorials sections * Change list of IAM policies from bash to text for Prism to colour

ghost assigned alexanderdean Aug 15, 2012

alexanderdean mentioned this issue Nov 16, 2012

txn_id is not sufficiently unique - need a UUID event_id too #89

Closed

alexanderdean mentioned this issue Feb 17, 2013

Add support for (configurable) sessionization in the ETL #169

Closed

alexanderdean mentioned this issue Mar 28, 2013

Track unstructured events #198

Closed

alexanderdean changed the title ~~Occasionally CloudFront registers an event twice~~ Scala Common Enrich: handle duplicate event_ids Nov 8, 2014

alexanderdean mentioned this issue Nov 15, 2014

Postgres: remove primary key constraint on event_id #1187

Closed

alexanderdean changed the title ~~Scala Common Enrich: handle duplicate event_ids~~ New module: handle duplicate event_ids May 28, 2015

alexanderdean changed the title ~~New module: handle duplicate event_ids~~ Unnamed new module: handle duplicate event_ids May 28, 2015

alexanderdean mentioned this issue Jul 6, 2015

Deduplication: add a step that deduplicates events #1866

Closed

alexanderdean changed the title ~~Unnamed new module: handle duplicate event_ids~~ Spark Dedupe: handle duplicate event_ids Aug 18, 2015

alexanderdean changed the title ~~Spark Dedupe: handle duplicate event_ids~~ Scala Hadoop Shred: deduplicate event_ids with different event_fingerprints (synthetic duplicates) Sep 3, 2016

alexanderdean added 4. Storage and removed enhancement 3. Enrich labels Sep 3, 2016

alexanderdean added this to the R8x [HAD] Synthetic dedupe milestone Sep 3, 2016

alexanderdean assigned chuwy and unassigned alexanderdean Nov 9, 2016

alexanderdean mentioned this issue Nov 9, 2016

Add com.snowplowanalytics.snowplow/duplicate/jsonschema/1-0-0 snowplow/iglu-central#439

Closed

chuwy added a commit that referenced this issue Nov 28, 2016

Scala Hadoop Shred: deduplicate event_ids with different event_finger…

f8a7843

…prints (synthetic duplicates) (close #24)

alexanderdean pushed a commit that referenced this issue Dec 4, 2016

Scala Hadoop Shred: deduplicate event_ids with different event_finger…

33ca749

…prints (synthetic duplicates) (close #24)

alexanderdean pushed a commit that referenced this issue Dec 19, 2016

Scala Hadoop Shred: deduplicate event_ids with different event_finger…

a6ba90c

…prints (synthetic duplicates) (close #24)

alexanderdean closed this as completed in 7ba11c1 Dec 20, 2016

chuwy added a commit to snowplow/snowplow-rdb-loader that referenced this issue Sep 5, 2017

Scala Hadoop Shred: deduplicate event_ids with different event_finger…

a860fda

…prints (synthetic duplicates) (snowplow/snowplow#24)

chuwy mentioned this issue Sep 30, 2021

RDB Shredder: allow configuring deduplication snowplow/snowplow-rdb-loader#583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scala Hadoop Shred: deduplicate event_ids with different event_fingerprints (synthetic duplicates) #24

Scala Hadoop Shred: deduplicate event_ids with different event_fingerprints (synthetic duplicates) #24

alexanderdean commented Aug 15, 2012

alexanderdean commented Dec 27, 2012

yalisassoon commented Dec 27, 2012

alexanderdean commented Dec 27, 2012

alexanderdean commented May 2, 2013

alexanderdean commented Nov 8, 2014

alexanderdean commented May 28, 2015

alexanderdean commented Aug 18, 2015

alexanderdean commented Aug 18, 2015

alexanderdean commented Aug 18, 2015

yalisassoon commented Aug 18, 2015

alexanderdean commented Nov 9, 2016

alexanderdean commented Nov 9, 2016

Scala Hadoop Shred: deduplicate event_ids with different event_fingerprints (synthetic duplicates) #24

Scala Hadoop Shred: deduplicate event_ids with different event_fingerprints (synthetic duplicates) #24

Comments

alexanderdean commented Aug 15, 2012

alexanderdean commented Dec 27, 2012

yalisassoon commented Dec 27, 2012

alexanderdean commented Dec 27, 2012

alexanderdean commented May 2, 2013

alexanderdean commented Nov 8, 2014

alexanderdean commented May 28, 2015

alexanderdean commented Aug 18, 2015

alexanderdean commented Aug 18, 2015

alexanderdean commented Aug 18, 2015

yalisassoon commented Aug 18, 2015

alexanderdean commented Nov 9, 2016

alexanderdean commented Nov 9, 2016