-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scala Hadoop Shred: deduplicate event_ids with different event_fingerprints (synthetic duplicates) #24
Comments
Note that achieving this isn't currently possible with our Hive serde-based row-level deserialisation. We would need either:
/cc @yalisassoon |
I was imagining the following alternative, which I was hoping would work with Cascading / Scalding (so not strictly speaking stream-based, rather batch based), but is in essence a streaming based approach:
We will need to have a number of lookup tables referenced as part of a Scalding ETL, especially as we build out an OLAP friendly version of the events table. I imagined it would make sense to use Amazon SimpleDB or RDS for the tables for simplicity, but probably worth discussing on a dedicated thread... |
Yep that could well work... Hopefully once we have built the caches required by the OLAP cube, the approach to deduplication will become clearer. One thing we should probably do at some stage is move from storing the |
To handle dupes cleanly, we need to store the hashes for all messages from the last batch, which could be duplicated in our time period. We can store these in a table in HBase: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hbase.html |
One of our oldest open tickets now. In a stream processing world, you have to make an assumption that all dupes will come from the same shard key (IP address?), and then keep all event_ids in your shard's local storage for N minutes. (Otherwise you would have to re-partition based on incoming event_id, pretty painful.) There could be an interesting abstraction in Scala Common Enrich where as well as an EnrichmentRegistry, it is supplied with read/write access to a KV store - the specifics of the KV store are abstracted (Samza's KV store impl? Storehaus?) but Common Enrich can use it to manage dupes, sessionization etc. |
Renaming from Scala Common Enrich: to Unnamed new module:, because in a stream processing world, any stage can introduce natural copies. So we have to embed this module in each storage sink that cares about duplicate event_ids. |
Then we came up with this taxonomy:
|
We quickly realized that it is essential to define the ur event (ur meaning most primitive) correctly. @yalisassoon to add his whiteboard photo... |
For existing code: 8c09ade |
…prints (synthetic duplicates) (close #24)
…prints (synthetic duplicates) (close #24)
…prints (synthetic duplicates) (close #24)
…prints (synthetic duplicates) (snowplow/snowplow#24)
snowplow#24) * Add language to Markdown blocks in Try Snowplow and Tutorials sections * Change list of IAM policies from bash to text for Prism to colour
When you observe two events in your event stream with the same event_id, one of three things could be happening:
Thinking about this further, a simple de-duplication algorithm would be:
With this approach, distinguishing between ID collisions and synthetic copies can still be done (if needed) at analysis time.
Could this be done using bloom filters? (http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf) Not directly:
Rather than delete some false positives (causing data loss), it would be safer to err on the side of caution:
But this approach will still cause inflated counts with at least once processing systems. So a hybrid model, using a KV cache of N days of events and another KV cache of N days of event hashes:
This prevents natural copies within an N day window and safely renames ID collisions and synthetic copies all the way back through time.
Title was: Occasionally CloudFront registers an event twice
Content was:
Presumably because it hits two separate nodes at the same time.
Update the ETL to dedupe this - algorithm would be to hash the querystring and check for duplicates within a X minute timeframe. (Hashing the full raw querystring would implicitly include
txn_id
, the random JavaScript-side identifier, in the uniqueness check.)The text was updated successfully, but these errors were encountered: