txn_id is not sufficiently unique - need a UUID event_id too #89

alexanderdean · 2012-11-16T12:04:42Z

TBD whether UUID is set in ETL or tracker.

There are a couple of potential issues with setting the UUID in the tracker:

JavaScript has pretty poor UUID capabilities (http://stackoverflow.com/questions/105034/how-to-create-a-guid-uuid-in-javascript) - and we need this UUID to be really, well, unique :-)
Different trackers using different UUID capabilities might increase risk of a collision down the line (I haven't really thought this one through)

So I'm leaning towards:

Keeping txn_id as a 'local' transaction ID to detect dupes (issue Scala Hadoop Shred: deduplicate event_ids with different event_fingerprints (synthetic duplicates) #24), and
Adding a properly unique event_id into the ETL layer using a proper Java UUID library.

See email thread below:

On 16 November 2012 11:14, Yali yali@snowplowanalytics.com wrote:
Hi Michael,

You're right - we've not been clear on transaction ids.

We originally added them because occassionally, Cloudfront registers the same event twice. The transaction id became a way of telling if two similar looking events are really the same event recording twice - as such the requirement was only only to be unique within a narrow time frame.

However, having a UUID for each transaction is very desirable for analysis across the full SnowPlow data set for a particular web property. The question we're working through (and would appreciate input into) is: should that UUID be generated by the tracker? (The way the current transaction id is). Or would it make more sense to generate it at the ETL phase? (In which case the current transaction_id would be one input into the creation of a more robust id that really was globally unique.

If you have any thoughts on pros / cons of each approach we'd welcome your feedback. Otherwise, I'll update our current documentation to clarify the limitations of the transaction id, as currently implemented.

All the best,

Yali

On Friday, November 16, 2012 9:51:23 AM UTC, Michael Bell wrote:
Hi,

We've recently added SnowPlow to our site and have found the transaction id (txn_id) being captured is not as unique as we'd understood from the documentation. Docs state the txn_id field is:

A unique event ID. If two or more records have the same txn_id, one is a duplicate record
The javascript that generates the txn_id appears to be simply taking a 6 character substring of a random number, which is unlikely to be unique with a large enough dataset. Have we misunderstood the intention of this field, or we missing something else?

Regards,

Mike

ghost assigned alexanderdean Nov 16, 2012

alexanderdean closed this as completed Dec 26, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

txn_id is not sufficiently unique - need a UUID event_id too #89

txn_id is not sufficiently unique - need a UUID event_id too #89

alexanderdean commented Nov 16, 2012

txn_id is not sufficiently unique - need a UUID event_id too #89

txn_id is not sufficiently unique - need a UUID event_id too #89

Comments

alexanderdean commented Nov 16, 2012