Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

txn_id is not sufficiently unique - need a UUID event_id too #89

Closed
alexanderdean opened this issue Nov 16, 2012 · 0 comments
Closed

txn_id is not sufficiently unique - need a UUID event_id too #89

alexanderdean opened this issue Nov 16, 2012 · 0 comments

Comments

@alexanderdean
Copy link
Member

TBD whether UUID is set in ETL or tracker.

There are a couple of potential issues with setting the UUID in the tracker:

So I'm leaning towards:

  1. Keeping txn_id as a 'local' transaction ID to detect dupes (issue Scala Hadoop Shred: deduplicate event_ids with different event_fingerprints (synthetic duplicates) #24), and
  2. Adding a properly unique event_id into the ETL layer using a proper Java UUID library.

See email thread below:


On 16 November 2012 11:14, Yali yali@snowplowanalytics.com wrote:
Hi Michael,

You're right - we've not been clear on transaction ids.

We originally added them because occassionally, Cloudfront registers the same event twice. The transaction id became a way of telling if two similar looking events are really the same event recording twice - as such the requirement was only only to be unique within a narrow time frame.

However, having a UUID for each transaction is very desirable for analysis across the full SnowPlow data set for a particular web property. The question we're working through (and would appreciate input into) is: should that UUID be generated by the tracker? (The way the current transaction id is). Or would it make more sense to generate it at the ETL phase? (In which case the current transaction_id would be one input into the creation of a more robust id that really was globally unique.

If you have any thoughts on pros / cons of each approach we'd welcome your feedback. Otherwise, I'll update our current documentation to clarify the limitations of the transaction id, as currently implemented.

All the best,

Yali

On Friday, November 16, 2012 9:51:23 AM UTC, Michael Bell wrote:
Hi,

We've recently added SnowPlow to our site and have found the transaction id (txn_id) being captured is not as unique as we'd understood from the documentation. Docs state the txn_id field is:

A unique event ID. If two or more records have the same txn_id, one is a duplicate record
The javascript that generates the txn_id appears to be simply taking a 6 character substring of a random number, which is unlikely to be unique with a large enough dataset. Have we misunderstood the intention of this field, or we missing something else?

Regards,

Mike

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant