New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don’t assign unique identifiers/fingerprints to visitors by default #14
Comments
Hey @da2x, Thanks for the suggestion and the thought out comment. This is great stuff! We've been contemplating on whether we want to use fingerprinting vs. using a cookie and came to the same conclusion as you, using a cookie is actually more privacy-friendly as fingerprinting as with the cookie users can delete them in order to be forgotten. Eventually we want to support "visitor paths" and I have yet to think of a way to do that without assigning a unique identifier to each user. This can and should definitely be anonymous though, imo. Do you have any ideas on how you'd tackle visitor paths without the use of identifiers? |
@dannyvankooten, how is information about the visitor path more useful that just storing the referrer information for a given page? If you can tell me more about the specific requirements of this feature, I may be able to come up with another solution to get the same data. Just use the
(Speaking of aggregate data and referrer information: unique referrer data should be dropped after a week. E.g. if only 2 clicks came from example.com/webmail, then delete the link after a week. If 600 people came from reddit.com then that should be considered non-unique and you can keep storing it.) |
This is an interesting discussion and I hope you don't mind me both chipping in and following. @dannyvankooten - you look like you're WAY more into this than I am. But I'm currently working out this exact same thing on my (much smaller and probably less viable) Kownter project. In fact, one of the reasons I'm building it is to see what can be done without cookies. Some kind of aggregate flow between pages as @da2x suggests is pretty much the thinking that I came to . I don't need to see individual's paths if I can see the aggregate drop-off as people move through the user journey. In this post I attempted to explain this, saying: "we can still report the ratio of conversions against page views." OK, so there are probably some advanced cases where you want to know more than that but if you want that then you probably actually want GA, right? I've also been wondering if it makes sense, is useful, and is performant enough (server-side) to store a cookie, but to recycle it on each visit. And, if you do this, is it any more private than just setting a cookie and leaving it there? I'm not sure it is. I am not actually convinced that a session ID is personally identifiable as there's no reverse-lookup. If you were storing a session ID alongside an IP address then that would be different. I know GDPR says something about web-tokens, but I think what you/we are doing here is way within the spirit of GDPR. Interested to see how this progresses. |
Hey @rosswintle, I definitely do not mind - quite the opposite. Thank you for chiming in! Kownter looks super interesting, I'm glad that there are more people thinking about solving analytics in a better way and providing more options besides "just use GA, even though you don't use 90% of what they're collecting". I'm going to go through all of your posts as there's most likely a ton to learn - I'm only just getting back into this project and forgot about a lot of the decisions that went into this when I started 18 months ago.
Same for me, although there may be other off-site identifiers that will still give away this particular user (eg cross referencing timestamped actions in app?). Anyway, at this stage I lack the intricate details to really add anything to this discussion, so I'll get right to reading your posts and experimenting to see if there's something we can do to tackle this. Dropping the unique visitor ID entirely is definitely worth striving for, I'd say! |
Great! I think with the backing of @pjrvs and whatever design skills you have that are way better than mine, your project will be much less of a toy/experiment than mine. Plus, writing it in Go probably helps with the scaling a LOT (though I think the downside is that self-hosted deployment might be harder for some people). Great that some of what I've done developing the early stages in the open might help. Always happy to talk about the experience! I'll leave you and @da2x to work out the specific issue listed here. |
@rosswintle, put this on repeat and get back to working on Kownter. It’ll be a great alternative (more alternatives is great!) and you just need to stay motivated. Make it your own and it’ll probably turn out great. Nick some stylesheets and graphs from Fathom if you like the visuals and stuff your own data in them. This isn’t legal advice and I’m not anyone’s lawyer. The following could very well be totally wrong: Fewer magical identifiers means more transparency. It also mean people won’t contact the operator of the analytics service to ask for a copy of the data belonging to $magical_token or ask to have the data of $magical_token deleted. I specifically suggested cookie names that were named after the data they contain instead of a single magical cookie containing all the data. Individual cookies are more easy for people to inspect. Opting out of this is as simple as disabling cookies, and their use is easy to document in a privacy policy. The following is provided for context (GDPR sections mentioning identifiers as relevant to this discussion): GDPR Recital 30:
GDPR Article 4 Definitions: Point 1:
|
…implicity and privacy. see #14.
Quick progress update: Fathom now relies on storing all visitor-specific data on the client side so that the server only has to keep track of aggregated data. Not only did this allow for much simpler code and better privacy (as discussed), it'll scale a whole lot better too. So thanks @da2x and @rosswintle for lending your brains here. Super helpful! Right now visitors are still assigned a random ID but it is only stored for a theoretical maximum of 30 minutes (the expiration time of a session). If a visitor visits multiple pages, all pageview hits except the last one is deleted within 5 minutes (the time between aggregation). The visitor ID is only needed to give an indication of realtime visitors (that is: distinct visitors that did a pageview or performed another event within the last 5 minutes). I haven't yet been able to come up with a way to do that without storing some kind of short-lived identifier on the server, but let me know if you have any suggestions here please. Besides the visitor ID (a random string), no other identifiable data is stored anymore. 🎉 🍾 |
Could you, for the purposes of real-time unique visitors, just ignore any hits with an internal (same site) referring page? (This was a super-quick thought I wanted to jot down. Will properly read and think another time.) |
This also gives me the thought that you can set cookies to identify returning users without having an ID in the cookie. You just set The PECR rules would still (currently) need the existence of the cookie to be disclosed to users. But it should keep you free of GDPR “personal data” rules. Hmm. 🤔 |
Greatly appreciated.
I was about to suggest the same. The
That serves the same purpose as the
(This isn't legal advise, and I'm not a lawyer.) The ePrivacy Directive requirements varies greatly from country to country. Some require consent and opt-in, some require a prompt and opt-out, some just requires information, and some require browser settings to be respected (cookies, DNT). The ePrivacy Regulation (late 2018/early 2019?) changes this mess to the last option (browser settings) plus detailed documentation (privacy policy). I'm personally aligning all my thinking with the ePrivacy Regulation + GDPR. The ePrivacy Directive and GDPR are mutually exclusive as far as I understand it, so the current situation is unclear. However, aiming for transparency (purposeful cookie names and matching privacy policies) and aiming for data minimization and avoiding any kind of user-profiling should be the way to go to be compliant with the intent of European regulations. |
Returning visitors are indeed already tracked using the cookie (and not using the identifier), but the ID is used to get a list of the number of _distinct_ visitors active in the last 5 minutes (but they might have arrived earlier than that).
I don't see how the server would be able to tell how many distinct visitors are online using the referer header, especially if they have been browsing the site for a little while (so they are not a "new visitor" but they are still "online"). Am I overseeing something here?
…On May 7, 2018 10:12:36 PM GMT+02:00, Daniel Aleksandersen ***@***.***> wrote:
> Quick progress update [...]
Greatly appriciated.
> Could you, for the purposes of real-time unique visitors, just ignore
any hits with an internal (same site) referring page?
I was about to sugget the same. The `Referer` header holds this
information. I don't see what a random ID adds here.
> This also gives me the thought that you can set cookies to identify
returning users without having an ID in the cookie. You just set
`tracked-with-fathom = 1` and pick it up server side.
That serves the same purpose as the `last-visit` cookie I suggested
earlier.
> The PECR rules would still (currently) need the existence of the
cookie to be disclosed to users.
(This isn't legal advise, and I'm not a lawyer.) The ePrivacy Directive
requirements varies greatly from country to country. Some require
consent and opt-in, some require a prompt and opt-out, some just
requires information, and some require browser settings to be respected
(cookies, DNT). The ePrivacy Regulation (late 2018/early 2019?) changes
this mess to the last option (browser settings) plus detailed
documentation (privacy policy).
I'm personally aligning all my thinking with the ePrivacy Regulation +
GDPR. The ePrivacy Directive and GDPR are mutually exclusive as far as
I understand it, so the current situation is unclear. However, aiming
for transparency (purposeful cookie names and matching privacy
policies) and aiming for data minimization and avoiding any kind of
user-profiling should be the way to go to be compliant with the intent
of European regulations.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#14 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
That’s neat...I’d not thought of using cookie existence to track a return visit. I like that. And yeah, you’re right. Without IDs you could track NEW live visitors, but not those that had been around for a while...this could be an explained limitation...or just use your temporary IDs! Great discussion and thanks for the update. |
Use the Record the lastvisit offset in minutes. E.g. add a plus one count in table active_visitors with columns datetime (minute precision), 0min, 1min, 2min, ... 15min. For any minute of the day, you'd be able to see you can see how how many active sessions/people versus total pageviews by comparing pageview counts to the active_users table. This table should be cleared regularly, of course, and data stored in aggregate in a more practical table. This should get you the info you want from just a timestamp cookie. |
I'm 100% in favor of using no identifiers, but you could still store travel path
|
Quick update: current For a new visitor, the data sent to the Fathom backend will now look something like this. Given 3 pageviews:
The "previous ID" is used to update the previous pageview (not a bounce, update time on page) but is not stored, so that each row in the
|
That's very cool/smart. I guess the ID is stored in the cookie for use in the next page view, right? But what does the ID actually represent and how is it generated? Is it some kind of hash of timestamp and device fingerprint? |
(I'd read the code, but I'll be AFK very soon) |
The relevant part is in assets/src/js/tracker.js. Mostly: const d = {
id: util.randomString(20),
pid: data.previousPageviewId || '',
.... So it's just a (truly) random string of 20 characters, leaving a small chance of collisions but given that the pageview table is cleared every minute I'd say the chances of that happening are nil. And even if it happens, it won't have any real consequences. |
Disclaimer: I’m not a lawyer and this isn’t legal advise.
It would be useful for General Data Protection Regulation (GDPR) compliance to not store IP addresses, cookie identifiers, or other unique fingerprints. The current unique identifiers can be decoded back to IP addresses. See EU GDPR and personal data in web server logs for context.
I’d like to see this as the default mode, but at least make it an option. This could be a unique selling point
IP addresses also aren’t all that useful any more for assigning a unique identifier as mobile devices roam between different networks several times in a normal day (at home, mobile carrier, work, café, etc.) IPv6 reduces the usefulness of this further by assigning new addresses periodically (usually once per reboot, reconnect: or 48, 24, or 12 hours depending on operating system and network environment).
Here are some ideas and alternatives approaches to get the same data in aggregate without assigning unique identifiers to each user:
Pageviews per session and number of unique users:
/collect
that runs an incremental short-lived/session pageview counters. E.g.Set-Cookie: ana_pageviews=1; path=/collect; max-age=3600
(1 hour session). Second request you send back the same cookie with a value of 2, etc.User-retention/repeat visitors:
/collect
that includes an imprecise timestamp (e.g. only daily precision to avoid them being too unique). E.g.Set-Cookie: ana_lastvisit=2018-04-23; path=/collect; max-age=7776000
(3 months). Reset on every visit.now() - $cookie['ana_lastvisit']
. Maybe don’t track this within an active session ($cookie['ana_pageviews']
is set)?What else is needed to track?
On the use of cookies: The cookies are transparent (even self-explanatory), their use is easy to explain in a privacy policy, and in my opinion they should be GDPR-friendly. They’re not used to track the behaviour of an individual users, just the movements and trends in the herd.
Disclaimer: I’m not a lawyer and this isn’t legal advise.
The text was updated successfully, but these errors were encountered: