Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs metrics trial with Plausible #204

Closed
hugovk opened this issue Jul 10, 2023 · 9 comments
Closed

Docs metrics trial with Plausible #204

hugovk opened this issue Jul 10, 2023 · 9 comments

Comments

@hugovk
Copy link
Member

hugovk commented Jul 10, 2023

In the Docs Community monthly meetings, we’ve discussed the need for gathering some metrics of page views from the docs:

  • This would help translators prioritise and focus their effort on the most popular pages
  • Likewise it would help us see which pages could most benefit from a rewrite or update

Plausible looks a promising choice:

  • open source
  • simple and lightweight, easy-to-use
  • no cookies and compliant with GDPR, CCPA and PECR
  • not as invasive as Google Analytics
  • paid option, hosted in the EU
  • free version, for self-hosting

The Docs Community would like to run a 30-day trial of the hosted version for https://docs.python.org/. This would let us know how many page views we’d get, so we can see at what pricing level we’d need for paid hosting, and whether to consider self-hosting.

I asked first on DPO and there were no objections (and 10x👍).

Would a 30-day trial be okay with the Steering Council?

Thanks!

@JulienPalard
Copy link
Member

Missed the thread from discuss, I don't spend a lot of time connected those days, but I'm against all form of user tracking:

  • It costs resources (energy, server, storage, human time, and network bandwidth for every users which also has "hidden" costs).
  • I'm happy to see the proposition is not Google Analytics, which is illegal here in France, but if not self-hosted I bet it's still a leak of personal information (IP address) to a 3rd party, which requires the permission of the user: I don't want GDPR popups on d.p.o.
  • We already have stats about which page are more frequently visited, via a simple awk|sort|uniq in the logs, and it probably does not change often, so probably a dump every other years should give us enough visibility.

@egeakman
Copy link

  • It costs resources (energy, server, storage, human time, and network bandwidth for every users which also has "hidden" costs).

I think the investment in these resources, is worthwhile because it motivates translators to stay connected with the impact of their work. By enabling them to monitor recent views on their translations, we create an environment that encourages their engagement and improves the quality of their contributions. Despite these costs, the value gained from this investment outweighs them.

  • We already have stats about which page are more frequently visited, via a simple awk|sort|uniq in the logs, and it probably does not change often, so probably a dump every other years should give us enough visibility.

This is what Ee said in our conversation about the traffic data of docs.python.org:

There’s really no “easy” way to pull this, which has effectively become the barrier. It’s just gigabytes of data in S3 that needs to be anonymized and processed. Past pulls were more doable, but now the volume of data has grown significantly and it is not tenable as a “one off”.

Ideally we would include some form of analytics on docs.python.org to collect this in aggregate, which would need to be discussed/decided upon. I’m sorry I can’t be of more help as is. If the docs working group can decide on an acceptable analytics platform, I can assist in getting it setup and access granted for the various docs teams and translators.

I believe going with Plausible is the best option when considering its advantages and disadvantages.

@hugovk
Copy link
Member Author

hugovk commented Jul 10, 2023

  • It costs resources (energy, server, storage, human time, and network bandwidth for every users which also has "hidden" costs).

Plausible say:

Lightweight script that keeps your site speed fast

Plausible is lightweight analytics. Our script is 45 times smaller than Google Analytics. Your page weight will be cut down, your site will load faster and you'll reduce your carbon footprint for a greener and more sustainable web. A site with 10,000 monthly visitors can save 4.5 kg of CO2 emissions per year by switching.

And https://plausible.io/lightweight-web-analytics says the https://plausible.io/js/script.js is <1 KB.

Picking a docs page at random, 76.4 KB is transferred for https://docs.python.org/3/library/json.html. 77.4 KB instead should be a negligible difference.


  • I'm happy to see the proposition is not Google Analytics, which is illegal here in France, but if not self-hosted I bet it's still a leak of personal information (IP address) to a 3rd party, which requires the permission of the user: I don't want GDPR popups on d.p.o.

Plausible say of their hosted option:

No need for cookie banners or GDPR consent

Plausible is privacy-friendly analytics. All the site measurement is carried out absolutely anonymously. Cookies are not used and no personal data is collected. There are no persistent identifiers. No cross-site or cross-device tracking either. Your site data is not used for any other purposes. All visitor data is exclusively processed with servers owned and operated by European companies and it never leaves the EU.

Their Data Processing Agreement says (again about their hosted option):

We do not attempt to generate a device-persistent identifier because they are considered personal data under GDPR. We do not use cookies, browser cache nor the local storage. We do not store, retrieve nor extract anything from visitor’s devices. The data we process cannot be used to identify any single individual.

Every single HTTP request sends the IP address and the User-Agent to the server so that’s what we use. We generate a daily changing identifier using the visitor’s IP address and User-Agent. To anonymize these datapoints and make them impossible to relate back to the user, we run them through a hash function with a rotating salt.

hash(daily_salt + website_domain + ip_address + user_agent)

This generates a random string of letters and numbers that is used to calculate unique visitor numbers for the day. The raw data IP address and User-Agent are never stored in our logs, databases or anywhere on disk at all.

Old salts are deleted every 24 hours to avoid the possibility of linking visitor information from one day to the next. Forgetting used salts also removes the possibility of the original IP addresses being revealed in a brute-force attack. The raw IP address and User-Agent are rendered completely inaccessible to anyone, including ourselves.


  • We already have stats about which page are more frequently visited, via a simple awk|sort|uniq in the logs, and it probably does not change often, so probably a dump every other years should give us enough visibility.

As @egeakman mentioned, we started off asking Ee if it's possible to get fresh stats out, but unfortunately it's no longer feasible.

Thanks for pointing out the old top 50, that gives a good rough idea, but the docs have changed in the past 7 years and there are new pages. It would also be good to know more than just the top 50 (which includes Python 2 pages which we're no longer interested in).

@JulienPalard
Copy link
Member

I agree that Plausible seems way way way better than GA, yet I hardly changes my mind.

About the costs for example, OK it's way smaller than GA, yet it's infinitely bigger than no tracking.

The costs are far from negligible, for every page view it's a DNS query, a 3 way TCP handshake, the certificate verification (costs non-negligible CPU usage), potentially another DNS query if the analytics script, at runtime, hit another domain, ...

Calculating the exact cost of it is near impossible, the AFNIC (.fr operator) tried to compute the cost of a single DNS query (and answer), they went deep on the road, even trying to compute the amortized cost of network hardware along the path. The conclusion IIRC is: we keep adding things, it forces network to adapt in the long term (upgrade hardware, which itself has an (amortized) cost), but removing one thing does not allow to "downgrade"/"reimburse" already installed hardware.

OHHH you know what could change my mind a little bit about the costs? Client-side Bernoulli sampling.

If we could make the client-side script hosted on d.p.o (to avoid DNS query and new TCP connection) and make it only log 1 out of n visits, it would divide by n some of the involved costs. Not the humans costs though. Put n=10_000 or so and we got non-negligible "gains" vs 1 query per page view (again, infinitely more than not having analytics at all).

@astrojuanlu
Copy link

The costs are far from negligible, for every page view it's a DNS query, a 3 way TCP handshake, the certificate verification (costs non-negligible CPU usage), potentially another DNS query if the analytics script, at runtime, hit another domain, ...

That sounds like a lot but opening the query inspector on any browser I bet that the cost of downloading just the Python logo is way more than that. When people say "negligible", I suppose they mean "negligible with respect to what it's already in there".

@hugovk
Copy link
Member Author

hugovk commented Jul 11, 2023

Measuring https://hugovk-cpython.readthedocs.io/en/plausible/ which has Plausible's script.js (but the trial there has ended):

Details

image

  • script.js: 1.3 kB, 8 ms
  • py.svg: 1.6 kB, 20 ms

Here's analysis from PageSpeed Insights (like Lighthouse in Chrome):

https://pagespeed.web.dev/analysis/https-hugovk-cpython-readthedocs-io-en-plausible-library-json-html/wa134xnlve?form_factor=desktop

Details

image
image
image
image
image

I think the main problem is the page itself is so big at 2,636 DOM elements (16.3 kB, 33 ms) and not caching our own static assets. No mention of script.js in the various audits.

Looking at a "treemap" of the JavaScript, script.js is one of the lighter files:

Details

image

@JelleZijlstra
Copy link
Member

Seems like this discussion should be happening at https://discuss.python.org/t/docs-metrics-trial-with-plausible/28896?u=hugovk, not here.

@gpshead
Copy link
Member

gpshead commented Jul 11, 2023

The steering council resolved this internally via chat, we all agree: Go ahead with a Plausible trial for the docs.

@merwok
Copy link
Member

merwok commented Dec 29, 2023

I think the main problem is the page itself is so big […] and not caching our own static assets.

Can that be discussed in the appropriate tracker? (not sure if it’s sphinx, cpython, psf infra…)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants