Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get top 1000 viewed pages per site per month #6

Closed
vipulnaik opened this issue Jun 27, 2019 · 6 comments
Closed

Get top 1000 viewed pages per site per month #6

vipulnaik opened this issue Jun 27, 2019 · 6 comments
Assignees

Comments

@vipulnaik
Copy link
Collaborator

@vipulnaik vipulnaik commented Jun 27, 2019

I would like to have a page that has a table view that can stretch across months, where the rows are pages, the columns are months, and the cells give one of these:

  • Either a pageview count, if the page was in the top 1000 pages in that month
  • Or a not in top 1000 to indicate that we don't have a pageview count (so it's likely small anyways)

We should have a column for total count, and a row for total across the top 1000 pages in a month; the next row can give the grand total in the month so we can see what percentage is covered by the top 1000 pages (most likely it should be 90%+).

We can refrain from getting this data for the current month, so we will fetch the data only for completed months (this will keep use of the analytics api to a minimum, because we will fetch data for each month only once).

@riceissa

This comment has been minimized.

Copy link
Owner

@riceissa riceissa commented Oct 25, 2019

I don't understand why you want to show "not in top 1000" for the low-pageviews pages. If we are querying for these pageviews anyway, why not show the count? The options that make sense to me are either (1) show all pageview counts, even for low-pageviews pages, or (2) only show the top 1000 pages. Here (1) has the advantage of showing all pages, and (2) has the advantage of conserving space.

The only reason I can think of for preferring your approach is if GA allows sorting prior to querying, so that we wouldn't have to "waste queries" on low-pageviews pages, but we would still be able to show users all the pages that exist on the site.

@riceissa

This comment has been minimized.

Copy link
Owner

@riceissa riceissa commented Oct 25, 2019

I'm wondering how to structure this in the database. Currently we store (project, date, pageviews) tuples in a single table. But this doesn't allow us to tell how many pageviews a specific page gets, which is what we need here, so we need to make some modifications to the database structure. The two options that make sense to me are:

  1. We scrap the old table and start storing (project, date, pagepath, pageviews) tuples instead.
  2. We keep the old table, but we add a new table that stores (project, month, pagepath, pageviews) tuples. In addition to ga:date (YYYYMMDD), GA has ga:yearMonth (YYYYMM), so it seems possible to query just the total pageviews for a given month/pagepath/project combination, which would cut down on the number of queries.

Option (1) allows for more flexibility (e.g. if we suddenly care about the number of views some specific page got on some specific day, we would be able to display that). The main concern with (1) is that we might go over GA's query limit, especially when we initially store all the historical data.

Personally, (1) is more elegant, and I've been trying it so far. I haven't run into query limits with the limited number of sites whose GA I have access to, but some of the larger Subwikis (or maybe the sheer number of sites) could get us in trouble with (1).

One idea is to try to do (1), but split querying over multiple days if we go over the quota. Assuming a single month's worth of queries fits within the limit, this would allow us to make the appropriate queries on a single day each month. The only problem is that at the start we may need to split querying over multiple days to fill in the historical data.

Just to give you an idea, GA allows 50,000 queries per day. Row counts for (1) are:

mysql> select project_title,count(*) from pageviews group by project_title;
+---------------------------+----------+
| project_title             | count(*) |
+---------------------------+----------+
| AI Watch                  |     3073 |
| Cause Prioritization Wiki |    12883 |
| Org Watch                 |     1468 |
| Timelines Wiki            |    53112 |
+---------------------------+----------+
4 rows in set (0.04 sec)

(70k total rows)

Timelines Wiki has around 1.3k pages (plus things like revision history, meta pages, discussion pages, etc. that have low pageview counts) and has been around 1k days. This gives an upper bound of 1300k rows, but of course the actual count is only 53k (because not all pages existed at the start and because many pages get 0 views on a given day so aren't recorded in the database).

Groupprops has around 14k pages.

The GA API has a pageSize parameter, which is set to 1000 by default, so the actual number of queries is approximately the number of rows divided by 1000. Apparently this can be increased to 100,000 ("The Analytics Core Reporting API returns a maximum of 100,000 rows per request, no matter how many you ask for"). So as long as the number of sites * average number of pages per site * average number of days of non-zero pageviews per page / 100,000 is below 50,000, it seems like we can query for everything in a single day.

@vipulnaik

This comment has been minimized.

Copy link
Collaborator Author

@vipulnaik vipulnaik commented Oct 27, 2019

So I'm ok with a variant of option 1; I'd still like to keep a smaller table with just the site-level pageviews, but I'm happy to have another larger table at the (page, date) granularity for all pages (and not just limiting to the top 1000). (Basically I don't want the calculation of total pageviews for each day or month to require an expensive summation operation).

@vipulnaik vipulnaik mentioned this issue Nov 2, 2019
4 of 4 tasks complete
riceissa added a commit that referenced this issue Dec 11, 2019
@riceissa

This comment has been minimized.

Copy link
Owner

@riceissa riceissa commented Dec 12, 2019

This is basically done. Here's what it looks like:

Screenshot at 2019-12-11 21-21-33

Notes:

  • I'm still not doing the "not in top 1000" thing, because I don't think it saves any computation, and I think you clarified in chat that the saving computation happens on the main/index page, not this top pages page. This only matters if limit_pagepaths in the URL parameter is set to a value over 1000.
  • You can use some URL parameters e.g. /top-pages.php?project_title=Timelines+Wiki&start_date=2019-03-01&end_date=2019-10-31&limit_pagepaths=10
  • You said in OP "We can refrain from getting this data for the current month, so we will fetch the data only for completed months (this will keep use of the analytics api to a minimum, because we will fetch data for each month only once)." But currently the same fetch_pageviews.py script does the fetching in the same way as the main/index page data (i.e. it checks for the most recent pageviews data per project, then fetches everything new until 4 days ago). I'm not sure what GA API savings you're thinking of.
  • Everything works locally for me, but I can't guarantee everything will work for you, since you will be doing more queries to GA and will insert a larger number of rows. I'm particularly wondering about Groupprops. If something does fail here, we can split queries over multiple days or something.

Things I'd like a response on:

  1. What do you think of not doing the "not in top 1000" thing?
  2. Elaborate on "this will keep use of the analytics api to a minimum"?
  3. What do you think of potential privacy issues, since pagepaths can sometimes contain sensitive information?
@vipulnaik

This comment has been minimized.

Copy link
Collaborator Author

@vipulnaik vipulnaik commented Dec 13, 2019

  1. What do you think of not doing the "not in top 1000" thing?
    I'm fine with not doing it -- you've convinced me.
  2. I thought getting data at monthly granularity would reduce the number of API calls. But it looks like this is a non-issue for most sites, and we'll probably figure something out for Groupprops. So I'm ok with going daily.
  3. Yes, so I don't think that applies to any of the sites currently in the list, but I'll check the list for each site.
    • It may be good to add a boolean flag in the SQL table listing the sites that can turn the computation of top pages on/off for sensitivity reasons.
    • Alternatively or additionally, we can modify "grouping/filtering" logic to address the page paths.

For Groupprops, yes, let's try and see how it goes.

@vipulnaik vipulnaik mentioned this issue Dec 15, 2019
8 of 8 tasks complete
@vipulnaik

This comment has been minimized.

Copy link
Collaborator Author

@vipulnaik vipulnaik commented Dec 15, 2019

LGTM; it's all working, thanks @riceissa!

@vipulnaik vipulnaik closed this Dec 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.