riceissa / analytics-table Public
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get top 1000 viewed pages per site per month #6
Comments
I don't understand why you want to show "not in top 1000" for the low-pageviews pages. If we are querying for these pageviews anyway, why not show the count? The options that make sense to me are either (1) show all pageview counts, even for low-pageviews pages, or (2) only show the top 1000 pages. Here (1) has the advantage of showing all pages, and (2) has the advantage of conserving space. The only reason I can think of for preferring your approach is if GA allows sorting prior to querying, so that we wouldn't have to "waste queries" on low-pageviews pages, but we would still be able to show users all the pages that exist on the site. |
I'm wondering how to structure this in the database. Currently we store (project, date, pageviews) tuples in a single table. But this doesn't allow us to tell how many pageviews a specific page gets, which is what we need here, so we need to make some modifications to the database structure. The two options that make sense to me are:
Option (1) allows for more flexibility (e.g. if we suddenly care about the number of views some specific page got on some specific day, we would be able to display that). The main concern with (1) is that we might go over GA's query limit, especially when we initially store all the historical data. Personally, (1) is more elegant, and I've been trying it so far. I haven't run into query limits with the limited number of sites whose GA I have access to, but some of the larger Subwikis (or maybe the sheer number of sites) could get us in trouble with (1). One idea is to try to do (1), but split querying over multiple days if we go over the quota. Assuming a single month's worth of queries fits within the limit, this would allow us to make the appropriate queries on a single day each month. The only problem is that at the start we may need to split querying over multiple days to fill in the historical data. Just to give you an idea, GA allows 50,000 queries per day. Row counts for (1) are:
(70k total rows) Timelines Wiki has around 1.3k pages (plus things like revision history, meta pages, discussion pages, etc. that have low pageview counts) and has been around 1k days. This gives an upper bound of 1300k rows, but of course the actual count is only 53k (because not all pages existed at the start and because many pages get 0 views on a given day so aren't recorded in the database). Groupprops has around 14k pages. The GA API has a |
So I'm ok with a variant of option 1; I'd still like to keep a smaller table with just the site-level pageviews, but I'm happy to have another larger table at the (page, date) granularity for all pages (and not just limiting to the top 1000). (Basically I don't want the calculation of total pageviews for each day or month to require an expensive summation operation). |
This is basically done. Here's what it looks like: Notes:
Things I'd like a response on:
|
For Groupprops, yes, let's try and see how it goes. |
LGTM; it's all working, thanks @riceissa! |
I would like to have a page that has a table view that can stretch across months, where the rows are pages, the columns are months, and the cells give one of these:
not in top 1000
to indicate that we don't have a pageview count (so it's likely small anyways)We should have a column for total count, and a row for total across the top 1000 pages in a month; the next row can give the grand total in the month so we can see what percentage is covered by the top 1000 pages (most likely it should be 90%+).
We can refrain from getting this data for the current month, so we will fetch the data only for completed months (this will keep use of the analytics api to a minimum, because we will fetch data for each month only once).
The text was updated successfully, but these errors were encountered: