Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to not count downloads < 1000 bytes #45

Open
lindbrook opened this issue Oct 25, 2019 · 9 comments
Open

Option to not count downloads < 1000 bytes #45

lindbrook opened this issue Oct 25, 2019 · 9 comments

Comments

@lindbrook
Copy link
Contributor

To compute the number of package downloads for a given day, cranlogs::cran_downloads() counts the number of entries (number of rows) for that package in CRAN's download logs. Would it be possible to add an optional argument so that observations with sizes less than 1000 bytes do not count toward the number of package downloads?

Two reasons. First, it's hard to say that such observations really represent a package download. Second, while much of this may just be unsuccessful/aborted downloads, I think that some of this is more than random noise.

Using 2019-10-23 as an example, here's what I found.

My back-of-envelope estimate is that around 5% (233,722 / 5,097,912) of all downloads on 2019-10-23 was smaller than 1000 bytes, typically around 500 bytes.

Here's an example. On 2019-10-23 'rstan' was downloaded 2,574 times.

> cranlogs::cran_downloads("rstan", from = "2019-10-23", to = "2019-10-23")
date count package
1 2019-10-23 2574 rstan

But if you look at the logs (RStudio's CRAN mirror at http://cran-logs.rstudio.com), you'll see that there are 40 entries smaller than 1000 bytes:

date time size package version country ip_id
1438000 2019-10-23 19:49:15 531 rstan 2.11.1 US 7
1438001 2019-10-23 19:49:15 537 rstan 2.12.1 US 7
1438002 2019-10-23 19:49:15 537 rstan 2.13.2 US 7
1438003 2019-10-23 19:49:15 531 rstan 2.14.1 US 7
1438004 2019-10-23 19:49:15 537 rstan 2.14.2 US 7
1438005 2019-10-23 19:49:15 537 rstan 2.15.1 US 7
1438006 2019-10-23 19:49:15 531 rstan 2.16.2 US 7
1438007 2019-10-23 19:49:15 537 rstan 2.17.2 US 7
1438008 2019-10-23 19:49:15 531 rstan 2.17.3 US 7
1438009 2019-10-23 19:49:15 537 rstan 2.17.4 US 7
1438010 2019-10-23 19:49:15 537 rstan 2.18.1 US 7
1438011 2019-10-23 19:49:15 537 rstan 2.18.2 US 7
1438012 2019-10-23 19:49:15 533 rstan 2.7.0-1 US 7
1438013 2019-10-23 19:49:15 533 rstan 2.8.0 US 7
1438014 2019-10-23 19:49:15 533 rstan 2.8.1 US 7
1438015 2019-10-23 19:49:15 539 rstan 2.8.2 US 7
1438016 2019-10-23 19:49:15 539 rstan 2.9.0-3 US 7
1438017 2019-10-23 19:49:15 539 rstan 2.9.0 US 7
1438121 2019-10-23 19:49:14 537 rstan 2.10.1 US 7
3607030 2019-10-23 10:59:51 534 rstan 2.19.2 <NA> 5
3702500 2019-10-23 20:16:37 537 rstan 2.10.1 US 7
3702501 2019-10-23 20:16:38 537 rstan 2.11.1 US 7
3702502 2019-10-23 20:16:38 537 rstan 2.12.1 US 7
3702503 2019-10-23 20:16:38 531 rstan 2.13.2 US 7
3702504 2019-10-23 20:16:38 531 rstan 2.14.1 US 7
3702505 2019-10-23 20:16:38 537 rstan 2.14.2 US 7
3702506 2019-10-23 20:16:38 537 rstan 2.15.1 US 7
3702507 2019-10-23 20:16:38 531 rstan 2.16.2 US 7
3702508 2019-10-23 20:16:38 537 rstan 2.17.2 US 7
3702509 2019-10-23 20:16:38 537 rstan 2.17.3 US 7
3702510 2019-10-23 20:16:38 531 rstan 2.17.4 US 7
3702511 2019-10-23 20:16:38 531 rstan 2.18.1 US 7
3702512 2019-10-23 20:16:38 537 rstan 2.18.2 US 7
3702513 2019-10-23 20:16:38 539 rstan 2.7.0-1 US 7
3702514 2019-10-23 20:16:39 539 rstan 2.8.0 US 7
3702515 2019-10-23 20:16:39 539 rstan 2.8.1 US 7
3702516 2019-10-23 20:16:39 539 rstan 2.8.2 US 7
3702517 2019-10-23 20:16:39 539 rstan 2.9.0-3 US 7
3702518 2019-10-23 20:16:39 533 rstan 2.9.0 US 7
4186380 2019-10-23 18:29:29 530 rstan 2.19.2 US 7

For what it's worth, here's the code for the above log data using packageLog() from the development version of 'packageRank' (v0.3.0.9000) on https://github.com/lindbrook/packageRank

rstan.log <- packageRank::packageLog("rstan", "2019-10-23")
vars <- c("date", "time", "size", "package", "version", "country", "ip_id")
rstan.log[rstan.log$size < 1000, vars]

While 40 of 2574 downloads is small, percentage-wise (1.6%), you'll see that the overwhelming majority of these observations comes from a single IP address that is "downloading" different (possible all) versions of 'rstan'.

While people may, of course, be interested in previous versions of a package and while many people are using network address translation (NAT), this kind of activity is not an isolated event. You'll find it across many packages intermittently throughout the month.

It even extends to "archived" packages (those that are not included on CRAN's main listing). For example, we see that 'bim' was downloaded 12 times:

> cranlogs::cran_downloads("bim", from = "2019-10-23", to = "2019-10-23")
date count package
1 2019-10-23 12 bim

But all 12 of those downloads looks like this:

date time size package version country ip_id
1746102 2019-10-23 19:39:06 539 bim 0.92-3 US 7
1746103 2019-10-23 19:39:06 533 bim 0.93-1 US 7
1746105 2019-10-23 19:39:06 537 bim 1.01-1 US 7
1746107 2019-10-23 19:39:06 537 bim 1.01-3 US 7
1746108 2019-10-23 19:39:06 537 bim 1.01-4 US 7
1746109 2019-10-23 19:39:06 537 bim 1.01-5 US 7
1937591 2019-10-23 19:12:29 533 bim 0.92-3 US 7
1937592 2019-10-23 19:12:29 539 bim 0.93-1 US 7
1937593 2019-10-23 19:12:29 537 bim 1.01-1 US 7
1937594 2019-10-23 19:12:29 537 bim 1.01-3 US 7
1937595 2019-10-23 19:12:29 537 bim 1.01-4 US 7
1937596 2019-10-23 19:12:29 537 bim 1.01-5 US 7

bim.log <- packageRank::packageLog("bim", "2019-10-23")
vars <- c("date", "time", "size", "package", "version", "country", "ip_id")
bim.log[, vars]

In this case, it's hard to say that this package was really downloaded.

@gaborcsardi
Copy link
Contributor

It turns out that these are HTTP HEAD requests, from an lftp client. I guess it is some feature of lftp. I'll try to update the original script that creates the log files on cran-logs.rstudio.com, to filter these out.

We can think about cleaning up the current DB retrospectively. I guess it would make sense, but it will also change the reports generated in the past, etc.

@gaborcsardi
Copy link
Contributor

It seems that I cannot fix this in the original script. I'll filter them out in my DB update script then. The smallest CRAN package tarball is 1189 bytes currently, and I doubt that it is practically possible to create a smaller one, so 1000 bytes seems like a good limit. Filter must come somewhere here: https://github.com/r-hub/cranlogs.app/blob/4ee2355d0739abb4c4d273e4e891a3a01a6165bf/db/update.sh#L30

@lindbrook
Copy link
Contributor Author

If possible, it might be better to leave the logs as "raw" as possible for those who might be interested, and just update/amend the DB. Since I've put in some time on this, if you want, I'm willing to write up my findings as a note/vignette to explain the discrepancy. Essentially, what's happening is that beyond the random HTTP requests to individual packages, each Wednesday (+ additional days) someone is making requests to all packages on CRAN (all active current versions and all archived inactive past versions).

Let me just check if I need to original DB data before you update it.

@gaborcsardi
Copy link
Contributor

If possible, it might be better to leave the logs as "raw" as possible for those who might be interested

However without the additional information that these are HEAD requests and not downloads, those log entries are misleading.

I don't think this needs any writeup, especially because the situation may change any time. I think this issue is a good place to record the issue.

Let me just check if I need to original DB data before you update it.

It will not happen overnight, I'll probably migrate the service before this, so no worries just yet.

@lindbrook
Copy link
Contributor Author

I'm fine with filtering out these records. But I am curious. Besides HEAD requests, why else would we see these small "downloads"? Are aborted "downloads" in the log?

For what it's worth, since yesterday was a Wednesday, here's the top of the filtered log for cranlogs:

vars <- c("date", "time", "size", "package", "version", "country", "ip_id")

log.filtered <- packageRank::packageLog("cranlogs", filter = TRUE)
head(log.filtered[, vars])

date time size package version country ip_id
246036 2019-11-13 13:26:43 27060 cranlogs 2.1.1 US 14760
325235 2019-11-13 19:29:32 24417 cranlogs 2.1.1 US 18769
340204 2019-11-13 18:08:12 24417 cranlogs 2.1.1 US 572
872939 2019-11-13 20:31:34 26972 cranlogs 2.1.1 FR 34238
928117 2019-11-13 14:00:44 24287 cranlogs 2.1.1 TR 3298
963961 2019-11-13 09:09:16 27163 cranlogs 2.1.1 JP 42282

And all 7 of the records that are filtered out:

log.audit <- packageRank::packageLog("cranlogs", filter = -1000)
log.audit[, vars]

date time size package version country ip_id
714794 2019-11-13 20:22:52 527 cranlogs 2.0.0 US 18559
714795 2019-11-13 20:22:52 528 cranlogs 2.1.0 US 18559
1459096 2019-11-13 19:59:32 527 cranlogs 2.0.0 US 18559
1459097 2019-11-13 19:59:32 528 cranlogs 2.1.0 US 18559
2123148 2019-11-13 08:39:09 0 cranlogs 2.1.1 US 2499
3988585 2019-11-13 07:41:52 529 cranlogs 2.1.1 DE 27828
4143146 2019-11-13 19:22:07 528 cranlogs 2.1.1 US 18559

@gaborcsardi
Copy link
Contributor

But I am curious. Besides HEAD requests, why else would we see these small "downloads"? Are aborted "downloads" in the log?

Every request is in the log I think, so yes. But aborted downloads are rare. The reasons I can think of:

  1. HEAD request
  2. request with If-modified-*, that results a 304 response
  3. aborted downloads, but even these probably send more than ~500 bytes, as far as the web server is concerned.

I think 3. is probably very rare. 1. should be filtered out, ideally, but 2. is a proper download (attempt).

@lindbrook
Copy link
Contributor Author

Would you happen to know if the lftp client is part of R and/or RStudio?

@gaborcsardi
Copy link
Contributor

I don't think so.

@lindbrook
Copy link
Contributor Author

A little more digging...

~500 byte downloads manifest themselves in three ways.

  1. as standalone entries:

log <- packageRank::packageLog("cholera", "2020-06-03")
log[log$time == "18:48:17", ]

date time size r_version r_arch r_os package version country
24 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.2.1 US
25 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.3.0 US
26 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.4.0 US
27 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.5.0 US
28 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.5.1 US
29 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.6.0 US
30 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.6.5 US
ip_id
24 4724
25 4724
26 4724
27 4724
28 4724
29 4724
30 4724

  1. as part of a pair (same machine, same time stamp): a ~500 byte entry plus a full download.

log <- packageRank::packageLog("sp", "2020-05-06")
log[log$time == "08:14:34", ]

date time size r_version r_arch r_os package version country
4263 2020-05-06 08:14:34 367 <NA> <NA> <NA> sp 1.4-1 <NA>
4264 2020-05-06 08:14:34 1134627 <NA> <NA> <NA> sp 1.4-1 <NA>
ip_id
4263 6
4264 6

  1. as part of a triplet (same machine, same time stamp). For packages >= 1 MB in size, it seems you'll see a ~500 byte download, a full download, and an intermediate-sized download:

log <- packageLog("cholera", "2020-02-02")
log[log$time == "10:19:17", ]

date time size r_version r_arch r_os package version country
5 2020-02-02 10:19:17 4147305 <NA> <NA> <NA> cholera 0.7.0 US
6 2020-02-02 10:19:17 34821 <NA> <NA> <NA> cholera 0.7.0 US
7 2020-02-02 10:19:17 539 <NA> <NA> <NA> cholera 0.7.0 US
ip_id
5 1047
6 1047
7 1047

For packages < 1 MB, it seems you'll see a ~500 byte download plus two "full" downloads:

log <- packageLog("cranlogs", "2020-05-06")
log[log$time == "09:03:46", ]

date time size r_version r_arch r_os package version country
19 2020-05-06 09:03:46 529 <NA> <NA> <NA> cranlogs 2.1.1 US
20 2020-05-06 09:03:46 24273 <NA> <NA> <NA> cranlogs 2.1.1 US
21 2020-05-06 09:03:46 24280 <NA> <NA> <NA> cranlogs 2.1.1 US
ip_id
19 3678
20 3678
21 3678

While we can easily correct for the first two cases by simply filtering out entries smaller than say 1000 bytes, the variability in size of the intermediate download makes triplets harder to deal with. That said, I'm wondering if you'd consider implementing a fix for them.

Here are three reasons why. First, we're essentially counting one download as three. Second, their frequency in the logs seems to be increasing. Third, because it is a somewhat computationally intensive task, it makes more sense to deal with it on the back-end than as part of a function on the user end.

For what it's worth, to get a ballpark empirical sense of the problem I wrote some prototype R code and done some preliminary analysis. If interested, I can post it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants