Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose not to override standard argument while defining a non-NULL first_date #87

Closed
caijun opened this issue Dec 9, 2018 · 6 comments · Fixed by #108
Closed

Propose not to override standard argument while defining a non-NULL first_date #87

caijun opened this issue Dec 9, 2018 · 6 comments · Fixed by #108
Assignees

Comments

@caijun
Copy link
Collaborator

caijun commented Dec 9, 2018

These days I am updating my analysis code for 2009 influenza A(H1N1) pandemic by using incidence package. When calculating ISO week based epidemic curve, I specified the starting date and ending date to extend the timespan. My command is as follows

> range(pdm$diagnose)
[1] "2009-05-10" "2010-04-30"
> pdm.wec <- incidence(pdm$diagnose, interval = "week", standard = TRUE, 
+                       first_date = as.Date("2009-05-01"), 
+                       last_date = as.Date("2010-04-30")) %>% 
+   as.data.frame()
> head(pdm.wec)
       dates isoweeks counts
1 2009-05-01 2009-W18      0
2 2009-05-08 2009-W19      2
3 2009-05-15 2009-W20      3
4 2009-05-22 2009-W21     13
5 2009-05-29 2009-W22     46
6 2009-06-05 2009-W23     66

I am confused by the returned incidence data.frame. Taking the second row for example, according to the dates column, 2 cases were observed during a week (7 days) from 2009-05-08 to 2009-05-14. However, according to the isoweeks column, 2 cases were counted during the ISO week 2009-W19, which was from 2009-05-04 to 2009-05-10. I looked at my data, only 1 case occurred from 2009-05-04 to 2009-05-10.

Without specifying the starting date and ending date, the returned incidence data.frame is what I expected.

> pdm.wec3 <- incidence(pdm$diagnose, interval = 7) %>% 
+   as.data.frame()
> head(pdm.wec3)
       dates isoweeks counts
1 2009-05-04 2009-W19      1
2 2009-05-11 2009-W20      2
3 2009-05-18 2009-W21      8
4 2009-05-25 2009-W22     22
5 2009-06-01 2009-W23     54
6 2009-06-08 2009-W24    127

I read the help file of the standard argument and found this is because this argument is overridden by defining a non-NULL first_date. Therefore, I suggest that when first_date and/or last_date are non-NULL, the standard argument is not overridden, and the dates column still stores the beginning of standard weeks. In this case, first_date and last_date are only used for extending or reducing the timespan. Of course, the range of standard weeks usually is not the same as the range specified by first_date and/or last_date, but we can echo a message to remind the users.

@zkamvar
Copy link
Member

zkamvar commented Dec 9, 2018

Hi Jun,

Can you please post a reproducible example? I can't really diagnose anything until this is done. Also, what happens if you specify first_date = "2009-04-27", the first date of 2019-W18?

Therefore, I suggest that when first_date and/or last_date are non-NULL, the standard argument is not overridden, and the dates column still stores the the beginning of standard weeks.

My thought process when making the decision is that if the user specifies a first_date, then that means they want that specific date to be the beginning. Because there are many ways for the first date to be non-standard, it made sense that, because the user has to put in effort to include the first date, then they know exactly where they want it to start and the standard argument should be mutually exclusive to first_date. Of course, this leads to the issue you are bringing up where you get unexpected results because you thought they weren't mutually exclusive.

On the other hand, if we were to make the default that standard and first_date are not mutually exclusive, then this would create the situation where if the user specifies first_date and doesn't realise that standard = TRUE, they end up with an unexpected situation.

Unfortunately, changing behavior like this is backwards-incompatible. What if we printed a message telling the user the first date they should use to achieve the equivalent to standard specification?

Warning:

`first_date` overrides `standard`. If you wish to have intervals according to the ISO standard for weeks, set `first_date` to "2009-04-27"

incidence(pdm$diagnose, interval = "week", standard = TRUE,  first_date = as.Date("2009-04-27"), last_date = as.Date("2010-04-30"))

@zkamvar
Copy link
Member

zkamvar commented Dec 9, 2018

As a side note: I will not be able to add anything new to incidence in the next month. I will be away for a family emergency in Korea.

@caijun
Copy link
Collaborator Author

caijun commented Dec 9, 2018

The code and data for reproducing my issue is shown below.

rm(list = ls())

load("pandemic.rda")

library(tidyverse)
library(incidence)

range(pdm$diagnose)

pdm.wec <- incidence(pdm$diagnose, interval = "week", standard = TRUE, 
                     first_date = as.Date("2009-05-01"), 
                     last_date = as.Date("2010-04-30")) %>% 
  as.data.frame()
head(pdm.wec)

pdm.wec1 <- incidence(pdm$diagnose, interval = "week", standard = TRUE, 
                     first_date = as.Date("2009-04-27"), 
                     last_date = as.Date("2010-04-30")) %>% 
  as.data.frame()
head(pdm.wec1)


pdm.wec2 <- incidence(pdm$diagnose, interval = "week", standard = FALSE, 
                     first_date = as.Date("2009-05-01"), 
                     last_date = as.Date("2010-04-30")) %>% 
  as.data.frame()
head(pdm.wec2)

pandemic.rda.zip

if the user specifies a first_date, then that means they want that specific date to be the beginning. Because there are many ways for the first date to be non-standard, it made sense that, because the user has to put in effort to include the first date, then they know exactly where they want it to start and the standard argument should be mutually exclusive to first_date.

For this usage, I think it is exactly the pdm.wec2 case with setting standard to FALSE and specifying the first date of interest.

if you specify first_date = "2009-04-27", the first date of 2019-W18?

Specifying first_date = "2009-04-27" produced incidence data.frame as I expected. But this way is not user friendly as the user has to know the starting week (2019-W18) and the first date of the starting week (2009-04-27). In practice, the user usually only know the date of the first case and the approximate starting date of interest.

if we were to make the default that standard and first_date are not mutually exclusive, then this would create the situation where if the user specifies first_date and doesn't realise that standard = TRUE, they end up with an unexpected situation.

For this unexpected situation, we can print a message to tell the user that the returning timespan is not the same as the user's input, and suggest the user to set standard = FALSE to get the exact timespan as the user input. The benefit of this approach is to keep the returning dates and isoweeks consistent regardless of specifying first_date. Otherwise, the user is likely to be confused.

@zkamvar
Copy link
Member

zkamvar commented Dec 9, 2018

Otherwise, the user is likely to be confused.

The user is likely to be confused in either situation. The problem with switching the behavior here is that it will break backwards compatibility.

But this way is not user friendly as the user has to know the starting week (2019-W18) and the first date of the starting week (2009-04-27). In practice, the user usually only know the date of the first case and the approximate starting date of interest.

I agree that it's not ideal to force the user to figure out what date would be the start of any given week; but for months, quarters, and years, the user should know what the start is, so it's really the week specification that is the weak point here, which is why I'm hesitant to change the default behavior.

What I would offer is to add two things:

  1. a message warning users about the standard as shown in my previous reply
  2. a helper function that would shift the date to the first day of the week maybe it could be called get_first_wday() or something like that. This way, it could be used like so:
    incidence(pdm$diagnose, first_date = get_first_wday("2009-05-01"))

We could also allow for the specification of the first isoweek for the first date (e.g. first_date = "2009-W18") and have a wrapper function that returns the isoweek (e.g. get_isoweek("2009-05-01"))

Again, I'm not going to be able to work on this for another month.

@caijun
Copy link
Collaborator Author

caijun commented Dec 10, 2018

Last night I couldn't sleep and thought this issue again. I can't help discussing it a bit more. Here the design is against principles of software engineering that I learned.

  1. Overriding an argument with another one is not an appropriate way, although we have encountered this design in some functions. The control of an argument is interfered by others, which will make the usage of the function complicated and confusing to users.

  2. To keep the mapping between dates and isoweeks columns consistent regardless of specifying first_date is important, because this mapping may be used in the user's subsequent analysis or future functions.

  3. If the user would like to emphasize the starting date, virtually incidence() provides an alternative by setting standard = FALSE. On the other side, turning standard to TRUE means the users would like to first obtain standard weeks.

  4. The solution you offered is workable, but it is not elegant and leave too much stuff to users. Why not put the wrapper function inside incidence()? In fact, the extra wrapper function is not necessary since without specifying first_date, incidence() returns dates and isoweeks columns as expected. If because "it will break backwards compatibility", I think as developers, it is our responsibility to resolve this problem, and to make the package easy and friendly to users.

It is not urgent to fix this issue. Currently I am writing my PhD thesis, and I have no time to resolve it. After I submit my thesis, I would fix it if it is still unresolved.

@zkamvar
Copy link
Member

zkamvar commented Dec 10, 2018

Hi Jun,

I'm sorry to hear that you couldn't sleep. I hope you get some rest soon.

Thinking this over again, the arguments you lay out make sense. Moreover, the change would be conceptually consistent with the default behavior of standardizing from the first observed date if first_date is not specified.

The control structure that defines the behavior is here:

if (isTRUE(dots$standard) && null_first_date) {

Here are the foreseeable tasks:

  • remove null_first_date from that control structure
  • add function that will check if the user specified both first_date and standard, and issue a message that the date has been interpreted as its standard form, printing advice to set standard = FALSE if this was not expected
  • add global option in zzz.R called "incidence.warn.first_date" that will allow users to turn off the message at the top of their script with options(incidence.warn.first_date = FALSE)

zkamvar added a commit that referenced this issue Mar 5, 2019
@zkamvar zkamvar self-assigned this Mar 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants