Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown bug arises from using cache = TRUE & data.table #1457

Closed
MichaelChirico opened this issue Nov 13, 2017 · 4 comments
Closed

Unknown bug arises from using cache = TRUE & data.table #1457

MichaelChirico opened this issue Nov 13, 2017 · 4 comments
Milestone

Comments

@MichaelChirico
Copy link
Contributor

MichaelChirico commented Nov 13, 2017

I'm going crazy from this; apologies in advance for not being able to provide a reproducible example -- my data set is quite large (hence the need for cache).

I have the following code in a chunk:

DT[DT[ , .(w = sum(V1)),
              by = .(id1, id2)
              ][ , .(m = median(w)), by = id1
                 ][ , .(id1, h =
                          cut(m, breaks = break_vals,
                              include.lowest = TRUE, right = FALSE))],
      .(id1, id3, h, V1), on = .(id1)
      # there are about 1,000,000 rows in the result here
      ][ , median(V1), by = h]

When I run it from a fresh session, there's no problem. And when I load the data into memory, it runs fine, as well.

However, when the chunk used to create DT is created with cache = TRUE, this chunk errors:

Error in gmedian(week_hrs) : negative length vectors are not allowed
Calls: ... eval -> eval -> [ -> [.data.table -> gforce -> gmedian

gmedian is an internal data.table function. That [.data.table got dispatched suggests it's not the kind of problem that arises with data.table being loaded from binary where we need to setDT the object if it's added with, e.g., load. In fact if I add everything before the last median call in a chunk just before this one it runs completely fine:

cat(capture.output({
  DT[ , .(w = sum(V1)),
                by = .(id1, id2)
                ][ , .(m = median(w)), by = id1
                   ][ , .(id1, h =
                            cut(m, breaks = break_vals,
                                include.lowest = TRUE, right = FALSE))],
        .(id1, id3, h, V1), on = .(id1)]
}, sep = '\n')
# to prevent it running the erroneous chunk
stop()

Which assures that DT is still being treated as a data.table, that GForce is being dispatched correctly (the m = median(w) line dispatches GForce), up through that point.

Further, if I replace the real V1 with V1 = rnorm(.N) to try and generate anonymized data to share here, the code does not reliably error (extra surprising since I assumed the error was related to h being a factor.

That's as far as I've gotten... it certainly seems like a bug somewhere. This is a pain because the cached chunk takes about 20 minutes to run -- a perfect use case for only running it on occasion if the underlying code changes. But as stands this isn't feasible.

@yihui
Copy link
Owner

yihui commented Nov 13, 2017

You don't need real data to create a reproducible example. Use simulated / random data instead.

Since the same error occurred without using knitr (e.g. Rdatatable/data.table#2046), I doubt if it is really a knitr issue. Anyway, without a reproducible example, I cannot do anything about it (i.e. I cannot fix an issue only by guessing, especially when the error comes from another package that I don't maintain). Sorry.

@yihui yihui added this to the v0.18 milestone Nov 13, 2017
@MichaelChirico
Copy link
Contributor Author

I don't think the issue is the same. and I haven't been able to reliably reproduce in mocked up data, though I've tried a bit to force it

Of course I understand you can't devote much time on open ended guessing. Posted this 1) in case something obvious came to mind or you had a suggestion to assay the issue a bit more and 2) to update if/when I can nail down the issue to something reproducible. thanks.

@yihui
Copy link
Owner

yihui commented Nov 13, 2017

Sounds good. I'll be happy to re-open this issue when we are sure it is knitr's fault. Thanks!

@yihui yihui closed this as completed Nov 22, 2017
@github-actions
Copy link

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue by following the issue guide (https://yihui.org/issue/), and link to this old issue if necessary.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants