knitr engine API and cache compatibility with reticulate engine #1505

tmastny · 2018-02-11T20:03:39Z

I think whether or not this is a knitr or reticulate bug depends on the knitr engine API, which I do not completely understand. I carefully searched for this bug in the knitr and reticulate issues and didn't see anything, so I apologize if this is already known.

The bug

Suppose we have the following file called python_test.Rmd:

---
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, 
                      results='show', cache=TRUE, autodep=FALSE)
knitr::opts_knit$set(progress = TRUE, verbose = TRUE)
knitr::knit_engines$set(python = reticulate::eng_python)
```

```{python chunk1}
x = 1
print(x)
```


```{python chunk2}
print(x + 9)
```

When you first press the Knit button the document compiles successfully.

Now suppose I change chunk2, so it has print(x + 10) and I save the file. If I try clicking the Knit button I get the following error:

Quitting from lines 19-20 (python_test.Rmd) 
Error in py_run_string_impl(code, local, convert) : 
  NameError: name 'x' is not defined

Detailed traceback: 
  File "<string>", line 1, in <module>
Calls: <Anonymous> ... force -> py_run_string -> py_run_string_impl -> .Call
Execution halted

My efforts to debug

Here's what I've learned about the error:

It reliably happens when I call knitr::knit('python_test.Rmd', envir = new.env()) after a session restart, so I don't think it is an rmarkdown::render error.

The error message points to py_run_string_impl, which is a reticulate function. But I believe the problem arises before knitr reaches the python engine.

When you call knitr::knit('python_test.Rmd', envir = new.env()), chunk1 eventually enters the call_block function in knitr/R/block.R. It passes if (params$cache > 0), and the hash comes up with the same value. Then cache$load tries to bring the saved data into knit_global.

At this point, if you ls(knit_global()) you'll see character(0). So it isn't clear if the python object x=1 was even saved.

However, whether it was saved doesn't even matter because it doesn't get a chance to use it. When chunk2 starts down the same path, its hash has changed so it moves onto block_exec. If this were R code, it would have access to the cache$loaded objects from chunk1 with env = knit_global(), but non-R engines go down a separate branch.

block_exec tells the reticulate engine to execute print(x + 9), but it fails because it doesn't know x. You can verify this in eng_python_synchronize_before by checking to see if main contains x, which it doesn't (assuming you alled knit after a session restart). The only thing that passes to the eng_python is options which as far as I can tell doesn't include any environment information such as x=1.

What is not clear to me

Despite crawling through the reticulate source code, when cache=FALSE, I'm not actually sure how the state is saved between chunks. Each time a python chunk is executed by eng_python reticulate/R/knitr-engine.R it calls import_main which provides an object main that has the previous chunk's variables (x = 1), but I don't see where this data is saved chunk to chunk.

I know the main data has to be saved somewhere, because if you don't restart the session after calling knitr::knit('python_test.Rmd', envir = new.env()) the main variable will still contain x = 1, even though the knit(..) call never runs x = 1 or loads it into memory since it was cached.

What to do?

I know everyone's busy, so I'm happy to help by making a PR but it is not clear to me how to fix this.

Ideas:

Factor out the chunk hashing. If autodep=FALSE, then if any cache fails you have to block_exec every chunk.
Rethink the engine API, so the language can provide loading/saving for its chunks. In fact, reticulate already supports pickle, which is a python package that can save and load python objects. As far as I can tell, this would still require refactoring some knitr code since engines don't touch caching at all at the moment, as far as I can tell.
Suppose .RData is able to save the python object (I don't know if possible). Similar to number 2, you could slightly change the engine API so you pass that object to the engine to process.

Session data

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_3.4.1   backports_1.1.2  magrittr_1.5.0   rsconnect_0.8.5 
 [5] rprojroot_1.3-2  htmltools_0.3.6  tools_3.4.1      yaml_2.1.16     
 [9] Rcpp_0.12.15     stringi_1.1.6    rmarkdown_1.8.10 knitr_1.19.2    
[13] stringr_1.2.0    digest_0.6.15    evaluate_0.10.1

The text was updated successfully, but these errors were encountered:

yihui · 2018-02-12T16:22:00Z

I'm afraid I'll have to defer this issue to @kevinushey (author of the python engine in reticulate).

kevinushey · 2018-02-12T18:01:52Z

For what it's worth, I never tried wiring in cache support into the reticulate engine as I wasn't exactly sure what that would entail, but it sounds like we'd need:

Serialization of the 'main' module (probably with pickle?)
A list of Python modules to be imported (what about state within a particular module?)

tmastny · 2018-02-12T23:11:41Z

Thanks @kevinushey. Something else to look into is dill which extends pickle. From the readme:

Hence, it would be feasable to save a interpreter session, close the interpreter, ship the pickled file to another computer, open a new interpreter, unpickle the session and thus continue from the 'saved' state of the original interpreter session.

And the reason I opened the issue here is because I think knitr will also require some refactoring to allow specific engines to handle caching. Right now all caching is handled by the cache methods in R/block.R call_block for loading and block_cache for saving.

If this is something you'd like for reticulate, I'd be interested in helping out with PRs. I'd like to be able write Python with rmarkdown using all the knitr features.

kevinushey · 2018-02-12T23:23:03Z

I'd definitely be open to reviewing a PR, but it seems like this will be tough to get right and I unfortunately won't have that much time to help with the actual implementation in the coming months.

leogama · 2022-04-22T19:31:02Z

Hi! I'm working on it (by sheer necessity). There are some some serious problems on the dill package at the moment, but I'm also updating some currently broken logic in the code proposed by @tmastny and it is mostly working by now, with some edge cases requiring monkey patches. As soon as the bugs on dill are fixed, I'll send a new pull request based on his code.

Beyond basic usage, I find that a Python cache engine for knitr is essential. We need this, folks! 🚀

This was referenced Feb 26, 2018

cache engine API #1518

Merged

cache engine for knitr rstudio/reticulate#167

Open

Non-Contradiction mentioned this issue Mar 30, 2018

Using knitr chunks caching feature with Julia via JuliaCall knitr engine JuliaInterop/JuliaCall#38

Open

cderv added the feature Feature requests label Jan 21, 2021

leogama mentioned this issue Apr 22, 2022

Fixes some bugs when using dump_session() with byref=True uqfoundation/dill#463

Merged

cderv mentioned this issue Apr 12, 2023

Caching sql chunks throwing error when output.var not used #1842

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

knitr engine API and cache compatibility with reticulate engine #1505

knitr engine API and cache compatibility with reticulate engine #1505

tmastny commented Feb 11, 2018

yihui commented Feb 12, 2018

kevinushey commented Feb 12, 2018

tmastny commented Feb 12, 2018

kevinushey commented Feb 12, 2018

leogama commented Apr 22, 2022

knitr engine API and cache compatibility with reticulate engine #1505

knitr engine API and cache compatibility with reticulate engine #1505

Comments

tmastny commented Feb 11, 2018

The bug

My efforts to debug

What is not clear to me

What to do?

Session data

yihui commented Feb 12, 2018

kevinushey commented Feb 12, 2018

tmastny commented Feb 12, 2018

kevinushey commented Feb 12, 2018

leogama commented Apr 22, 2022