Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eager execution does not work for R interface under Python 3 #20701

Closed
jjallaire opened this issue Jul 11, 2018 · 15 comments
Closed

Eager execution does not work for R interface under Python 3 #20701

jjallaire opened this issue Jul 11, 2018 · 15 comments
Assignees

Comments

@jjallaire
Copy link

Hi there, I am the maintainer of the R interface to TensorFlow. We are currently in the process of porting various Eager examples to R. We haven't had trouble with Python 2 versions of TensorFlow, but with Python 3 versions we get some strange errors.

I realize that this is within the R interface so technically falls outside of the scope of TF for Python. However, in order for us to address this we need some insight as to what might be different for Eager under Python 3. I'll provide a detailed repro and explanation of it's under the hood behavior below.

cc @martinwicke @random-forests

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): TensorFlow v1.10.0-dev20180710
  • Python version: 3.6.5
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A
  • Exact command to reproduce: See below

Describe the problem

Using the R interface to TensorFlow:

library(tensorflow)
tf$enable_eager_execution()
x <- tf$constant(1)
tf$add(x, x)

Results in this error:

SystemError: <built-in function TFE_Py_FastPathExecute> returned a result with an error set 

This error occurs within the definition of add() within gen_math_ops.py:

  _result = _pywrap_tensorflow.TFE_Py_FastPathExecute(
        _ctx._context_handle, _ctx._eager_context.device_name, "Add", name,
        _ctx._post_execution_callbacks, x, y)

This code works as expected under TF w/ Python 2.

Again, I realize that this is the R interface so you might not have an intuition about what could be wrong. You can think of the R interface conceptually as just using the C Python API to invoke functions. So in the above code we are essentially using:

  • PyImport_Import to import the tensorflow module
  • PyObject_CallFunctionObjArgs to call Python functions (e.g. tf.enable_eager_execution, tf.constant, etc.)

My theory is that under Python 3 there is something being done at the Python language level that we aren't emulating or capture when calling through the Python C interface. Hopefully this provides you with some clues as to what that might be and we will be able to make whatever changes are required to make this work within R.

@martinwicke
Copy link
Member

I have seen this error before (not in TF) when I made some mistake I believe related to reference counting in connection with exceptions. But I can't say what's wrong here. I also think that Py3 is stricter about what it lets you get away with, which is probably why you only see this with Py3.

Added Alex, who has more context on Eager specifically.

@alextp
Copy link
Contributor

alextp commented Jul 11, 2018

Interesting. @akshaym can you take a look at this? Maybe FastPathExecute is doing something funny.

@jjallaire , can you also show us some more information about what error is being set? If you print the TF_Status error code it'll be really helpful.

@jjallaire
Copy link
Author

Interesting. To test whether the R interface might be leaving an error set before calling I added a pre-emptive call to PyErr_Clear() right before we call PyObject_CallFunctionObjArgs(). Unfortunately the error is still occurring so it may be that there is a Python error occurring somewhere within the call to TFE_Py_FastPathExecute.

So this particular error condition could be somewhat of a red herring: i.e. there is an error occurring during execution b/c some precondition is not met when calling via the C API but we don't see it.

Looking at the code for TFE_Py_FastPathExecute() (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/eager/pywrap_tfe_src.cc#L2232) there are lots of reasons an error might occur but I don't have any intuition about which of these might only be tickled when calling a Python op via the C API as opposed to the Python interpreter.

@jjallaire
Copy link
Author

Is there a way to get the TF_Status error code from the Python interface?

@martinwicke
Copy link
Member

(sorry)

@alextp
Copy link
Contributor

alextp commented Jul 11, 2018 via email

@akshaym
Copy link
Contributor

akshaym commented Jul 11, 2018

Hi @jjallaire, Do you have steps to replicate your environment?

I'm pretty unfamiliar with how to set up R (I tried the r-base docker, but wasn't able to install TF using https://tensorflow.rstudio.com/tensorflow/).

Some questions I have:
Since the TF py3 docker has 3.5.2 (and your version is 3.6.5), can you check that this error doesn't occur from the python interpreter directly? If so, it might be easier for me to debug that.
Does it happen with all ops (perhaps try a matmul)?
Does tf.constant(1) return a reasonable looking Tensor?

(I'm trying to answer the first one myself, but responded in case its easy enough for you to run)

@jjallaire
Copy link
Author

The error doesn't occur when I execute from the Python interpreter directly.

It does appear to happen with other ops (I tried tf.subtract and tf.matmul and it occurred for both of those). For tf.matmul the error is slightly different, it occurs on this line of code:

with ops.name_scope(name, "MatMul", [a, b]) as name:

The specific error is:

SystemError: <class 'tensorflow.python.framework.ops.name_scope'> returned a result with an error set 

tf$constant(1) does in fact return a reasonable looking tensor.

Here's how I would suggest replicating:

  1. Start from a system that already has TF for Python installed and working.

  2. Install R

  3. Install the R tensorflow package from the R console:

    install.packages("tensorflow", repos = "https://cran.rstudio.com")
  4. Execute this R script:

    library(tensorflow)
    tf$enable_eager_execution()
    x <- tf$constant(1)
    tf$add(x,x)

R should be able to find your installation of TensorFlow. If it's in a virtualenv you may need to add this to give it a hint:

library(tensorflow)
use_virtualenv("/path/to/virtualenv")

@jjallaire
Copy link
Author

To install R on Debian just do this:

sudo apt-get install r-base

@jjallaire
Copy link
Author

Then to run R:

R

@jjallaire
Copy link
Author

So to summarize:

$ sudo apt-get install r-base
$ R

Then from within R:

> install.packages("tensorflow", repos = "https://cran.rstudio.com")
> library(tensorflow)
> use_virtualenv("/path/to/virtualenv") # if necessary
> tf$enable_eager_execution()
> x <- tf$constant(1)
> tf$add(x,x)

Or, after installing the R tensorflow packages w/ install.packages(), just put the following in a text file e.g. "eager.R":

library(tensorflow)
use_virtualenv("/path/to/virtualenv") # if necessary
tf$enable_eager_execution()
x <- tf$constant(1)
tf$add(x,x)

And then execute:

$ Rscript eager.R

If you need to keep the process alive for debugging then you can go into R and do this:

source("eager.R")

@akshaym
Copy link
Contributor

akshaym commented Jul 11, 2018

Thanks @jjallaire!

I'm able to reproduce with your steps.

The following fails for me though:

library(tensorflow)
tf$enable_eager_execution()
x <- tf$constant(1.0)
print(x) # fails with "returned a result with an error set" in EagerTensor_datatype_enum
print(x) # the second call succeeds on the same tensor.

So it seems as if the constant call is actually returning with an error set (regardless, the tf$add calls also fail after this). So something isn't working right there.

I'll try to spend some more time on this soon.

@jjallaire
Copy link
Author

Okay, great!

In terms of a conceptual model, think of the R interface as just using the C API to do everything. So perhaps there is some side-effect or state associated with using Eager via the Python interpreter that is not being replicated? Just a hunch about one angle to consider. Hopefully once you can see the actual failure on the C side everything will become clear!

@akshaym akshaym self-assigned this Jul 11, 2018
@akshaym
Copy link
Contributor

akshaym commented Jul 12, 2018

@jjallaire,

This seems to happen since the EagerTensor python type doesn't have a __module__, which is accessed here: https://github.com/rstudio/reticulate/blob/c9e222fc709a6dcbdda586bbb49d93757fc86086/src/python.cpp#L287

I'm looking into how to include a __module__ attribute for the EagerTensor class (looks like we might just need a fully qualified name for the EagerTensor).

I also sent a pull request to fix reticulate to not generate the python error in any case: rstudio/reticulate#312 (I'm not entirely sure how to actually test this).

I'll leave this open till I get the __module__ to work for EagerTensor.

@jjallaire
Copy link
Author

Brilliant!!! So happy we figured this out.

Here is the fix I made in reticulate: rstudio/reticulate@b1728da#diff-849f590f08cba3094f269e0a4d69398d

I only know of one other case in the wild where a Python object didn't have a module so clients are likely conditioned to expect it (whether or not it is formally required, I'm guessing it isn't).

I'll watch for this issue to be closed and sync to the new module name for EagerTensor (right now we default it to something generic but we'll want to use whatever you end up with once it's checked in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants