Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when R process is embedded in a Python process #98

Closed
randy3k opened this issue Sep 12, 2017 · 26 comments · Fixed by #104
Closed

when R process is embedded in a Python process #98

randy3k opened this issue Sep 12, 2017 · 26 comments · Fixed by #104

Comments

@randy3k
Copy link
Contributor

randy3k commented Sep 12, 2017

reticulate is not working if the R process is embedded in a Python process. Particularly, reticulate doesn't run on rice which is a python program embedding R. I would also imagine that it will not work with rpy2.

I think the issue is that reticulate failis to initialize a Python instance when another python instance is running. However, I am not sure how easy/difficult the fix would be.

cf: randy3k/radian#7

@jjallaire
Copy link
Member

It's in principle possible but I'm not sure of all the mechanics. There are two issues to consider:

  1. reticulate dynamically loads all of the Python C API symbols it uses (this allows a single reticulate binary to work against multiple versions of Python). So while it would be easy to detect that a Python session already exists, "attaching" to it would involve locating and loading those symbols, as well as reconciling this new code path with the rest of reticulate which assumes that it loads Python. I could consider a PR for this but it's a very non-trivial change.

  2. Having an existing Python environment will also disable an important capability of reticulate, which is to dynamically bind to the appropriate Python environment for a given R session. Inheriting the Python environment from another shell will effectively disable this capability, meaning that users of the alternate shell need to know to launch that shell from within the appropriate Python environment. Not a show stopper for more technical users, but definitely more to grok and get right for more casual users.

@randy3k
Copy link
Contributor Author

randy3k commented Sep 27, 2017

It turns out it is much easier than I thought. The only changes are https://github.com/rstudio/reticulate/pull/104/files?w=1

@randy3k
Copy link
Contributor Author

randy3k commented Sep 27, 2017

I am trying to force reticulate to use a specific python by

    os.environ["RETICULATE_PYTHON"] = sys.executable

But it seems that it would still choose a different python if the requirements do not meet. Is there any way to force using a specific python. And throw an error instead of automatically choose a version of python if the requirements do not meet.

@jjallaire
Copy link
Member

The automatic binding to versions of Python is all based on the python package requirements implied by the R packages loaded. For example:

library(keras)

Implies not only that I want to use Python but that I also want to use a version of Python that has the keras Python package. Without this behavior the user will get an error when attempting to call functions that in turn call into Keras. If we blindly take the RETICULATE_PYTHON we will subsequently have a runtime error when e.g. the keras package tries to call keras. So we can force RETICULATE_PYTHON to work but another error will soon follow, and explaining to the user how to unroll this error condition is a mess.

@lgautier
Copy link
Contributor

Several elements might be conflated together here:

  • whether the apparently embedded language is in fact also the embedding language (and therefore shouldn't be initialized).
  • which python executable reticulate wants to use / bind to.

Regarding the former, Python has a C-level function Py_IsInitialized() while R does not have an equivalent. The pull request #104 is making effective use of Py_IsInitialized() but there is a need for a way to keep the initialization status of R. Without this, in situations where we have R -> reticulate -> Python -> rpy2
rpy2 will try to initialize R while it is obviously already initialized and "not so great things" will likely happen. An environment variable R_INITIALIZATION_STATUS indicating whether R was initialized, and information about the environment that initialized R (e.g., a tag about the embedding language and the content of sys.executable if python). This would also also help reticulate pick the right Python if R is already embedded in Python (and a tag python in R_INITIALIZATION_STATUS override the content of RETICULATE_PYTHON).

(note: this is obviously hinting toward having the complementary environment variable PY_INITIALIZATION_STATUS, and may be the need for general convention adopted by other languages embedding each others).

@jjallaire
Copy link
Member

It's unfortunate that there isn't a way provided by R itself to determine if R is already initialized. There are a bunch of environment variables set by R, e.g. R_SESSION_TMPDIR, however these would technically be inherited by child processes that might not themselves be R. There is probably some lower-level platform specific thing you could do to dynamically lookup global symbols "known" to exist in R. For example, if you are were able to call a hypothetical R_IsInitialized function that means that you able to look up the address of the function, which means that you could also look up the address of other functions in the R API. Their presence in the executable is a near 100% guarantee that R is initialized as I'm not aware of any scenario were the R dynamic library is loaded and then R isn't initialized.

In terms of reticulate, no you don't have a dependency on it but reticulate does need to bind to a version of Python and if it's not advised of which one to bind to via something like an environment variable then it may bind to the wrong one.

@lgautier
Copy link
Contributor

lgautier commented Oct 1, 2017

I think that answering the first point is going to be necessary if reticulate and rpy2 are expected to work together. Without this. rpy2 will not know that R is not the embedded language but the embedding language, and will initialize R a second time (leading to serious problems and ultimately a likely segfault).

You are indeed right that setting environment variables will not work too well if child processes are created (rpy2 would fail to initialize R whenever Python's multiprocessing is used - a significant problem). However, having the PID in which R was initialized in the additional information about R's initialization status I was mentioning would solve the problem.

The alternative solution would be to create a common C library implementing a R_IsInitialized() function. The logistics for this seems much more complicated.

Note: Regarding the dynamic loading of R's shared library without initialization, I have an example of this happening right below:

import rpy2.rinterface

@jjallaire
Copy link
Member

jjallaire commented Oct 2, 2017 via email

@lgautier
Copy link
Contributor

lgautier commented Oct 7, 2017

I can speak best about what rpy2 (as a language embedding R) would need.
I am changing the name to R_SESSION_INITIALIZED for naming consistency (you pointed out the existence of R_SESSION_TEMPDIR), but do not hesitate to suggests an other name.

After thinking a bit, R_SESSION_INITIALIZED would only need PID=<pid>:NAME="<name of initializer>". The NAME is not strictly required but would be only convenient for messages that report which process has initialized R (the "command" or the absolute path of the executable could also be found from the PID, but NAME would be designed to be a short name immediately interpretable by a user). Initializing that variable, unless already defined and with the same PID as the current PID, is the responsibility of the bridge (if this is rpy2 it will set this variable when initializing R, if this is reticulate it will set this variable when loading so rpy2 does not try to initialize R).

For Python, this could then be essentially the same in a variable called PYTHON_SESSION_INITIALIZED (and rpy2, being a bridge between Python and R, would define it as needed).

@jjallaire
Copy link
Member

On the reticulate side I also need to know the location of the Python binary which hosts the session. This is because reticulate dynamically binds to the Python shared library and needs the location of Python to discover the location of the shared library (this is the RETICULATE_PYTHON environment variable I discussed previously, but if PYTHON_SESSION_INITIALIZED included the location of the binary that would work as well.

@lgautier
Copy link
Contributor

lgautier commented Oct 7, 2017

Hopefully we can make something generic enough to work for all.

I would be more than happy to have rpy2 define a couple of variables that make the interoperability with other languages "just work" rather than leave it to the user, but I would to see with you if something general for Python-R bridges using C can be worked out (today it is reticulate, later PyCall / RCall (Julia's bridges) whenever they realize that they have the same issue, etc...). I know that R_SESSION_INITIALIZED is absolutely necessary on the rpy2 side for reticulate to successfully use it, but since your are on the other side (R-calling-), you might be in a better position than me to think whether this is going to work from the point of view of reticulate or R calling any other language.

@lgautier
Copy link
Contributor

lgautier commented Oct 8, 2017

There is an candidate implementation in rpy2 (branch default):
https://bitbucket.org/rpy2/rpy2/commits/e537eaf8d01cdad3cdacda1a865dd430004e300a

As soon as reticulate defines R_SESSION_INITIALIZED the following should already work:

import os
import sys

# This won't be needed when rpy2 defines PYTHON_SESSION_INITIALIZED
# (and reticulate uses it).
os.environ['RETICULATE_PYTHON'] = sys.executable

from rpy2.robjects import r

r("""
library(reticulate)
robjects <- import('rpy2.robjects')
robjects$r('R.version')
""")

@jjallaire
Copy link
Member

Just added the reticulate side here: b24eb08

Once you define PYTHON_SESSION_INITIALIZED let me know the format and I'll incorporate that into reticulate as well.

@lgautier
Copy link
Contributor

I finally found a bit of time to do it.

The environment variable PYTHON_SESSION_INITIALIZED is now set when rpy2 is initilalizing R (no need before) to a value like current_pid=<PID>:sys.executable=<absolute path to python>.

This is in the branch default (https://bitbucket.org/rpy2/rpy2/commits/314db036fa8142ac0757405ef54ccafae058531a). Since this is no breaking the API, I can backport to the currently release (2.9.x) once confirmed to be working as intended.

@jjallaire
Copy link
Member

Okay, I've implemented support for detecting and using the Python defined in PYTHON_SESSION_INITIALIZED (so long as the PIDs match): b4ddddf

@randy3k I think you should switch to defining PYTHON_SESSION_INITIALIZED rather than RETICULATE_PYTHON as it's both a more generic name and also includes the PID so won't mis-detect Python in child processes that inherit the parent environment.

@lgautier
Copy link
Contributor

As soon as I get a confirmation that this is all good I will backport the change to the branch 2.9.x in the rpy2 repository in order to include it in the upcoming release 2.9.1.

@jjallaire
Copy link
Member

Excellent!!

@lgautier
Copy link
Contributor

FWITW I just tried with rpy2's HEAD in branch default and reticulate installed with devtools::install_github("rstudio/reticulate") and I get a segfault as soon as I do

rpy2.robjects.r("""
reticulate::import(sys)
""")

The backtrace is hinting as this happening during reticulate's own initialization:

/home/jupyteruser/R/x86_64-pc-linux-gnu-library/3.4/reticulate/libs/reticulate.so(_Z13py_initializeRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_S6_S6_bbS6_+0x370)[0x7f88d1b10270]
/home/jupyteruser/R/x86_64-pc-linux-gnu-library/3.4/reticulate/libs/reticulate.so(_reticulate_py_initialize+0x1d1)[0x7f88d1b00781]

@russellpierce
Copy link
Contributor

@lgautier with randy3k/rpy2@440322c were you able to get things working?

@lgautier
Copy link
Contributor

I have not tried again since my last comment.

@lgautier
Copy link
Contributor

I am still getting a segfault:

>>> import rpy2.robjects
>>> rpy2.robjects.r('library("reticulate")')
>>> rpy2.robjects.r('reticulate::import("sys")')
>>> rpy2.robjects.r('reticulate::import("sys")')
*** Error in `python': free(): invalid pointer: 0x00007f484af29078 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7908b)[0x7f4853b1108b]
/lib/x86_64-linux-gnu/libc.so.6(+0x82c3a)[0x7f4853b1ac3a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f4853b1ed2c]
/usr/lib/python3.5/config-3.5m-x86_64-linux-gnu/libpython3.5.so(PyUnicode_InternFromString+0x2a)[0x7f4845a6419a]
/usr/lib/python3.5/config-3.5m-x86_64-linux-gnu/libpython3.5.so(PyType_Ready+0x18c5)[0x7f48458ac4c5]
/usr/lib/python3.5/config-3.5m-x86_64-linux-gnu/libpython3.5.so(PyStructSequence_InitType2+0x181)[0x7f48458b13e1]
/usr/lib/python3.5/config-3.5m-x86_64-linux-gnu/libpython3.5.so(_PyLong_Init+0xb7)[0x7f48458c0867]
/usr/lib/python3.5/config-3.5m-x86_64-linux-gnu/libpython3.5.so(_Py_InitializeEx_Private+0xb4)[0x7f4845a1e5d4]
/usr/local/packages/R/3.4/lib/R/library/reticulate/libs/reticulate.so(_Z13py_initializeRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_S6_S6_bbS6_+0x400)[0x7f484a70bed0]
/usr/local/packages/R/3.4/lib/R/library/reticulate/libs/reticulate.so(_reticulate_py_initialize+0x1d1)[0x7f484a6fb131]
/usr/local/packages/R/3.4/lib/R/lib/libR.so(+0xd782a)[0x7f4851cd482a]
(...)

@russellpierce
Copy link
Contributor

@lgautier Is that with the companion commit in RPy2 you mentioned here: randy3k/rpy2@440322c#commitcomment-25703908 ?

@lgautier
Copy link
Contributor

Yes.

@SemanticBeeng
Copy link

SemanticBeeng commented Mar 20, 2018

@lgautier , @jjallaire : please advise if this integration is supposed to work with latest versions of reticulate and rpy2.

Can see the issue is closed but last report from Laurent was about segfault.

With this

from rpy2 import robjects
import os
import sys

# This won't be needed when rpy2 defines PYTHON_SESSION_INITIALIZED
# (and reticulate uses it).
#os.environ['RETICULATE_PYTHON'] = sys.executable

robjects.r('library("reticulate")')
robjects.r('reticulate::import("sys")')

nested = robjects.r('''
nested <- function() {

    library(reticulate)

    reticulate::import("sys")
    robjects <- import('rpy2.robjects')
    robjects$r('R.version')  # error triggered here
 
    return (1)
}
    
nested       
''')

nested()

I get

/opt/conda/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:145: RRuntimeWarning: Error in py_call_impl(callable, dots$args, dots$keywords) : 
  RuntimeError: Concurrent access to R is not allowed.

Detailed traceback: 
  File "/opt/conda/lib/python3.6/site-packages/rpy2/robjects/__init__.py", line 351, in __call__
    p = _rparse(text=StrSexpVector((string,)))

  warnings.warn(x, RRuntimeWarning)

Found "RuntimeError: Concurrent access to R is not allowed." here https://bitbucket.org/rpy2/rpy2/issues/182/process_revents-keeps-on-raising but cannot tell what causes this.

The error suggests that the same R instance is used which is good, right?

Error still happens if I comment calls to robjects.r from Python, with rpy2 version 2.9.1 and 2.9.2.

Am I not doing the right thing making the joint initialization sequence work?

@SemanticBeeng
Copy link

SemanticBeeng commented Mar 21, 2018

context = above

My end goal is to load a large time series in python / pandas and then call an R function with it (doing rpy2 conversion for pandas or not?) and then have the R code access it through the reticulate API as per this correspondence: https://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html.

Ideally there would not be any data conversion going on (I know reticulate overlays over Python memory for arrays
https://rstudio.github.io/reticulate/articles/arrays.html#reticulate-with-care

Sorry for posting here but this might be material for a blog post documenting joint use of rpy2 and reticulate for a non trivial use case.

I will write that post if you are kind to help out!

@lgautier
Copy link
Contributor

Just a note to point out to #208 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants