Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global Python functions no longer works? #49

Closed
gaow opened this issue Mar 25, 2016 · 21 comments
Closed

Global Python functions no longer works? #49

gaow opened this issue Mar 25, 2016 · 21 comments

Comments

@gaow
Copy link
Member

gaow commented Mar 25, 2016

After updating to the current SoS my previous code does not work:

def get_md5(value):
    import sys, hashlib
    base, ext = value.rsplit('.', 1)
    res = hashlib.md5(base.encode('utf-8')).hexdigest() if sys.version_info[0] == 3 else hashlib.md5(base).hexdigest()
    return '{}.{}'.format(res, ext)

[simulate_1: alias = 'simulate']
n = [1000]
true_mean = [0, 1]
output_suffix = 'rds'
input: for_each = ['n', 'true_mean']
output: pattern = 'exec=rnorm.R::n=${_n}::true_mean=${_true_mean}.${output_suffix}'
_output = [get_md5(x) for x in _output]
run: ...

The error message is:

ERROR: Failed to execute subworkflow: Failed to assign [get_md5(x) for x in _output]
 to variable _output: name 'get_md5' is not defined

What is the proper way to use customized Python functions with current SoS?

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

This is a namespace issue that I am trying to address, namely we now have
SoS script locals (for SoS variables), SoS script globals, SoS command
locals and globals, and we should also handle namespace of spawned step
processes. The first two are related to SoS script and the last two are the
SoS runtime. Previously we have these namespace mixed which means malicious
user defined code might interfere with how SoS works (e.g. replace run
action). The updated code tries to separate them but get_md5 is now defined
in SoS script local but called in SoS runtime local.

On Fri, Mar 25, 2016 at 10:46 AM, gaow notifications@github.com wrote:

After updating to the current SoS my previous code does not work:

def get_md5(value):
import sys, hashlib
base, ext = value.rsplit('.', 1)
res = hashlib.md5(base.encode('utf-8')).hexdigest() if sys.version_info[0] == 3 else hashlib.md5(base).hexdigest()
return '{}.{}'.format(res, ext)

[simulate_1: alias = 'simulate']
n = [1000]
true_mean = [0, 1]
output_suffix = 'rds'
input: for_each = ['n', 'true_mean']
output: pattern = 'exec=rnorm.R::n=${_n}::true_mean=${_true_mean}.${output_suffix}'
_output = [get_md5(x) for x in _output]
run: ...

The error message is:

ERROR: Failed to execute subworkflow: Failed to assign [get_md5(x) for x in _output]
to variable _output: name 'get_md5' is not defined

What is the proper way to use customized Python functions with current
SoS?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#49

@gaow
Copy link
Member Author

gaow commented Mar 25, 2016

I see. Will this syntax remain the same and I should wait for an update of SoS, or are there any alternative syntax I can use to achieve my purpose? I assume moving it to inside SoS step is not good idea because the function will be used in multiple steps.

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

I also disabled writing to _step object directly, made variables defined in global definitions readonly, and I might have disallowed changing _input, _output and _depends from SoS script. As useful as it might be, allowing SoS scripts to change these variables can be really troublesome. For example,

[parameters]
var = a

uses a default value defined in global section. If a can be changed, the default value of this variable can be changed. This is problematic if the workflow is used in a nested workflow where another workflow might change a accidentally and cause unpredictable results.

The same logic goes to _step and their alias. If we allow changing alias.output, the same step might behave differently with the execution of another step (that changes an existing alias) so theoretically speaking we cannot execute any steps in parallel.

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

I am working on this. I left this unfixed but I need to set up sensible rules what can and what cannot be accessed in what way from within SoS steps... Your code should work though.

More specifically, SoS currently executes these functions (e.g. run) in its own namespace because they are not available in SoS namespace. What we need to do is to inject only the ones that are needed by SoS script to them before the execution of the script and evaluate all function calls in SoS script namespaces. In this way SoS itself will not be affected by malicious scripts.

@gaow
Copy link
Member Author

gaow commented Mar 25, 2016

The updates are reasonable when -j comes into the picture. Currently my code does have a line that changes _output directly. What if we need to adjust the names of an output within a step?

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

I believe update of _output yields a warning now. The idea is that we should not need to manipulate _output directly if we have a strong enough syntax in step output. We will see.

@gaow
Copy link
Member Author

gaow commented Mar 25, 2016

Indeed. Are you suggesting of designing some options that can allow for flexible post-processing of these file names at the end of input: and output: specifications?

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

Yes. The easiest and most general solution is to allow the specification of a user-defined function that take _output and returns massaged filenames. Any suggestion on a parameter name or other solution?

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

I have added a test case (TestExecute.testUserDefinedFunc) but it works. Could you add a test case that demonstrate the problem? Might it involve subworkflow?

@gaow
Copy link
Member Author

gaow commented Mar 25, 2016

Using another named parameter might be straightforward and a rename = might do if all we do is renaming. But that will restrict the renaming within one line so perhaps users need to provide customized functions, e.g., my script can be written as:

output: pattern = 'exec=rnorm.R::n=${_n}::true_mean=${_true_mean}.${output_suffix}', rename = [get_md5(x) for x in _output]

and is it cool to have _output here?

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

By general I meant the function should be able to remove some output files. No, I do not think it is good to use _output here. The usage should be something like

rename=get_md5

or

rename=lambda x, **kwargs: get_md5[x[0]]

if get_md5 cannot handle multiple names.

@gaow
Copy link
Member Author

gaow commented Mar 25, 2016

I just updated the test. It turns out to be an issue with list comprehension. If I put [myfunc() for x in range(5)][0] instead of just myfunc() as in the test it will fail.

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

Oh, this is a nested namespace issue. myfunc needs to be in the globals namspace for it to be seen by list comprehension but it is currently imported to the locals namespace.

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

Just use this thread to summarize namespace rules:

  1. SoS will be executed in its own namespace. Everything will be eval or exec with their own locals and globals. This will make sure SoS itself will not be affected by SoS script.
  2. SoS script will have two namespaces: The global namespace and local namespace that will be used to execute statements.
  3. SoS global namespace consists of
    • SoS actions and functions injected by SoS when the script is executed
    • All global definitions
    • Variables seen in other steps (alias etc)
  4. SoS local namespace consists of step specific variables (_step, _input ...). Local namespace will be cleared after the execution of each step.
  5. Spawned process will get a copy of local namespace (with _input, _index etc) because these differ process by process.

Pending problems:

  1. The most difficult part is the global namespace because it contains modules and function objects that cannot be transferred. My suggestion is that
    • We re-run the global definition in spawned process so re-import and re-define functions.
    • We transfer pickleable objects to the spawned process. Because we required that global definitions to be read-only, there should be no conflict between these two steps. We can also ONLY transfer StepInfo objects (aliased step information) because right now a step only leave such information in the global namespace.
  2. A potential problem is with functions defined in SoS step. These should be kept in the Step's own local namespace. However, when we evaluate a nested function in Step, it only see's the function's namespace (locals), its parent function's namespace (nested namespace), and the global namespace. This means the locally defined function will not be seen. I am not sure how to handle this problem right now.
  3. Nested namespaces: There will be multiple global sections and in theory we should either
    • merge global sections during parsing and let it become the new global section of the new workflow
    • assign global section to each section so that sections have their own global definition. To make sure global sections from different sources are executed, these definitions are currently executed for each step (which is a bit wastful).

@gaow
Copy link
Member Author

gaow commented Mar 25, 2016

Great summary! I can change my implementation to not using list comprehension. But how do I make it to global instead? From reading the summary I still haven't figured out how it works in practice if I stick to my current syntax (again I can change it but just curious).

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

The problems are not resolved. We can use a single global namespace for everything but the problem is that if step_15 can use a function or variable defined in step_10, it cannot be executed before it. Right now the DAG will be set up purely from file dependencies so we have to make functions and variables defined in steps local, then function calls within step will face the triple namespace problem and might have trouble accessing functions defined locally.

gdict = {'a': 1}
ldict = {'b': 2}
# this works
eval('a+b', gdict, ldict)
# function only sees 'a' and its own local namespace
exec('''
def func():
    print(locals().keys())
    print(globals().keys())
func()
''', gdict, ldict)
# so this fails
exec('''
def func():
    print(b)

func()
''', gdict, ldict)

A seeming feasible way to address this problem is to insert ldict to a nested namespace of func so that it can be accessed, something like:

def add_names(extra_dict, func):
    def temp(*args, **kwargs):
        # This does not work
        locals().update(extra_dict)
        return func(*args, **kwargs)
    return temp

add_names(ldict, func)()

However, updating locals() does not work. Even the exec trick does not work here because the dictionary might contain functions which cannot be exec-ed. There are many discussions about this online but I do not see a good solution.

In the end, we might have to manually merge locals to globals and remove locals, but this will not work because we are allowing arbitrary statements in steps.

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

I find a solution, which is as ugly as hell. Note that the namespace rule is

  1. local <- function local, cannot change
  2. global <- SoS global, difficult to change
  3. nested <- the above decorator-style trick, does not work
  4. builtins ???

So

gdict = {'a': 1}
ldict = {'b': 2}

for k,v in ldict.items():
    setattr(__builtins__, k, v)

exec('''
def func():
    print(b)

func()
''', gdict, ldict)

works. As long as we do not change anything in __builtins__, this should be safe...

Finally, a less-ugly but resource intensive solution would be to use a single globals dictionary but execute each step in a separate process, then any change to globals would not affect the main process. This might be necessary anyway because of potentials to run steps in parallel.

@gaow
Copy link
Member Author

gaow commented Mar 25, 2016

Oh ... I do not think I understand the example. So here, b is set to 2 and is accessible even inside the exec? What does gdict do here? And the exec action is for definition of functions -- for what namespace?

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

Python exec function accepts a local and a global dictionary to execute the passed statement. I am demonstrating that passing b in the local dictionary is accessible from expression, but not from within a function.

@BoPeng
Copy link
Contributor

BoPeng commented Mar 25, 2016

Ok, I have submitted a patch to merge the global and local namespace of SoS script. This is not ideal but at least the SoS namespace is now separated from the SoS runtime. The separation of local and global namespace could be done by either of the two methods proposed, but let us wait and see because the implementation of DAG would require steps to be executed in parallel so the problem resolves then naturally if the steps are executed in separate processes.

@BoPeng
Copy link
Contributor

BoPeng commented Mar 26, 2016

I have fixed this problem by running each step in its own process and only the aliased object is sent back to the master process (see TestExecute.testLocalNamespace). Now we are sure that variables in each step will not affect others.

@BoPeng BoPeng closed this as completed Mar 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants