Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to execute single file packaged python binaries #45

Closed
drunksaint opened this issue Jul 27, 2020 · 20 comments
Closed

How to execute single file packaged python binaries #45

drunksaint opened this issue Jul 27, 2020 · 20 comments

Comments

@drunksaint
Copy link

I have a custom python file argtest.py.txt that counts the number of lines from an input file and writes it to an output file:

$ python argtest.py input.txt output.txt

I am trying to run this using gg. I tried packaging it to a single binary using pyinstaller (pip install pyinstaller)

$ pyinstaller argtest.py --onefile --distpath .

This creates a single binary argtest which gives the expected output.

$ ./argtest input.txt output.txt

But after adding the correct wrapper

#!/bin/bash
model-generic "/path/to/argtest @infile @outfile" "$@"

and correctly generating the thunk in output.txt

$ gg infer argtest input.txt output.txt

running gg force output.txt results in the following error:

$ gg force output.txt 
→ Loading the thunks...  done (0 ms).
[104] Cannot open self /tmp/thunk-execute.FANRSb/argtest or archive /tmp/thunk-execute.FANRSb/argtest.pkg
std::exception
 `Tmrvv.MZ1JLsEcE3l9Jyz4bxjJ0kXnhl_ewuzxSceamw00000107': process exited with failure status 255
gg-force: `Tmrvv.MZ1JLsEcE3l9Jyz4bxjJ0kXnhl_ewuzxSceamw00000107': process exited with failure status 5

the binary is of type:

$ file argtest
argtest: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=294d1f19a085a730da19a6c55788ec08c2187039, stripped

is there something i am doing wrong here? any help will be appreciated!

@sadjad
Copy link
Member

sadjad commented Jul 27, 2020

Hi @drunksaint,

Let me try this out first, and will get back to you in a couple hours.

Best,
Sadjad

@drunksaint
Copy link
Author

@sadjad let me know if i can help with anything

@sadjad
Copy link
Member

sadjad commented Jul 28, 2020

Okay, I managed to reproduce the error with a simple script, and I'm getting the exact same message.

The problem

The PyInstaller bootstrapping function, tries to open and read the binary itself. From the looks of it, it takes argv[0] as the path for the binary, but that's non-existent (the actual binary is located in .gg/blobs/BINARY_HASH).

That's the error message you're getting: /tmp/thunk-execute.FANRSb/argtest. It tries to find argtest in the current directory.

I can think of a few solutions:

Solution 1

You can instruct gg to create a link to the binary in the execution directory. However, it's a new feature and currently only available through gg create-thunk, using --link option. For example, in your case, after creating your binary you can create your thunk like this:

gg create-thunk \
    --value $(gg hash input.txt) \
    --output output.txt \
    --executable $(gg hash argtest) \
    --placeholder output.txt \
    --link input.txt=$(gg hash input.txt) \
    --link argtest=$(gg hash argtest) \
    $(gg hash argtest)
    argtest input.txt output.txt

Two links are created: one a link to input.txt and one link to argtest. This simplifies the application, since they can refer to files using those names, and PyInstaller would be happy...

Solution 2

Of course, doing all that is not the most convenient way to create thunks. You can also change modes/model-generic.cc and add a option to tell it to include the link to the executable... (I can help with this, if you wanna go down this road).

Solution 3

Take a look at Nuitka, it's a Python compiler that's faster and makes smaller binaries than PyInstaller. I tried it with my simple script, and it works out of the box with gg (I haven't used this in real life before, it just looked promising!).

Please let me know if any of these helps!

Best,
Sadjad

@drunksaint
Copy link
Author

drunksaint commented Jul 28, 2020

Thanks for your help looking at this @sadjad. Both solution 1 and 3 worked for positional arguments! Solution 2 may not be required yet. I'll go down this road if I need to later. I had tried cython earlier but this was causing problems compiling larger libraries. That's why I started looking at pyinstaller. Nuitka seems to work great for larger libraries as well though. Thanks for this suggestion!

I expanded the test python file argtest.py to use a more complex combination of positional and optional arguments and this caused failure using both solution 1 & 3. In solution 1, gg was trying to read the optional arguments itself (gg-create-thunk: unrecognized option '--inputfile=i.txt') and in solution 3, i think my wrapper function is incorrect.

the command i used (i.txt and j.txt are input files whose number of lines are read):

python argtest.py 34 i.txt o.txt --arg 45 --inputfile j.txt --outputfile p.txt

my wrapper file:

#!/bin/bash
model-generic "/path/to/argtest @ @infile @outfile --inputfile=@infile --outputfile=@outfile" "$@"
  • is there something wrong with my wrapper file?
  • does gg create-thunk accept commands that use optional arguments?
  • is there some documentation on how to create wrappers?

@sadjad
Copy link
Member

sadjad commented Jul 28, 2020

Hello there,

Glad it worked!

is there something wrong with my wrapper file?

I think the only thing that it's missing the --arg option. You need to tell model-generic about the non-file options as well, so it can parse the whole command correctly. For example, in this case, you need to add --arg=@ to the description.

does gg create-thunk accept commands that use optional arguments?

Yes, it does, but you need to tell it explicitly where the create-thunk options ends and your arguments begin; by passing -- right before passing the positional arguments:

gg create-thunk \
    --value $(gg hash input.txt) \
    ...
    --output output.txt \
    -- $(gg hash binary) argtest input.txt output.txt --any-option-you-like test

is there some documentation on how to create wrappers?

Sadly no. There are a few examples in here, and frankly, that's really all that's supported by model-generic.

@drunksaint
Copy link
Author

drunksaint commented Jul 28, 2020

I tried adding --arg=@. But the gg model creation gives an error.
My wrapper file:

#!/bin/bash
model-generic "/path/to/argtest @ @infile @outfile --arg=@ --inputfile=@infile --outputfile=@outfile" "$@"

The error I get:

$ gg infer argtest 34 i.txt o.txt --arg 45 --inputfile j.txt --outputfile p.txt
terminate called after throwing an instance of 'std::runtime_error'
  what():  unexpected token in description
/path/to/argtest: line 2:  3516 Aborted                 (core dumped) model-generic "/path/to/argtest @ @infile @outfile --arg=@ --inputfile=@infile --outputfile=@outfile" "$@"
$ gg infer argtest 34 i.txt o.txt --arg=45 --inputfile=j.txt --outputfile=p.txt
terminate called after throwing an instance of 'std::runtime_error'
  what():  unexpected token in description
/path/to/argtest: line 2:  3531 Aborted                 (core dumped) model-generic "/path/to/argtest @ @infile @outfile --arg=@ --inputfile=@infile --outputfile=@outfile" "$@"

I can add a PR for simple documentation to use a custom binary with gg (create wrapper file, python to binary) if that helps.

@sadjad
Copy link
Member

sadjad commented Jul 28, 2020

The issue was that we didn't have support for non-file positional arguments (the first @ in your arguments). I just pushed a commit that should fix that problem.

I can add a PR for simple documentation to use a custom binary with gg (create wrapper file, python to binary) if that helps.

That would be amazing. Thank you!

@drunksaint
Copy link
Author

Nice! now the thunk creation goes through. But gg force fails:

$ gg infer argtest 65 i.txt o.txt --arg=23 --inputfile=j.txt --outputfile=p.txt
$ gg force o.txt 
→ Loading the thunks...  done (0 ms).
usage: argtest [-h] [--arg ARG] [--inputfile INPUTFILE]
               [--outputfile OUTPUTFILE]
               posarg posinfile posoutfile
argtest: error: argument posinfile: can't open 'i.txt': [Errno 2] No such file or directory: 'i.txt'
std::exception
 `TsCTUcZt2lNy.X4aqO5fZapqL5rQt219d.TVJVfDlcaA0000016d': process exited with failure status 2
gg-force: `TsCTUcZt2lNy.X4aqO5fZapqL5rQt219d.TVJVfDlcaA0000016d': process exited with failure status 5

the python file if it helps. I created the binary using

python -m nuitka --follow-imports argtest.py -o argtest

@sadjad
Copy link
Member

sadjad commented Jul 28, 2020

Could you please run gg describe TsCTUcZt2lNy.X4aqO5fZapqL5rQt219d.TVJVfDlcaA0000016d and post the output here?

@drunksaint
Copy link
Author

drunksaint commented Jul 28, 2020

$ gg describe TsCTUcZt2lNy.X4aqO5fZapqL5rQt219d.TVJVfDlcaA0000016d
{
 "function": {
  "hash": "VwndxS_gxNE2mcwtSLrx9tEXMvi75zyQaPl5DAZEY8PA00068380",
  "args": [
   "argtest",
   "65",
   "i.txt",
   "o.txt",
   "--arg=23",
   "@{GGHASH:VziaXgzsiNzeCIBBjDSZ9oHqywVIPYTBP5.ksiphPP_000000016}",
   "--outputfile=p.txt"
  ],
  "envars": []
 },
 "values": [
  "VQJaFeszdSCpcqZ.8IO313LxfNQhtfxIAF7wcf7U2nZc0000001c",
  "VziaXgzsiNzeCIBBjDSZ9oHqywVIPYTBP5.ksiphPP_000000016"
 ],
 "thunks": [],
 "executables": [
  "VwndxS_gxNE2mcwtSLrx9tEXMvi75zyQaPl5DAZEY8PA00068380"
 ],
 "outputs": [
  "p.txt",
  "o.txt"
 ],
 "links": [],
 "timeout": 0
}

$ cat o.txt 
#!/usr/bin/env gg-force-and-run
TsCTUcZt2lNy.X4aqO5fZapqL5rQt219d.TVJVfDlcaA0000016d#o.txt
$ cat p.txt 
#!/usr/bin/env gg-force-and-run
TsCTUcZt2lNy.X4aqO5fZapqL5rQt219d.TVJVfDlcaA0000016d

@sadjad
Copy link
Member

sadjad commented Jul 29, 2020

It looks like there's a bug in model-generic. Look at the function.args above; it totally omitted --inputfile and also didn't convert i.txt to @{GGHASH:}. I'm gonna take a look at it and fix the issue.

@sadjad
Copy link
Member

sadjad commented Jul 29, 2020

I just pushed a commit that hopefully fixes the issue!

When I was looking at model-generic implementation, I remembered how narrow the implementation was. It should be fine for now, but, for example, if instead of --inputfile A you pass --inputfile=A, it would not work. I'm motivated to redo the implementation to include support for all POSIX-style options, but that'll take some time :)

Please let me know if this fixes your problem.

Thank you!

@drunksaint
Copy link
Author

drunksaint commented Jul 29, 2020

@sadjad that would really help using gg with custom commands! :).

Your changes + replacing --inputfile=A with --inputfile A works perfectly!! Thanks for the fixes!

I tried 2 other things:

  • boolean flags don't seem to work right now (command --flag). I can see the problem here. Maybe something needs to be added in generic.cc as well, but I'm not sure.
  • i tried seeing if gg could be made to use the redirection operator > by substituting it with @ in the wrapper file. Seems like that doesn't work as expected.
$ helloworld > o.txt

associated wrapper file:

#!/bin/bash
model-generic "/path/to/helloworld @ @outfile" "$@"

model inference error:

$ gg infer helloworld > o.txt 
terminate called after throwing an instance of 'std::runtime_error'
  what():  missing positional argument
/path/to/helloworld: line 2: 19498 Aborted                 (core dumped) model-generic "/path/to/helloworld @ @outfile" "$@"

The error message seems to be related to your latest commit, so i thought it might be relevant.

I really appreciate your help with everything here. Thanks!

UPDATE: just realized that the shell is removing everything from the redirection operator. not sure what the best way to do this is.

@sadjad
Copy link
Member

sadjad commented Jul 29, 2020

Awesome!

boolean flags don't seem to work right now (command --flag). I can see the problem here. Maybe something needs to be added in generic.cc as well, but I'm not sure.

You should not include boolean flags in the description---only options with a required argument are necessary.

i tried seeing if gg could be made to use the redirection operator > by substituting it with @ in the wrapper file. Seems like that doesn't work as expected.

The redirection operator is handled by the shell itself and is never passed to the program. So, in case of helloworld > o.txt, shell runs helloworld and writes its stdout to o.txt. The contract in a gg thunk is that it writes its output to a file, and then that file is grabbed by gg. Currently, there's no mechanism to directly tell gg to grab the stdout.

However, there's a trick you can play. You can wrap the command you wanna run in another script. For example:

#!/bin/sh

helloworld >o.txt

Then, create a thunk for this script, which writes its output to o.txt!

A year ago, I was trying to make gg work for simple command line programs like cat and grep that write their output to stdout, by creating a generic wrapper (iowrap). It was abandoned since, but feel free to take a look: https://github.com/sadjad/ggsh

@drunksaint
Copy link
Author

drunksaint commented Jul 29, 2020

You should not include boolean flags in the description---only options with a required argument are necessary.

Nice, this works! I've added this with our whole discussion to the documentation in this pull request

You can wrap the command you wanna run in another script.

Sounds good. I'll try this.

A year ago, I was trying to make gg work for simple command line programs like cat and grep that write their output to stdout, by creating a generic wrapper (iowrap). It was abandoned since, but feel free to take a look: https://github.com/sadjad/ggsh

This is neat! much better than having to write wrapper commands for all scripts. I tried running it but wasn't sure how to add iowrap as a thunk. I added the files from ggsh/models to gg/src/models/wrappers and kept ggsh/iowrap in the current directory that i ran the commands from. looks like gg didn't detect the iowrap thunk or something.

$ gg infer cat i.txt
TJcHES0HLwnIqnEgUVbrAvgj2r6aKDPoh9IcGzfM9fbs00000117

$ gg describe TJcHES0HLwnIqnEgUVbrAvgj2r6aKDPoh9IcGzfM9fbs00000117
{
 "function": {
  "hash": "VkIXLi2AvcdLUIbAIYdr4IfjH5c.ikp.MZ4QNEELTWPY00000133",
  "args": [
   "iowrap",
   "-",
   "out",
   "cat",
   "@{GGHASH:VziaXgzsiNzeCIBBjDSZ9oHqywVIPYTBP5.ksiphPP_000000016}"
  ],
  "envars": []
 },
 "values": [
  "VziaXgzsiNzeCIBBjDSZ9oHqywVIPYTBP5.ksiphPP_000000016=i.txt"
 ],
 "thunks": [],
 "executables": [
  "VkIXLi2AvcdLUIbAIYdr4IfjH5c.ikp.MZ4QNEELTWPY00000133=iowrap"
 ],
 "outputs": [
  "out"
 ],
 "links": [],
 "timeout": 0
}

$ gg force out 
→ Loading the thunks...  done (0 ms).
TJcHES0HLwnIqnEgUVbrAvgj2r6aKDPoh9IcGzfM9fbs00000117: execvpe failed
std::exception
 `TJcHES0HLwnIqnEgUVbrAvgj2r6aKDPoh9IcGzfM9fbs00000117': process exited with failure status 1
gg-force: `TJcHES0HLwnIqnEgUVbrAvgj2r6aKDPoh9IcGzfM9fbs00000117': process exited with failure status 5

$ gg create-thunk --value $(gg hash iowrap) --executable $(gg hash iowrap) $(gg hash iowrap) iowrap
gg-create-thunk: a thunk needs at least one output

cat especially helps with the linking step for custom commands.
Some help with how to set this up will be great. Thanks!

@sadjad
Copy link
Member

sadjad commented Jul 29, 2020

Nice, this works! I've added this with our whole discussion to the documentation in this pull request

Thank you for the pull request! I just had a peek and it looks great. Will merge it as soon as possible.

I tried running it but wasn't sure how to add iowrap as a thunk.

You're almost there! You need to collect the iowrap file. From the directory of your program, run gg collect /path/to/iowrap to make a copy in .gg/blobs directory. Also you may need to collect your input file (i.txt) manually as well (these should be easy to fix).

The nice part is that you can pipe these commands together. For example, you can run:

gg infer sh -c 'cat i.txt | grep hello'

And it will work. (as far as I remember!)

(Unfortunately, gg infer cat i.txt | grep hello would not work. But imagine if instead of bash, there's a gg shell that understands these commands and takes care of things without having to explicitly type gg infer. That was the ultimate idea behind this gsh thing...)

@drunksaint
Copy link
Author

drunksaint commented Jul 30, 2020

You're almost there! You need to collect the iowrap file. From the directory of your program, run gg collect /path/to/iowrap to make a copy in .gg/blobs directory. Also you may need to collect your input file (i.txt) manually as well (these should be easy to fix).

Nice, it works with this fix! Piping works too! Thanks! But I'm not sure I'll be able to use it since the modeled cat looks like it works with only one input file. I'm not sure it is possible to send an unknown number of input files to a command. If I have to use cat to perform the final linking step, It can be done locally if that is the case.

I'm trying to parallelize a simple script. To do this, I'm splitting a file into small pieces and trying to create an output for each piece in an output directory. Outputs to the current directory work fine, but outputs to a subdirectory give an error:

$ mkdir outputdir
$ gg infer fileoutputtest outputdir/out.txt
$ gg force outputdir/out.txt 
→ Loading the thunks...  done (0 ms).
Issue in opening the Output file
std::exception
 `TmcJUtXkVfu6qE5vMqOPpVmKnO3RWSTvoHp66MaHCvPU0000009f': process died on signal 11
gg-force: `TmcJUtXkVfu6qE5vMqOPpVmKnO3RWSTvoHp66MaHCvPU0000009f': process exited with failure status 5

$ gg describe TmcJUtXkVfu6qE5vMqOPpVmKnO3RWSTvoHp66MaHCvPU0000009f
{
 "function": {
  "hash": "VGbXEAZKy6aaAzFPLtIR0m1JOnTchAJ2vw_7UJLiVe1s000020f8",
  "args": [
   "fileoutputtest",
   "o/out.txt"
  ],
  "envars": []
 },
 "values": [],
 "thunks": [],
 "executables": [
  "VGbXEAZKy6aaAzFPLtIR0m1JOnTchAJ2vw_7UJLiVe1s000020f8"
 ],
 "outputs": [
  "o/out.txt"
 ],
 "links": [],
 "timeout": 0
}

Seems like the issue is that the directory outputdir doesn't exist in the execution context. Looks like inputs can have directories since they are referred to by their hash but outputs cannot since there is no implicit directory creation in the execution context. Am I thinking about this the right way? Or is there some other way to create output files in a subdirectory?

@sadjad
Copy link
Member

sadjad commented Jul 30, 2020

You're right about this. Currently the system doesn't create the output directory automatically. Although, I think you can try creating the o/ directory in your script, and then put the output file there.

@sadjad
Copy link
Member

sadjad commented Jul 30, 2020

I'm not sure it is possible to send an unknown number of input files to a command.

This should be possible, because at the time of thunk generation, I think we know how many files we have. But I'm not sure if current implementation of iowrap has support for multiple inputs.

@drunksaint
Copy link
Author

Ah i see, the gg create-thunk command can be generated dynamically. I've added multiple file support for cat to this pull request.

I think I have a much better understanding of how gg can be used to parallelize a custom workload now. Thanks for your help with everything here! I'll close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants