Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Regex to split shell script #717

Merged
merged 3 commits into from Jun 19, 2020
Merged

Conversation

ForgetMe17
Copy link
Contributor

Add function split_shell_script() in tern/utils/general.py.
This function receives a shell script, split it into statements:

  • command
  • variable
  • loop
  • branch

In this commit, use Regex to split the shell script by seperators
which are &&, ||, |, ;, :;(this is a special case).
To skip quotes while spliting, use (*SKIP)(*F). This is a feature
in Regex not in re(python module).

Works towards #682.

Signed-off-by: WangJL hazard15020@gmail.com

@ForgetMe17
Copy link
Contributor Author

I am using a new module Regex.
Regex is backwards-compatible with the standard ‘re’ module, but offers additional functionality.

Copy link
Contributor

@nishakm nishakm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! The results are quite accurate. Well done!! I have a few minor changes and considerations for you. Other than that, you can use this method in your implementation.

@@ -6,6 +6,7 @@
import os
import random
import re
import regex as mrab
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regex is not part of the standard library, so you will need to add it to requirements.in and requirements.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: I don't think it's necessary to alias regex. re is a very commonly used python module and included in the standard library.

tern/utils/general.py Show resolved Hide resolved
Add function split_shell_script() in tern/utils/general.py.
This function receives a shell script, split it into statements:
- command
- variable
- loop
- branch

In this commit, Regex is used to split the shell script by seperators
which are '&&', '||', '|', ';', ':;'(this is a special case).
To skip quotes while spliting, use (*SKIP)(*F). This is a feature
in Regex not in re(python module).

Add function parse_shell_variables_and_command().
Add function parse_shell_loop_and_branch().

Works towards tern-tools#682.

Signed-off-by: WangJL <hazard15020@gmail.com>
@ForgetMe17 ForgetMe17 marked this pull request as ready for review June 9, 2020 13:10
@ForgetMe17
Copy link
Contributor Author

I have implemented the rest of split function. Here are the local test output of the dockerfiles under tests\dockerfiles\split_shell_script.
buildpack_deps_buster.txt
buildpack_deps_buster_scm.txt
debian_buster_slim.txt

@ForgetMe17
Copy link
Contributor Author

ForgetMe17 commented Jun 9, 2020

I find another issue.
For command like this:

find /usr/local -depth \
		\( \
			\( -type d -a \( -name test -o -name tests -o -name idle_test \) \) \
			-o \
			\( -type f -a \( -name '*.pyc' -o -name '*.pyo' \) \) \
		\) -exec rm -rf '{}' + \

Currently we can not remove \ and multi whitespace in a concatenated command.
the result is:

{'command': {'name': 'find',
             'options': [('-depth', '\\('),
                         ('-type', 'd'),
                         ('-a', '\\('),
                         ('-name', 'test'),
                         ('-o', ''),
                         ('-name', 'tests'),
                         ('-o', ''),
                         ('-name', 'idle_test'),
                         ('-type', 'f'),
                         ('-a', '\\('),
                         ('-name', "'*.pyc'"),
                         ('-o', ''),
                         ('-name', "'*.pyo'"),
                         ('-exec', 'rm'),
                         ('-rf', "'{}'")],
             'words': ['/usr/local',
                       '\\(',
                       '\\(',
                       'd',
                       '\\(',
                       'test',
                       'tests',
                       'idle_test',
                       '\\)',
                       '\\)',
                       '-o',
                       '\\(',
                       'f',
                       '\\(',
                       "'*.pyc'",
                       "'*.pyo'",
                       '\\)',
                       '\\)',
                       '\\)',
                       'rm',
                       "'{}'",
                       '+']},
 'content': 'find /usr/local -depth \t\t\\( \t\t\t\\( -type d -a \\( -name '
            'test -o -name tests -o -name idle_test \\) \\) \t\t\t-o \t\t\t\\( '
            "-type f -a \\( -name '*.pyc' -o -name '*.pyo' \\) \\) \t\t\\) "
            "-exec rm -rf '{}' +"}

Maybe we can use str.replace() to replace \, \n, \t into (whitespace), and the use
' '.join(shlex.split()) to replace multi whitespace into one whitespace

Copy link
Contributor

@nishakm nishakm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks good! I would only ask for you to complete the two TODOs you have recorded in this commit. If you think you need to add a new commit on top of this one, you can go ahead and do that.

Comment on lines 258 to 261
# TODO remove \t \r \n \ in the command
statement['command'] = parse_command(concatenated_command)
return statement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can fix this issue in this PR as well. If you need to, add another commit on top of this one.

tern/utils/general.py Outdated Show resolved Hide resolved
This commit enables parse_shell_loop_and_branch() to extract
statements for a loop, and use clean_command() to clean tab
and line indentations.

Modificate the statement dictionary, change the dict for command:
{'content':..., 'command':...} to {'conmmand':...}.
Since we will do parse_command() in the futher part of finding the
installed software, we will only record the concatenated command
string here.

Works towards tern-tools#682 and tern-tools#706.

Signed-off-by: WangJL <hazard15020@gmail.com>
@ForgetMe17
Copy link
Contributor Author

I added a new commit, and here are the changes:

  1. use clean_command() in general.py to remove tab and line indentations.
  2. add code to extractstatements for a loop in parse_shell_loop_and_branch().
  3. remove parse_command on the command, since we will do this in the futher step for finding installed software. Accordingly, i remove the key content for command since it is uncessary and record the concatenated command string in the key command.

Attached are the output from my local test.
buildpack_deps_buster.txt
buildpack_deps_buster_scm.txt
debian_buster_slim.txt

@nishakm
Copy link
Contributor

nishakm commented Jun 18, 2020

@ForgetMe17 I tried your changes on the golang RUN statement:

set -eux;               dpkgArch="$(dpkg --print-architecture)";        case "${dpkgArch##*-}" in               amd64) goRelArch='linux-amd64'; goRelSha256='11814b7475680a09720f3de32c66bca135289c8d528b2e1132b0ce56b3d9d6d7' ;;                armhf) goRelArch='linux-armv6l'; goRelSha256='d4da5c06097be8d14aeeb45bf8440a05c82e93e6de26063a147a31ed1d901ebc' ;;              arm64) goRelArch='linux-arm64'; goRelSha256='2648b7d08fe74d0486ec82b3b539d15f3dd63bb34d79e7e57bebc3e5d06b5a38' ;;                i386) goRelArch='linux-386'; goRelSha256='83d732a3961006e058f44c9672fde93dbea3d1c3d69e8807d135eeaf21fb80c8' ;;          ppc64el) goRelArch='linux-ppc64le'; goRelSha256='33f7bed5ee9d4a0343dc90a5aa4ec7a1db755d0749b624618c15178fd8df4420' ;;            s390x) goRelArch='linux-s390x'; goRelSha256='493b4449e68d0deba559e3f23f611310467e4c70d30b3605ff06852f14477457' ;;               *) goRelArch='src'; goRelSha256='78cda84d4217ae0fdb8f87848474be28644bdc1aa16579055f852999a4793ac0';                      echo >&2; echo >&2 "warning: current architecture ($dpkgArch) does not have a corresponding Go binary release; will be building from source"; echo >&2 ;;        esac;           url="https://golang.org/dl/go${GOLANG_VERSION}.${goRelArch}.tar.gz";    wget -O go.tgz "$url";  echo "${goRelSha256} *go.tgz" | sha256sum -c -;         tar -C /usr/local -xzf go.tgz;   rm go.tgz;              if [ "$goRelArch" = 'src' ]; then               echo >&2;               echo >&2 'error: UNIMPLEMENTED';                echo >&2 'TODO install golang-any from jessie-backports for GOROOT_BOOTSTRAP (and uninstall after build)';               echo >&2;               exit 1;         fi;             export PATH="/usr/local/go/bin:$PATH";  go version

I get this output:

{'command': 'set -eux'}
{'variable': {'name': 'dpkgArch', 'value': '"$(dpkg --print-architecture)"'}, 'content': 'dpkgArch="$(dpkg --print-architecture)"'}
{'content': ['case "${dpkgArch##*-}" in \t\tamd64) goRelArch=\'linux-amd64\'', "goRelSha256='11814b7475680a09720f3de32c66bca135289c8d528b2e1132b0ce56b3d9d6d7'", "armhf) goRelArch='linux-armv6l'", "goRelSha256='d4da5c06097be8d14aeeb45bf8440a05c82e93e6de26063a147a31ed1d901ebc'", "arm64) goRelArch='linux-arm64'", "goRelSha256='2648b7d08fe74d0486ec82b3b539d15f3dd63bb34d79e7e57bebc3e5d06b5a38'", "i386) goRelArch='linux-386'", "goRelSha256='83d732a3961006e058f44c9672fde93dbea3d1c3d69e8807d135eeaf21fb80c8'", "ppc64el) goRelArch='linux-ppc64le'", "goRelSha256='33f7bed5ee9d4a0343dc90a5aa4ec7a1db755d0749b624618c15178fd8df4420'", "s390x) goRelArch='linux-s390x'", "goRelSha256='493b4449e68d0deba559e3f23f611310467e4c70d30b3605ff06852f14477457'", "*) goRelArch='src'", "goRelSha256='78cda84d4217ae0fdb8f87848474be28644bdc1aa16579055f852999a4793ac0'", 'echo >&2', 'echo >&2 "warning: current architecture
($dpkgArch) does not have a corresponding Go binary release; will be building from source"', 'echo >&2', 'esac'], 'branch': {'type': 'case'}}
{'variable': {'name': 'url', 'value': '"https://golang.org/dl/go${GOLANG_VERSION}.${goRelArch}.tar.gz"'}, 'content': 'url="https://golang.org/dl/go${GOLANG_VERSION}.${goRelArch}.tar.gz"'}
{'command': 'wget -O go.tgz "$url"'}
{'command': 'echo "${goRelSha256} *go.tgz"'}
{'command': 'sha256sum -c -'}
{'command': 'tar -C /usr/local -xzf go.tgz'}
{'command': 'rm go.tgz'}
{'content': ['if [ "$goRelArch" = \'src\' ]', 'then \t\techo >&2', "echo >&2 'error: UNIMPLEMENTED'", "echo >&2 'TODO install golang-any from jessie-backports for GOROOT_BOOTSTRAP (and uninstall after build)'", 'echo >&2', 'exit 1', 'fi'], 'branch': {'type': 'if'}}
{'command': 'export PATH="/usr/local/go/bin:$PATH"'}
{'command': 'go version'}

I think this looks pretty good! I wonder if you can take care of a few small things here:

For command echo "${goRelSha256} *go.tgz" | sha256sum -c - it looks like the parser considers echo "${goRelSha256} *go.tgz" and sha256sum -c - as two separate commands. Pipes are one of those things that we really can't determine. So I think it is fine to not split commands based on |.

If we are monitoring variable assignments, we may want to also take into account export PATH="/usr/local/go/bin:$PATH" which is basically a variable assignment in a command. It's not going to help much but it is a corner case to consider.

@ForgetMe17
Copy link
Contributor Author

Thanks for your adivce! I have made a quick fix by the latest commit.

Pipe symbol '|' should not be seperated.
Command 'export' can be treated as variable assignment.

Works towards tern-tools#682.

Signed-off-by: WangJL <hazard15020@gmail.com>
@ForgetMe17
Copy link
Contributor Author

Here are the test results.
buildpack_deps_buster.txt
buildpack_deps_buster_scm.txt
debian_buster_slim.txt
In buildpack_deps_buster_scm.txt:
The output for

echo "${goRelSha256} *go.tgz" | sha256sum -c -; \
	tar -C /usr/local -xzf go.tgz; \
	rm go.tgz; \
	\
	if [ "$goRelArch" = 'src' ]; then \
		echo >&2; \
		echo >&2 'error: UNIMPLEMENTED'; \
		echo >&2 'TODO install golang-any from jessie-backports for GOROOT_BOOTSTRAP (and uninstall after build)'; \
		echo >&2; \
		exit 1; \
	fi; \
	\
	export PATH="/usr/local/go/bin:$PATH"; \

is now

5
-----------------------
{'command': 'echo "${goRelSha256} *go.tgz" | sha256sum -c -'}
-----------------------
6
-----------------------
{'command': 'tar -C /usr/local -xzf go.tgz'}
-----------------------
7
-----------------------
{'command': 'rm go.tgz'}
-----------------------
8
-----------------------
{'branch': {'type': 'if'},
 'content': ['if [ "$goRelArch" = \'src\' ]',
             'then \t\techo >&2',
             "echo >&2 'error: UNIMPLEMENTED'",
             "echo >&2 'TODO install golang-any from jessie-backports for "
             "GOROOT_BOOTSTRAP (and uninstall after build)'",
             'echo >&2',
             'exit 1',
             'fi']}
-----------------------
9
-----------------------
{'content': 'export PATH="/usr/local/go/bin:$PATH"',
 'variable': {'name': 'PATH', 'value': '"/usr/local/go/bin:$PATH"'}}
-----------------------
10
-----------------------
{'command': 'go version'}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants