Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched & parallel support for cfn-lint, eslint, gitleaks #4088

Merged
merged 16 commits into from
Oct 31, 2023

Conversation

kftsehk
Copy link
Contributor

@kftsehk kftsehk commented Apr 16, 2023

Proposal to fix

Currently planned to provide implementation for

  • cfn-lint
  • eslint
  • gitleak

Use cases includes

  • Faster whole repo linting on cloud (changed files can affect unchanged files)
  • Local linting on commit / push without having developers to wait too much

Proposed Changes

Batched and parallel linting for use case involving large number of files.

  • Batched: Use $LINTER_COMMAND FILE1 FILE2 FILE3 .... instead of $LINTER_COMMAND FILE looping over all files
  • Parallel: Files are divided into not-so-small chunks and multiple linter instances up to nprocs are run in parallel
    As for most parallelization, this patch does not reproduce exact output of serial version

To enable this experimental worker, set env EXPERIMENTAL_BATCH_WORKER to true

In order to mimic serial behavior as closely as possible, the implementation

  • show linter output as is,
  • does not break or interleave linter output in any way,
  • provide the same error file count (lines outputting text such as [ERROR] ERRORS FOUND in JAVASCRIPT_ES:[123]) as serial version when super-linter is completed

Current ERROR output on serial version cannot be recovered, since

  • it is not possible retrieve error code of linter for individual file, batching implies only 1 return code is given for a batch of files.

Performance comparison

A figure is provided only for eslint on a repo that I have on hand

  • 8 core laptop
  • approx. 450 .js file
  • timing only on JAVASCRIPT_ES block

Test 1: 429 files contains linter error (not warning) artificially introduced (mainly react/* rules that was turned off for the project)

  • original: 7 min 21 sec
  • new impl: 14 sec (it is not necessary to parallel, still under 30s without parallel)

Test 2: No artificially introduced error (0 error, several warnings)

  • original: 7 min 4 sec
  • new impl: 5 sec

For the experimental implementation, majority of the performance issue is in bash-script implemented output parser to calculate number of error files, it is not too bad anyway.

Readiness Checklist

Author/Contributor

  • If documentation is needed for this change, has that been included in this pull request
  • Test case count is not yet recovered

Reviewing Maintainer

  • Label as breaking if this is a large fundamental change
  • Label as either automation, bug, documentation, enhancement, infrastructure, or performance

@kftsehk
Copy link
Contributor Author

kftsehk commented Apr 17, 2023

Just wonder why this gets a linter error, was assigning the internal counter back to variable ERRORS_FOUND_${FILE_TYPE}

  (("ERRORS_FOUND_${FILE_TYPE}"=CFNLINT_ERRORS_FOUND));

while

(("ERRORS_FOUND_${FILE_TYPE}"++))

has no linter error.

Both evaluated as expected in bash.

If cannot be solved maybe I will hardcode the FILE_TYPE there

@kftsehk
Copy link
Contributor Author

kftsehk commented Apr 17, 2023

I think I will leave the python toolchain later, but adding gitleaks parallelized (gitleaks does not support batch, only parallel is possible)

One small note for gitleaks

parallel version only output the information related to the detected leaks, i. e.

Finding:     -----BEGIN PRIVATE KEY-----
put your private key here  
-----END PRIVATE KEY----   
Secret:      -----BEGIN PRIVATE KEY-----
put your private key here  
-----END PRIVATE KEY----   
RuleID:      private-key
Entropy:     4.186297
File:        /tmp/lint/private-key.pem
Line:        1
Fingerprint: /tmp/lint/private-key.pem:private-key:1

parallel version does not output

  • (number|"no") leaks found
  • gitleaks took (number) ms

because when parallel, it is not possible to relate the above message to the file which generates it. However this can be justified by the fact that these messages are kind of not informative even when run in serial

The number returned at the end of super-linter is still correct, equals to serial run.

@kftsehk
Copy link
Contributor Author

kftsehk commented Apr 17, 2023

Sorry for messing with linter errors, have checked locally against this for latest commit

docker run --rm  -e IGNORE_GITIGNORED_FILES=true -e LINTER_RULES_PATH=/ -e LOG_LEVEL=NOTICE \
-e RUN_LOCAL=true -e USE_FIND_ALGORITHM=true \
-e VALIDATE_BASH=true \
-e ERROR_ON_MISSING_EXEC_BIT=true \
-e VALIDATE_BASH_EXEC=true  \
-e VALIDATE_EDITORCONFIG=true \
-e VALIDATE_SHELL_SHFMT=true  \
-w /tmp/lint -v "$(git rev-parse --show-toplevel)":/tmp/lint github/super-linter:slim-latest

@kftsehk kftsehk changed the title faster linter for cfn-lint and eslint Batched & parallel support for cfn-lint, eslint, gitleaks Apr 19, 2023
@kftsehk
Copy link
Contributor Author

kftsehk commented Apr 28, 2023

Hi, would like to follow up on how to proceed with testing and moving towards merging the PR.

This issue features a new code path, that is slightly incompatible (as in exact char-by-char output diff sense), enabled by setting EXPERIMENTAL_BATCH_WORKER to true, it seem current CI won’t cover this path unless also test with this env set.

@github-actions
Copy link
Contributor

This pull request has been automatically marked as stale because it has not had recent activity.
It will be closed in 14 days if no further activity occurs.
Thank you for your contributions.

If you think this pull request should stay open, please remove the O: stale 🤖 label or comment on the pull request.

If you're a maintainer, you can stop the bot to mark this issue as stale in the future by adding the O: backlog 🤖 label`.

@github-actions github-actions bot added the O: stale 🤖 Stale issue/pr label Jun 22, 2023
@kftsehk
Copy link
Contributor Author

kftsehk commented Oct 26, 2023

Please block a while for refactoring some code to improve tracability

@kftsehk kftsehk force-pushed the faster-linter branch 3 times, most recently from b4cb9a2 to 55882bc Compare October 26, 2023 13:21
@kftsehk
Copy link
Contributor Author

kftsehk commented Oct 26, 2023

is good now, refactor using named pipe to get rid of redirection hell. all 3 linter's file is just ~60 line now.

@kftsehk
Copy link
Contributor Author

kftsehk commented Oct 27, 2023

As an aside, I guess someone might use non-default linter output format, e.g. json/xml if the linter supports it.

Serial implementation relies on return value of linter running one file, and is format agnostic.

Parallel is format agnostic with no batching, fully compatible with previous implementation, parallel --joblog can be used to retrieve each return value of single file linting run in parallel

Batching on the other hand, combines linter errors from multiple files, linter return value has no clear meaning for most linter.
In order to sum total file failed, way to read linter output depends on user's preferred format.

@zkoppert
Copy link
Contributor

approved ci workflows, awaiting results.

@zkoppert
Copy link
Contributor

Can we update permissions on the file to be executable?

Error: File:[/github/workspace/lib/functions/experimental-batch-workers/base.sh] is not executable

@kftsehk
Copy link
Contributor Author

kftsehk commented Oct 28, 2023 via email

@zkoppert zkoppert added this pull request to the merge queue Oct 31, 2023
@zkoppert zkoppert linked an issue Oct 31, 2023 that may be closed by this pull request
Merged via the queue into super-linter:main with commit c3ac3aa Oct 31, 2023
4 checks passed
@kftsehk kftsehk deleted the faster-linter branch October 31, 2023 01:48
@kftsehk kftsehk mentioned this pull request Nov 12, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request O: backlog 🤖 Backlog, stale ignores this label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invoking eslint per file cause serious performance issue
3 participants