Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement: split out query packs #1367

Open
mitchell-liatrio opened this issue Nov 14, 2022 · 11 comments
Open

enhancement: split out query packs #1367

mitchell-liatrio opened this issue Nov 14, 2022 · 11 comments
Labels
question Further information is requested

Comments

@mitchell-liatrio
Copy link

One challenge we've seen with customers running CodeQL against large applications is the time to execute CodeQL scans. One potential solution is to split out the query packs to run separately.

ex.

  • Default (Security): Run all of the default queries provided by CodeQL
  • Security-Extended: Run all of the extended queries but do not run the default queries
  • Security-And-Quality: Run only the quality queries

This could be combined with a matrix strategy to run parallel jobs and optimize the total execution time of the workflow. I understand one limitation of this is new CodeQL packs would need to be published that support this pattern. However, it would be very beneficial for customers where PRs running over 10 minutes in their large legacy apps are considered a nonstarter for CodeQL as a long-term solution.

@aeisenberg
Copy link
Contributor

Thanks for raising this issue. Do you have any examples of code scanning runs that are taking too long? We can investigate and see if you might benefit from larger runners? I would like to understand more about your issue.

Some scans on larger projects can be resource intensive. Sometimes it's beneficial to run fewer queries at a time, but you may wind up spending time re-building the database on each run. Before trying to split the query suites, I would recommend first running using larger runners.

Alternatively, you can use query filters to selectively run subsets of queries in a query suite.

Please let me know if these suggestions help you.

@aeisenberg aeisenberg added the question Further information is requested label Nov 15, 2022
@mitchell-liatrio
Copy link
Author

We can use this repo for example but the specific use case is in a private repo.

https://github.com/OneGHEOrg/angular/actions

We've given bigger runners and have only been able to reduce the time to 11 minutes (16 core 64 Gb ram). We're open to other ideas but we'd really like to find a solution that doesn't require tossing a 64-core machine at this.

I'd also looked at using query filters to run a subset but this particular case requires security-extended which is due to needing more results that originally surfaced by the default query pack. I've thought about running a matrix job and having several different codeql-configs but that doesn't feel like a very maintainable solution since those query packs would require additional maintenance going forward as GitHub iterates on CodeQL.

In the private repo, we've been able to reduce the runtime from 17 minutes to 11 but we'd ideally get it under 10.

@aeisenberg
Copy link
Contributor

From your example repository, I have a few comments and suggestions. Not sure how applicable this is to your private repository:

  1. Since you're using javascript, there's no need for an autobuild step.
  2. Make sure that each category specified in the analyze step is unique, otherwise they will overwrite each other
  3. I see that the main job that consumes time is packages. On the one hand, it's easy to scan slices of javascript repositories so that you can get your analysis times down. On the other hand, you may lose some data flow precision if you are only scanning partial code. This is a difficult tradeoff to make and depends on how tightly coupled your code is.
  4. Another thing you can do is take a look at individual queries and find ones that are taking too long. You can then decide if they are worth running. You can use query filters to remove them. For instance, here is a list of queries that each took more than 15s to run (there are probably more, but I didn't look too hard):
https://github.com/OneGHEOrg/angular/actions/runs/3348634558/jobs/5548463308#step:5:7949
https://github.com/OneGHEOrg/angular/actions/runs/3348634558/jobs/5548463308#step:5:8017
https://github.com/OneGHEOrg/angular/actions/runs/3348634558/jobs/5548463308#step:5:8103
https://github.com/OneGHEOrg/angular/actions/runs/3348634558/jobs/5548463308#step:5:8183

The last query took almost 30s. There is a possibility that one or more of our queries are just performing poorly on your repo. We do a lot of internal benchmarking, but it's hard to do this on private repos (since we can't see the code). If you see a query that you think is taking longer than it should, let us know.

@mitchell-liatrio
Copy link
Author

  1. I removed the autobuild step. I typically just leave it in there since it doesn't hurt much but every second counts in this case.
  2. I updated the category to use the path since that will be unique in this case. Good catch 🎣
  3. In the private repo, we're using Turbo to pull in packages consumed by each app in the mono-repo since precision was a key factor.
  4. I'll work with the team to see if there are any additional queries that can be removed. I noticed the ATM Queries are showing a long evaluation time. Looking at these in the CodeQL Repo it says they are only publishing results for internal users. I can exclude these queries but it seems odd that they run if results won't be returned.

I noticed one other thing after modifying the categories. The preview and show paths are unavailable when providing an alternate path in source-root during the init step. Is this the expected behavior or something that needs to be addressed?

I'm linking an example but I believe you will need to accept the invite to be an outside collaborator in order to see it.
https://github.com/OneGHEOrg/angular/security/code-scanning/1444

@aeisenberg
Copy link
Contributor

When you specify a custom source-root, code-scanning (the server-side imlpementation that processes the results) does not know how to map the locations of the alerts to places in the code since the root is different. You should instead be using paths and paths-ignore. And you will need to specify this in a user config file. See https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#specifying-directories-to-scan

Please give me more time to think about your other questions.

@aeisenberg
Copy link
Contributor

And thanks for pointing out the documentation for the ATM queries. We've just updated the README. They are being run for javascript codescanning runs.

@aeisenberg
Copy link
Contributor

The ATM queries are still somewhat experimental. You may want to filter them from running if they are taking too much time.

@mitchell-liatrio
Copy link
Author

Sounds good and we actually found an alternative solution for source-root using mono-repos by using turbo which can pull a project and its dependent packages while maintaining the locations for visibility.

We originally looked at using codeql-config but needed to deviate from this pattern since we wanted to run each app as part of a matrix job.

As for the codeql pack split out. What are your initial thoughts on having some additional packs such as security-core, extended-only, and quality-only so we could run those as part of a matrix further reducing runtime? We can likely omit some of the long-running queries (ATM) but want to be careful about omitting too much.

@aeisenberg
Copy link
Contributor

We originally looked at using codeql-config but needed to deviate from this pattern since we wanted to run each app as part of a matrix job.

When you specify a matrix job, you can include some matrix parameters including which codeql-config.yml files to run. For example, your matrix currently looks like this:

      matrix:
        language: [ 'javascript' ]
        paths: ${{ fromJson(needs.list-paths.outputs.paths) }}

It happens to be that each of your directories is just a single word, so you can use this to generate unique codeql-config files for each job. Eg-

      - name: Initialize CodeQL
        uses: github/codeql-action/init@v2
        with:
          languages: ${{ matrix.language }}
          queries: security-extended,security-and-quality
          source-root: ${{ matrix.paths }}
          config: './configs/codeql-config-${{ matrix.paths }}.yml'

Or something like that.

As for the codeql pack split out. What are your initial thoughts on having some additional packs such as security-core, extended-only, and quality-only so we could run those as part of a matrix further reducing runtime? We can likely omit some of the long-running queries (ATM) but want to be careful about omitting too much.

I will have to communicate with our product team, since this is not a technical decision and can potentially affect many customers.

@mitchell-liatrio
Copy link
Author

We originally looked at using codeql-config but needed to deviate from this pattern since we wanted to run each app as part of a matrix job.

When you specify a matrix job, you can include some matrix parameters including which codeql-config.yml files to run. For example, your matrix currently looks like this:

      matrix:
        language: [ 'javascript' ]
        paths: ${{ fromJson(needs.list-paths.outputs.paths) }}

It happens to be that each of your directories is just a single word, so you can use this to generate unique codeql-config files for each job. Eg-

      - name: Initialize CodeQL
        uses: github/codeql-action/init@v2
        with:
          languages: ${{ matrix.language }}
          queries: security-extended,security-and-quality
          source-root: ${{ matrix.paths }}
          config: './configs/codeql-config-${{ matrix.paths }}.yml'

Or something like that.

I'll take a look into this and see if it will work for the actual use case.

As for the codeql pack split out. What are your initial thoughts on having some additional packs such as security-core, extended-only, and quality-only so we could run those as part of a matrix further reducing runtime? We can likely omit some of the long-running queries (ATM) but want to be careful about omitting too much.

I will have to communicate with our product team, since this is not a technical decision and can potentially affect many customers.

Sounds great. I understand there's a lift here but I can definitely see this strategy helping with some of these larger repos/apps that are concerned about performance. Especially when it's embedded into the PR workflow.

@aeisenberg
Copy link
Contributor

We're having a conversation about expanding our set of query suites or query suite selectors (ie- building blocks for custom query suites). If we decide this is the right thing to do, we won't be able to implement in the short term, so I would recommend finding a solution here.

If you would like to permanently opt out of the ATM queries, we can do so on our end. Just please let us know which organizations and repositories we should do this for. And if you ever change your mind, we can easily undo it. This is not much different from filtering them out using query-filters thought. So, the choice is yours.

Another suggestion is that you avoid running security-and-quality queries and only run security-extended or simply security queries. The quality queries are more appropriate for a linting tool. There are more comprehensive linters available for you to use if that is your goal. CodeQL focuses on security. Also, if you are seeing many false positives with the security-extended queries, you can consider running the default queries only, which will save you even more time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants