Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(php): Support case insensitive function calls and classes #8356

Merged

Conversation

akuhlens
Copy link
Contributor

@akuhlens akuhlens commented Jul 24, 2023

What

Adds matching support for languages that have case insensitive identifiers and demonstrates their usage for Php.

closes #7231

How

Adds a boolean field to id_info fields and updates Generic_vs_generic.ml and Matching_generic.ml to respect these fields. I originally thought it would be easier to add a special case for Php in matching, but extending Matching_generic.ml to be language specific becomes troublesome because equal_ast_bound_code is called from outside this submodules in contexts that could possibly be addressing variables for multiple languages (or at least that was my take on the situation and types at play).

Remaining Work To Do

This was shared as a draft to communicate my work and get feedback about how to add the bitfield to id_info and update the submodule.

  • Modify id_info to contain a id_flags field that contains a bitfield instead of having two separate boolean fields (id_case_insensitive, and id_hidden).
  • Clean up code and document purpose.
  • Add change log entry.
  • Submit pull request for semgrep-rules submodule and update submodule to point to main branch again. (Not actually sure the correct order to do this without breaking things). (ongoing here)

Testing

Adds a few test cases matching against identifiers and metavariables in a case insensitive fashion and updates tests/semgrep-rules that were disabled due lack of support for this. Note, tests/semgrep-rules is currently pointing to a branch that I need to open a pull request for.

PR checklist:

  • Purpose of the code is evident to future readers
  • Tests included or PR comment includes a reproducible test plan
  • Documentation is up-to-date
  • A changelog entry was added to changelog.d for any user-facing change
  • Change has no security implications (otherwise, ping security team)

If you're unsure about any of this, please see:

@akuhlens akuhlens force-pushed the andre/7231-php_function_call_case_insensitive-field_added branch from 5c63607 to dbcf2ff Compare July 26, 2023 23:50
@github-actions
Copy link
Contributor

📸 The pytest shapshots changed in your PR.
Please carefully review these changes and make sure they are intended:

  1. Review the changes at f67618b

  2. Accept the new snapshots with

    git fetch origin && git cherry-pick f67618b && git push
    

@github-actions
Copy link
Contributor

🚫 The whole benchmark suite is too slow: +10.5% (+1.105 s)

14 benchmarks, 10.5% slower on average.

Individual deviations greater than 20% from the baseline are reported. An individual performance degradation of over 30% or a global degradation of over 7% is an error and will block the pull request. See run output for full results ('Show all checks' > 'Tests / semgrep benchmark tests' 'Details').

@akuhlens akuhlens marked this pull request as ready for review July 27, 2023 16:09
@akuhlens akuhlens requested a review from aryx July 27, 2023 18:24
@akuhlens
Copy link
Contributor Author

📸 The pytest shapshots changed in your PR. Please carefully review these changes and make sure they are intended:

  1. Review the changes at f67618b
  2. Accept the new snapshots with
    git fetch origin && git cherry-pick f67618b && git push
    

I think these changes to the python output are fine. I am pretty sure they are caused by the update to libs/ast_generic/Meta_ast.ml, but I would appreciate someone confirming this for me.

@akuhlens
Copy link
Contributor Author

The tests that are failing are the ones that I have an approved pull request in semgrep-rules for. I just can't land that PR until those tests pass on develop, which requires I land this diff first.

@ihji
Copy link
Contributor

ihji commented Jul 27, 2023

@akuhlens Thanks for the awesome work! Just out of curiosity, what were the major benefits of introducing bit encoding instead of using two boolean record fields? I might be missing some context, but it appears a bit over-engineered to me.

@akuhlens
Copy link
Contributor Author

@akuhlens Thanks for the awesome work! Just out of curiosity, what were the major benefits of introducing bit encoding instead of using two boolean record fields? I might be missing some context, but it appears a bit over-engineered to me.

I probably wouldn't have done it that way at first either, but there was some concern on the team that adding more boolean flags to such a prevalent node in the AST might have performance implications. I went ahead and provided the optimized representation since there were concerns.
https://semgrepinc.slack.com/archives/C01NXGX2EHZ/p1689791184708529

As a side note, the language specific approach in this discussion ended up not working which is why there is no naive diff.

@aryx
Copy link
Collaborator

aryx commented Jul 28, 2023

You can update the semgrep-rules submodule in this repo to point to your experimental semgrep-rules, so that we can be sure
it works.

@aryx aryx requested review from IagoAbal and mjambon July 28, 2023 11:14
Copy link
Collaborator

@aryx aryx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works if the metavariable binds to an expression rather than directly
to an idenifier.
For example if you have the pattern $X == $X,

and you have 1 + bar() == 1 + Bar()

I don't think your PR will return a match.
It's because for that we call MV.Structural.equal_mvalue a b,
and I don't think we do anything with case sensitivity there.

languages/php/generic/php_to_generic.ml Outdated Show resolved Hide resolved
libs/ast_generic/Id_info_flags.mli Outdated Show resolved Hide resolved
libs/commons/Immediate_bitfield.ml Outdated Show resolved Hide resolved
src/matching/Generic_vs_generic.ml Outdated Show resolved Hide resolved
@aryx
Copy link
Collaborator

aryx commented Jul 29, 2023

To be fair, this is an important improvement about what we had before; maybe my $X == $X match of complex expression should be done in a separate PR.

@aryx
Copy link
Collaborator

aryx commented Jul 31, 2023

@akuhlens Maybe fix the failing tests, and add a giant TODO somewhere (maybe in equal_bind_ast case) that equality between metavar expression is still done in a case-sensitive way (because we use the derived eq). In theory we should use Generic_vs_generic too when we check for equality of metavars ... In any case this looks like progress to what we had before
so maybe we can merge this as a good step forward.

@akuhlens
Copy link
Contributor Author

@aryx Sorry, I am just seeing these last two comments. Maybe I need to adjust my notification settings to give github notification more visibility, or perhaps they were here when I replied to your comments and I just got caught up with the bug you noticed. Either way I will take steps to make sure it doesn't happen again. I did see your comment in slack on Monday but have been out sick with covid until today. I will fix the other things you mentioned and leave a TODO and commented out test case for the derived equality case.

@akuhlens akuhlens force-pushed the andre/7231-php_function_call_case_insensitive-field_added branch from 06a484e to 1ce06fb Compare August 11, 2023 05:18
@github-actions
Copy link
Contributor

🚫 Benchmark semgrep.bench.dropbox.std is too slow: +48.3% (+0.569 s)

🚫 Benchmark semgrep.bench.0c34.std is too slow: +45.7% (+0.563 s)

🚫 Benchmark semgrep.bench.lodash.std is too slow: +35.7% (+0.619 s)

🚫 Benchmark semgrep.bench.coolMenu.std is too slow: +47.1% (+0.551 s)

🚫 Benchmark semgrep.bench.grpc.std is too slow: +33.9% (+0.561 s)

🚫 The whole benchmark suite is too slow: +27.4% (+1.274 s)

14 benchmarks, 27.4% slower on average.

Individual deviations greater than 20% from the baseline are reported. An individual performance degradation of over 30% or a global degradation of over 7% is an error and will block the pull request. See run output for full results ('Show all checks' > 'Tests / semgrep benchmark tests' 'Details').

@akuhlens akuhlens requested a review from aryx August 13, 2023 23:32
@github-actions
Copy link
Contributor

🚫 Benchmark semgrep.bench.dropbox.std is too slow: +53.6% (+0.650 s)

🚫 Benchmark semgrep.bench.0c34.std is too slow: +50.6% (+0.640 s)

🚫 Benchmark semgrep.bench.lodash.std is too slow: +41.2% (+0.730 s)

🚫 Benchmark semgrep.bench.coolMenu.std is too slow: +53.1% (+0.646 s)

🚫 Benchmark semgrep.bench.grpc.std is too slow: +40.0% (+0.683 s)

🚫 The whole benchmark suite is too slow: +30.6% (+1.306 s)

14 benchmarks, 30.6% slower on average.

Individual deviations greater than 20% from the baseline are reported. An individual performance degradation of over 30% or a global degradation of over 7% is an error and will block the pull request. See run output for full results ('Show all checks' > 'Tests / semgrep benchmark tests' 'Details').

@aryx aryx merged commit e8da591 into develop Aug 14, 2023
38 checks passed
@aryx aryx deleted the andre/7231-php_function_call_case_insensitive-field_added branch August 14, 2023 18:59
cretoxyrhina pushed a commit to cretoxyrhina/semgrep that referenced this pull request Oct 17, 2023
…ep#8356)

## What
Adds matching support for languages that have case insensitive
identifiers and demonstrates their usage for Php.

closes semgrep#7231

## How
Adds a boolean field to `id_info` fields and updates
`Generic_vs_generic.ml` and `Matching_generic.ml` to respect these
fields. I originally thought it would be easier to add a special case
for Php in matching, but extending `Matching_generic.ml` to be language
specific becomes troublesome because `equal_ast_bound_code` is called
from outside this submodules in contexts that could possibly be
addressing variables for multiple languages (or at least that was my
take on the situation and types at play).

## Remaining Work To Do
This was shared as a draft to communicate my work and get feedback about
how to add the bitfield to id_info and update the submodule.
- [x] Modify `id_info` to contain a `id_flags` field that contains a
bitfield instead of having two separate boolean fields
(id_case_insensitive, and id_hidden).
- [x] Clean up code and document purpose.
- [x] Add change log entry.
- [x] Submit pull request for `semgrep-rules` submodule and update
submodule to point to main branch again. (Not actually sure the correct
order to do this without breaking things). [(ongoing
here)](semgrep/semgrep-rules#3013)

## Testing
Adds a few test cases matching against identifiers and metavariables in
a case insensitive fashion and updates `tests/semgrep-rules` that were
disabled due lack of support for this. Note, `tests/semgrep-rules` is
currently pointing to a branch that I need to open a pull request for.

PR checklist:

- [x] Purpose of the code is [evident to future
readers](https://semgrep.dev/docs/contributing/contributing-code/#explaining-code)
- [x] Tests included or PR comment includes a reproducible test plan
- [x] Documentation is up-to-date
- [x] A changelog entry was [added to
changelog.d](https://semgrep.dev/docs/contributing/contributing-code/#adding-a-changelog-entry)
for any user-facing change
- [x] Change has no security implications (otherwise, ping security
team)

If you're unsure about any of this, please see:

- [Contribution
guidelines](https://semgrep.dev/docs/contributing/contributing-code)!
- [One of the more specific guides located
here](https://semgrep.dev/docs/contributing/contributing/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

PHP function call matching not case insensitive
3 participants