Skip to content

Improve score by supporting extra_phrase for extra words in rules #4432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

alok1304
Copy link
Collaborator

@alok1304 alok1304 commented Jun 19, 2025

Follow up of:

Add new phrases like extra_phrase this is special for extra-words. This phrase is represented in the format [[n]], where n indicates the maximum number of extra-words allowed at that position in the rule.

If extra-words appear at the correct position and their count does not exceed the allowed limit n, then the score is increased to 100.

Reference #4420

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁
  • Updated documentation pages (if applicable)
  • Updated CHANGELOG.rst (if applicable)

Signed-off-by: Alok Kumar alokkumarjipura9973@gmail.com

Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alok1304! Looking much better

See comments for your consideration. I've updated your PR description to mention that this is a follow up PR, since there is important context and reviews in the previous PR, we need to preserve this as required.

"""
Return True if any of the matches in ``license_matches`` List of LicenseMatch
has extra words are in the correct place.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check both a bit explicitly:

  1. For all the matches which have extra words, they are in correct location
  2. For all the matches which does not have extra words, they are correct detections

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add a test accordingly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add a test accordingly

where I should add a test and like how I implement for all license_matches

@alok1304 alok1304 force-pushed the improve-score-extra-words branch from b9c7c16 to a50b3db Compare June 24, 2025 06:27
@alok1304
Copy link
Collaborator Author

alok1304 commented Jun 24, 2025

I addet test for 3-seq where there is no detection of copyrights statements , Ref: https://github.com/xyzzy-022/xyzzy/blob/5a16eb998470241b33ad3caa6a4946d0448a16b6/LEGAL.md?plain=1#L97
this file when we scan we got extra-words so I added extra-phrase marker in that corresponding matched rule. Such that we can improve the score.

Next things we want:
Remove this extra-phrase marker from rules while loading.. so that this extra-phrase not consider as extra-words that is inserted in the rules.

Some test cases are failing; I will solve them in the next commit.

alok1304 added 8 commits June 24, 2025 13:14
…_log`

Add test for is correct position of `extra-words` according to `extra-phrases` that is present in rules.

if we find `extra-words` are in the right place then we set score to `100`.
And also show in `detection_log` why we increasing the score to keep track of this.

Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
Add new phrases like `extra_phrase` this is special for extra-words.
This phrase is represented in the format [[n]], where n indicates the maximum number of extra-words allowed at that position in the rule.

If extra-words appear at the correct position and their count does not exceed the allowed limit `n`, then the score is increased to `100`.

Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
due to `extra_phrase` in rules, this shows that rules containing `extra-words`

Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
add a new `extra-phrase` for a rule i.e bsd-new

Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
Signed-off-by: Alok Kumar <alokkumarjipura9973@gmail.com>
@alok1304 alok1304 force-pushed the improve-score-extra-words branch from 8a25b51 to 43c6bdb Compare June 24, 2025 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants