Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tr: calculate complement set early #6340

Merged
merged 2 commits into from
May 4, 2024

Conversation

jalil-salame
Copy link
Contributor

Fixes #6163 and adds a test to verify that a regression is not caused.

Instead of inverting the conditions to check (e.g. delete characters not present in set1) invert set1 when passed the complement flag (-c, -C, --complement). This is done by calculating set1 then "inverting" it by subtracting from the "full" (universe) set (0..=u8::MAX).

This fixes issue 6163 because it was caused by a combination of the -c and -t flag. -c is the abovementioned complement flag and -t/--truncate-set1 truncates set1 to the length of set2. What happened in issue 6163 is that set1={b'Y'} and set2={b'Z'}, when truncated set1 stays the same and we proceed. The problem is GNU utils does not consider set1 to be {b'Y'}, but the complement of {b'Y'}, that is U \ {b'Y'}={0, 1, ..., b'X', b'Z', ...}, thus it is truncated to {0}.

We can verify this by doing: printf '\0' | tr -c -t Y Z, which prints Z to stdout as expected.

Additionally, by calculating the complement of set1 we no longer need to consider the complement flag when doing the translate operation, this allows us to delete a lot of code.

Copy link

github-actions bot commented May 4, 2024

GNU testsuite comparison:

Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

@jalil-salame
Copy link
Contributor Author

The MacOS failures look unrelated to the tr changes, and appear to be the same as the ones present in the last commit.

Copy link
Collaborator

@BenWiederhake BenWiederhake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! Just two nitpicks, and a worry about a potentially new bug: tr -dts asdf qwe, see below.

src/uu/tr/src/operation.rs Show resolved Hide resolved
src/uu/tr/src/operation.rs Outdated Show resolved Hide resolved
tests/by-util/test_tr.rs Show resolved Hide resolved
@BenWiederhake
Copy link
Collaborator

And you're right, the CI failure on Mac is a known issue: #6275 and #6333. I don't have a mac, and don't know why those fail, but something seems to be off there.
If you know how to fix it that would be fantastic, but in either case it's out-of-scope for this PR.

@jalil-salame
Copy link
Contributor Author

And you're right, the CI failure on Mac is a known issue: #6275 and #6333. I don't have a mac, and don't know why those fail, but something seems to be off there. If you know how to fix it that would be fantastic, but in either case it's out-of-scope for this PR.

I also don't have a Mac so my hands are tied. My comment was to make it easier to review c:

jalil-salame added a commit to jalil-salame/coreutils that referenced this pull request May 4, 2024
An additional issue was found while reviewing uutils#6340, check [this thread][1]. A summary is:

- `tr` ignores the `-t`/`--truncate-set1` flag when not translating
- Not translating is defined as `-d` was passed, or one set was passed.

[1]: uutils#6340 (comment)
Fixes uutils#6163 and adds a test to verify that a regression is not caused.

Instead of inverting the conditions to check (e.g. delete characters **not** present in set1) invert
set1 when passed the complement flag (`-c`, `-C`, `--complement`). This is done by calculating set1
then "inverting" it by subtracting from the "full" (universe) set (0..=u8::MAX).

This fixes issue 6163 because it was caused by a combination of the `-c` and `-t` flag. `-c` is the
abovementioned complement flag and `-t`/`--truncate-set1` truncates set1 to the length of set2. What
happened in issue 6163 is that `set1={b'Y'}` and `set2={b'Z'}`, when truncated set1 stays the same
and we proceed. The problem is GNU utils does not consider set1 to be `{b'Y'}`, but the complement
of `{b'Y'}`, that is `U \ {b'Y'}={0, 1, ..., b'X', b'Z', ...}`, thus it is truncated to `{0}`.

We can verify this by doing: `printf '\0' | tr -c -t Y Z`, which prints `Z` to stdout as expected.

Additionally, by calculating the complement of set1 we no longer need to consider the complement
flag when doing the translate operation, this allows us to delete a lot of code.
jalil-salame added a commit to jalil-salame/coreutils that referenced this pull request May 4, 2024
An additional issue was found while reviewing uutils#6340, check [this thread][1]. A summary is:

- `tr` ignores the `-t`/`--truncate-set1` flag when not translating
- Not translating is defined as `-d` was passed, or one set was passed.

[1]: uutils#6340 (comment)
Copy link
Collaborator

@BenWiederhake BenWiederhake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link

github-actions bot commented May 4, 2024

GNU testsuite comparison:

Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

@BenWiederhake
Copy link
Collaborator

Ah, I missed the "spelling mistakes". Personally I believe the spellchecker is configured too strictly. For this PR, please somehow make the spellchecker stop complaining about these:

./tests/by-util/test_tr.rs:1342:19 - Unknown word (asdfqqwweerr)
./tests/by-util/test_tr.rs:1345:23 - Unknown word (qwerr)
./tests/by-util/test_tr.rs:1352:19 - Unknown word (asdfqwer)
./tests/by-util/test_tr.rs:1355:23 - Unknown word (qwer)
./tests/by-util/test_tr.rs:1362:19 - Unknown word (aassddffqwer)
./tests/by-util/test_tr.rs:1365:23 - Unknown word (asdfqwer)

You'll find in the same file examples how to use spell-checker:disable-line and spell-checker:disable-next-line.

@jalil-salame
Copy link
Contributor Author

Ah, I missed the "spelling mistakes". Personally I believe the spellchecker is configured too strictly. For this PR, please somehow make the spellchecker stop complaining about these:

./tests/by-util/test_tr.rs:1342:19 - Unknown word (asdfqqwweerr)
./tests/by-util/test_tr.rs:1345:23 - Unknown word (qwerr)
./tests/by-util/test_tr.rs:1352:19 - Unknown word (asdfqwer)
./tests/by-util/test_tr.rs:1355:23 - Unknown word (qwer)
./tests/by-util/test_tr.rs:1362:19 - Unknown word (aassddffqwer)
./tests/by-util/test_tr.rs:1365:23 - Unknown word (asdfqwer)

You'll find in the same file examples how to use spell-checker:disable-line and spell-checker:disable-next-line.

Those are new, you didn't miss them, I did T-T

An additional issue was found while reviewing uutils#6340, check [this thread][1]. A summary is:

- `tr` ignores the `-t`/`--truncate-set1` flag when not translating
- Not translating is defined as `-d` was passed, or one set was passed.

[1]: uutils#6340 (comment)
@BenWiederhake
Copy link
Collaborator

No worries, all is good ^^

@jalil-salame
Copy link
Contributor Author

I gave it a try to improving perf by rewriting the set from a Vec<u8> to a custom BiSetU8 ([u32; 8]), but then I realized that TranslateOperation is order sensitive: "abc" != "cba", so it can't be done as early as possible. The complexity of the code increases, so I'd rather keep it as a Vec<u8> while I give it some more thought.

I think delaying the translation from args to sets might help, but it doesn't seem necessary to me.

Is there a benchmark framework used to test uutils? I don't think the current implementation of tr is much slower than GNU's.

@BenWiederhake
Copy link
Collaborator

Yes, the TranslateOperation is order-sensitive. I was mostly thinking about the process of computing the complement itself. And I totally agree, let's keep it as-is for now, and worry about performance later. (And when you do, MEASURE first! I feel like this would make a difference, but feelings aren't reliable enough.)

I'm not aware of any consistent benchmarking scheme. However, there are a few BENCHMARKING.md files for some of the tools. (But again, that's out-of-scope for this PR, please open a separate PR for that.)

Copy link

github-actions bot commented May 4, 2024

GNU testsuite comparison:

Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

Copy link
Collaborator

@BenWiederhake BenWiederhake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jup, still LGTM

@BenWiederhake BenWiederhake merged commit ff1a03c into uutils:main May 4, 2024
67 of 68 checks passed
@jalil-salame jalil-salame deleted the tr-fix-6163 branch May 4, 2024 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tr: different behavior than GNU with echo "BB%t"|tr -c -d -s Bt "B"
2 participants