-
Notifications
You must be signed in to change notification settings - Fork 454
Description
Description
The keyword_token function uses unsafe { str::from_utf8_unchecked(word) } to convert a byte slice (&[u8]) into a string slice (&str) without validating whether the input is valid UTF-8. This introduces undefined behavior (UB) if the word parameter contains invalid UTF-8 bytes. The absence of validation makes the function unsound.
| .get(UncasedStr::new(unsafe { str::from_utf8_unchecked(word) })) |
pub fn keyword_token(word: &[u8]) -> Option<TokenType> {
KEYWORDS
.get(UncasedStr::new(unsafe { str::from_utf8_unchecked(word) }))
.cloned()
}
Problems:
this function is a pub function, so I assume user can control the word field, it cause some problems.
- Undefined Behavior on Invalid UTF-8:
unsafe { str::from_utf8_unchecked(word) } assumes that the word slice is valid UTF-8. If this assumption is violated, undefined behavior occurs immediately.
The function does not verify that word is valid UTF-8 before invoking the unsafe conversion. - No Safety Contract:
The function is not marked as unsafe, nor does it document the requirement that the word input must be valid UTF-8. This makes it easy for callers to misuse the function by passing invalid inputs. - Potential Exploitation:
If word is derived from untrusted or external input, it could contain invalid UTF-8. This could lead to crashes, memory corruption, or other unpredictable behavior.
Suggestion
- mark this function as unsafe and provide safety doc.
- add some check in the function body eg. use
from_utf8instead.
Additional Context:
Unsafe code should only be used when safety invariants are strictly guaranteed. The current implementation assumes that the word input is always valid UTF-8, but this is not enforced or documented, making the function unsound. By switching to std::str::from_utf8, the function can remain safe and robust while handling invalid input gracefully.