Skip to content

Unsound Use of str::from_utf8_unchecked in keyword_token #1859

@lwz23

Description

@lwz23

Description
The keyword_token function uses unsafe { str::from_utf8_unchecked(word) } to convert a byte slice (&[u8]) into a string slice (&str) without validating whether the input is valid UTF-8. This introduces undefined behavior (UB) if the word parameter contains invalid UTF-8 bytes. The absence of validation makes the function unsound.

.get(UncasedStr::new(unsafe { str::from_utf8_unchecked(word) }))

pub fn keyword_token(word: &[u8]) -> Option<TokenType> {
    KEYWORDS
        .get(UncasedStr::new(unsafe { str::from_utf8_unchecked(word) }))
        .cloned()
}

Problems:
this function is a pub function, so I assume user can control the word field, it cause some problems.

  1. Undefined Behavior on Invalid UTF-8:
    unsafe { str::from_utf8_unchecked(word) } assumes that the word slice is valid UTF-8. If this assumption is violated, undefined behavior occurs immediately.
    The function does not verify that word is valid UTF-8 before invoking the unsafe conversion.
  2. No Safety Contract:
    The function is not marked as unsafe, nor does it document the requirement that the word input must be valid UTF-8. This makes it easy for callers to misuse the function by passing invalid inputs.
  3. Potential Exploitation:
    If word is derived from untrusted or external input, it could contain invalid UTF-8. This could lead to crashes, memory corruption, or other unpredictable behavior.

Suggestion

  1. mark this function as unsafe and provide safety doc.
  2. add some check in the function body eg. use from_utf8 instead.

Additional Context:
Unsafe code should only be used when safety invariants are strictly guaranteed. The current implementation assumes that the word input is always valid UTF-8, but this is not enforced or documented, making the function unsound. By switching to std::str::from_utf8, the function can remain safe and robust while handling invalid input gracefully.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions