Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UTF-16 #21

Closed
rhysd opened this issue Apr 16, 2024 · 0 comments
Closed

Support UTF-16 #21

rhysd opened this issue Apr 16, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@rhysd
Copy link
Owner

rhysd commented Apr 16, 2024

Currently hgrep supports only UTF-8 texts. This means that hgrep tries to print UTF-16 texts as if they are encoded in UTF-8, resulting in a quite broken output.

ripgrep supports UTF-16 by --encoding option so technically hgrep can support it too. ripgrep transcodes UTF-16 to UTF-8 on memory removing BOM using encoding_rs_io::DecodeReaderBytesBuilder. It means that ripgrep reports byte offsets for matched regions in transcoded UTF-8 text.

https://github.com/BurntSushi/ripgrep/blob/d922b7ac114c24d6800ae5f79d2967481f380c83/crates/searcher/src/searcher/mod.rs#L720-L744

hgrep can read matched file transcoding UTF-16 to UTF-8 as well. Currently hgrep reads file contents as-is. --encoding (-E) option can be added by reading files through the encoding_rs encoders.

hgrep/src/chunk.rs

Lines 220 to 223 in 6f49cb0

let contents = match fs::read(&path) {
Ok(vec) => vec,
Err(err) => return self.error_item(err.into()),
};

  • When BOM exists at top, respect the encoding (UTF-8, UTF-16-LE, UTF-16-BE)
    • UTF-16: Detect encoding with Encoding::for_bom and transcode input to UTF-8
    • UTF-8: Remove BOM and pass through input
  • Add --encode option which accepts encoding labels
@rhysd rhysd added the enhancement New feature or request label Apr 16, 2024
@rhysd rhysd changed the title Support UTF-16 and Shift JIS Support UTF-16 Apr 17, 2024
@rhysd rhysd closed this as completed in 3b7b7d9 Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant