Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for utf16 character encoding #26

Merged
merged 7 commits into from
Jul 4, 2017
Merged

Add support for utf16 character encoding #26

merged 7 commits into from
Jul 4, 2017

Conversation

avihay-av
Copy link
Contributor

@avihay-av avihay-av commented Jun 25, 2017

Pull Request Checklist

Is this in reference to an existing issue?
Yes - #25

General

  • Update Changelog following the conventions laid out on Keep A Changelog

  • Update README with any necessary configuration snippets

  • Binstubs are created if needed

  • RuboCop passes

  • Existing tests pass

New Plugins

  • Tests

  • Add the plugin to the README

  • Does it have a complete header as outlined here

Purpose

Known Compatablity Issues

@avihay-av avihay-av changed the title add utf16 line encoding add utf16 character encoding Jun 25, 2017
@avihay-av avihay-av changed the title add utf16 character encoding Add support for utf16 character encoding Jun 25, 2017
@majormoses
Copy link
Member

@avoosh thanks for submitting will take a look in a second.

Copy link
Member

@majormoses majormoses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it an option with a default matching old behavior to keep backwards compatibility.

bin/check-log.rb Outdated
@@ -208,7 +208,7 @@ def search_log
line = get_log_entry(line)
end

line = line.encode('UTF-8', invalid: :replace, replace: '')
line = line.encode('UTF-16', invalid: :replace, replace: '').encode('UTF-8', invalid: :replace, replace: '')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this an option, while UTF-16 does have some advantages it does also have some disadvantages as outlined here (quick summary): https://stackoverflow.com/questions/4655250/difference-between-utf-8-and-utf-16 to avoid breaking change we could have it default to utf8 and then override where neccessry.

@avihay-av
Copy link
Contributor Author

avihay-av commented Jul 3, 2017

@majormoses thx for the reply, I've added an option to encode utf16 before line matching.

Copy link
Member

@majormoses majormoses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what I saw that would not work as you would encode utf-16 and then re-encode utf-8. Anyways there is a better path forward. Let's just expose an option to use any encoding they want and just pass it in. Let's keep the default of UTF-8 to avoid breaking changes.

bin/check-log.rb Outdated
@@ -128,6 +128,13 @@ class CheckLog < Sensu::Plugin::Check::CLI
proc: proc(&:to_i),
default: 250

option :encode_utf16,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not make the option just encoding? for example:

option :encoding
        description: 'Encode line with the following encoding before matching',
        long: '--encoding $ENCODING'
        default: 'UTF-8'

This makes it easier when someone decides hey I want utf-32 encoding when that exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point that being said its trivial to set the encoding on the regex itself, if you still feel strongly about it I am willing to conceded as long as you add some comments in the code around it so a year from now we understand why it was done that way.

@avihay-av
Copy link
Contributor Author

avihay-av commented Jul 4, 2017

@majormoses

"A regexp can be matched against a string when they either share an encoding, or the regexp’s encoding is US-ASCII and the string’s encoding is ASCII-compatible."
[ruby-doc]

The issue that we have is invalid byte sequence when matching strings, and the prefered encoding in such cases is UTF-8 (ASCII is a compatible subset).
Since we want to remove invalid byte sequence, we are using the encode utf16 to do that for us.

"it is impossible to fix an invalid UTF-8 filename using a UTF-16 .... The opposite is not true: it is trivial to translate invalid UTF-16 to a unique (though technically invalid) UTF-8 string, so a UTF-8 API can control both UTF-8 and UTF-16 files and names"
[wiki]

Hence we do want to re-encode with utf-8 before calling the match function.
And the solution does keep the default of UTF-8 to avoid breaking changes. 👍

@majormoses
Copy link
Member

just add a space between the # and the link and this looks good.

@majormoses
Copy link
Member

trailing whitespace.

@majormoses majormoses merged commit dc75801 into sensu-plugins:master Jul 4, 2017
@majormoses
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants