Add support for utf16 character encoding #26

avihay-av · 2017-06-25T14:07:05Z

Pull Request Checklist

Is this in reference to an existing issue?
Yes - #25

General

Update Changelog following the conventions laid out on Keep A Changelog
Update README with any necessary configuration snippets
Binstubs are created if needed
RuboCop passes
Existing tests pass

New Plugins

Tests
Add the plugin to the README
Does it have a complete header as outlined here

Purpose

Known Compatablity Issues

majormoses · 2017-07-02T21:52:08Z

@avoosh thanks for submitting will take a look in a second.

majormoses

Let's make it an option with a default matching old behavior to keep backwards compatibility.

majormoses · 2017-07-02T21:53:53Z

bin/check-log.rb

@@ -208,7 +208,7 @@ def search_log
        line = get_log_entry(line)
      end

-      line = line.encode('UTF-8', invalid: :replace, replace: '')
+      line = line.encode('UTF-16', invalid: :replace, replace: '').encode('UTF-8', invalid: :replace, replace: '')


Let's make this an option, while UTF-16 does have some advantages it does also have some disadvantages as outlined here (quick summary): https://stackoverflow.com/questions/4655250/difference-between-utf-8-and-utf-16 to avoid breaking change we could have it default to utf8 and then override where neccessry.

avihay-av · 2017-07-03T09:43:06Z

@majormoses thx for the reply, I've added an option to encode utf16 before line matching.

majormoses

Based on what I saw that would not work as you would encode utf-16 and then re-encode utf-8. Anyways there is a better path forward. Let's just expose an option to use any encoding they want and just pass it in. Let's keep the default of UTF-8 to avoid breaking changes.

majormoses · 2017-07-03T16:26:26Z

bin/check-log.rb

@@ -128,6 +128,13 @@ class CheckLog < Sensu::Plugin::Check::CLI
         proc: proc(&:to_i),
         default: 250

+  option :encode_utf16,


why not make the option just encoding? for example:

option :encoding description: 'Encode line with the following encoding before matching', long: '--encoding $ENCODING' default: 'UTF-8'

This makes it easier when someone decides hey I want utf-32 encoding when that exists.

I see your point that being said its trivial to set the encoding on the regex itself, if you still feel strongly about it I am willing to conceded as long as you add some comments in the code around it so a year from now we understand why it was done that way.

avihay-av · 2017-07-04T21:12:33Z

@majormoses

"A regexp can be matched against a string when they either share an encoding, or the regexp’s encoding is US-ASCII and the string’s encoding is ASCII-compatible."
[ruby-doc]

The issue that we have is invalid byte sequence when matching strings, and the prefered encoding in such cases is UTF-8 (ASCII is a compatible subset).
Since we want to remove invalid byte sequence, we are using the encode utf16 to do that for us.

"it is impossible to fix an invalid UTF-8 filename using a UTF-16 .... The opposite is not true: it is trivial to translate invalid UTF-16 to a unique (though technically invalid) UTF-8 string, so a UTF-8 API can control both UTF-8 and UTF-16 files and names"
[wiki]

Hence we do want to re-encode with utf-8 before calling the match function.
And the solution does keep the default of UTF-8 to avoid breaking changes. 👍

majormoses · 2017-07-04T21:34:47Z

just add a space between the # and the link and this looks good.

majormoses · 2017-07-04T21:37:47Z

trailing whitespace.

majormoses · 2017-07-04T22:08:39Z

released: https://rubygems.org/gems/sensu-plugins-logs/versions/1.2.0

avihay added 2 commits June 25, 2017 17:00

add utf16 line encoding

24e91a6

add utf16 line encoding

d85a7df

avihay-av changed the title ~~add utf16 line encoding~~ add utf16 character encoding Jun 25, 2017

avihay-av changed the title ~~add utf16 character encoding~~ Add support for utf16 character encoding Jun 25, 2017

majormoses requested changes Jul 2, 2017

View reviewed changes

majormoses added the Feedback Requested label Jul 2, 2017

avihay added 2 commits July 3, 2017 12:27

making utf16 encoding as option

73b8f33

making utf16 encoding as option

07a045a

majormoses requested changes Jul 3, 2017

View reviewed changes

adding comments

8dd8d38

adding comments

4daacf0

fixing inspections

7048aa0

majormoses approved these changes Jul 4, 2017

View reviewed changes

majormoses merged commit dc75801 into sensu-plugins:master Jul 4, 2017

avihay-av mentioned this pull request Jul 4, 2017

invalid byte sequence in UTF-8, check-log.rb:226:in match #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for utf16 character encoding #26

Add support for utf16 character encoding #26

avihay-av commented Jun 25, 2017 •

edited

Loading

majormoses commented Jul 2, 2017

majormoses left a comment

majormoses Jul 2, 2017

avihay-av commented Jul 3, 2017 •

edited

Loading

majormoses left a comment

majormoses Jul 3, 2017

majormoses Jul 4, 2017

avihay-av commented Jul 4, 2017 •

edited

Loading

majormoses commented Jul 4, 2017

majormoses commented Jul 4, 2017

majormoses commented Jul 4, 2017

Add support for utf16 character encoding #26

Add support for utf16 character encoding #26

Conversation

avihay-av commented Jun 25, 2017 • edited Loading

Pull Request Checklist

General

New Plugins

Purpose

Known Compatablity Issues

majormoses commented Jul 2, 2017

majormoses left a comment

Choose a reason for hiding this comment

majormoses Jul 2, 2017

Choose a reason for hiding this comment

avihay-av commented Jul 3, 2017 • edited Loading

majormoses left a comment

Choose a reason for hiding this comment

majormoses Jul 3, 2017

Choose a reason for hiding this comment

majormoses Jul 4, 2017

Choose a reason for hiding this comment

avihay-av commented Jul 4, 2017 • edited Loading

majormoses commented Jul 4, 2017

majormoses commented Jul 4, 2017

majormoses commented Jul 4, 2017

avihay-av commented Jun 25, 2017 •

edited

Loading

avihay-av commented Jul 3, 2017 •

edited

Loading

avihay-av commented Jul 4, 2017 •

edited

Loading