Skip to content

Commit

Permalink
utils: fix autoDecode error for specific sequence
Browse files Browse the repository at this point in the history
Since we're probably going to be seeing either ASCII or UTF-8 input anyway, bump the
detection confidence requirement from 0.5 to 0.8. In the case of the first test input
shown in utils_test.py, it was coming in at 0.559 for Windows-1254 and just 0.505 for
UTF-8, when in fact, it's UTF-8.

This also makes me question using chardet at all, but it probably won't hurt at the
new confidence threshold.
  • Loading branch information
wwade committed May 27, 2021
1 parent 1f85d23 commit 12e5949
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 1 deletion.
18 changes: 18 additions & 0 deletions jobrunner/test/utils_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/usr/bin/env python
# Copyright (c) 2021 Arista Networks, Inc. All rights reserved.
# Arista Networks, Inc. Confidential and Proprietary.

from __future__ import absolute_import, division, print_function

import pytest

from jobrunner.utils import autoDecode


@pytest.mark.parametrize(("value", "encoding"), [
(b"Waiting for '\xe2\x9d\xaf|[Pp]db' in session "
b"routing-enabled-structure_0_64\n(Pdb++)\n", "utf-8"),
(b"hi there", "ascii"),
])
def testAutoDecode(value, encoding):
assert value.decode(encoding) == autoDecode(value)
4 changes: 3 additions & 1 deletion jobrunner/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -351,6 +351,8 @@ def sudoKillProcGroup(pgrp):
def autoDecode(byteArray):
detected = chardet.detect(byteArray)
encoding = detected['encoding']
if detected['confidence'] < 0.5: # very arbitrary
if detected['confidence'] < 0.8: # very arbitrary
LOG.debug("char encoding below confidence level 0.8 (%r). "
"Fall back to UTF-8.", detected)
encoding = 'utf-8'
return byteArray.decode(encoding)

0 comments on commit 12e5949

Please sign in to comment.