Fails to identify cp1252 (aka Windows-1252) #9

clach04 · 2013-01-23T20:45:12Z

I have a very small test file that gets incorrectly identified as ISO-8859-2 http://en.wikipedia.org/wiki/ISO/IEC_8859-2 what makes this interesting is that the non-ascii characters in the test file are invalid characters in ISO-8859-2 so ISO-8859-2 not even close:

0x93, 0x94, 0x97, 0x96

I wasn't able to attached a txt file for some reason so here is a Python repr (from Python 2.x) of the file contents.

Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open(r'charade\cp1252_test.txt', 'rb')
>>> test_str = f.read()
>>> f.close()
>>> test_str
'Then he said, \x93The names Bod, James Bond.\x94\r\nto be \x93me\x94\r\nSpam, beans, spam \x96 served every day\r\nbeans, spam, beans, \x97 served every other day\r\n'

I have a larger (real) file if this demo one is not suitable.

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2013-01-24T14:20:28Z

If there's someway to post the file (even as a gist if possible), that would be awesome.

clach04 · 2013-01-24T18:00:30Z

Githib gists are not ideal for this (at least as of 2013-01-24), I've created https://gist.github.com/4625701 but github went ahead and converted it into utf-8 :-( I then cloned it locally, re-wrote the file as cp1252 and if you clone it you should get cp1252 (do not rely in the web interface view).

I would recommend you use the repr format from above (NOTE it is using Windows newlines, as cp1252 is most common under Windows), e.g. Python 2.x example:

test_str = 'Then he said, \x93The names Bod, James Bond.\x94\r\nto be \x93me\x94\r\nSpam, beans, spam \x96 served every day\r\nbeans, spam, beans, \x97 served every other day\r\n'
f = open('cp1252_test.txt', 'wb')
f.write(test_str)
f.close()

sigmavirus24 · 2013-01-24T18:09:44Z

Alright, I just like having more than one thing to test with to be more certain. I'll start working on this later tonight.

MestreLion · 2013-10-22T23:12:41Z

I'm experiencing the same thing... but not with a test sample file: with all my subtitle (.srt) files!

A few of them are plain ASCII or UTF-8, but the vast majority is some form of Latin1, so I was expecting charade to output CP-1252 or ISO-8859-{1,15}. But none of them were reported as such. Instead, it reports them as IBM855, ISO-8859-{2,5,..}, etc. Many russian-related encodings even if there's not a single Cyrilic character in those files.

Aside for a few exceptions, my subtitle files are:

Windows CRLF line terminated (does that matter?)
Brazilian Portuguese (pt_BR, or simply pt) language

Breakdown of charade:

    484 100.0% Total (11)
    349  72.1% IBM855
     45   9.3% windows-1251
     20   4.1% utf-8
     20   4.1% ISO-8859-2
     12   2.5% windows-1255
     11   2.3% ISO-8859-7
     11   2.3% ISO-8859-5
      5   1.0% ascii
      5   1.0% MacCyrillic
      4   0.8% IBM866
      2   0.4% UTF-16LE

Comparison with konwert all/pt-test:

    484 100.0% Total (3)
    458  94.6% cp1252
     20   4.1% utf8
      6   1.2% -

So both basically agree what is an UTF-8 and plain-ascii file. The problem is about the other 94% of my files, True, konwert "cheated" since It had the language hint, but still there is no way to give such hint to charade.

I guess the culprit are the last lines in latin1prober:

        # lower the confidence of latin1 so that other more accurate
        # detector can take priority.
        confidence = confidence * 0.5
        return confidence

50% is a huge penalty. That, I guess, makes Latin1 extremely unlikely to be picked. And there is no "more accurate detector" for Western Europe languages like French, Spanish and Portuguese. This cripples charade for all of Latin America users, and a good part of Europe and Africa too.

sigmavirus24 added a commit that referenced this issue Feb 4, 2013

Add test for #9

1f68706

sigmavirus24 closed this as completed Dec 29, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fails to identify cp1252 (aka Windows-1252) #9

Fails to identify cp1252 (aka Windows-1252) #9

clach04 commented Jan 23, 2013

sigmavirus24 commented Jan 24, 2013

clach04 commented Jan 24, 2013

sigmavirus24 commented Jan 24, 2013

MestreLion commented Oct 22, 2013

Fails to identify cp1252 (aka Windows-1252) #9

Fails to identify cp1252 (aka Windows-1252) #9

Comments

clach04 commented Jan 23, 2013

sigmavirus24 commented Jan 24, 2013

clach04 commented Jan 24, 2013

sigmavirus24 commented Jan 24, 2013

MestreLion commented Oct 22, 2013