-
Notifications
You must be signed in to change notification settings - Fork 11
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fails to identify cp1252 (aka Windows-1252) #9
Comments
If there's someway to post the file (even as a gist if possible), that would be awesome. |
Githib gists are not ideal for this (at least as of 2013-01-24), I've created https://gist.github.com/4625701 but github went ahead and converted it into utf-8 :-( I then cloned it locally, re-wrote the file as cp1252 and if you clone it you should get cp1252 (do not rely in the web interface view). I would recommend you use the repr format from above (NOTE it is using Windows newlines, as cp1252 is most common under Windows), e.g. Python 2.x example:
|
Alright, I just like having more than one thing to test with to be more certain. I'll start working on this later tonight. |
I'm experiencing the same thing... but not with a test sample file: with all my subtitle ( A few of them are plain ASCII or UTF-8, but the vast majority is some form of Latin1, so I was expecting Aside for a few exceptions, my subtitle files are:
Breakdown of
Comparison with
So both basically agree what is an UTF-8 and plain-ascii file. The problem is about the other 94% of my files, True, I guess the culprit are the last lines in
50% is a huge penalty. That, I guess, makes Latin1 extremely unlikely to be picked. And there is no "more accurate detector" for Western Europe languages like French, Spanish and Portuguese. This cripples charade for all of Latin America users, and a good part of Europe and Africa too. |
I have a very small test file that gets incorrectly identified as ISO-8859-2 http://en.wikipedia.org/wiki/ISO/IEC_8859-2 what makes this interesting is that the non-ascii characters in the test file are invalid characters in ISO-8859-2 so ISO-8859-2 not even close:
I wasn't able to attached a txt file for some reason so here is a Python repr (from Python 2.x) of the file contents.
I have a larger (real) file if this demo one is not suitable.
The text was updated successfully, but these errors were encountered: