Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding issues with encodingKludge #294

Closed
wants to merge 1 commit into from
Closed

Fix encoding issues with encodingKludge #294

wants to merge 1 commit into from

Conversation

Diaoul
Copy link
Contributor

@Diaoul Diaoul commented Aug 21, 2011

Fixed all encoding issues I could encounter on both my Synology NAS and Ubuntu 10.04

Main changes

  • Add encodingKludge from Sick-Beard to deal with encoding issues
  • Removed irrelevant hardcoded 'UTF-8' strings and replaced with ek-ed ones
  • The logger is now unicode-aware

Tested with album 4 of Beyoncé

  • Scan Music Library
  • Force Post-Process Albums in Download Folder (all post-process options activated)

Any feedback is appreciated

@abuisman
Copy link

I was having these encoding issues as well, good work :).

@pabloalcantara
Copy link
Contributor

Hope this fix every problem...
@Diaoul could you see if my encode.py needs some changes?

@rembo10
Copy link
Owner

rembo10 commented Aug 21, 2011

Hey Diaoul - everything looks good - this is the only thing i'm thinking about:

temp_f = headphones.DESTINATION_DIR
for f in folder_list:
    temp_f = ek.ek(os.path.join, temp_f, f)
    ek.ek(os.chmod, temp_f, int(headphones.FOLDER_PERMISSIONS, 8))

Not that familiar with ek, so I'm not sure if this handles it but i was having trouble with this before and i made some changes along the way to where i encoded it. headphones.DESTINATION_DIR is unicode, folder_list is unicode. When you ek.ek temp_f, it becomes bytestring, so when it adds the next folder, temp_f is now bytestring and f is unicode. Does ek handle these mixed encoding arguments alright? If not I think we can take that line out and just ek the last line, and maybe os.path.normpath(temp_f). Does that make sense? Let me know what you think

@rembo10
Copy link
Owner

rembo10 commented Aug 21, 2011

I don't know what it is with ek but it just doesn't want to work on OSX. I start getting mad unicodedecodeerrors with it

@Diaoul
Copy link
Contributor Author

Diaoul commented Aug 21, 2011

@rembo10 AFAIK it works on Sick-Beard so this should work on OSX as Sick-Beard runs well on OSX. Maybe ask @midgetspy how he made it with OSX.

About the chmod thing, I tried several things to make it simplier but without success...

@rembo10
Copy link
Owner

rembo10 commented Aug 21, 2011

I'm probably doing this wrong but I think there are some things I want to keep in bytestring. For example, downloaded track paths - those don't really need to be returned as unicode, since filenames in *nix can have multiple encodings in the same filename. if i unicode it with one encoding, an error will probably be thrown. keeping it bytestring means i'm not trying to manipulate it - and since i'm not really storing those values, but using them to embed album art, rename, etc., doesn't it make sense to keep it as a bytestring?

On Aug 21, 2011, at 2:09 PM, Diaoul wrote:

@rembo10 AFAIK it works on Sick-Beard so this should work on OSX as Sick-Beard runs well on OSX. Maybe ask @midgetspy how he made it with OSX.

Reply to this email directly or view it on GitHub:
#294 (comment)

@rembo10
Copy link
Owner

rembo10 commented Aug 22, 2011

Maybe it was how I was adapting it to the changes I made? Can you update your version with my master, since I made a few changes that I had to resolve in the merge which may have messed some things up. If it works I'll definitely merge it in because it seems a lot cleaner than all my encoding and decoding all over the place.

On Aug 21, 2011, at 2:09 PM, Diaoul wrote:

@rembo10 AFAIK it works on Sick-Beard so this should work on OSX as Sick-Beard runs well on OSX. Maybe ask @midgetspy how he made it with OSX.

Reply to this email directly or view it on GitHub:
#294 (comment)

@midgetspy
Copy link

The safe way to deal with encoding is:

  • decode every bytestring "entering" the application (from disk, database, the internet, etc)
  • use exclusively unicode string objects inside your application
  • explicitly encode all strings to bytestrings any time they "exit" your application (file names/content, database entries, etc)

The last one is the most important (but it relies on the first two being done properly) because if you give some functions a unicode string they will encode it with default encoding which may be something ASCII-like which will cause it to choke on any special characters that may be present in the output strings.

Theoretically you should always know what encoding a bytestring you receive is using but unfortunately that is not always the case. Specifically, many NAS boxes have embedded linux with an ASCII-like encoding specified as the system encoding but then their filesystems have UTF-8 (for example) filenames that were written over the network by a different OS.

encodingKludge was written because some functions (like os.*) in python seem to return unicode objects on some OSes and bytestrings on others. I have never reproduced this but that is what I guessed the problem was based on the tracebacks I was seeing. If you're starting from scratch I would encourage you to deal with your strings in a better way than I did (preferably something that doesn't require "kludge" in the name ;-P).

@rembo10
Copy link
Owner

rembo10 commented Aug 22, 2011

haha kludge is a great word :-)

i'm keeping all the filenames in the post processor as bytestrings. as soon as i try to unicode them, i face the problem that i may lose the actual filepath on linux/nas systems, since filenames are stored as bytestrings anyways. the values that i pass in (from the database, etc) are all unicode, so i encode those values using their sys encoding when renaming, etc.

one thing i noticed was that config variables (i.e. headphones.CONFIG_OPTION) were bytestrings on startup, but unicode when config options were saved from the UI. this might account for some of the randomness with your os functions. i added encoding to ConfigObj to fix this:

headphones.CFG = ConfigObj(headphones.CONFIG_FILE, encoding='utf-8')

now I know all my headphones.CONFIG_OPTIONS will be utf-8 and can encode them when appropriate.

@rembo10
Copy link
Owner

rembo10 commented Aug 22, 2011

ps. hope you don't mind that i basically stole the whole startup script from sickbeard :-)

@Diaoul
Copy link
Contributor Author

Diaoul commented Aug 22, 2011

I am not sure there is "a better way" to deal with unicode than a kludge given the randomness of some os.* function results...
I'll try to build up a test stript to understand python behaviour on my NAS and my computer aswell, maybe a unittest file or something

@Diaoul
Copy link
Contributor Author

Diaoul commented Aug 22, 2011

@pabloalcantara: I can't really look at it as I don't use this function and can only correct bugs as I see them. You can try on your own to edit the code using ek until all errors are gone.

@rembo10
Copy link
Owner

rembo10 commented Aug 22, 2011

From what i sort of pieced together the os functions arent random, as long as you know what type of strings youre passing in (unicode or byte)

On Aug 22, 2011, at 2:43 AM, Diaoulreply@reply.github.com wrote:

I am not sure there is "a better way" to deal with unicode than a kludge given the randomness of some os.* function results...
I'll try to build up a test stript to understand python behaviour on my NAS and my computer aswell, maybe a unittest file or something

Reply to this email directly or view it on GitHub:
#294 (comment)

@Diaoul
Copy link
Contributor Author

Diaoul commented Aug 29, 2011

I've found the problem and it's not yours. You should just work with unicode everywhere, including os.*

@Diaoul Diaoul closed this Aug 29, 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants