Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CJK characters are missing when using youtube-dl -e to get the title #2046

Open
niltsh opened this issue Dec 25, 2013 · 6 comments
Open

CJK characters are missing when using youtube-dl -e to get the title #2046

niltsh opened this issue Dec 25, 2013 · 6 comments

Comments

@niltsh
Copy link

@niltsh niltsh commented Dec 25, 2013

Version: 2013.12.09.1 ~ 2013.12.23.4
Platform: Mac OS X 10.9
pre-condition:
0. website: youtube.com
1. command is ./youtube-dl -e xxxx
2. NOT in terminal but programmatically using NSTask/NSPipe

I made an application which will call youtube-dl and grab the result, by using NSTask/NSPipe.

I found if the title of the video has CJK characters, the CJK characters will just be missing, only the alphabets and numbers outputted.
But with same command and options, executing youtube-dl in Terminal is OK, no CJK chars is missed and I confirmed all characters come from stdout, not stderr.

So the only problem is with calling it by NSTask/NSPipe, sounds like a Mac's issue.

However, I did a binary search to find out which release has brought this problem.
and I found the last GOOD version 2013.12.09, from 2013.12.09.1 this issue happens.

Luckily there is only one commit between these two releases.

Add a workaround for terminals without bidi support (Fixes #1912)
0783b09

I wish I could debug more, but sorry I am not python guy.
Since the commit is character treating related, would you please confirm it?

BRs,
Zongyao Qu

@phihag
Copy link
Contributor

@phihag phihag commented Dec 26, 2013

Thank you very much for the extremely detailed bug report. As I don't own a Mac, I cannot confirm the issue, and it works fine on all my Linux boxes. What we changed in that commit is that we now always encode output strings ourselves instead of letting some pass through Python's default stdout (that often broke the experience, particularly for Windows users). Can you update to 2013.12.26 and post the output of youtube-dl -v? That should give us a lot of hints what's going on there.

@niltsh
Copy link
Author

@niltsh niltsh commented Dec 27, 2013

Thank you phi,

I found the root cause.

In Terminal the locale is UTF-8, but in program, locale is ASCII, because LANG is not set.

After I set LANG to UTF-8 in my program, everything goes fine.

@niltsh niltsh closed this Dec 27, 2013
@phihag
Copy link
Contributor

@phihag phihag commented Dec 27, 2013

Reopening, we should be able to at least detect this.

@phihag phihag reopened this Dec 27, 2013
@niltsh
Copy link
Author

@niltsh niltsh commented Dec 27, 2013

I have a idea that, since the incoming data's charset is known, as long as the webpage has set the charset. (nowadays it is usually UTF-8).

If we detect that locale does not match the incoming data's charset, it is possible that the characters mess up.
we could use iconv to convert the data to match the locale.

However, if the locals is not a super set of the incoming data, character could possibly still be missing after all.

@phihag
Copy link
Contributor

@phihag phihag commented Dec 27, 2013

@niltsh Internally, we deal with characters. This allows the end-user to not have to care about which encoding the webpage happens to use. Can you post the precise output you get for youtube-dl -v? I'll add a warning (and/or a default to UTF-8) then.

@niltsh
Copy link
Author

@niltsh niltsh commented Dec 30, 2013

the output in my App before modification

[debug] User config: []
[debug] Command-line args: ['--no-playlist', '-e', '-v', 'http://www.youtube.com/watch?v=TXHYTleODkc']
[debug] Encodings: locale 'US-ASCII', fs 'utf-8', out None, pref: 'US-ASCII'
[debug] youtube-dl version 2013.12.26
[debug] Python version 2.7.5 - Darwin-13.0.0-x86_64-i386-64bit
[debug] Proxy map: {}

the output in my App after modification

[debug] User config: []
[debug] Command-line args: ['--no-playlist', '-e', '-v', 'http://www.youtube.com/watch?v=TXHYTleODkc']
[debug] Encodings: locale 'UTF-8', fs 'utf-8', out None, pref: 'UTF-8'
[debug] youtube-dl version 2013.12.26
[debug] Python version 2.7.5 - Darwin-13.0.0-x86_64-i386-64bit
[debug] Proxy map: {}

the output in Terminal

[debug] System config: []
[debug] User config: []
[debug] Command-line args: ['-e', '-v', 'http://www.youtube.com/watch?v=04P8hemO4SY']
[debug] Encodings: locale 'UTF-8', fs 'utf-8', out 'UTF-8', pref: 'UTF-8'
[debug] youtube-dl version 2013.12.26
[debug] Python version 2.7.5 - Darwin-13.0.0-x86_64-i386-64bit
[debug] Proxy map: {}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.