Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character corruption #10927

Closed
yoyohoho98765 opened this issue Oct 15, 2016 · 5 comments
Closed

Character corruption #10927

yoyohoho98765 opened this issue Oct 15, 2016 · 5 comments

Comments

@yoyohoho98765
Copy link

@yoyohoho98765 yoyohoho98765 commented Oct 15, 2016

Character corruption generates in json of youtube-dl.
Its cause is that python don't well treat other than English with only .encode(utf-8) and there is the rests of improvements of python codes.
A concrete example

##` -*- coding: utf-8 -*-
import json
import codecs

def json_write_test(path):

    with codecs.open(path,'w','utf-8') as f:
        dump = json.dumps({'test':u"소녀시대"},ensure_ascii=False)
        f.write(dump)

json_write_test("/Users/foo/test.json")

Namely, import codecs is needed, not open( but codecs.open( has to be used, and ensure_ascii=False has to be added as a parameter. The last replace has to be applied to not json.dumps(foo).encode(utf-8) but all of json.dumps(foo).

I have done these in all descriptions of all files found with

grep -r  -l 'json.dump' ./youtube-dl-master/youtube_dl |  sort -u  
grep -r  -l 'with open(' ./youtube-dl-master/youtube_dl | sort -u

Only ensure_ascii=False may be enough. I don't know if doing these in all the descriptions of all the files is appreciate or not because I don't read all the codes of youtube-dl.
But I get good results with the below command

LF=$(printf '\\\n_'); LF=${LF%_}
youtube-dl -j https://www.youtube.com/watch?v=hJYGddE0vHc | sed -e 's/, /'"$LF"'/g' -e 's/\"//g' | grep fulltitle
youtube-dl -j https://www.youtube.com/watch?v=0Uhx6AgNylQ | sed -e 's/, /'"$LF"'/g' -e 's/\"//g' | grep fulltitle
youtube-dl -j https://www.youtube.com/watch?v=DeGkiItB9d8 | sed -e 's/, /'"$LF"'/g' -e 's/\"//g' | grep fulltitle

Also, I recommend using explicitly

# -*- coding: utf-8 -*-
import sys
import codecs

sys.stdout = codecs.getwriter('utf_8')(sys.stdout)
print u'소녀시대'

because character corruption generates when LANG isn't set to foo_bar.UTF-8.

Thank you for reading.

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Oct 15, 2016

Not sure what you want. For me Korean characters are correctly parsed when LC_ALL=C:

$ export LC_ALL=C
$ youtube-dl -j "https://www.youtube.com/watch?v=hJYGddE0vHc" | jq .fulltitle
WARNING: Assuming --restrict-filenames since file system encoding cannot encode all characters. Set the LC_ALL environment variable to fix this.
"Girls' Generation 소녀시대_You Think_Music Video"
@dstftw dstftw closed this Oct 18, 2016
@yoyohoho98765
Copy link
Author

@yoyohoho98765 yoyohoho98765 commented Oct 19, 2016

The problem isn't fixed in youtube-dl (2016.10.16), which is downloaded by youtube-dl -U.
I use Python 2.7.10 and Python 2.7.12 on OSX 10.11.6.
Thank you for reading
The results of tests tried in both pythons

$ LF=$(printf '\\\n_'); LF=${LF%_}
$ export LC_ALL=C  
$ youtube-dl -j "https://www.youtube.com/watch?v=hJYGddE0vHc" | sed -e 's/, /'"$LF"'/g' -e 's/\"//g' | grep fulltitle  
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
$ unset LC_ALL 
$ for f in "C" "af_ZA.UTF-8" "am_ET.UTF-8" "be_BY.UTF-8" "bg_BG.UTF-8" "ca_ES.UTF-8" "cs_CZ.UTF-8" "da_DK.UTF-8" "de_AT.UTF-8" "de_CH.UTF-8" "de_DE.UTF-8" "el_GR.UTF-8" "en_AU.UTF-8" "en_CA.UTF-8" "en_GB.UTF-8" "en_IE.UTF-8" "en_NZ.UTF-8" "en_US.UTF-8" "es_ES.UTF-8" "et_EE.UTF-8" "eu_ES.UTF-8" "fi_FI.UTF-8" "fr_BE.UTF-8" "fr_CA.UTF-8" "fr_CH.UTF-8" "fr_FR.UTF-8" "he_IL.UTF-8" "hr_HR.UTF-8" "hu_HU.UTF-8" "hy_AM.UTF-8" "is_IS.UTF-8" "it_CH.UTF-8" "it_IT.UTF-8" "ja_JP.UTF-8" "kk_KZ.UTF-8" "ko_KR.UTF-8" "lt_LT.UTF-8" "nl_BE.UTF-8" "nl_NL.UTF-8" "no_NO.UTF-8" "pl_PL.UTF-8" "pt_BR.UTF-8" "pt_PT.UTF-8" "ro_RO.UTF-8" "ru_RU.UTF-8" "sk_SK.UTF-8" "sl_SI.UTF-8" "sr_YU.UTF-8" "sv_SE.UTF-8" "tr_TR.UTF-8" "uk_UA.UTF-8" "zh_CN.UTF-8" "zh_HK.UTF-8" "zh_TW.UTF-8" ; do export LANG="$f" ; LF=$(printf '\\\n_'); LF=${LF%_} ; youtube-dl -j https://www.youtube.com/watch?v=hJYGddE0vHc | sed -e 's/, /'"$LF"'/g' -e 's/\"//g' | grep fulltitle ; done
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
fulltitle: Girls' Generation \uc18c\ub140\uc2dc\ub300_You Think_Music Video
@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Oct 19, 2016

Is there a reason not using common JSON tools like json_reformat or jq but inventing new ones?

@yoyohoho98765
Copy link
Author

@yoyohoho98765 yoyohoho98765 commented Oct 24, 2016

Thank for your reply.
Simply, it's useful when just seeing full information of video.
The present youtube-dl outputs information with character corruption in display or saved files, not using tools.
There is the way to avoid character corruption without tools. The way isn't not special but simple and normal when treating multiple languages in python. This is because the default character code of python is ascii, which has often overridden settings.

Thank you for reading. I wish the progress of youtube-dl.

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Oct 24, 2016

First, I'd like to clarify that those characters are not "corrupted". Instead, \uc18c\ub140\uc2dc\ub300 is equivalent to 소녀시대, so jq can translate it correctly. On the other hand, sed, grep and shell scripts are not designed to fit complex use cases like JSON. The reason to use such a strange representation is to ensure compatibility. For example, on Windows UTF-8 does not work fine., and not all third-party applications that embed youtube-dl can handle non-ASCII inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.