Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve JSON data in unicode (Encoding UTF-8) #11696

Open
linglung opened this issue Jan 13, 2017 · 14 comments
Open

Retrieve JSON data in unicode (Encoding UTF-8) #11696

linglung opened this issue Jan 13, 2017 · 14 comments
Labels

Comments

@linglung
Copy link

@linglung linglung commented Jan 13, 2017

  • Bug report (encountered problems with youtube-dl)
  • Site support request (request for adding support for a new site)
  • Feature request (request for a new functionality)
  • Question
  • Other
    ===================================================

I need JSON data containing unicode (utf-8) from Youtube-dl, sadly it couldn't retrieve JSON data from YouTube video in UTF-8 (?).

Trying to print JSON info with -j, --dump-json or -J, --dump-single-json , --print-json and or wrote directly into JSON file with --write-info-json. All results were printed in non unicode data string like originally of video source.

The paramaters which were used with/out --encoding utf-8

youtube-dl --write-info-json --encoding utf-8 -f mp4 -o "%(title)s.%(ext)s" https://www.youtube.com/watch?v=0alnhFO1B7Y -v

youtube-dl -j --encoding utf-8 -f mp4 -o "%(title)s.%(ext)s" https://www.youtube.com/watch?v=0alnhFO1B7Y -v

youtube-dl -J --encoding utf-8 -f mp4 -o "%(title)s.%(ext)s" https://www.youtube.com/watch?v=0alnhFO1B7Y -v

youtube-dl --print-json --encoding utf-8 -f mp4 -o "%(title)s.%(ext)s" https://www.youtube.com/watch?v=0alnhFO1B7Y -v

The log output:

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--write-info-json', '--encoding', 'utf-8', '-f', 'mp4', '-o', '%(title)s.%(ext)s', 'https://www.youtube.com/watch?v=0alnhFO1B7Y', '-v']
[debug] Encodings: locale cp1252, fs mbcs, out cp1252, pref utf-8
[debug] youtube-dl version 2017.01.10
[debug] Python version 3.4.4 - Windows-10-10.0.14393
[debug] exe versions: ffmpeg N-82966-g6993bb4, ffprobe N-82966-g6993bb4
[debug] Proxy map: {}
[youtube] 0alnhFO1B7Y: Downloading webpage
[youtube] 0alnhFO1B7Y: Downloading video info webpage
[youtube] 0alnhFO1B7Y: Extracting video information
[youtube] 0alnhFO1B7Y: Downloading MPD manifest
[info] Writing video description metadata as JSON to: 香港麥當勞42年歷史大盤點.info.json
[debug] Invoking downloader on 'https://r4---sn-npoeen7k.googlevideo.com/videoplayback?signature=9CF8920347BA9578C4C6C1909BF07083928118A5.C4840C9CB5EE1D0233179DF1EBB58DBF74095DD8&initcwndbps=6973750&mime=video%2Fmp4&key=yt6&ei=yiJ4WKbGGcugoQOQprzACg&upn=-mYp2oMPHqQ&expire=1484289834&dur=105.581&lmt=1484189129530128&clen=8002642&gir=yes&nh=IgpwcjAyLnNpbjExKg03NC4xMjUuNTEuMTcz&ratebypass=yes&sparams=clen%2Cdur%2Cei%2Cgir%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Cratebypass%2Crequiressl%2Csource%2Cupn%2Cexpire&requiressl=yes&itag=18&source=youtube&id=o-AOL2Ym3gKDUBTYFiGyLZ6ipSYhPAoMG_7kFGBFNI5-ti&pl=18&ms=au&mt=1484268048&mv=m&mm=31&ip=128.199.217.235&mn=sn-npoeen7k&ipbits=0'
[download] Destination: 香港麥當勞42年歷史大盤點.mp4
[download] 100% of 7.63MiB

Below is log of JSON data (this is only a part of full logs - but it represent the essential of this issue) as JSON data contains a huge string data.
For example: Title, tags and descriptions :

"title": "\u9999\u6e2f\u9ea5\u7576\u52de42\u5e74\u6b77\u53f2\u5927\u76e4\u9ede", "url": "https://r4---sn-npoeen7k.googlevideo.com/videoplayback?nh=IgpwcjAyLnNpbjExKg03NC4xMjUuNTEuMTcz&mm=31&mime=video%2Fmp4&pl=18&itag=18&mv=m&mt=1484268354&ms=au&ei=iiN4WN3qBc-XoQOAkrCgAQ&requiressl=yes&gir=yes&ratebypass=yes&mn=sn-npoeen7k&clen=8002642&initcwndbps=6792500&source=youtube&id=o-AGC0tYRBdONrnPr4dLWQi5RZD33w4-n6WvsXmWUoX6-W&lmt=1484189129530128&key=yt6&ip=128.199.217.235&expire=1484290026&dur=105.581&upn=1lnlrUcKnCg&signature=D656C1E342F3EFA9F3C8D5DE181169801D2B52F9.17D6C35911182083504F92FEDE78E6010D62E3B6&sparams=clen%2Cdur%2Cei%2Cgir%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Cratebypass%2Crequiressl%2Csource%2Cupn%2Cexpire&ipbits=0", "categories": ["News & Politics"], "duration": 106, "uploader": "\u860b\u679c\u52d5\u65b0\u805e HK Apple Daily", "uploader_id": "appleactionews", "subtitles": {}, "format": "18 - 640x360 (medium)", "abr": 96, "ext": "mp4", "upload_date": "20170110", "thumbnail": "https://i.ytimg.com/vi/0alnhFO1B7Y/hqdefault.jpg", "formats": [{"height": null, "format_note": "DASH audio", "tbr": 57, "fps": null, "vcodec": "none", "url": 
 "description": "\u3010\u672c\u5831\u8a0a\u3011\u9ea5\u7576\u52de\u9003\u4e0d\u904e\u67d3\u7d05\u547d\u904b\uff0c\u6e2f\u4eba\u559c\u6b61\u53eb\u9ea5\u7576\u52de\u505a\u300c\u8001\u9ea5\u300d\u3001\u300c\u9ea5\u8a18\u300d\uff0c\u5168\u56e0\u9ea5\u7576\u52de\u5df2\u966a\u4f34\u6e2f\u4eba\u903e42\u500b\u5e74\u982d\uff0c\u9ea5\u7576\u52de\u53d4\u53d4\u3001\u958b\u751f\u65e5\u6703\u7b49\u96c6\u9ad4\u56de\u61b6\u6df1\u5165\u6c11\u5fc3\uff0c\u9ea5\u7576\u52de\u66fe\u63a8\u63db\u8cfc\u53f2\u8afe\u6bd4\u516c\u4ed4\u6380\u5168\u57ce\u6392\u968a\u71b1\u6f6e\uff0c\u4ea6\u5920\u7d93\u5178\u3002\n\n\u860b\u679c\u65e5\u5831\uff1ahttp://hk.apple.nextmedia.com\n\u5373like\u860b\u679cfb\uff1ahttp://www.facebook.com/hk.nextmedia\niPhone App\uff1ahttp://bit.ly/AppleDailyApp-iPhone\nAndroid App\uff1ahttp://bit.ly/AppleDailyApp-Android", "http_headers": {"Accept-Language": "en-us,en;q=0.5", "Accept-Encoding": "gzip, deflate", "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20150101 Firefox/47.0 (Chrome)"}, "start_time": null, "player_url": null, "playlist_index": null, "like_count": 490, "protocol": "https", "format_id": "18"}
{"tags": ["\u860b\u679c\u52d5\u65b0\u805e", "\u860b\u679c\u65e5\u5831", "appledaily", "Apple Daily(Newspaper)", "Hong Kong", "news", "\u52d5\u65b0\u805e", "\u65b0\u805e", "hk", "\u9999\u6e2f"], "uploader_url": "http://www.youtube.com/user/appleactionews", "license": "Standard YouTube License", "age_limit": 0, "resolution": "640x360", "id": "0alnhFO1B7Y",
@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Jan 14, 2017

Well the second time people are looking forward to unescaped strings (#10927). It might worth an option.

Here's a quick hack:

diff --git a/youtube_dl/YoutubeDL.py b/youtube_dl/YoutubeDL.py
index 5d654f55f..d7374e820 100755
--- a/youtube_dl/YoutubeDL.py
+++ b/youtube_dl/YoutubeDL.py
@@ -1535,7 +1535,7 @@ class YoutubeDL(object):
         if self.params.get('forceformat', False):
             self.to_stdout(info_dict['format'])
         if self.params.get('forcejson', False):
-            self.to_stdout(json.dumps(info_dict))
+            self.to_stdout(json.dumps(info_dict, ensure_ascii=False))
 
         # Do nothing else if in simulate mode
         if self.params.get('simulate', False):
@yan12125 yan12125 added the request label Jan 14, 2017
@linglung
Copy link
Author

@linglung linglung commented Jan 16, 2017

Using git shell, got like this:

diff --git a/youtube_dl/YoutubeDL.py b/youtube_dl/YoutubeDL.py
diff: unknown option -- git
diff: Try 'diff --help' for more information.

I try to configure it manually. Edit YoutubeDL.py file from zip master, add your approach self.to_stdout(json.dumps(info_dict, ensure_ascii=False)) in line 1540. Then Execute it as developer mode to test it : python -m youtube_dl --write-info-json https://www.youtube.com/watch?v=of0B-ZvxYI4.

Same result. 😢

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Jan 16, 2017

Well, --write-info-json uses a different function.

diff --git a/youtube_dl/utils.py b/youtube_dl/utils.py
index 12863e74a..6ded34832 100644
--- a/youtube_dl/utils.py
+++ b/youtube_dl/utils.py
@@ -231,7 +231,7 @@ def write_json_file(obj, fn):
 
     try:
         with tf:
-            json.dump(obj, tf)
+            json.dump(obj, tf, ensure_ascii=False)
         if sys.platform == 'win32':
             # Need to remove existing file on Windows, else os.rename raises
             # WindowsError or FileExistsError.

On Linux/Mac/... you can use patch to apply the change. On Windows, I'm afraid you'll need to change those files by hands.

@linglung
Copy link
Author

@linglung linglung commented Jan 16, 2017

Great..!. It works as expected.

"title": "【激震】松本伊代(51)が逮捕の可能性…(画像あり)", "alt_title": null, "thumbnail": "https://i.ytimg.com/vi/of0B-ZvxYI4/hqdefault.jpg", 
"description": "これはいかんやろ\n\n【おすすめサイト】\nびっくり映像まとめ\nhttp://lifestylemovie305.club/\n癒し系感動画像まとめ\nhttp://lifestyle305.link/\n\n引用元\nまとめもりー\n\n関連動画\n【警察がガラスを割って逃走車を逮捕の大暴れの瞬間\nhttps://youtu.be/FRc_PDxdaKk\n\n【親友】草なぎ剛の逮捕後あいつだけが連絡をくれたんだ【芸能ゴシップch】\nhttps://youtu.be/F7u-eeVqvNo\n\n【逮捕】ヤマト運輸チェーンソー襲撃事件\nhttps://youtu.be/Kr4k1RXmBXk", "categories": ["Entertainment"], "tags": ["松本伊代", "逮捕", "鉄ヲタ", "侵入", "芸能ゴシップチャンネル"], "subtitles": {}, "automatic_captions": {}, "duration": 44, "age_limit": 0, "annotations": null, 
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--write-info-json', 'https://www.youtube.com/watch?v=of0B-ZvxYI4', '-v']
[debug] Encodings: locale cp1252, fs utf-8, out cp1252, pref cp1252
[debug] youtube-dl version 2017.01.10
[debug] Git HEAD: 250a6a6
[debug] Python version 3.6.0 - Windows-10-10.0.14393-SP0
[debug] exe versions: ffmpeg 2.8.4, ffprobe N-82966-g6993bb4
[debug] Proxy map: {}
[youtube] of0B-ZvxYI4: Downloading webpage
[youtube] of0B-ZvxYI4: Downloading video info webpage
[youtube] of0B-ZvxYI4: Extracting video information
[youtube] of0B-ZvxYI4: Downloading MPD manifest
[info] Writing video description metadata as JSON to: 51▒-of0B-ZvxYI4.info.json
WARNING: Requested formats are incompatible for merge and will be merged into mkv.
[debug] Invoking downloader on 'https://r1---sn-npoeene7.googlevideo.com/videoplayback/id/a1fd01f99bf1608e/itag/137/source/youtube/requiressl/yes/pl/20/ms/au/mv/m/mm/31/mn/sn-npoeene7/nh/IgpwcjAyLnNpbjExKgkxMjcuMC4wLjE/initcwndbps/5181250/ratebypass/yes/mime/video%2Fmp4/otfp/1/gir/yes/clen/15804514/lmt/1484537241771041/dur/44.010/mt/1484587873/signature/51F5F5775AFC186891468FEA3189DE2C4363AEC0.73349929948C64E626C984C19B4450A69ADFBC48/key/dg_yt0/upn/TvBQw5qcbLw/ip/128.199.120.49/ipbits/0/expire/1484609801/sparams/ip,ipbits,expire,id,itag,source,requiressl,pl,ms,mv,mm,mn,nh,initcwndbps,ratebypass,mime,otfp,gir,clen,lmt,dur/'
[dashsegments] Total fragments: 10
[download] Destination: 51▒-of0B-ZvxYI4.f137.mp4
[download] 100% of 15.07MiB in 00:10
[debug] Invoking downloader on 'https://r1---sn-npoeene7.googlevideo.com/videoplayback?keepalive=yes&ei=qAR9WLyQCqWWoAOU2LGQAg&lmt=1484537811953574&sparams=clen%2Cdur%2Cei%2Cgir%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Ckeepalive%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Crequiressl%2Csource%2Cupn%2Cexpire&gir=yes&nh=IgpwcjAyLnNpbjExKgkxMjcuMC4wLjE&signature=E0FCAFA6A26E36BBAF079871A1245E52D44F38BA.39EE2958290D2587D1F9133C72A6DD80542DCE0F&dur=44.021&initcwndbps=5181250&itag=251&clen=721390&ipbits=0&key=yt6&upn=XK0SksNiZ_k&expire=1484609800&mv=m&mt=1484587873&ms=au&id=o-AAGPZjeL-9r4CcQxIfhSH50qx54cLzbhhisXP7f74bbJ&mn=sn-npoeene7&pl=20&source=youtube&mm=31&ip=128.199.120.49&mime=audio%2Fwebm&requiressl=yes&ratebypass=yes'
[download] Destination: 51▒-of0B-ZvxYI4.f251.webm
[download] 100% of 704.48KiB in 00:01
[ffmpeg] Merging formats into "51▒-of0B-ZvxYI4.mkv"
[debug] ffmpeg command line: ffmpeg -y -i 'file:51▒-of0B-ZvxYI4.f137.mp4' -i 'file:51▒-of0B-ZvxYI4.f251.webm' -c copy -map 0:v:0 -map 1:a:0 'file:51▒-of0B-ZvxYI4.temp.mkv'
Deleting original file 51▒-of0B-ZvxYI4.f137.mp4 (pass -k to keep)
Deleting original file 51▒-of0B-ZvxYI4.f251.webm (pass -k to keep)
@linglung
Copy link
Author

@linglung linglung commented Jan 16, 2017

Sadly if i used your first approach with dump json -j or -J (no write json file), it didn't work.
FYI, first i restore the original utils.py file before doing this, and changed the lines of YouTubeDL.py file as your 1st approach.

and the logs:

python -m youtube_dl -j https://www.youtube.com/watch?v=of0B-ZvxYI4 -v
Traceback (most recent call last):
  File "C:\Users\Google\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 183, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "C:\Users\Google\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 142, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "C:\Users\Google\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "C:\Users\Google\Documents\GitHub\ytdl\youtube_dl\__init__.py", line 45, in <module>
    from .YoutubeDL import YoutubeDL
  File "C:\Users\Google\Documents\GitHub\ytdl\youtube_dl\YoutubeDL.py", line 1540
    self.to_stdout(json.dumps(info_dict, ensure_ascii=False))
                                                            ^
TabError: inconsistent use of tabs and spaces in indentation

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Jan 16, 2017

Most likely there are tabs - replace them all with spaces.

@linglung
Copy link
Author

@linglung linglung commented Jan 16, 2017

@yan12125 Perfect. Fix now. Thank you so much 😄

python -m youtube_dl -j --encoding utf-8 https://www.youtube.com/watch?v=of0B-ZvxYI4 -v
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-j', '--encoding', 'utf-8', 'https://www.youtube.com/watch?v=of0B-ZvxYI4', '-v']
[debug] Encodings: locale cp1252, fs utf-8, out cp1252, pref utf-8
[debug] youtube-dl version 2017.01.10
[debug] Git HEAD: 250a6a6
[debug] Python version 3.6.0 - Windows-10-10.0.14393-SP0
[debug] exe versions: ffmpeg 2.8.4, ffprobe N-82966-g6993bb4
[debug] Proxy map: {}
{"id": "of0B-ZvxYI4", "uploader": "芸能 ゴシップ チャンネル", "uploader_id": "UC0OUfSvMHCpn2sukdhH-5kw", "uploader_url": "http://www.youtube.com/channel/UC0OUfSvMHCpn2sukdhH-5kw", "upload_date": "20170115", "license": "Standard YouTube License", "creator": null, "title": "【激震】松本伊代(51)が逮捕の可能性…(画像あり)", "alt_title": null, "thumbnail": "https://i.ytimg.com/vi/of0B-ZvxYI4/hqdefault.jpg", "description": "これはいかんやろ\n\n【おすすめサイト】\nびっくり映像まとめ\nhttp://lifestylemovie305.club/\n癒し系感動画像まとめ\nhttp://lifestyle305.link/\n\n引用元\nまとめもりー\n\n関連動画\n【警察がガラスを割って逃走車を逮捕の大暴れの瞬間\nhttps://youtu.be/FRc_PDxdaKk\n\n【親友】草なぎ剛の逮捕後あいつだけが連絡をくれたんだ【芸能ゴシップch】\nhttps://youtu.be/F7u-eeVqvNo\n\n【逮捕】ヤマト運輸チェーンソー襲撃事件\nhttps://youtu.be/Kr4k1RXmBXk", "categories": ["Entertainment"], "tags": ["松本伊代", "逮捕", "鉄ヲタ", "侵入", "芸能ゴシップチャンネル"], "subtitles": {}, "automatic_captions": {}, "duration": 44, "age_limit": 0, "annotations": null, "webpage_url": "https://www.youtube.com/watch?v=of0B-ZvxYI4", "view_count": 285206, "like_count": 62, "dislike_count": 523, "average_rating": 1.42393159866, "formats":
@one2gov
Copy link

@one2gov one2gov commented Mar 11, 2017

self.to_stdout(json.dumps(info_dict, ensure_ascii=False)) makes -j works, but json.dump(obj, tf, ensure_ascii=False) doesn't make a difference for --write-info-json

youtube-dl --encoding utf-8 --write-info-json https://www.youtube.com/watch?v=VA0rAN0GRY4

@linglung
Copy link
Author

@linglung linglung commented Mar 20, 2017

why this didn't applied as the default setting in every YouTube-dl released version?

@AraHaan
Copy link

@AraHaan AraHaan commented Mar 20, 2017

actually @yan12125 you can apply the patch on windows if you use git for windows (git bash). Well at least I can. Also to do it on Windows I am affraid you have to write the diffs to file [filename].patch and then you can use git patch [filename].patch``

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Mar 20, 2017

To @linglung: It may sound silly, but not all environments supports raw (not-encoded) UTF-8. youtube-dl aims to keep compatibility with most systems, so it can't be the default.

@AraHaan
Copy link

@AraHaan AraHaan commented Mar 20, 2017

hmm you could in this case use sys.platform and use the values from that to determine which ones @yan12125 that is how I determine to use system opus / ffmpeg on linux but not on windows in 1 of my projects.

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Mar 20, 2017

Linux does not indicate full UTF-8 support. If one uses LC_ALL=C or LC_ALL=POSIX, UTF-8 strings can break the console. Such a setting is common in containers like Docker. (http://bugs.python.org/issue28180) On the other hand, since Python 3.6 UTF-8 support seems quite fine on Windows. (PEP528, PEP529) The logic for determining UTF-8 can be rather complicated.

@AraHaan
Copy link

@AraHaan AraHaan commented Mar 20, 2017

which is why you could have it like this on both diffs.

diff --git a/youtube_dl/YoutubeDL.py b/youtube_dl/YoutubeDL.py
index 5d654f55f..d7374e820 100755
--- a/youtube_dl/YoutubeDL.py
+++ b/youtube_dl/YoutubeDL.py
@@ -1535,7 +1535,7 @@ class YoutubeDL(object):
         if self.params.get('forceformat', False):
             self.to_stdout(info_dict['format'])
         if self.params.get('forcejson', False):
-            self.to_stdout(json.dumps(info_dict))
+            if sys.platform == 'win32':
+                self.to_stdout(json.dumps(info_dict, ensure_ascii=False))
+            else:
+                self.to_stdout(json.dumps(info_dict))
 
         # Do nothing else if in simulate mode
         if self.params.get('simulate', False):
diff --git a/youtube_dl/utils.py b/youtube_dl/utils.py
index 12863e74a..6ded34832 100644
--- a/youtube_dl/utils.py
+++ b/youtube_dl/utils.py
@@ -231,7 +231,7 @@ def write_json_file(obj, fn):
 
     try:
         with tf:
-            json.dump(obj, tf)
+           if sys.platform == 'win32':
+                json.dump(obj, tf, ensure_ascii=False)
+           else:
+                json.dump(obj, tf)
         if sys.platform == 'win32':
             # Need to remove existing file on Windows, else os.rename raises
             # WindowsError or FileExistsError.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.