Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[youtube] Upload date being wrong by one day #9829

Closed
9 tasks done
znaczki654 opened this issue Apr 30, 2024 · 15 comments · Fixed by #9856
Closed
9 tasks done

[youtube] Upload date being wrong by one day #9829

znaczki654 opened this issue Apr 30, 2024 · 15 comments · Fixed by #9856
Assignees
Labels
patch-available There is patch available that should fix this issue. Someone needs to make a PR with it site-enhancement Feature request for some website

Comments

@znaczki654
Copy link

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

  • I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

Please make sure the question is worded well enough to be understood

As I understand, I uploaded the video on 25-10-2021, but youtube processed it on 26-10-2021 therefore this date shows up instead of 25-10-2021? Reference: #7802

Command:
yt-dlp --print upload_date "https://www.youtube.com/watch?v=OqjTtnmGv8s"

As I've seen there is no way to print the date after being processed by youtube that shows up near the description?
Or do I understand it wrongly?

Provide verbose output that clearly demonstrates the problem

  • Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
  • If using API, add 'verbose': True to YoutubeDL params instead
  • Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

[debug] Command-line config: ['--print', 'upload_date', 'https://www.youtube.com/watch?v=OqjTtnmGv8s', '-vU']
[debug] Encodings: locale cp1250, fs utf-8, pref cp1250, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2024.04.09 from yt-dlp/yt-dlp [ff0779267] (win_exe)
[debug] Python 3.8.10 (CPython AMD64 64bit) - Windows-10-10.0.22631-SP0 (OpenSSL 1.1.1k  25 Mar 2021)
[debug] exe versions: ffmpeg 7.0-essentials_build-www.gyan.dev (setts), ffprobe 7.0-essentials_build-www.gyan.dev
[debug] Optional libraries: Cryptodome-3.20.0, brotli-1.1.0, certifi-2024.02.02, curl_cffi-0.5.10, mutagen-1.47.0, requests-2.31.0, sqlite3-3.35.5, urllib3-2.2.1, websockets-12.0
[debug] Proxy map: {}
[debug] Request Handlers: urllib, requests, websockets, curl_cffi
[debug] Loaded 1810 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: stable@2024.04.09 from yt-dlp/yt-dlp
yt-dlp is up to date (stable@2024.04.09 from yt-dlp/yt-dlp)
[youtube] Extracting URL: https://www.youtube.com/watch?v=OqjTtnmGv8s
[youtube] OqjTtnmGv8s: Downloading webpage
[youtube] OqjTtnmGv8s: Downloading ios player API JSON
[youtube] OqjTtnmGv8s: Downloading android player API JSON
WARNING: [youtube] Skipping player responses from android clients (got player responses for video "aQvGIIdgFDM" instead of "OqjTtnmGv8s")
[debug] Loading youtube-nsig.7ee5b648 from cache
[debug] [youtube] Decrypted nsig qBP4tfEIQJ6d9G23LXO => j7u7NJ33_M_BvQ
[debug] Loading youtube-nsig.7ee5b648 from cache
[debug] [youtube] Decrypted nsig iSggGPn2WbkdYE2hmMa => UclxAJpmlpI-iA
[youtube] OqjTtnmGv8s: Downloading m3u8 information
[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, vcodec:vp9.2, channels, acodec, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec:vp9.2(10), channels, acodec, lang, proto, size, br, asr, vext, aext, hasaud, id
[debug] Default format spec: bestvideo*+bestaudio/best
[info] OqjTtnmGv8s: Downloading 1 format(s): 137+251
20211025
@znaczki654 znaczki654 added the question Question label Apr 30, 2024
@pukkandan
Copy link
Member

The upload_date is returned in UTC

@znaczki654
Copy link
Author

Ok makes sense, but is there an option to change the time format to CEST. Maybe some way to get also a time so I could workaround it on my own?

@pukkandan
Copy link
Member

use --compat-option no-youtube-prefer-utc-upload-date to get the date shown in the webpage, which is in Pacific Time afair. It is not possible to get the time with yt-dlp since it's not available in the InnerTube API

@znaczki654
Copy link
Author

I tried yt-dlp -no-download --compat-option no-youtube-prefer-utc-upload-date https://www.youtube.com/watch?v=OqjTtnmGv8s --print upload_date

But still 20211025 not 20211026, that is weird. Since on the youtube webpage no matter of region I see the same date 26.10.2021... I don't know why there is a difference.

@pukkandan
Copy link
Member

cc @coletdjnz

@dirkf
Copy link
Contributor

dirkf commented May 1, 2024

Upstream gets the same result:

$ youtube-dl 'https://www.youtube.com/watch?v=OqjTtnmGv8s' --get-filename -o '%(upload_date)s'
20211025
$

That value is the 'uploadDate' in the page's ytInitialPlayerResponse.

The page displayed by Chromium (Qt WebEngine) and by Firefox ESR (shows 2 years ago before expanding the description field) has:

73 views  25 Oct 2021
My discord recruiter, playing akali on bots in League of Legends game.
Mój rekruter discordowy, grający akali na boty w grze League of Legends.

From OP's account, in CEST the page shows 26 instead of 25 (2021's CEST ended on 31 October, the last Sunday of the month).

Why do we get 20211025? The yt-dl extractor is looking at the webpage:

(Pdb) pp re.findall(r'[^,>]{,45}2021-?10-?2\d[^\s>,]*', webpage)
['<meta itemprop="datePublished" content="2021-10-25T15:29:23-07:00"',
 '<meta itemprop="uploadDate" content="2021-10-25T15:29:23-07:00"',
 '"publishDate":"2021-10-25T15:29:23-07:00"',
 '"uploadDate":"2021-10-25T15:29:23-07:00"}}']
(Pdb) 

There is plenty of full resolution timestamp data there. The yt-dlp extractor looks at all the player responses that were both requested and available, but similar data also seemed to be present in the cases that I checked.

The root of the problem seems to be this:

  • the actual date-time is 2021-10-25T15:29:23-07:00(PDT) == 2021-10-26T00:29:23+02:00(CEST).
  • both extractors use the unified_strdate() function, essentially identical here, that has this special feature:
...
    # Remove AM/PM + timezone
    date_str = re.sub(r'(?i)\s*(?:AM|PM)(?:\s+[A-Z]+)?', '', date_str)
    _, date_str = extract_timezone(date_str)
...

Is it just me, or is this is complete nonsense?

The explanation might be, from the first commit of this function:

# %z (UTC offset) is only supported in python>=3.2

So this was a first pass that hasn't been fixed since 2014. The extract_timezone() function was added that returns a datetime.timedelta for the TZ, but it was still discarded, while the TZ was used in parse_iso8601() and in unified_timestamp() (where it seems to be stripped twice?).

Solutions:

  • conservative but potentially confusing: leave the upload_date as-is (ie, the YYYYMMDD as measured at YT HQ) while adding a timestamp that is accurately based on UTC
  • more disruptively: scrap the explicit upload_date extraction in favour of a timestamp (that will result in an altered upload_date in cases like this), unless the page only offers YYYYMMDD.

Presumably, before JS got fancy date/time processing, YT used to send the YYYYMMDD resolution shown on the page, rather sending the ISO 8601 format and degrading it.

@dirkf
Copy link
Contributor

dirkf commented May 1, 2024

Also, apparently the no-youtube-prefer-utc-upload-date option isn't effective because extraction from microformats, (..., 'uploadDate', any) has already succeeded, discarding the timestamp resolution.

In fact, unified_strdate() as-is is already enforcing no-youtube-prefer-utc-upload-date!

@pukkandan
Copy link
Member

I'm confident the actual timestamp was not available the last time I checked. If we can now extract full timestamp instead of upload_date, that's awesome! With timestamp, we don't need non-UTC date since user can just add their timezone %(timestamp+xx>%Y%m%d)s as desired

@pukkandan pukkandan added the site-enhancement Feature request for some website label May 1, 2024
@pukkandan pukkandan changed the title Upload date being wrong by one day [youtube] Upload date being wrong by one day May 1, 2024
@znaczki654
Copy link
Author

I've read the dirkf comments and your pukkandan. As I can see it will go towards improvement of creating a possibility to add a timezone within a timestamp? Because when I added timestamp instead of upload_date in my command it seemed to return NA.

@dirkf
Copy link
Contributor

dirkf commented May 1, 2024

Since historically (I'm guessing this did not change until after the generic/offset values for the timeZoneName option to Intl.DateFormat() were widely supported, ca mid-2022), YT only sent day-resolution values, the extractor does not (yet) extract a timestamp. A missing upload_date can be generated from a timestamp, but the reverse isn't sensible. Hence currently you get NA for the timestamp.

@gamer191
Copy link
Collaborator

gamer191 commented May 2, 2024

youtube processed it on 26-10-2021 therefore this date shows up instead of 25-10-2021?

But still 20211025 not 20211026, that is weird

I’m really confused. Do you want 20211025 or 20211026?

@znaczki654
Copy link
Author

youtube processed it on 26-10-2021 therefore this date shows up instead of 25-10-2021?

But still 20211025 not 20211026, that is weird

I’m really confused. Do you want 20211025 or 20211026?

I want the 20211026
image

@coletdjnz
Copy link
Member

coletdjnz commented May 2, 2024

The upload date should always be in UTC unless it is a stream or premiere. We tell YouTube to return everything in UTC.

The compat opt reverts this behaviour to what yt-dl does - use the date in PT(?) tz from microformats (if available)

upload_date = strftime_or_none(

We cannot get the timestamp. If you need precise datetime then yt-dlp is not the right tool, you need to use the data API for that.

@dirkf
Copy link
Contributor

dirkf commented May 2, 2024

But as described above, it seems that timestamp is now available by using parse_iso8601() (or unified_timestamp()) instead of unified_strdate().

If a timestamp is available, OP will be able to format it into the desired TZ (CEST) as also described above.

Example:

         # The upload date for scheduled, live and past live streams / premieres in microformats
         # may be different from the stream date. Although not in UTC, we will prefer it in this case.
         # See: https://github.com/yt-dlp/yt-dlp/pull/2223#issuecomment-1008485139
-        upload_date = (
-            unified_strdate(get_first(microformats, 'uploadDate'))
-            or unified_strdate(search_meta('uploadDate')))
-        if not upload_date or (
-            live_status in ('not_live', None)
-            and 'no-youtube-prefer-utc-upload-date' not in self.get_param('compat_opts', [])
-        ):
-            upload_date = strftime_or_none(
-                self._parse_time_text(self._get_text(vpir, 'dateText'))) or upload_date
-        info['upload_date'] = upload_date
+        timestamp = (
+            unified_timestamp(get_first(microformats, 'uploadDate'))
+            or unified_timestamp(search_meta('uploadDate')))
+        own_upload_date = (
+            live_status not in ('not_live', None)
+            or 'no-youtube-prefer-utc-upload-date' in self.get_param('compat_opts', []))
+        if not timestamp or own_upload_date:
+            upload_date = (
+                unified_strdate(get_first(microformats, 'uploadDate'))
+                or unified_strdate(search_meta('uploadDate')))
+        else:
+            upload_date = None
+        if not (upload_date and own_upload_date):
+            if not upload_date and timestamp:
+                # TODO: complicated TZ processing to render timestamp as YYMMDD in Pacific time
+                pass
+            if not upload_date:
+                upload_date = strftime_or_none(
+                    self._parse_time_text(self._get_text(vpir, 'dateText'))) or upload_date
+        if timestamp:
+            info['timestamp'] = timestamp
+        if upload_date:
+            info['upload_date'] = upload_date
+
+        if (timestamp or upload_date) and live_status not in ('is_live', 'post_live', 'is_upcoming'):
+            # Newly uploaded videos' HLS formats are potentially problematic and need to be checked
+            if timestamp:
+                upload_datetime = datetime.datetime.utcfromtimestamp(timestamp)
+            else:
+                upload_datetime = datetime_from_str(upload_date) # .replace(tzinfo=datetime.timezone.utc)
+            if upload_datetime >= datetime_from_str('today-2days'):
+                for fmt in info['formats']:
+                    if fmt.get('protocol') == 'm3u8_native':
+                        fmt['__needs_testing'] = True
 
         for s_k, d_k in [('artist', 'creator'), ('track', 'alt_title')]:

Output (CEST = UTC+2 = +7200s):

$ yt-dlp --print upload_date --print timestamp --print '%(timestamp+7200>%Y%m%d)s' 'https://www.youtube.com/watch?v=OqjTtnmGv8s'
20211025
1635200963
20211026
$

@pukkandan pukkandan added the patch-available There is patch available that should fix this issue. Someone needs to make a PR with it label May 2, 2024
@coletdjnz
Copy link
Member

If you can get the full timestamp then that must be a recent change by YouTube. In that case the extractor should be updated to extract it (as UTC) unix timestamp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
patch-available There is patch available that should fix this issue. Someone needs to make a PR with it site-enhancement Feature request for some website
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants