Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ie/twitter] Add fallback, improve error handling #7621

Merged
merged 16 commits into from Jul 29, 2023

Conversation

bashonly
Copy link
Member

@bashonly bashonly commented Jul 17, 2023

Adds syndication fallback for tweet extraction and improves error handling all over

Closes #7579, Closes #7625

Outdated description

 
(EDIT: The guest token extraction was actually broken by a core regression which has been fixed)

Twitter has apparently decommissioned the guest token API endpoint, and the browser now gets the token from the webpage html instead. _perform_login was already doing this, so this patch updates _fetch_guest_token to always do this. No more guest token endpoint means that "legacy API" tweet extraction is now broken, so the dead code has been removed.

with master:

$ yt-dlp -F "https://twitter.com/starwars/status/665052190608723968"
[twitter] Extracting URL: https://twitter.com/starwars/status/665052190608723968
[twitter] 665052190608723968: Downloading guest token
ERROR: [twitter] 665052190608723968: Unable to download JSON metadata: HTTP Error 404: Not Found (caused by <HTTPError 404: Not Found>); please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U

with patch:

$ yt-dlp -F "https://twitter.com/starwars/status/665052190608723968"
[twitter] Extracting URL: https://twitter.com/starwars/status/665052190608723968
[twitter] 665052190608723968: Downloading guest token
[twitter] 665052190608723968: Downloading GraphQL JSON
[twitter] a1fe3ba7-fad3-403a-be50-02752bb88d0d: Downloading XML
[twitter] a1fe3ba7-fad3-403a-be50-02752bb88d0d: Downloading m3u8 information
[info] Available formats for 665052190608723968:
ID        EXT RESOLUTION │   TBR PROTO │ VCODEC      ACODEC
──────────────────────────────────────────────────────────────
http-320  mp4 320x180    │  320k https │ unknown     unknown
hls-320   mp4 320x180    │  320k m3u8  │ avc1.420015 mp4a.40.2
http-832  mp4 640x360    │  832k https │ unknown     unknown
hls-832   mp4 640x360    │  832k m3u8  │ avc1.42001f mp4a.40.2
http-2176 mp4 1280x720   │ 2176k https │ unknown     unknown
hls-2176  mp4 1280x720   │ 2176k m3u8  │ avc1.4d0020 mp4a.40.2
Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

🤖 Generated by Copilot at b13f71b

Summary

🔄🚀🔢

Improve the performance and reliability of the twitter.py extractor by streamlining the authentication process. Use the webpage to get the guest token and reduce redundant requests.

Oh we're the crew of the Twitter ship, and we sail the data sea
We fetch the tweets with our guest tokens, and we do it smart and free
We don't need the JSON endpoint, nor the login page to load
We get our tokens from the webpage, and we save ourselves some code

Walkthrough

  • Simplify guest token fetching by extracting it from webpage instead of JSON request (link)
  • Remove redundant login page download and guest token search (link)

@bashonly bashonly added the site-bug Issue with a specific website label Jul 17, 2023
@bashonly bashonly marked this pull request as draft July 17, 2023 15:02
@bashonly bashonly marked this pull request as ready for review July 17, 2023 15:55
yt_dlp/extractor/twitter.py Show resolved Hide resolved
yt_dlp/extractor/twitter.py Outdated Show resolved Hide resolved
yt_dlp/extractor/twitter.py Outdated Show resolved Hide resolved
Comment on lines 1194 to 1208
'cards_platform': 'Web-12',
'include_cards': 1,
'include_reply_count': 1,
'include_user_entities': 0,
'tweet_mode': 'extended',
}), 'retweeted_status', None)
elif not self.is_logged_in:
status = self._graphql_to_legacy(
self._call_graphql_api('2ICDjqPd81tulZcYrtpTuQ/TweetResultByRestId', twid), twid)
else:
if self.is_logged_in:
status = self._graphql_to_legacy(
self._call_graphql_api('zZXycP0V6H7m-2r0mOnFcA/TweetDetail', twid), twid)
else:
try:
status = self._graphql_to_legacy(
self._call_graphql_api('2ICDjqPd81tulZcYrtpTuQ/TweetResultByRestId', twid), twid)
except ExtractorError as e:
if self._login_hint() in e.msg or bug_reports_message() not in e.msg:
raise # Do not try fallback when tweet is expected to be unavailable
self.report_warning(e.msg, video_id=twid)
self.report_warning('Falling back to syndication endpoint; some metadata may be missing')
status = self._download_json(
'https://cdn.syndication.twimg.com/tweet-result', twid, 'Downloading syndication JSON',
headers={'User-Agent': 'Googlebot'}, query={'id': twid})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth splitting this into a function, but not mandatory

yt_dlp/extractor/twitter.py Outdated Show resolved Hide resolved
@bashonly
Copy link
Member Author

bashonly commented Jul 20, 2023

The issue with the guest token was caused by a core regression (fixed by #7648), not by any change that Twitter made. I only realized this last night after I discovered the regression. I do think some of the changes in this PR are still valuable, like checking for protected tweets and adding the syndication fallback. Should I revert the unnecessary changes and continue working with this PR as a site-enhancement?

@pukkandan
Copy link
Member

Sure. Go with whichever impl you think is better

@bashonly bashonly marked this pull request as draft July 20, 2023 17:42
@bashonly bashonly changed the title [ie/twitter] Fix guest token extraction [ie/twitter] Add fallback, improve error handling Jul 20, 2023
@bashonly bashonly marked this pull request as ready for review July 22, 2023 02:29
yt_dlp/extractor/twitter.py Outdated Show resolved Hide resolved
yt_dlp/extractor/twitter.py Outdated Show resolved Hide resolved
@bashonly bashonly added the pending-review PR needs a review label Jul 22, 2023
@pukkandan pukkandan removed the pending-review PR needs a review label Jul 28, 2023
@bashonly bashonly merged commit 6014355 into yt-dlp:master Jul 29, 2023
13 checks passed
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
@bashonly bashonly deleted the fix/twitter-gt branch May 10, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-bug Issue with a specific website
Projects
None yet
2 participants