Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extractor/twitter] Fix unauthenticated extraction #7476

Merged
merged 10 commits into from Jul 5, 2023

Conversation

bashonly
Copy link
Member

@bashonly bashonly commented Jun 30, 2023

In response to Twitter authwalling all tweets. The owner claims this is temporary, so maybe we should wait a bit to see if he is telling the truth before merging this.

TODO:

  • Update tests done
  • Include fix for Spaces? separate PR
  • Downgrade 'not all metadata' message severity: report_warning => to_screen done


Closes #7473

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

馃 Generated by Copilot at 7674248

Summary

馃惁馃攧馃殌

Update Twitter extractors to use new APIs and simplify code. This improves the reliability and consistency of downloading videos and audio from Twitter and Twitter Spaces, and reduces code complexity and duplication.

We're sailing on the Twitter sea, with JSON and GraphQL
We've left behind the legacy API, it was a bloody hell
We've simplified the code, we've used the traverse_obj tool
So heave away, me hearties, heave away and sing so cool

Walkthrough

  • Simplify the _call_api method of the TwitterBaseIE class and remove the guest token and retry logic (link)
  • Use the syndication JSON endpoint instead of the legacy API endpoint for extracting tweet information without authentication in the _real_extract method of the TwitterIE class (link)
  • Handle the possible 404 error from the syndication JSON endpoint and raise a login required error if the tweet is only available when logged in (link)
  • Import the urllib.error module to handle HTTP errors from the syndication JSON endpoint (link)
  • Handle the case where the media ID is not present in the syndication JSON response and try to extract it from the video URL in the extract_from_video_info function (link)
  • Use the traverse_obj helper function to access nested fields in the JSON responses more safely and concisely in the _real_extract method of the TwitterIE class and the extract_from_video_info function (link, link, link)
  • Use the mediaDetails field instead of the extended_entities field in the syndication JSON response, as the latter may not be complete or present (link)
  • Remove the check for 'adult content' in the GraphQL errors, as it is no longer relevant for the syndication JSON method (link)
  • Add a check for login status in the _real_extract method of the TwitterSpacesIE class and raise a login required error if not logged in, as the GraphQL API for Twitter Spaces requires authentication (link)

@bashonly bashonly added the site-bug Issue with a specific website label Jun 30, 2023
Copy link
Member

@pukkandan pukkandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include fix for Spaces?

It's fine to be in separate PR too

yt_dlp/extractor/twitter.py Outdated Show resolved Hide resolved
@bashonly bashonly marked this pull request as ready for review July 2, 2023 21:51
yt_dlp/extractor/twitter.py Outdated Show resolved Hide resolved
@kapitaali
Copy link

Perfect fix, thank you

@gamer191
Copy link
Collaborator

gamer191 commented Jul 5, 2023

Slightly off-topic, but I just noticed twitter videos can be downloaded by substituting the video id into https://twitter.com/i/videos/tweet/1675748123468267521 (discovered by reverse engineering discord)

@bashonly
Copy link
Member Author

bashonly commented Jul 5, 2023

https://twitter.com/i/videos/tweet/1675748123468267521

I get a 403 ("bad guest token")

@gamer191
Copy link
Collaborator

gamer191 commented Jul 5, 2023

I just realised that twitter's legacy API does work, if you change the authorisation token to
authorization: Bearer AAAAAAAAAAAAAAAAAAAAAIK1zgAAAAAA2tUWuhGZ2JceoId5GwYWU5GspY4%3DUq7gzFoCZs1QfwGoVdvSac3IniczZEYXIcDyumCauIXpcAPorE

@gamer191
Copy link
Collaborator

gamer191 commented Jul 5, 2023

I get a 403 ("bad guest token")

Can you access that link in a browser?

@bashonly
Copy link
Member Author

bashonly commented Jul 5, 2023

Can you access that link in a browser?

in browser is where I am getting the 403

@bashonly
Copy link
Member Author

bashonly commented Jul 5, 2023

authorization: Bearer AAAAAAAAAAAAAAAAAAAAAIK1zgAAAAAA2tUWuhGZ2JceoId5GwYWU5GspY4%3DUq7gzFoCZs1QfwGoVdvSac3IniczZEYXIcDyumCauIXpcAPorE

This does work with the legacy API though. (& does not work with graphql.) Hmm
It has a rate-limit of 180 (per ??), but the limit is based on guest token, not IP

@gamer191
Copy link
Collaborator

gamer191 commented Jul 5, 2023

This does work with legacy API though. Hmm

I found it in the network tab of https://twitter.com/i/videos/tweet/1675748123468267521

in browser is where I am getting the 403

What happens if you try to play https://discord.com/channels/807245652072857610/934024231098925078/1126126462630645820?

@gamer191
Copy link
Collaborator

gamer191 commented Jul 5, 2023

This does work with the legacy API though. (& does not work with graphql.)

Is it useful if it doesn't work with graphql?

@bashonly
Copy link
Member Author

bashonly commented Jul 5, 2023

Is it useful if it doesn't work with graphql?

GraphQL may be a lost cause for unauthenticated users, and the legacy API gives more metadata than the syndication endpoint, so yeah it is. It's just a question of which solution will be more stable going forward.

@pukkandan
Copy link
Member

imo, let's merge this for now and wait for twitter to become stable

@bashonly
Copy link
Member Author

bashonly commented Jul 5, 2023

imo, let's merge this for now and wait for twitter to become stable

Yeah I think so too. Moving back to legacy API for tweet extraction basically means we have to revert to before 147e62f, and then add additional rate-limit handling. So it will be a whole other PR anyways

edit: it is probably worth doing though, since currently twitter broadcasts are still authwalled (and I think they are a still thing)

@bashonly bashonly merged commit 4929643 into yt-dlp:master Jul 5, 2023
11 checks passed
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
@bashonly bashonly deleted the fix/twitter-noauth branch May 10, 2024 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-bug Issue with a specific website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Twitter] Account is now required to download (Failed to parse JSON)
4 participants