Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extractor/facebook] Improve extraction #30700

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dirkf
Copy link
Contributor

@dirkf dirkf commented Feb 28, 2022

Please follow the guide below


Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • Except for the commit by @kikuyan who separately agreed to either this or the below, I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

This PR replaces #29796 which was orphaned when @kikuyan's account was deleted.

The PR makes the following fixes and improvements to the Facebook extractor:

  • improves title, add description, preferring metadata in ld+json if present
  • adds another getter for attachments (from PR [extractor/facebook] fix extraction #30496, issue)
  • as ld+json VideoObject type seems to be gone since PR was first made, adds SocialMediaPosting type as a fallback
  • when web pages are analyzed as playlists but consist of single video, treats the pages as regular video pages to extract metadata well (examples: tests 7, 9, 10)
  • makes test skip for now: TestDownload.test_Facebook_8: the page cannot be parsed (get ERROR: This video is only available for registered users.) while it can be opened by a browser.

The PR makes the following improvement to the extractor/common.py:

Resolves #29421, resolves #23627, resolves #23180, resolves #14156.
Resolves #30472, resolves #30474, resolves #30650, resolves #30681.

Closes #29796 (superseded)
Closes #30496 (superseded)
Closes #30513 (superseded).

kikuyan and others added 3 commits February 27, 2022 13:41
* add another data structure for video extraction
* modify metadata extraction due to site change
* avoid crashing in parse_attachment() on invalid attachment
* ignore empty results in <meta> search
@dirkf
Copy link
Contributor Author

dirkf commented Apr 27, 2022

The extractor does this

            thumbnail = html_search_meta_non_empty(
                ['og:image', 'twitter:image'], webpage, 'thumbnail', default=None)

So it's looking for <meta (itemprop|name|property|id|http-equiv)=... content=...>

@ranelpadon
Copy link

Facebook video downloading has issue now which was already filed before in several related/duplicated issues:

ERROR: Cannot parse data

But it's working fine few days ago. Hopefully, this PR could be merged soon, since I badly needed the fix also. Thanks :)

@dirkf
Copy link
Contributor Author

dirkf commented Mar 17, 2023

Are you saying that this happens with the PR code? Or (as I hope) that the PR code is still valid and fixes "Cannot parse data"?

That message basically means that the extractor tried all the tactics it knows to extract from the page and none worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment