[extractor/facebook] Improve extraction #30700

dirkf · 2022-02-28T14:41:13Z

Please follow the guide below

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

Except for the commit by @kikuyan who separately agreed to either this or the below, I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

This PR replaces #29796 which was orphaned when @kikuyan's account was deleted.

The PR makes the following fixes and improvements to the Facebook extractor:

improves title, add description, preferring metadata in ld+json if present
adds another getter for attachments (from PR [extractor/facebook] fix extraction #30496, issue)
as ld+json VideoObject type seems to be gone since PR was first made, adds SocialMediaPosting type as a fallback
when web pages are analyzed as playlists but consist of single video, treats the pages as regular video pages to extract metadata well (examples: tests 7, 9, 10)
makes test skip for now: TestDownload.test_Facebook_8: the page cannot be parsed (get ERROR: This video is only available for registered users.) while it can be opened by a browser.

The PR makes the following improvement to the extractor/common.py:

adds SocialMediaPosting for author extraction in InfoExtractor._json_ld() (replaces PR [extractor/common] add SocialMediaPosting type to _json_ld() #30513).

Resolves #29421, resolves #23627, resolves #23180, resolves #14156.
Resolves #30472, resolves #30474, resolves #30650, resolves #30681.

Closes #29796 (superseded)
Closes #30496 (superseded)
Closes #30513 (superseded).

* add another data structure for video extraction * modify metadata extraction due to site change

* avoid crashing in parse_attachment() on invalid attachment * ignore empty results in <meta> search

…ink, or not including the ID

dirkf · 2022-04-27T13:12:51Z

The extractor does this

            thumbnail = html_search_meta_non_empty(
                ['og:image', 'twitter:image'], webpage, 'thumbnail', default=None)

So it's looking for <meta (itemprop|name|property|id|http-equiv)=... content=...>

ranelpadon · 2023-02-02T06:58:01Z

Facebook video downloading has issue now which was already filed before in several related/duplicated issues:

ERROR: Cannot parse data

But it's working fine few days ago. Hopefully, this PR could be merged soon, since I badly needed the fix also. Thanks :)

dirkf · 2023-03-17T22:39:47Z

Are you saying that this happens with the PR code? Or (as I hope) that the PR code is still valid and fixes "Cannot parse data"?

That message basically means that the extractor tried all the tactics it knows to extract from the page and none worked.

kikuyan and others added 3 commits February 27, 2022 13:41

[extractor/facebook] Improve metadata extraction

ab2df90

* add another data structure for video extraction * modify metadata extraction due to site change

[common] Extract author from SocialMediaPosting in ld+json

078f268

[extractor/facebook] Update extraction improvements and fix tests

ecd14c8

* avoid crashing in parse_attachment() on invalid attachment * ignore empty results in <meta> search

dirkf mentioned this pull request Apr 19, 2022

[Facebook] Cannot parse data: unable to download #30872

Open

6 tasks

[Facebook] Check for redirection to URL with login, or not a permal…

c555e5e

…ink, or not including the ID

dirkf mentioned this pull request Apr 22, 2022

Unable to download Facebook videos #30473

Closed

dirkf force-pushed the master branch from 01bf89e to 4c6fba3 Compare August 26, 2022 07:51

dirkf mentioned this pull request Aug 29, 2022

Facebook video download problem #31205

Closed

5 tasks

dirkf mentioned this pull request Oct 9, 2022

Facebook download port #31284

Closed

dirkf mentioned this pull request Mar 17, 2023

Error with Facebook video #31870

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[extractor/facebook] Improve extraction #30700

[extractor/facebook] Improve extraction #30700

dirkf commented Feb 28, 2022

dirkf commented Apr 27, 2022 •

edited

Loading

ranelpadon commented Feb 2, 2023

dirkf commented Mar 17, 2023

[extractor/facebook] Improve extraction #30700

Are you sure you want to change the base?

[extractor/facebook] Improve extraction #30700

Conversation

dirkf commented Feb 28, 2022

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

dirkf commented Apr 27, 2022 • edited Loading

ranelpadon commented Feb 2, 2023

dirkf commented Mar 17, 2023

dirkf commented Apr 27, 2022 •

edited

Loading