Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new API based on extraction steps #8757

Open
remitamine opened this issue Mar 4, 2016 · 0 comments
Open

new API based on extraction steps #8757

remitamine opened this issue Mar 4, 2016 · 0 comments

Comments

@remitamine
Copy link
Collaborator

@remitamine remitamine commented Mar 4, 2016

the idea come after discussion with @dstftw in this PR #8439.
i find the that adding a new API based on extraction steps can be used for more than the initial need(skip protocols).
it will help:
for users:

  • skip extraction steps in general(suggested by @dstftw in the discussion #8439 (comment)).
  • more information extracted from multiple sources.
  • speed --download-archive option for some extractors(the only needed information is the ie_key and id extracted by metadata step and even for some extractors it can be extracted directly from the url no need for other requests)
  • for playlist it will be possible to start downloading after getting the video list from page has been extracted(useful with playlist with a lot of videos).

for develepment:

  • ability to combine info from multiple extractors.
    example of this case:
    cbs, cbs news, cnet, fox, aenetworks formats extracted from multiple the platform SMIL.
    dcn, vevo has formats can be extracted from the site itself and from youtube.
    it would be simpler do something like this(can't be done with url type):
return {
    #info extracted from the current extractor
    'type': 'external',
    'sources': [{
        'ie': 'ThePlatform',
        'steps': [
             'metadata',
             'formats',
             'subtitles',
        ],
        'url': '',
        'priority': 5,
    },{
        'ie': 'ThePlatform',
        'steps': [
             'formats',
             'subtitles',
        ],
        'url': '',
        'priority': 3,
    },...]
}

the priority can be useful for metadata to declare the order of overiding the attributes.

possible steps that i can think of(for some sites they need separate request to extract them).
id, metadata, formats, subtitles, comments, smil, m3u8, mpd, f4m.
possible method to implement this:

  • register the steps to extract(can be modified by cli options or by previous other extractor like the previous example and by the extractor itself)
  • create common method for every step and optionally override them in the extractors.
  • every method will receive two dicts one contain information extracted by other steps and the other contain the info_dict it will extract the information using the first dict and can add or update attributes in it and update the dictionary with related info and return both of them.
  • every step can declare it's dependencies that should be resolved(most of the steps will need the metadata).

i don't know if this will be the best way to do this so i opened this issue mainly for discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.