Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A model, well-commented extractor #11968

Open
johnhawkinson opened this issue Feb 4, 2017 · 4 comments
Open

A model, well-commented extractor #11968

johnhawkinson opened this issue Feb 4, 2017 · 4 comments
Labels

Comments

@johnhawkinson
Copy link
Contributor

@johnhawkinson johnhawkinson commented Feb 4, 2017

Since I'm not intimately familiar with the youtube-dl codebase, when I make a change to an extractor I often want to review other extractors to see how things have been handled previously. But oftentimes, it's really not clear which of several choices is better, what is favored, and what is more elegant.

If I had time to read through every single extractor I could make those determinations, but with 600+ of them, that's kind of daunting. Oftentimes I'll grep for a function and start looking at how abc.py and abcnews.py do it, since they are first alphabetically. But that's perhaps not a great plan.

It would be nice if there were a few well-written sample or "model" extractors that were highlighted as Best Practices that we could look at and review. And if those were clearly identified somehow, so I know where to look (maybe such model extractors exist already, I'm just not sure what to be looking at?). All things being equal, it'd be great if they were at the head of the alphabet, too :)

Thanks.

Also, the youtube-dl coding style seems to discourage comments. Maybe when you're very familiar with the code and the functions in common.py they don't seem so important, but for someone new to the codebase, they can help a lot. So it would be nice if any such model extractors could also be well-commented, or at least more verbosely commented than is typical in the codebase.

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Feb 4, 2017

That's a great suggestion. However, in youtube-dl there's no uniform coding convention. Coding styles change fast with time, so new or recent changed extractors looks quite different than earlier extractors. (I would rather say my early contributions are like a shit :D) To make things worse, coding styles between major developers are different as they may be involved in other Python projects. I don't think such a model extractor can really exist. If you really need well-written extractors, consider those who are written mostly by major developers (dstftw, phihag, jaimeMF, remitamine and me).

If you really need concrete examples, you may consider to have a look into litv.py, bilibili.py and niconico.py - those are my recent changes.

About comments:

It's also a great idea. I prefer to document functions where they are defined. At least InfoExtractor and YoutubeDL should be documented. The former is the base class for all extractors and the latter is the public API of the youtube_dl module. I imagine there's something like Java's javadoc, and I've heard pydoc, too, but I'm not sure is it suitable. I may try to add some comments to functions in aforementioned classes and see what happened.

@yan12125 yan12125 added the request label Feb 4, 2017
@johnhawkinson
Copy link
Contributor Author

@johnhawkinson johnhawkinson commented Feb 4, 2017

Model extractors

I'm not terribly worried about uniform coding conventions — several choices would be fine, even if they're conflicting. Multiple model extractors would be fine. And the point here isn't to use to critique submissions and fail them because of style; it is intended as an aid to developers to figure out recommendations that might make sense for them, and to pick and choose them.

The particular boring and pedestrian example this morning was finding an example where match_regex was used with a list, and whether there were any particular gotchas I shouold worry about, or even how best to handle the line breaks for long patterns. I followed the style from audimedia.py rather than your bilibili.py (again, alphabetical). But this was really a boring whitespace issue, not something substantive.

Still, it would be helpful if those concrete examples were listed somewhere, maybe CONTRIBUTING.md?

Comments

Comments and documentation strings are different, and they are both useful and compliment each other. Indeed, pydoc does assemble the documentation strings into HTML and the results have been useful to me.

But that's not the same as showing the narrative for how an extractor should be written and the reasonable choices, and maybe why you should call _html_search_regex() instead of _search_regex(); those kinds of things are hard to get from the docstrings.

Really, the difference between comments in example code and docstrings is the same as he difference between a User's Guide and a Reference Manual. Both are useful, but sometimes reading function definitions in alphabetical order isn't the best way to learn.

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Feb 4, 2017

The particular boring and pedestrian example this morning was finding an example where match_regex was used with a list

A list of available APIs should solve the problem. I guess pydoc or alternatives can generate HTML documents just like docs.python.org or readthedocs.org. New comers can read them first.

how best to handle the line breaks for long patterns

For non-functionality styles, there's much freedom!

Still, it would be helpful if those concrete examples were listed somewhere, maybe CONTRIBUTING.md?

I guess problems will be gone when API documents are (almost) complete.

those kinds of things are hard to get from the docstrings.

It may look like:

_search_regex(): search for a pattern or a list of patterns from a given webpage
_html_search_regex(): like _search_regex(), but used for searching patterns involving HTML codes

Documents are enought, aren't they?

reading function definitions in alphabetical order isn't the best way to learn

I guess some tools can put functions into groups? docs.python.org do that very well

Of course a user guide is also helpful. It can contain primary stages for writing an extractor. Common functions like _html_search_regex and _search_regex will be mentioned in descriptions.

@johnhawkinson
Copy link
Contributor Author

@johnhawkinson johnhawkinson commented Feb 4, 2017

model extractors

Still, it would be helpful if those concrete examples were listed somewhere, maybe CONTRIBUTING.md?

I guess problems will be gone when API documents are (almost) complete.

I'm pretty skeptical! But it's easy to remove them later, so let's not let the "best be the enemy of the good."

docstrings

It may look like:

_search_regex(): search for a pattern or a list of patterns from a given webpage
_html_search_regex(): like _search_regex(), but used for searching patterns involving HTML codes

Documents are enough, aren't they?

I think they are not! Again, no matter how good the documentation of the function is, that is not the same as seeing an actual example of how it is used. And it's not typical to have examples of usage within docstrings (is it?).

I guess some tools can put functions into groups? docs.python.org do that very well

Of course a user guide is also helpful. It can contain primary stages for writing an extractor. Common functions like _html_search_regex and _search_regex will be mentioned in descriptions.

Sure. But again, reading the purpose of a function and knowing how it may be invoked is not the same as seeing good examples of how to use it. All of these things are useful. But where we already have good examples among the 600+ extractors, it would be really helpful to have a way to find the good ones.

function naming

So, there is already a good docstring for _search_regex:

    def _search_regex(self, pattern, string, name, default=NO_DEFAULT, fatal=True, flags=0, group=None):
        """
        Perform a regex search on the given string, using a single or a list of
        patterns returning the first matching group.
        In case of failure return a default value or raise a WARNING or a
        RegexNotFoundError, depending on fatal, specifying the field name.
        """

But strangely, it doesn't show up in the documentation that pydoc generates (youtube_dl.extractor.common.html). I think this is because of the leading underscore.

Which leaves me to wonder, why do these functions have leading underscores at all?

PEP 8 says:

 Use one leading underscore only for non-public methods and instance variables. 

And sure enough, pydoc excludes them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.