New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cover https://michael-shub.github.io/curl2scrapy/ in the documentation #4206
Comments
@Gallaecio something something as simple as #4221 ? |
Do you really want to start covering 3rd party tools in the official scrapy docs? I'm not opposed to this here, but just a warning about the potential minefield, if others request inclusion of their scrapy tools into the docs, following this precedence. Then you might want to have some rules about what's acceptable and what not (3rd party code repos of middlewares, etc.? Links to helper code like scrapy-inline-callbacks? shub competitor websites with scrapy tools like the above?) and who gets to decide what is allowed and what not. Seeing how the original request includes a link to https://github.com/croqaz/awesome-scrapy, that might be a better location for it? (Or perhaps even have a similar link-list project under the github.com/scrapy/ namespace?) |
Yes, I would love to have 3rd party open source tools and libraries that are commonly used along with Scrapy extensions covered in the parts of the documentation where they make sense. In my vision of the Scrapy documentation, the documentation would not explain how to “use Scrapy”, but rather how to “do web scraping using Scrapy”, and that involves using 3rd party libraries and tools. I think users should discover about things like Dateparser, Price Parser, Spidermon, etc. from the Scrapy documentation. I am aware it is a potential minefield, but I hope that if it comes to that in specific scenarios, we can find ways to handle them, and in worst case scenarios have the documentation link to a resource like Stack Overflow or Software Recommendations. In this case, I would really like Scrapy users reading the documentation to find out there is a free web service out there to transform cURL command lines into Scrapy code. And if later a new similar service pops up and is arguably better than the current one, we can replace it in the documentation. And if that is not clear, we can open a discussion in Software Recommendations and point to that as a source for services that allow converting cURL command lines to Scrapy code. |
Oh, cool, I think a write-up like this would be nice. Let's discuss this - but that's off-topic and not related to this issue, so don't block a PR for my thoughts on the matter. Just to add that first. I think a documentation in the style "do web scraping using Scrapy", covering the whole ecosystem, would be great. But that's a high-level view which people who know the basics of scraping, and just want the gist of scrapy, wouldn't appreciate to have to wade through. (Like me, when I just want to refresh my memory, maybe.) So I'd love if it could be a different project: perhaps a 'github pages' repo in the style of a book or tutorial, for the whole ecosystem as you say, but uncoupled from the scrapy API docs. They could then reference each other wherever sensible ("for more in-depth examples, see here"/"for the class API see over there"). Perhaps scrapy has become big enough for that split to make sense now. I also think scrapy's docs are starting to get a bit overloaded with all that extra info, and confuse people that are not familiar with the architecture/structure of scrapy already. Of course there can be any number of reasons people might have trouble to understand the current docs, from language barrier to attention span, but I think keeping the core "API" docs succinct would ease the burden for all of them, and help to get an overall picture before diving into the details. |
Personally, I would prefer to keep such content within the Scrapy documentation, so that it is updated as affected Scrapy parts change, in the same pull request. I think a new documentation section between First steps and Basic concepts would be a good place, and people familiar with web scraping could just skip that section. As for the documentation confusing people, I think the core issue is not that there is too much information, but rather that the documentation needs improvement in how it introduces complexity. I have in my personal to-do list the goal of looking at the documentation pages from the beginning with the perspective of someone who knows nothing about Scrapy or asynchronous programming, and only has a basic knowledge of Python, and make sure pages do not assume knowledge that users may not have yet, including stuff that is only covered in later pages of our documentation. And I think a section like the one I’m proposing, about web scraping, could help a lot in that regard; topics like https://docs.scrapy.org/en/latest/topics/dynamic-content.html can really help people starting with both Scrapy and web scraping. But I have too many documentation-related pull requests open at the moment (#3688, #3706, #4039, #4090, #4192, #4310, #4399), so I stopped working on improving how the documentation introduces complexity for the time being. A section about web scraping in general can also help keeping the API docs succinct, by allowing us to move documentation that is now in the API docs into those new web scraping topics that people looking for reference documentation can simply ignore. |
See #3991 (comment)
In the parts of the documentation where we currently cover
Request.from_curl
we may want to mention this online tool as well.The text was updated successfully, but these errors were encountered: