Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add FAQ entry on using BeautifulSoup in spider callbacks #2048

Merged
merged 3 commits into from Jun 14, 2016

Conversation

@redapple
Copy link
Contributor

@redapple redapple commented Jun 10, 2016

@codecov-io
Copy link

@codecov-io codecov-io commented Jun 10, 2016

Current coverage is 83.32%

Merging #2048 into master will decrease coverage by 0.01%

Powered by Codecov. Last updated by b7925e4...1ff9a48

You just have to feed the response's body into a ``BeautifulSoup`` object
and extract whatever data you need from it.

Here's an example spider using ``lxml`` parser with BeautifulSoup API::

This comment has been minimized.

@kmike

kmike Jun 10, 2016
Member

Why is 'lxml' needed for the example? Does it work without 'lxml'?

This comment has been minimized.

@redapple

redapple Jun 10, 2016
Author Contributor

It does work without, yes

This comment has been minimized.

@kmike

kmike Jun 10, 2016
Member

I think it is better to keep the example minimal; it may worths adding a note about 'lxml' though - if I'm not mistaken, it is not default (or is it?) because lxml is hard to install, but if Scrapy is installed then lxml is already installed.

This comment has been minimized.

@redapple

redapple Jun 10, 2016
Author Contributor

Just tested, what you get without "lxml" is:

2016-06-10 19:17:34 [scrapy] DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
2016-06-10 19:17:34 [py.warnings] WARNING: /home/paul/.virtualenvs/scrapybs4.py3/lib/python3.5/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))

2016-06-10 19:17:35 [scrapy] DEBUG: Scraped from <200 http://www.example.com/>
{'url': 'http://www.example.com/', 'title': 'Example Domain'}

so it works but it's not the cleanest output.
but I agree we can keep the example minimal, and maybe add a in-code comment about adding "lxml"

This comment has been minimized.

@kmike

kmike Jun 10, 2016
Member

That's not a good API by BS.. Maybe we should keep 'lxml' so, but add a comment about it, and maybe a link to https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use

@redapple
Copy link
Contributor Author

@redapple redapple commented Jun 14, 2016

@kmike , I've updated the entry. Reads better?

@redapple redapple changed the title Add FAQ entry on using BeautifulSoup in spider callbacks [MRG] Add FAQ entry on using BeautifulSoup in spider callbacks Jun 14, 2016
and extract whatever data you need from it.

Here's an example spider using BeautifulSoup API, with ``lxml`` as the HTML parser
(so you get the same parsing speed as with scrapy/parsel selectors)::

This comment has been minimized.

@kmike

kmike Jun 14, 2016
Member

Are you sure about that? I recall some old benchmark where BS+lxml was much slower than just lxml.

This comment has been minimized.

@redapple

redapple Jun 14, 2016
Author Contributor

I have no idea. I just extrapolated

If you can, I recommend you install and use lxml for speed.

from https://www.crummy.com/software/BeautifulSoup/bs4/doc/

This comment has been minimized.

@redapple

redapple Jun 14, 2016
Author Contributor

I'll remove that line

@kmike kmike merged commit 80c296e into scrapy:master Jun 14, 2016
2 checks passed
2 checks passed
codecov/patch 100% of diff hit (target 100%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@kmike
Copy link
Member

@kmike kmike commented Jun 14, 2016

Looks good, thanks @redapple!

@redapple redapple deleted the redapple:bs4-faq branch Jul 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants