[MRG] Add FAQ entry on using BeautifulSoup in spider callbacks #2048
Conversation
Current coverage is 83.32%
|
You just have to feed the response's body into a ``BeautifulSoup`` object | ||
and extract whatever data you need from it. | ||
|
||
Here's an example spider using ``lxml`` parser with BeautifulSoup API:: |
kmike
Jun 10, 2016
Member
Why is 'lxml' needed for the example? Does it work without 'lxml'?
Why is 'lxml' needed for the example? Does it work without 'lxml'?
redapple
Jun 10, 2016
•
Author
Contributor
It does work without, yes
It does work without, yes
kmike
Jun 10, 2016
•
Member
I think it is better to keep the example minimal; it may worths adding a note about 'lxml' though - if I'm not mistaken, it is not default (or is it?) because lxml is hard to install, but if Scrapy is installed then lxml is already installed.
I think it is better to keep the example minimal; it may worths adding a note about 'lxml' though - if I'm not mistaken, it is not default (or is it?) because lxml is hard to install, but if Scrapy is installed then lxml is already installed.
redapple
Jun 10, 2016
•
Author
Contributor
Just tested, what you get without "lxml" is:
2016-06-10 19:17:34 [scrapy] DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
2016-06-10 19:17:34 [py.warnings] WARNING: /home/paul/.virtualenvs/scrapybs4.py3/lib/python3.5/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))
2016-06-10 19:17:35 [scrapy] DEBUG: Scraped from <200 http://www.example.com/>
{'url': 'http://www.example.com/', 'title': 'Example Domain'}
so it works but it's not the cleanest output.
but I agree we can keep the example minimal, and maybe add a in-code comment about adding "lxml"
Just tested, what you get without "lxml" is:
2016-06-10 19:17:34 [scrapy] DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
2016-06-10 19:17:34 [py.warnings] WARNING: /home/paul/.virtualenvs/scrapybs4.py3/lib/python3.5/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))
2016-06-10 19:17:35 [scrapy] DEBUG: Scraped from <200 http://www.example.com/>
{'url': 'http://www.example.com/', 'title': 'Example Domain'}
so it works but it's not the cleanest output.
but I agree we can keep the example minimal, and maybe add a in-code comment about adding "lxml"
kmike
Jun 10, 2016
Member
That's not a good API by BS.. Maybe we should keep 'lxml' so, but add a comment about it, and maybe a link to https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use
That's not a good API by BS.. Maybe we should keep 'lxml' so, but add a comment about it, and maybe a link to https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use
@kmike , I've updated the entry. Reads better? |
and extract whatever data you need from it. | ||
|
||
Here's an example spider using BeautifulSoup API, with ``lxml`` as the HTML parser | ||
(so you get the same parsing speed as with scrapy/parsel selectors):: |
kmike
Jun 14, 2016
Member
Are you sure about that? I recall some old benchmark where BS+lxml was much slower than just lxml.
Are you sure about that? I recall some old benchmark where BS+lxml was much slower than just lxml.
redapple
Jun 14, 2016
•
Author
Contributor
I have no idea. I just extrapolated
If you can, I recommend you install and use lxml for speed.
I have no idea. I just extrapolated
If you can, I recommend you install and use lxml for speed.
redapple
Jun 14, 2016
Author
Contributor
I'll remove that line
I'll remove that line
Looks good, thanks @redapple! |
See https://twitter.com/PackOsiris/status/741114216833241089 for motivation