Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Document how to scrape JavaScript-rendered webpages
- Loading branch information
Showing
2 changed files
with
119 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
.. _topics-javascript: | ||
|
||
===================================== | ||
Scraping JavaScript-rendered webpages | ||
===================================== | ||
|
||
Some webpages show the desired data when you load them on a web browser; | ||
however, when you download them using Scrapy, the desired data :ref:`is not in | ||
the expected location <topics-livedom>`. | ||
|
||
These webpages use JavaScript_ to place the desired data on its final location | ||
at run time. | ||
|
||
To extract the desired data, you must first find its source location. | ||
|
||
.. _topics-parsing-javascript: | ||
|
||
Parsing JavaScript code | ||
======================= | ||
|
||
First, you should inspect the HTML contents of the webpage | ||
(:class:`response.text <scrapy.http.TextResponse.text>`). The desired data may | ||
be within a ``<script/>`` element, hardcoded in JavaScript. | ||
|
||
If that is the case, you first need to extract the JavaScript code within that | ||
``<script/>`` element using :ref:`selectors <topics-selectors>`. | ||
|
||
Then you can extract the data from the JavaScript code. How you do that depends | ||
on how the data is defined in the JavaScript code. | ||
|
||
You might be able to use a `regular expression`_ to extract the desired data in | ||
JSON format, which you can then parse with Python’s json_ module. | ||
|
||
For example, if the JavaScript code contains a separate line like | ||
``var data = {"field": "value"};`` you can extract that data as follows:: | ||
|
||
>>> pattern = r'\bvar\s+data\s*=\s*(\{.*?\})\s*;\s*\n' | ||
>>> json_data = response.css('script::text').re_first(pattern) | ||
>>> json.loads(json_data) | ||
{'field': 'value'} | ||
|
||
Otherwise, you may use js2xml_ to convert the JavaScript code into an XML | ||
document that you can parse using :ref:`selectors <topics-selectors>`. | ||
|
||
For example:: | ||
|
||
>>> import js2xml | ||
>>> import lxml.etree | ||
>>> from parsel import Selector | ||
>>> javascript = response.css('script::text').get() | ||
>>> xml = lxml.etree.tostring(js2xml.parse(javascript), encoding='unicode') | ||
>>> selector = Selector(text=xml) | ||
>>> selector.css('var[name="data"]').get() | ||
'<var name="data"><object><property name="field"><string>value</string></property></object></var>' | ||
|
||
.. _topics-reproducing-ajax: | ||
|
||
Reproducing AJAX requests | ||
========================= | ||
|
||
If you cannot find the desired data on the HTML contents of the webpage, then | ||
the webpage is probably using JavaScript to perform one or more additional HTTP | ||
requests to fetch the desired data, a technique commonly known as AJAX_. | ||
|
||
Use the :ref:`network tool <topics-network-tool>` of your web browser to find | ||
out which requests receive the desired data, and reproduce them in Scrapy. | ||
|
||
It might be enough to yield a :class:`~scrapy.http.Request` with the same HTTP | ||
method and URL. However, you may also need to reproduce the body, headers and | ||
form parameters (see :class:`~scrapy.http.FormRequest`) of those requests. | ||
|
||
Responses to these requests are often JSON data. When they are, use Python’s | ||
json_ module to load this data from | ||
:attr:`response.text <scrapy.http.TextResponse.text>`:: | ||
|
||
data = json.loads(response.text) | ||
|
||
.. _topics-javascript-rendering: | ||
|
||
Pre-rendering JavaScript | ||
======================== | ||
|
||
On webpages using AJAX, reproducing their requests is the preferred way to get | ||
the desired data. The extra effort is often worth the result: structured, | ||
complete data with minimum parsing time and network transfer. | ||
|
||
However, sometimes it can be really hard to reproduce certain AJAX requests. Or | ||
you may need something that no request can give you, such as a screenshot of | ||
a webpage as seen in a web browser. | ||
|
||
In these cases use the Splash_ JavaScript-rendering service, along with | ||
`scrapy-splash`_ for seamless integration. | ||
|
||
.. _topics-headless-browser: | ||
|
||
Using a headless browser | ||
======================== | ||
|
||
If you need something beyond what Splash offers, you might need to use a | ||
`headless browser`_ instead. | ||
|
||
The easiest way to use a headless browser with Scrapy is to use Selenium_, | ||
along with `scrapy-selenium`_ for seamless integration. | ||
|
||
|
||
.. _AJAX: https://en.wikipedia.org/wiki/Ajax_%28programming%29 | ||
.. _headless browser: https://en.wikipedia.org/wiki/Headless_browser | ||
.. _JavaScript: https://en.wikipedia.org/wiki/JavaScript | ||
.. _js2xml: https://github.com/scrapinghub/js2xml | ||
.. _json: https://docs.python.org/library/json.html | ||
.. _regular expression: https://docs.python.org/library/re.html | ||
.. _scrapy-selenium: https://github.com/clemfromspace/scrapy-selenium | ||
.. _scrapy-splash: https://github.com/scrapy-plugins/scrapy-splash | ||
.. _Selenium: https://www.seleniumhq.org/ | ||
.. _Splash: https://github.com/scrapinghub/splash |