Skip to content

Commit

Permalink
Document how to scrape JavaScript-rendered webpages
Browse files Browse the repository at this point in the history
  • Loading branch information
Gallaecio committed Mar 29, 2019
1 parent 2fd8b7c commit d2143dc
Show file tree
Hide file tree
Showing 2 changed files with 119 additions and 0 deletions.
4 changes: 4 additions & 0 deletions docs/index.rst
Expand Up @@ -158,6 +158,7 @@ Solving specific problems
topics/practices
topics/broad-crawls
topics/developer-tools
topics/javascript
topics/leaks
topics/media-pipeline
topics/deploy
Expand All @@ -183,6 +184,9 @@ Solving specific problems
:doc:`topics/developer-tools`
Learn how to scrape with your browser's developer tools.

:doc:`topics/javascript`
Access page data that is loaded dynamically using JavaScript.

:doc:`topics/leaks`
Learn how to find and get rid of memory leaks in your crawler.

Expand Down
115 changes: 115 additions & 0 deletions docs/topics/javascript.rst
@@ -0,0 +1,115 @@
.. _topics-javascript:

=====================================
Scraping JavaScript-rendered webpages
=====================================

Some webpages show the desired data when you load them on a web browser;
however, when you download them using Scrapy, the desired data :ref:`is not in
the expected location <topics-livedom>`.

These webpages use JavaScript_ to place the desired data on its final location
at run time.

To extract the desired data, you must first find its source location.

.. _topics-parsing-javascript:

Parsing JavaScript code
=======================

First, you should inspect the HTML contents of the webpage
(:class:`response.text <scrapy.http.TextResponse.text>`). The desired data may
be within a ``<script/>`` element, hardcoded in JavaScript.

If that is the case, you first need to extract the JavaScript code within that
``<script/>`` element using :ref:`selectors <topics-selectors>`.

Then you can extract the data from the JavaScript code. How you do that depends
on how the data is defined in the JavaScript code.

You might be able to use a `regular expression`_ to extract the desired data in
JSON format, which you can then parse with Python’s json_ module.

For example, if the JavaScript code contains a separate line like
``var data = {"field": "value"};`` you can extract that data as follows::

>>> pattern = r'\bvar\s+data\s*=\s*(\{.*?\})\s*;\s*\n'
>>> json_data = response.css('script::text').re_first(pattern)
>>> json.loads(json_data)
{'field': 'value'}

Otherwise, you may use js2xml_ to convert the JavaScript code into an XML
document that you can parse using :ref:`selectors <topics-selectors>`.

For example::

>>> import js2xml
>>> import lxml.etree
>>> from parsel import Selector
>>> javascript = response.css('script::text').get()
>>> xml = lxml.etree.tostring(js2xml.parse(javascript), encoding='unicode')
>>> selector = Selector(text=xml)
>>> selector.css('var[name="data"]').get()
'<var name="data"><object><property name="field"><string>value</string></property></object></var>'

.. _topics-reproducing-ajax:

Reproducing AJAX requests
=========================

If you cannot find the desired data on the HTML contents of the webpage, then
the webpage is probably using JavaScript to perform one or more additional HTTP
requests to fetch the desired data, a technique commonly known as AJAX_.

Use the :ref:`network tool <topics-network-tool>` of your web browser to find
out which requests receive the desired data, and reproduce them in Scrapy.

It might be enough to yield a :class:`~scrapy.http.Request` with the same HTTP
method and URL. However, you may also need to reproduce the body, headers and
form parameters (see :class:`~scrapy.http.FormRequest`) of those requests.

Responses to these requests are often JSON data. When they are, use Python’s
json_ module to load this data from
:attr:`response.text <scrapy.http.TextResponse.text>`::

data = json.loads(response.text)

.. _topics-javascript-rendering:

Pre-rendering JavaScript
========================

On webpages using AJAX, reproducing their requests is the preferred way to get
the desired data. The extra effort is often worth the result: structured,
complete data with minimum parsing time and network transfer.

However, sometimes it can be really hard to reproduce certain AJAX requests. Or
you may need something that no request can give you, such as a screenshot of
a webpage as seen in a web browser.

In these cases use the Splash_ JavaScript-rendering service, along with
`scrapy-splash`_ for seamless integration.

.. _topics-headless-browser:

Using a headless browser
========================

If you need something beyond what Splash offers, you might need to use a
`headless browser`_ instead.

The easiest way to use a headless browser with Scrapy is to use Selenium_,
along with `scrapy-selenium`_ for seamless integration.


.. _AJAX: https://en.wikipedia.org/wiki/Ajax_%28programming%29
.. _headless browser: https://en.wikipedia.org/wiki/Headless_browser
.. _JavaScript: https://en.wikipedia.org/wiki/JavaScript
.. _js2xml: https://github.com/scrapinghub/js2xml
.. _json: https://docs.python.org/library/json.html
.. _regular expression: https://docs.python.org/library/re.html
.. _scrapy-selenium: https://github.com/clemfromspace/scrapy-selenium
.. _scrapy-splash: https://github.com/scrapy-plugins/scrapy-splash
.. _Selenium: https://www.seleniumhq.org/
.. _Splash: https://github.com/scrapinghub/splash

0 comments on commit d2143dc

Please sign in to comment.