Document how to scrape JavaScript-rendered webpages

scrapy · Mar 29, 2019 · d2143dc · d2143dc
1 parent 2fd8b7c
commit d2143dc
Show file tree

Hide file tree

Showing 2 changed files with 119 additions and 0 deletions.
diff --git a/docs/index.rst b/docs/index.rst
@@ -158,6 +158,7 @@ Solving specific problems
    topics/practices
    topics/broad-crawls
    topics/developer-tools
+   topics/javascript
    topics/leaks
    topics/media-pipeline
    topics/deploy
@@ -183,6 +184,9 @@ Solving specific problems
 :doc:`topics/developer-tools`
     Learn how to scrape with your browser's developer tools.
 
+:doc:`topics/javascript`
+    Access page data that is loaded dynamically using JavaScript.
+
 :doc:`topics/leaks`
     Learn how to find and get rid of memory leaks in your crawler.
 

diff --git a/docs/topics/javascript.rst b/docs/topics/javascript.rst
@@ -0,0 +1,115 @@
+.. _topics-javascript:
+
+=====================================
+Scraping JavaScript-rendered webpages
+=====================================
+
+Some webpages show the desired data when you load them on a web browser;
+however, when you download them using Scrapy, the desired data :ref:`is not in
+the expected location <topics-livedom>`.
+
+These webpages use JavaScript_ to place the desired data on its final location
+at run time.
+
+To extract the desired data, you must first find its source location.
+
+.. _topics-parsing-javascript:
+
+Parsing JavaScript code
+=======================
+
+First, you should inspect the HTML contents of the webpage
+(:class:`response.text <scrapy.http.TextResponse.text>`). The desired data may
+be within a ``<script/>`` element, hardcoded in JavaScript.
+
+If that is the case, you first need to extract the JavaScript code within that
+``<script/>`` element using :ref:`selectors <topics-selectors>`.
+
+Then you can extract the data from the JavaScript code. How you do that depends
+on how the data is defined in the JavaScript code.
+
+You might be able to use a `regular expression`_ to extract the desired data in
+JSON format, which you can then parse with Python’s json_ module.
+
+For example, if the JavaScript code contains a separate line like
+``var data = {"field": "value"};`` you can extract that data as follows::
+
+    >>> pattern = r'\bvar\s+data\s*=\s*(\{.*?\})\s*;\s*\n'
+    >>> json_data = response.css('script::text').re_first(pattern)
+    >>> json.loads(json_data)
+    {'field': 'value'}
+
+Otherwise, you may use js2xml_ to convert the JavaScript code into an XML
+document that you can parse using :ref:`selectors <topics-selectors>`.
+
+For example::
+
+    >>> import js2xml
+    >>> import lxml.etree
+    >>> from parsel import Selector
+    >>> javascript = response.css('script::text').get()
+    >>> xml = lxml.etree.tostring(js2xml.parse(javascript), encoding='unicode')
+    >>> selector = Selector(text=xml)
+    >>> selector.css('var[name="data"]').get()
+    '<var name="data"><object><property name="field"><string>value</string></property></object></var>'
+
+.. _topics-reproducing-ajax:
+
+Reproducing AJAX requests
+=========================
+
+If you cannot find the desired data on the HTML contents of the webpage, then
+the webpage is probably using JavaScript to perform one or more additional HTTP
+requests to fetch the desired data, a technique commonly known as AJAX_.
+
+Use the :ref:`network tool <topics-network-tool>` of your web browser to find
+out which requests receive the desired data, and reproduce them in Scrapy.
+
+It might be enough to yield a :class:`~scrapy.http.Request` with the same HTTP
+method and URL. However, you may also need to reproduce the body, headers and
+form parameters (see :class:`~scrapy.http.FormRequest`) of those requests.
+
+Responses to these requests are often JSON data. When they are, use Python’s
+json_ module to load this data from
+:attr:`response.text <scrapy.http.TextResponse.text>`::
+
+    data = json.loads(response.text)
+
+.. _topics-javascript-rendering:
+
+Pre-rendering JavaScript
+========================
+
+On webpages using AJAX, reproducing their requests is the preferred way to get
+the desired data. The extra effort is often worth the result: structured,
+complete data with minimum parsing time and network transfer.
+
+However, sometimes it can be really hard to reproduce certain AJAX requests. Or
+you may need something that no request can give you, such as a screenshot of
+a webpage as seen in a web browser.
+
+In these cases use the Splash_ JavaScript-rendering service, along with
+`scrapy-splash`_ for seamless integration.
+
+.. _topics-headless-browser:
+
+Using a headless browser
+========================
+
+If you need something beyond what Splash offers, you might need to use a
+`headless browser`_ instead.
+
+The easiest way to use a headless browser with Scrapy is to use Selenium_,
+along with `scrapy-selenium`_ for seamless integration.
+
+
+.. _AJAX: https://en.wikipedia.org/wiki/Ajax_%28programming%29
+.. _headless browser: https://en.wikipedia.org/wiki/Headless_browser
+.. _JavaScript: https://en.wikipedia.org/wiki/JavaScript
+.. _js2xml: https://github.com/scrapinghub/js2xml
+.. _json: https://docs.python.org/library/json.html
+.. _regular expression: https://docs.python.org/library/re.html
+.. _scrapy-selenium: https://github.com/clemfromspace/scrapy-selenium
+.. _scrapy-splash: https://github.com/scrapy-plugins/scrapy-splash
+.. _Selenium: https://www.seleniumhq.org/
+.. _Splash: https://github.com/scrapinghub/splash