From 8a3b15eb91169ab262e4dca60105f56467ecd1ff Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Adri=C3=A1n=20Chaves?= Date: Wed, 27 Mar 2019 08:50:33 +0100 Subject: [PATCH] Document how to select dynamically-loaded content --- docs/index.rst | 4 + docs/topics/dynamic-content.rst | 246 ++++++++++++++++++++++++++++++++ 2 files changed, 250 insertions(+) create mode 100644 docs/topics/dynamic-content.rst diff --git a/docs/index.rst b/docs/index.rst index cedde8f380e..6d5f9e77dae 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -158,6 +158,7 @@ Solving specific problems topics/practices topics/broad-crawls topics/developer-tools + topics/dynamic-content topics/leaks topics/media-pipeline topics/deploy @@ -183,6 +184,9 @@ Solving specific problems :doc:`topics/developer-tools` Learn how to scrape with your browser's developer tools. +:doc:`topics/dynamic-content` + Read webpage data that is loaded dynamically. + :doc:`topics/leaks` Learn how to find and get rid of memory leaks in your crawler. diff --git a/docs/topics/dynamic-content.rst b/docs/topics/dynamic-content.rst new file mode 100644 index 00000000000..8b5dacf5607 --- /dev/null +++ b/docs/topics/dynamic-content.rst @@ -0,0 +1,246 @@ +.. _topics-dynamic-content: + +==================================== +Selecting dynamically-loaded content +==================================== + +Some webpages show the desired data when you load them in a web browser. +However, when you download them using Scrapy, you cannot reach the desired data +using :ref:`selectors `. + +When this happens, the recommended approach is to +:ref:`find the data source ` and extract the data +from it. + +If you fail to do that, and you can nonetheless access the desired data through +the :ref:`DOM ` from your web browser, see +:ref:`topics-javascript-rendering`. + +.. _topics-finding-data-source: + +Finding the data source +======================= + +To extract the desired data, you must first find its source location. + +If the data is in a non-text-based format, such as an image or a PDF document, +use the :ref:`network tool ` of your web browser to find +the corresponding request, and :ref:`reproduce it +`. + +If your web browser lets you select the desired data as text, the data may be +defined in embedded JavaScript code, or loaded from an external resource in a +text-based format. + +In that case, you can use a tool like wgrep_ to find the URL of that resource. + +If the data turns out to come from the original URL itself, you must +:ref:`inspect the source code of the webpage ` to +determine where the data is located. + +If the data comes from a different URL, you will need to :ref:`reproduce the +corresponding request `. + +.. _topics-inspecting-source: + +Inspecting the source code of a webpage +======================================= + +Sometimes you need to inspect the source code of a webpage (not the +:ref:`DOM `) to determine where some desired data is located. + +Use Scrapy’s :command:`fetch` command to download the webpage contents as seen +by Scrapy:: + + scrapy fetch --nolog https://example.com > response.html + +If the desired data is in embedded JavaScript code within a ``