Extraction strategy for deep crawlin #936

franjefriten · 2025-04-03T18:50:30Z

franjefriten
Apr 3, 2025

In the documentation, it specifically tells to use the following structure when using JsonXPathExtractionStrategy or JsonCssExtractionStrategy:

    # 1. Minimal dummy HTML with some repeating rows
    dummy_html = """
    <html>
      <body>
        <div class='crypto-row'>
          <h2 class='coin-name'>Bitcoin</h2>
          <span class='coin-price'>$28,000</span>
        </div>
        <div class='crypto-row'>
          <h2 class='coin-name'>Ethereum</h2>
          <span class='coin-price'>$1,800</span>
        </div>
      </body>
    </html>
    """

    # 2. Define the JSON schema (XPath version)
    schema = {
        "name": "Crypto Prices via XPath",
        "baseSelector": "//div[@class='crypto-row']",
        "fields": [
            {
                "name": "coin_name",
                "selector": ".//h2[@class='coin-name']",
                "type": "text"
            },
            {
                "name": "price",
                "selector": ".//span[@class='coin-price']",
                "type": "text"
            }
        ]
    }

However, this only leaves us with only the possibility of doing Shallow crawling and not Deep crawling, as both methods construct a lxml tree from the html of one page and require a base Selector to be given. Therefore, you cannot extract from another anidated page whose url is found in the html of the base page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Extraction strategy for deep crawlin #936

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Extraction strategy for deep crawlin #936

Uh oh!

franjefriten Apr 3, 2025

Replies: 0 comments

franjefriten
Apr 3, 2025