Skip to content

Latest commit

 

History

History
104 lines (74 loc) · 3.42 KB

scrape-web-url-dynamic.mdx

File metadata and controls

104 lines (74 loc) · 3.42 KB

Scrape Web URL (Dynamic)

import { Callout } from 'nextra/components'; import Image from 'next/image'; import dynamicScrape from '/public/tutorial/scrape/dynamic-scrape.png'; import buildshipGoogleSearch from '/public/tutorial/scrape/buildship-google-search.png'; import dynamicScrapeResult from '/public/tutorial/scrape/dynamic-scrape-result.png'; import puppeteerTypeMethodParams from '/public/tutorial/puppeteer-type-method-params.png';

Extract data from web pages by scraping their text content. This node works great for dynamic and complex sites that rely on javascript to load.

Using the Scrape Web URL Node (Dynamic)

The Scrape Web URL node's functionality includes fetching the content of a web page for the provided url. Under the hood, it uses Puppeteer to scrape the web page thus allowing for more dynamic scraping. The node accepts these parameters:

  • URL (required): The URL of the page you want to scrape. Example: https://docs.buildship.com.

  • Selector (optional): Specific HTML selector you want to extract text content from (by default body will be used).

  • Steps (optional): List of steps to follow after loading the page in given url.

    Usage Example: Suppose you want to scrape the below information:

    {' '}

    Buildship google search

To begin we can setup the node, set the URL to https://www.google.com/ and the selector to #result-stats. This will extract the text content of the search result stats. The steps parameter can be used to interact with the page before extracting the text content.

{' '}

{' '}

Buildship google search

So, the steps input value would look something like below:

[
  // type "buildship" in google search input box
  {
    "action": "type",
    "params": ["#APjFqb", "buildship"]
  },

  // click on google search button
  {
    "action": "click",
    "params": [".gNO89b"]
  },

  // wait for searched query to load and if page doesn't load within 3 seconds, move to next step
  {
    "action": "waitForNavigation",
    "params": [
      {
        "timeout": 3000,
        "waitUntil": "load"
      }
    ]
  }
]

And after execution we get the information we're looking for:

{' '}

Buildship google search

The root selector value is the selector from which you want to extract the text-content, after all steps are executed.

Each step object in steps list consists of action and params.

The action parameter is the name of any method from puppeteer-page-methods list. And, the params is list of parameters required in the selected action (a puppeteer method name).

For one of the action - type, the parameters for the puppeteer-type-method are:

{' '}

Puppeteer type method parameters

Hence, the step object for type action would look like:

{
  // puppeteer method name
  "action": "type",

  // "#APjFqb" is the "selector" (the selector to find <input>)
  // "buildship" is the "text" (the value to be typed in <input>)
  // As per parameters list of "type" method, the third parameter is optional,
  // hence we can either include or exclude it from "params" list
  "params": ["#APjFqb", "buildship"]
}