import { Callout } from 'nextra/components'; import Image from 'next/image'; import dynamicScrape from '/public/tutorial/scrape/dynamic-scrape.png'; import buildshipGoogleSearch from '/public/tutorial/scrape/buildship-google-search.png'; import dynamicScrapeResult from '/public/tutorial/scrape/dynamic-scrape-result.png'; import puppeteerTypeMethodParams from '/public/tutorial/puppeteer-type-method-params.png';
Extract data from web pages by scraping their text content. This node works great for dynamic and complex sites that rely on javascript to load.
The Scrape Web URL node's functionality includes fetching the content of a web page for the provided url. Under the hood, it uses Puppeteer to scrape the web page thus allowing for more dynamic scraping. The node accepts these parameters:
-
URL (required): The URL of the page you want to scrape. Example: https://docs.buildship.com.
-
Selector (optional): Specific HTML selector you want to extract text content from (by default body will be used).
-
Steps (optional): List of steps to follow after loading the page in given url.
Usage Example: Suppose you want to scrape the below information:
{' '}
To begin we can setup the node, set the URL to https://www.google.com/
and the selector to #result-stats
. This will
extract the text content of the search result stats. The steps
parameter can be used to interact with the page before
extracting the text content.
{' '}
{' '}
So, the steps
input value would look something like below:
[
// type "buildship" in google search input box
{
"action": "type",
"params": ["#APjFqb", "buildship"]
},
// click on google search button
{
"action": "click",
"params": [".gNO89b"]
},
// wait for searched query to load and if page doesn't load within 3 seconds, move to next step
{
"action": "waitForNavigation",
"params": [
{
"timeout": 3000,
"waitUntil": "load"
}
]
}
]
And after execution we get the information we're looking for:
{' '}
The root selector
value is the selector from which you want to extract the text-content, after all steps are executed.
Each step object in steps
list consists of action
and params
.
The action
parameter is the name of any method from
puppeteer-page-methods list. And, the params
is list of parameters
required in the selected action
(a puppeteer method name).
For one of the action
- type
, the parameters for the
puppeteer-type-method are:
{' '}
Hence, the step object for type
action would look like:
{
// puppeteer method name
"action": "type",
// "#APjFqb" is the "selector" (the selector to find <input>)
// "buildship" is the "text" (the value to be typed in <input>)
// As per parameters list of "type" method, the third parameter is optional,
// hence we can either include or exclude it from "params" list
"params": ["#APjFqb", "buildship"]
}