XMLCrawler

A WebCrawler takes url of html document as input.
It parses the html document and finds the resources referred by <a href="..."> in that document.
It repeats the same process on the html resources referred and so on.
While doing this it saves the resources into local filesystem.

Similarly jlibs.xml.sax.crawl.XMLCrawler is for XML Files.

However an xml document can refer to another xml document in many ways. for example:

XMLSchema uses <xsd:import> and <xsd:include>
WSDL uses <wsdl:import, <wsdl:include>, <xsd:import> and <xsd:include>

i.e each type of xml document has its own way of referring other xml documents.

XMLCrawler is preconfigured to crawl xmlschema, wsdl and xsl documents.

Usage:

import jlibs.xml.sax.crawl.XMLCrawler;



String dir = "d:\\crawl"; // directory where to save crawled documents

String wsdl = "https://fps.amazonaws.com/doc/2007-01-08/AmazonFPS.wsdl"; // wsdl to be crawled



new XMLCrawler().crawlInto(new InputSource(wsdl), new File(dir));

All xml documents are saved into the specified directory. After running above code, you will find following files in d:\crawl directory

AmazonFPS.wsdl

AmazonFPS.xsd

It never overwrites any existing file in that directory. So if you run the above code twice, you will see following files in d:\crawl directory

AmazonFPS1.wsdl

AmazonFPS1.xsd

AmazonFPS.wsdl

AmazonFPS.xsd

you could also do:

new XMLCrawler().crawl(new InputSource(wsdl), new File("d:\\crawl\\target.wsdl"));

crawl(...) method's second argument is the file where to save the document specified in first argument.
It will save all referred documents in the containing directory of second argument.
for example, the above creates following files in d:\crawl

target.wsdl

AmazonFPS.xsd

NOTE: All files are saved directly in given directory, i.e, no subdirectories are created.

XMLCrawler can be configured to crawl any type of xml using CrawlingRules.

The no-arg constructor uses CrawlingRules configured for wsdl, xsd and xsl documents.

Let us see how to configure XMLCrawler for XMLSchema Documents using CrawlingRules.

import jlibs.xml.sax.crawl.XMLCrawler;

import jlibs.xml.sax.crawl.CrawlingRules;

import jlibs.xml.Namespaces;



QName xsd_schema = new QName(Namespaces.URI_XSD, "schema");

QName xsd_import = new QName(Namespaces.URI_XSD, "import");

QName attr_schemaLocation = new QName("schemaLocation");

QName xsd_include = new QName(Namespaces.URI_XSD, "include");



CrawlingRules rules = new CrawlingRules();

rules.addExtension("xsd", xsd_schema);

rules.addAttributeLink(xsd_schema, xsd_import, attr_schemaLocation);

rules.addAttributeLink(xsd_schema, xsd_include, attr_schemaLocation);



XMLCrawler crawler = new XMLCrawler(rules);



// now crawler is ready for use

String xsd = "http://somesite.com/xsds/complex.xsd";

String dir = "d:\\crawl";

crawler.crawlInto(new InputSource(xsd), new File(dir));

First we need to tell, how to recognize the extension of xml file.

rules.addExtension("xsd", xsd_schema);

here we are saying that xml file with root element {"http://www.w3.org/2001/XMLSchema"}schema should be saved with file extension xsd.

rules.addAttributeLink(xsd_schema, xsd_import, attr_schemaLocation);

rules.addAttributeLink(xsd_schema, xsd_include, attr_schemaLocation);

The above lines tell that schemaLocation attribute of xsd:schema/xsd:import and xsd:schema/xsd:include are used to refer other xml files.

Customization:

CrawlerListener interface can be used to customize crawling behavior. It has two methods:

public boolean doCrawl(URL url);

public File toFile(URL url, String extension);

The default implementation used is DefaultCrawlerListener.

doCrawl(url) is used to determine whether given url should be crawled or not. DefaultCrawlerListener implementation always returns true.

toFile(...) is used to determine the file into which the specified url needs to be saved.

to use your implementation of CrawlerListener, you have to use following method in XMLCrawler

public File crawl(InputSource document, CrawlerListener listener, File file) throws IOException

the last argument file can be null, if you don't want to specify target file.
In such case, listener.toFile(...) is used to determine target file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XMLCrawler

Clone this wiki locally