-
Notifications
You must be signed in to change notification settings - Fork 22
XMLCrawler
A WebCrawler takes url of html document as input.
It parses the html document and finds the resources referred by <a href="...">
in that document.
It repeats the same process on the html resources referred and so on.
While doing this it saves the resources into local filesystem.
Similarly jlibs.xml.sax.crawl.XMLCrawler
is for XML Files.
However an xml document can refer to another xml document in many ways. for example:
XMLSchema uses <xsd:import>
and <xsd:include>
WSDL uses <wsdl:import
, <wsdl:include>
, <xsd:import>
and <xsd:include>
i.e each type of xml document has its own way of referring other xml documents.
XMLCrawler
is preconfigured to crawl xmlschema, wsdl and xsl documents.
Usage:
import jlibs.xml.sax.crawl.XMLCrawler;
String dir = "d:\\crawl"; // directory where to save crawled documents
String wsdl = "https://fps.amazonaws.com/doc/2007-01-08/AmazonFPS.wsdl"; // wsdl to be crawled
new XMLCrawler().crawlInto(new InputSource(wsdl), new File(dir));
All xml documents are saved into the specified directory. After running above code, you will find following files in d:\crawl
directory
AmazonFPS.wsdl
AmazonFPS.xsd
It never overwrites any existing file in that directory. So if you run the above code twice, you will see following files in d:\crawl
directory
AmazonFPS1.wsdl
AmazonFPS1.xsd
AmazonFPS.wsdl
AmazonFPS.xsd
you could also do:
new XMLCrawler().crawl(new InputSource(wsdl), new File("d:\\crawl\\target.wsdl"));
crawl(...)
method's second argument is the file where to save the document specified in first argument.
It will save all referred documents in the containing directory of second argument.
for example, the above creates following files in d:\crawl
target.wsdl
AmazonFPS.xsd
NOTE: All files are saved directly in given directory, i.e, no subdirectories are created.
XMLCrawler
can be configured to crawl any type of xml using CrawlingRules
.The no-arg constructor uses
CrawlingRules
configured for wsdl, xsd and xsl documents.Let us see how to configure XMLCrawler for XMLSchema Documents using
CrawlingRules
.import jlibs.xml.sax.crawl.XMLCrawler;
import jlibs.xml.sax.crawl.CrawlingRules;
import jlibs.xml.Namespaces;
QName xsd_schema = new QName(Namespaces.URI_XSD, "schema");
QName xsd_import = new QName(Namespaces.URI_XSD, "import");
QName attr_schemaLocation = new QName("schemaLocation");
QName xsd_include = new QName(Namespaces.URI_XSD, "include");
CrawlingRules rules = new CrawlingRules();
rules.addExtension("xsd", xsd_schema);
rules.addAttributeLink(xsd_schema, xsd_import, attr_schemaLocation);
rules.addAttributeLink(xsd_schema, xsd_include, attr_schemaLocation);
XMLCrawler crawler = new XMLCrawler(rules);
// now crawler is ready for use
String xsd = "http://somesite.com/xsds/complex.xsd";
String dir = "d:\\crawl";
crawler.crawlInto(new InputSource(xsd), new File(dir));
First we need to tell, how to recognize the extension of xml file.
rules.addExtension("xsd", xsd_schema);
here we are saying that xml file with root element {"http://www.w3.org/2001/XMLSchema"}schema
should be saved with file extension xsd
.
rules.addAttributeLink(xsd_schema, xsd_import, attr_schemaLocation);
rules.addAttributeLink(xsd_schema, xsd_include, attr_schemaLocation);
The above lines tell that schemaLocation
attribute of xsd:schema/xsd:import
and xsd:schema/xsd:include
are used to refer other xml files.
Customization:
CrawlerListener
interface can be used to customize crawling behavior. It has two methods:
public boolean doCrawl(URL url);
public File toFile(URL url, String extension);
The default implementation used is DefaultCrawlerListener
.
doCrawl(url)
is used to determine whether given url should be crawled or not. DefaultCrawlerListener
implementation always returns true
.
toFile(...)
is used to determine the file into which the specified url needs to be saved.
to use your implementation of CrawlerListener
, you have to use following method in XMLCrawler
public File crawl(InputSource document, CrawlerListener listener, File file) throws IOException
the last argument file
can be null, if you don't want to specify target file.
In such case, listener.toFile(...)
is used to determine target file.