Test Crawler-Commons' Sitemap Parser

Test performance of crawler-commons' SiteMapParser. Sitemaps are read from WARC files to make results reproducible.

compile and install crawler-commons and this project:

git clone  https://github.com/crawler-commons/crawler-commons.git
cd crawler-commons/
mvn install
cd -
mvn package

You may also change the crawler-commons version dependency in the pom.xml to test another crawler-commons version.

prepare a list of URLs pointing to sitemap files, e.g.

echo "https://www.sitemaps.org/sitemap.xml" >sitemaps.txt

fetch the sitemaps and wrap them into a WARC file, e.g, using wget:

wget --no-warc-keep-log \
     --warc-file sitemaps \
     -O /dev/null \
     -i sitemaps.txt

test the sitemap parser, e.g.,

./run.sh -Dwarc.index=false \
         -Dsitemap.strict=false \
         -Dsitemap.partial=true \
         -Dsitemap.lazyNamespace=true \
         -Dsitemap.extensions=true \
         sitemaps.warc.gz

set properties to modify the tests

sitemap.strict (if true) check URLs whether they share the prefix (protocol, host, port, path without filename) with the sitemap URL
sitemap.partial (if true) relax the parser to return also a parsed sitemap if the document is truncated or broken and was only partially parsed
sitemap.strictNamespace (if true) check sitemap namespaces, ignore XML elements which are not in the http://www.sitemaps.org/schemas/sitemap/0.9 namespace
sitemap.lazyNamespace (if true) check namespace but allow legacy namespaces
sitemap.extensions (if true) enable support for sitemap extensions (news, image, video, etc.)
warc.index (if true) read the WARC file(s) ahead and index the records in a Map <url,record>. This causes some overhead in CPU time and memory but allows to parse sitemap indexes recursively.
warc.parse.url parse a single sitemap identified by URL.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/main

src/main

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

run.sh

run.sh

Repository files navigation

Test Crawler-Commons' Sitemap Parser

About

Releases

Packages

Languages

License

sebastian-nagel/sitemap-performance-test

Folders and files

Latest commit

History

Repository files navigation

Test Crawler-Commons' Sitemap Parser

About

Resources

License

Stars

Watchers

Forks

Languages