This tutorial shows how to combine Jsoup, Neo4j, Spring Data, and several other technologies.
I developed a very similar system in the past. I needed a map of how all the pages linked on a 20,000 page site. But more important, I wanted to find the boilerplate html and exclude it for content extraction. The system normalized and stored all the dom trees to do it. The details of just how it works will unfold over time.
Please follow our ScrumBucket tutorial as we build out our project.