It is a python framework I have developed for my bachelor thesis. The main purpose was to research ways for content extraction from large collections of HTML documents stored in Web Archives.
This repository contains content that has been crawled for research purposes.