Systems such as Paracrawl/Bitextor batch process each step of the pipeline, ie. a crawler downloads many pages for each website, a document aligner searches for parallel documents in the set of pages, and so forth.
We propose using a reinforcement learning approach to decide which pages to crawl and when to stop crawling in order to optimize time and resources.
Incremental Mining uses a simulated environment of an actual web crawler. A handful of domains have been pre-crawled with data extracted to interact with decision making process of Incremental Mining.
To interact with this environment, we create the SimulatedEnvironment() class.
Incrementally expands a domain's tree with a reinforcement learning based decision-making process.
The state of the crawler is defined by:
-
Queue of
Links: The queue is limited by sizek.At each decision-making step. The crawler can decide to download one of these links. When a new link is pushed (after appearing on a crawled page), old links are popped to maintain queue size. -
List of unprocessed
MonolingualDocuments: the text from all crawled pages. It has not yet been aligned. -
List of
ParallelDocuments: list of all documents (and sentences) that have been extracted from crawler. This can be bitext or n-ways parallel. In practice, this will probably just be metadata about them (length, languages, etc). -
RewardHistory: a list of lengthhon the return/reward of previous decisions. This will inform the stopping mechanism. -
DecisionHistory: a list of previous decisions
At each step, the crawler can make one of three decisions:
-
Crawl a child link from the queue (does not have to be the first item in queue)
-
Increment (the loop element of diagram) by processing the
unprocessed documentsvia document alignment, sentence alignment, and filtering. -
Stop processing