Skip to content
Thorben Nissen edited this page Oct 27, 2017 · 2 revisions

What is it?

The extension versatile_crawler is not just another crawler extension for TYPO3 CMS. The goal was to create an extension, that is easy to use and understand, so an integrator could easily setup the crawling of pages and records (e.g. news, events or products). On the other hand, it should flexible and extendable, when it comes to how queueing and indexing is handled.

How does it work?

The extension consists mainly of the Queue task and the Process task. The queue task adds items to the queue, based on the selected configurations. The process task processes the queue and triggers the crawling/indexing for each item.

The provided crawlers initiate a frontend request that triggers the actual indexing, based on the information generated while rendering the page. Currently the indexer from indexed_search is used. This will be extendable in a future version.

Provided crawlers

By default the extension provides a crawler for pages and one for records. Both crawlers use the domain and base url settings from the configuration record to determine the url for the frontend request. One of these must be provided

The PageTree crawler adds the pages, starting on the page container the configuration. It adds as many levels as set in the configuration record (0 means infinite). With the languages setting, you can control which languages are queued. Only publicly visible standard pages are queued, and only in the existing languages. The page tree crawler can exclude pages, that contain other configuration records automatically. This is quite useful for single view pages for records like e.g. news.

The Records crawler adds records from the chosen table from the selected record storages pages. Recursion level for storage folder can be set in the configuration (default is 1: only current level). The languages can also be set. Missing languages of records are not queued. The query string is used to build the frontend request ({field:uid} is replaced with the records uid).

Quickstart

  1. Install the extension according to what your preferred way is (TER, git submodule, composer) and activate it in the extension manager.
  2. Create one or more configuration records depending on, what you want to have indexed.
  3. Create a queue task in the scheduler module. The Queue task should be run once a day.
  4. Create a process task. The process task could be run every x minutes, so it processes the queue. The process task shows how the queue status is.

Contact & contribution

This wiki is not ment to be complete. If you are missing something, feel free to write me an email. Open up an issue, if you think you have found a bug.

If like to contribute you are welcome to fork the repository and create a pull request.

Clone this wiki locally