![Build Status](http://gitlab.ttree.ch:8080/buildStatus/icon?job=OSS ContentInsight Master Commit)
This TYPO3 Flow package provider a CLI tools to extract Content Inventory CSV from existing website.
This package is under development and considered beta. This package require Flow 2.3.
- Extract website structure and basic meta data
- Support crawling presets
- Flexible report building (include a CSV report builder, but you can register your own report builder)
- Skip URI with regular expression
- Sort inventory based on document tree structure
- Generate human readable page ID (like, 1, 1.1, 1.2, 2, 2.1, 2.2, ...)
- Update report / multiple index support
- Get analytics data from Google Analytics
Check the Configuration/Settings.yaml
for detailed configurations.
By default, this package cache all Raw HTTP request for one day. You can change this settings in you own
Settings.yaml
and Caches.yaml
.
The base preset is automatically merged with all preset. You can enabled or disabled any property with the settings
presets.[preset_name].properties.[property_name].enabled
.
Ttree:
ContentInsight:
presets:
'*':
properties:
'pageTitle':
enabled: TRUE
'navigationTitle':
enabled: TRUE
You can define custom preset to crawle different kind of informations. With the class
setting you can use your
own processor implementation to get information from the current URI. Your processor must implement
Ttree\ContentInsight\CrawlerProcessor\ProcessorInterface
:
Ttree:
ContentInsight:
presets:
'custom':
properties:
'pageTitle':
class: 'Your\Package\CrawlerProcessor\PageTitleProcessor'
'metaDescription':
enabled: TRUE
'metaKeywords':
enabled: TRUE
'firstLevelHeader':
enabled: TRUE
The package support CSV reporting, but you can register your own Report builder. Check the Settings.yaml
:
Ttree:
ContentInsight:
presets:
'custom':
reportConfigurations:
'csv':
enabled: TRUE
renderType: 'Csv'
renderTypeOptions:
displayColumnHeaders: TRUE
reportPath: '%FLOW_PATH_DATA%Reports/Ttree.ContentInsight'
reportPrefix: 'content-inventory-report'
properties:
'id':
label: 'ID'
'pageTitle':
label: 'Page Title'
'navigationTitle':
label: 'Navigation Title'
'externalLink':
label: 'External Link'
postProcessor: 'Boolean'
'currentUri':
label: 'URL'
'metaDescription':
label: 'Meta Description'
'metaKeywords':
label: 'Meta Keywords'
'firstLevelHeaderCount':
label: 'Main Header Count (H1)'
'firstLevelHeaderContent':
label: 'Main Header Content (H1)'
'remark':
label: 'Crawling Remark'
The keys in the properties
section must match the key produced by the CrawlerProcessor
object.
The position of each column could be specified with the following syntax : position: '<position-string>'
The <position-string>
supports one of the following syntax:
start (<weight>)
end (<weight>)
before <key> (<weight>)
after <key> (<weight>)
<numerical-order>
Ttree:
ContentInsight:
presets:
'custom':
reportConfigurations:
'csv':
enabled: TRUE
renderType: 'Csv'
renderTypeOptions:
displayColumnHeaders: TRUE
reportPath: '%FLOW_PATH_DATA%Reports/Ttree.ContentInsight'
reportPrefix: 'content-inventory-report'
properties:
'id':
label: 'ID'
position: '<position-string>',
'pageTitle':
label: 'Page Title'
position:'<position-string>'
For a single crawling preset you can register multiple reports if required. Foreach property you can register a post
processor if you need to manipulate the property in the report, see BooleanPostProcessor
for a basic example.
You can define invalid URIs patterns in your crawling presets:
Ttree:
ContentInsight:
presets:
'custom':
invalidUriPatterns:
'javascript':
pattern: '@^javascript\:void\(0\)$@'
'mailto':
pattern: '@^mailto\:.*@'
'anchor':
pattern: '@^#.*@'
message: 'Link to anchor'
If the pattern has a message
all URL matching the pattern will be logged. By default the crawler skip
those URLs silently.
To get the complete website inventory:
# flow contentinventor:extract --base-url http://www.domain.com
Or to limit the crawler to a part of the website
# flow contentinventor:extract --base-url http://www.domain.com/products
You can select a crawling presets
# flow contentinventor:extract --base-url http://www.domain.com/products --preset default