A common use case in Plone is content extraction. Either to another Plone site or to another framework. Existing Plone export methods are either incomplete or rely heavily on complex XML structures. This project will focus on keeping things simple and easy. Some quick notes:
- Rather than traversing the content of the system, a simple portal_catalog query against all content will be performed.
- Complex data will be stored in dictionaroes that reside in lists.
- Content types will be stored in one file, users in another, and so forth. Rendering multiple types of data in one file adds too much complexity.
- The export format is JSON.
- We need tests to ensure content is accurately exported.
The list of content types will be generated by a simple portal_catalog() query. Each content type object will be represented by a python dict, and these content type representations will be stored in a python list.
Since Plone content types have many of the metadata fields defined by Dublin core as their standard fields, the extract will include those fields (title, description, creator, etc). It will add these as well:
|id:||More than just a standard Zope id, this displays the location of the content in the Plone heirarchy. So instead of just my-content you would get root/major-content-section/sub-content-section/my-content. Based off of this you can infer the location of the content within the architecture of the site.|
|content_type:||This field provides the type of the content type. In Plone terms this would be the portal_type.|
|workflow_state:||This field displays the current workflow state of the content type. In Plone terms this would be the review_state|
|custom_fields:||This field provides a dictionary that lists the names of all custom fields and the content within.|
Each user will be represented by a python dict in a python list.
References and Relations between objects will be stored in a list of tuples. Tuples will have three elements (source_id, target_id, type):
|source_id:||The source_id is the object whom is the source of the relationship.|
|target_id:||The target_id is the object whom is the target of the relationship.|
|type:||The type tells you if this is a reference, relationship, or something other custom method of having two objects connect to each other in a non-hierarchical manner.|
The problem with testing the validity of the extract is that an alternative method to fetch the data needs to be created in order to create valid assertions of response data. The most obvious method of fetching data is by using http get to fetch page content that can be scraped via a library such as html5lib.
Note that this will only work against a functioning Plone site. Since that is the scope of my current effort, that is acceptable. However, this approve might not be valid if your site is non-functioning and you want to confirm that your extract was done accurately.
This approach loops through the extract and checks each content object against the associated content in Plone.
This approach involves using a spider to compare content and links in Plone against the extract. Webcheck seems like an ideal tool for this and I'll be examining how much work is involved in migrating out the data.