Plone Extract

Contents

Introduction
Extraction Formats
Testing Exports
- Extract versus Plone
- Plone versus Extract

Introduction

A common use case in Plone is content extraction. Either to another Plone site or to another framework. Existing Plone export methods are either incomplete or rely heavily on complex XML structures. This project will focus on keeping things simple and easy. Some quick notes:

Rather than traversing the content of the system, a simple portal_catalog query against all content will be performed.
Complex data will be stored in dictionaroes that reside in lists.
Content types will be stored in one file, users in another, and so forth. Rendering multiple types of data in one file adds too much complexity.
The export format is JSON.
We need tests to ensure content is accurately exported.

Extraction Formats

Content Type Extract Format

The list of content types will be generated by a simple portal_catalog() query. Each content type object will be represented by a python dict, and these content type representations will be stored in a python list.

Since Plone content types have many of the metadata fields defined by Dublin core as their standard fields, the extract will include those fields (title, description, creator, etc). It will add these as well:

id:	More than just a standard Zope id, this displays the location of the content in the Plone heirarchy. So instead of just `my-content` you would get `root/major-content-section/sub-content-section/my-content`. Based off of this you can infer the location of the content within the architecture of the site.
content_type:	This field provides the type of the content type. In Plone terms this would be the `portal_type`.
workflow_state:	This field displays the current workflow state of the content type. In Plone terms this would be the `review_state`
custom_fields:	This field provides a dictionary that lists the names of all custom fields and the content within.

User Extract Format

Each user will be represented by a python dict in a python list.

References/Relations Extract Format

References and Relations between objects will be stored in a list of tuples. Tuples will have three elements (source_id, target_id, type):

source_id:	The source_id is the object whom is the source of the relationship.
target_id:	The target_id is the object whom is the target of the relationship.
type:	The type tells you if this is a `reference`, `relationship`, or something other custom method of having two objects connect to each other in a non-hierarchical manner.

Testing Exports

The problem with testing the validity of the extract is that an alternative method to fetch the data needs to be created in order to create valid assertions of response data. The most obvious method of fetching data is by using http get to fetch page content that can be scraped via a library such as html5lib.

Note that this will only work against a functioning Plone site. Since that is the scope of my current effort, that is acceptable. However, this approve might not be valid if your site is non-functioning and you want to confirm that your extract was done accurately.

Extract versus Plone

This approach loops through the extract and checks each content object against the associated content in Plone.

Plone versus Extract

This approach involves using a spider to compare content and links in Plone against the extract. Webcheck seems like an ideal tool for this and I'll be examining how much work is involved in migrating out the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rst

README.rst

Plone Extract

Introduction

Extraction Formats

Content Type Extract Format

User Extract Format

References/Relations Extract Format

Testing Exports

Extract versus Plone

Plone versus Extract

Files

README.rst

Latest commit

History

README.rst

File metadata and controls

Plone Extract