A collection of docrepos and configuration that is used to make a useful web site. The first step in creating a project is running ferenda-setup <projectname>
.
A project is primarily defined by its configuration file at <projectname>/ferenda.ini
, which specifies which docrepos are used, and settings for them as well as settings for the entire project.
A project is managed using the ferenda-build.py
tool.
If using the API instead of these command line tools, there is no concept of a project except for what your code provides. Your client code is responsible for creating the docrepo classes and providing them with proper settings. These can be loaded from a ferenda.ini
-style file, be hard-coded, or handled in any other way you see fit.
Note
Ferenda uses the layeredconfig
module internally to handle all settings.
A ferenda docrepo object can be configured in two ways - either when creating the object, eg:
d = DocumentSource(datadir="mydata", loglevel="DEBUG",force=True)
Note
Parameters that is not provided when creating the object are defaulted from the built-in configuration values (see below)
Or it can be configured using the :py~ferenda.LayeredConfig
class, which takes configuration data from three places:
- built-in configuration values (provided by :py
~ferenda.DocumentRepository.get_default_options
) - values from a configuration file (normally
ferenda.ini
", placed alongsideferenda-build.py
) - command-line parameters, eg
--force --datadir=mydata
d = DocumentSource()
d.config = LayeredConfig(defaults=d.get_default_options(),
inifile="ferenda.ini",
commandline=sys.argv)
(This is what ferenda-build.py
does behind the scenes)
Configuration values from the configuration file overrides built-in configuration values, and command line parameters override configuration file values.
By setting the config
property, you override any parameters provided when creating the object.
These are the normal configuration options:
option | description | default |
---|---|---|
datadir |
Directory for all downloaded/parsed etc files |
'data' |
patchdir |
Directory containing patch files used by patch_if_needed |
'patches' |
parseforce
|
Whether to re-parse downloaded files, even if resulting XHTML1.1 files exist newer than downloaded files |
False |
compress
|
Whether to compress intermediate files. Can be either a empty string (don't ) or 'bz2' (compress using bz2). |
|
serializejson |
Whether to serialize document data as a JSON document in the parse step. |
False |
generateforce
|
Whether to re-generate browser-ready HTML5 files, even if they exist and are an all dependencies |
False |
force |
If True, overrides both parseforce and generateforce. |
False |
fsmdebug |
Whether to display debugging information from FSMParser |
False |
refresh |
Whether to re-download all files even if previously downloaded. |
False |
lastdownload |
The datetime when this repo was last downloaded (stored in conf file) |
None |
downloadmax |
Maximum number of documents to download (None means download all of them). |
None |
conditionalget
|
Whether to use Conditional GET (through the If-modified-since and/or match headers) |
True |
url
|
The basic URL for the created site, used as template for all managed resources in o (see |
|
fulltextindex
|
Whether to index all text in a fulltext search engine. Note: This can take a lot |
|
useragent
|
The user-agent used with any external HTTP Requests. Please change this into g containing your contact info. |
'ferenda-bot' |
storetype
|
Any of the suppored types: 'SQLITE', 'SLEEPYCAT', 'SESAME' or 'FUSEKI'. :external-triplestore. |
'SQLITE' |
storelocation |
The file path or URL to the triple store, dependent on the storetype |
'data/ferenda.sqlite' |
storerepository |
The repository/database to use within the given triple store (if applicable) |
'ferenda' |
indextype
|
Any of the supported types: 'WHOOSH' or 'ELASTICSEARCH'. See ternal-fulltext`. |
'WHOOSH' |
indexlocation | The location of the fulltext index | 'data/whooshindex' |
republishsource
|
Whether the Atom files should contain links to the original, unparsed, source s |
False |
combineresources |
Whether to combine and minify all css and js files into a single file each |
False |
cssfiles |
A list of all required css files
|
['http://fonts.googleapis.com/css?family=Raleway:200,100', 'css/normalize.css', 'css/main.css', |
jsfiles |
A list of all required js files |
['js/jquery-1.9.0.js', 'js/modernizr-2.6.2-respond-1.1.0.min.js', 'js/ferenda.js'] |
staticsite
|
Whether to generate static HTML files suitable for offline usage (removes nd uses relative file paths of canonical URIs) |
False |
legacyapi
|
Whether the REST API should provide a simpler API for legacy clients. See gi`. |
False |
A document repository (docrepo for short) is a class that handles all aspects of a document collection: Downloading the documents (or aquiring them in some other way), parsing them into structured documents, and then re-generating HTML documents with added niceties, for example references from documents from other docrepos.
You add support for a new collection of documents by subclassing :py~ferenda.DocumentRepository
. For more details, see createdocrepos
A :py~ferenda.Document
is the main unit of information in Ferenda. A document is primarily represented in serialized form as a XHTML 1.1 file with embedded metadata in RDFa format, and in code by the :py~ferenda.Document
class. The class has five properties:
meta
(a RDFLib :py~rdflib.graph.Graph
)body
(a tree of building blocks, normally instances of :pyferenda.elements
classes, representing the structure and content of the document)lang
(an IETF language tag, egsv
oren-GB
)uri
(a string representing the canonical URI for this document)basefile
(a short internal id)
The method :py~ferenda.DocumentRepository.render_xhtml
(which is called automatically, as long as your parse
method use the :py~ferenda.decorators.managedparsing
decorator) renders a :py~ferenda.Document
object into a XHTML 1.1+RDFa document.
Documents, and parts of documents, in ferenda have a couple of different identifiers, and it's useful to understand the difference and relation between them.
basefile
: The internal id for a document. This is is internal to the document repository and is used as the base for the filenames for the stored files . The basefile isn't totally random and is expected to have some relationship with a human-readable identifier for the document. As an example from the RFC docrepo, the basefile for RFC 1147 would simply be "1147". By the rules encoded in :py~ferenda.DocumentStore
, this results in the downloaded filerfc/downloads/1147.txt
, the parsed filerfc/parsed/1147.xhtml
and the generated filerfc/generated/1147.html
. Only documents themselves, not parts of documents, have basefile identifiers.uri
: The canonical URI for a document or a part of a document (generally speaking, a resource). This identifier is used when storing data related to the resource in a triple store and a fulltext search engine, and is also used as the external URL for the document when republishing (seewsgi
and alsoparsing-uri
). URI:s for documents can be set by settings theuri
property of the Document object. URIs for parts of documents are set by setting theuri
property on any :py~ferenda.elements
based object in the body tree. When rendering the document into XHTML, render_xhtml creates RDFa statements based on this property and themeta
property.dcterms:identifier
: The human readable identifier for a document or a part of a document. If the document has an established human-readable identifier, such as "RFC 1147" or "2003/98/EC" (The EU directive on the re-use of public sector information), the dcterms:identifier is used for this. Unlikebasefile
anduri
, this identifier isn't set directly as a property on an object. Instead, you add a triple withdcterms:identifier
as the predicate to the object'smeta
property, seedocmetadata
and also DCMI Terms.
Apart from information about what a document contains, there is also information about how it has been handled, such as when a document was first downloaded or updated from a remote source, the URL from where it came, and when it was made available through Ferenda. .This information is encapsulated in the :py~ferenda.DocumentEntry
class. Such objects are created and updated by various methods in :py~ferenda.DocumentRepository
. The objects are persisted to JSON files, stored alongside the documents themselves, and are used by the :py~ferenda.DocumentRepository.news
method in order to create valid Atom feeds.
During the course of processing, data about each individual document is stored in many different files in various formats. The ~ferenda.DocumentStore
class handles most aspects of this file handling. A configured DocumentStore object is available as the store
property on any DocumentRepository object.
Example: If a created docrepo object d
has the alias foo
, and handles a document with the basefile identifier bar
, data about the document is then stored:
- When downloaded, the original data as retrieved from the remote server, is stored as
data/foo/downloaded/bar.html
, as determined byd.store.
~ferenda.DocumentStore.downloaded_path
- At the same time, a DocumentEntry object is serialized as
data/foo/entries/bar.json
, as determined byd.store.
~ferenda.DocumentStore.documententry_path
- If the downloaded source needs to be transformed into some intermediate format before parsing (which is the case for eg. PDF or Word documents), the intermediate data is stored as
data/foo/intermediate/bar.xml
, as determined byd.store.
~ferenda.DocumentStore.intermediate_path
- When the downloaded data has been parsed, the parsed XHTML+RDFa document is stored as
data/foo/parsed/bar.xhtml
, as determined byd.store.
~ferenda.DocumentStore.parsed_path
- From the parsed document is automatically destilled a RDF/XML file containing all RDFa statements from the parsed file, which is stored as
data/foo/distilled/bar.rdf
, as determined byd.store.
data/foo/annotations/bar.grit.txt
, as determined byd.store.
~ferenda.DocumentStore.annotation_path
. - During the
relate
step, all documents which are referred to by any other document are marked as dependencies of that document. If thebar
document is dependent on another document, then this dependency is recorded in a dependency file stored atdata/foo/deps/bar.txt
, as determined byd.store.
~ferenda.DocumentStore.dependencies_path
. - Just prior to the generation of browser-ready HTML5 files, all metadata in the system as a whole which is relevant to
bar
is serialized in an annotation file in GRIT/XML format atdata/foo/annotations/bar.grit.txt
, as determined byd.store.
~ferenda.DocumentStore.annotation_path
. - Finally, the generated HTML5 file is created at
data/foo/generated/bar.html
, as determined byd.store.
~ferenda.DocumentStore.generated_path
. (This step also updates the serialized DocumentEntry object described above)
Whenever a new version of an existing document is downloaded, an archiving process takes place when ~ferenda.DocumentStore.archive
is called (by ~ferenda.DocumentRepository.download_if_needed
). This method requires a version id, which can be any string that uniquely identifies a certain revision of the document. When called, all of the above files are moved into the subdirectory in the following way (assuming that the version id is "42"):
The result of this process is that a version id for the previously existing files is calculated (by default, this is just a simple incrementing integer, but the document in your docrepo might have a more suitable version identifier already, in which case you should override :py~ferenda.DocumentRepository.get_archive_version
to return this), and then all the above files (if they have been generated) are moved into the subdirectory archive
in the following way.
data/foo/downloaded/bar.html
-> data/foo/archive/downloaded/bar/42.html
The method :py~ferenda.DocumentRepository.get_archive_version
is used to calculate the version id. The default implementation just provides a simple incrementing integer, but if the documents in your docrepo has a more suitable version identifier already, you should override :py~ferenda.DocumentRepository.get_archive_version
to return this.
The archive path is calculated by providing the optional version
parameter to any of the *_path
methods above.
To list all archived versions for a given basefile, use the ~ferenda.DocumentStore.list_versions
method.
In many cases, you don't really need to know the filename that the *_path
methods return, because you only want to read from or write to it. For these cases, you can use the open_*
methods instead. These work as context managers just as the builtin open method do, and can be used in the same way:
Instead of:
examples/keyconcepts-file.py
use:
examples/keyconcepts-file.py
In many cases, a single file cannot represent the entirety of a document. For example, a downloaded HTML file may need a series of inline images. These can be handled as attachments by the download method. Just use the optional attachment parameter to the appropriate _path / open_ methods:
examples/keyconcepts-attachments.py
Note
The DocumentStore object must be configured to handle attachments by setting the storage_policy
property to dir
. This alters the behaviour of all *_path
methods, so that eg. the main downloaded path becomes data/foo/downloaded/bar/index.html
instead of data/foo/downloaded/bar.html
To list all attachments for a document, use ~ferenda.DocumentStore.list_attachments
method.
Note that only some of the *_path
/ open_*
methods supports the attachment
parameter (it doesn't make sense to have attachments for DocumentEntry files or distilled RDF/XML files).
Whenever ferenda needs any resource file, eg. an XSLT stylesheet, a SPARQL query template or some RDF triples in a Turtle (.ttl
) file, it uses a :py~ferenda.ResourceLoader
instance to look in a series of different "system" directories.
By placing files in the correct directories, and optionally configuring the loadpath
config option, you can substitute your own resource file if the system versions aren't to your liking.