Skip to content
Matt Senate edited this page Jun 13, 2014 · 16 revisions

Summary

This is a simple specification (spec) for the design and implementation of our new OA-signal bot. This spec is intended to remain more or less human readable. Key terms should be defined or easily checked via Wikipedia, etc.

##Flow

  • A doi is cited somewhere on Wikipedia, or a manual request to update a doi citation is made (see Producers below).
  • We get the full-text and media from Pubmed Central (PMC).
  • We use the JATS-to-MediaWiki conversion library to convert article XML to wikitext.
  • We upload full wikitext to Wikisource.
  • We upload images and other media files (limited to accepted file types) to Wikimedia Commons.
  • We start a Wikidata item with article metadata and suitable statements.
  • Lastly, we signal availability of Wikisource, Wikimedia Commons, Wikidata materials in references cited elsewhere on Wikimedia projects, starting with the English Wikipedia.

Infrastructure

We're making some assumptions here about what tools we're going to use:

  • PubMedCentral API - archive of article source files and meta-data (making use of existing OAMI code as appropriate).
  • CrossRef API - article license data by DOI.
  • Linux (Unix-like) server
  • Python programming language
    • Virtual Environment (virtualenv) for managing packages with pip and a requirements.txt file
    • Modular development (python style)
    • Object-oriented development
    • Multi-threading inside single python process.
    • Core python data structures (shelve, pickle, etc)
      • If we have any trouble with shelve alternative is to use mongodb instead
    • Deque python core module for queue system (enables working on both ends of the stack)
    • Publisher/Subscriber (Pub/Sub) paradigm (internally referred to as Producer/Consumer for clarity)
    • PyWikiBot for Mediawiki (Wikimedia project) interface
    • Other various appropriate libraries (python modules), specified in requirements.txt file
  • JATS-to-MediaWiki XSLT converter

Structure

For the application itself, we'll use the following layers of abstraction:

  • Data
    • Store as plain text or in-memory (shelve, pickle, deque)
  • Logging and Error-handling
    • Log useful, fully specified messages.
    • Handle errors gracefully with try/except, timeouts, max attempts, etc.
  • Queue
    • Use "Double-ended Queue" (Deque)
    • Queue manages stack of Articles (merely by some ID) to be handled by the application.
  • Producers
    • Run multiple threads to handle input streams, feed into Queue
      • Primary stream - "Listen for New Citations" probably by making a regular, narrow WMFLabs SQL-replica query for:
      • Secondary stream - "Jump the Queue" by user-submitted POST request (or similar), e.g. through on-wiki web form requesting a pass over a particular citation.
        • Best practice at this point may be to run simple REST api (such as with flask-shelve), host simple html form for submission, and include the form in a Wiki page using an <iframe> or similar.
        • Alternatively, if we move to mongodb instead of shelve, may want to use rest api eve.
  • Consumers
    • Run multiple threads to source from Queue and handle output streams (only one planned for now).
      • Primary stream - "Publish Article Reference, Source Content, and Meta-data" to MediaWiki instances (namely WikiSource, Wikipedia, etc). Requires the following distinct functions:
        • Download
          • Use JATS-to-MediaWiki handler script, port to internal class.
        • Convert
          • Use JATS-to-MediaWiki converter, port handler script to internal class.
        • Upload
          • Use OAMI bot (or custom fork) to upload media to commons.
          • Upload to WikiSource and extend upload script by @notconfusing

Maintenance

  • The Wikiproject Open Access Signalling team is funded to maintain the bot through September 2014. Thereafter our commitment to Open Access will drive us to maintain the bot in an unofficial capacity. We will also do as best we can to document and make it easier for other volunteer developers to maintain it.
Clone this wiki locally