Skip to content
This repository has been archived by the owner on Sep 26, 2019. It is now read-only.

Data Model & Tools Config

nonword edited this page May 20, 2015 · 14 revisions

Data Model

A first pass distillation of the Scribe data model as shared by ruby and js.

Project

The project is the top-level element defining project properties, site pages, and defining workflows. There SHOULD be only one project. It contains the following fields:

  • description: Text
  • producer: String
  • title: String
  • workflows: Array of WORKFLOWs

Workflow

Supported workflows are Mark, Transcribe, and Verify. Their common properties are:

  • key: String - (Established as key of the task in the tasks hash in the workflow json.) Unique alphanumeric key for this workflow. Must be one of 'mark','transcribe','verify' for now..
  • label: String - Friendly name for workflow (e.g. "Mark Stuff!")
  • first_task: TASK key of first task to invoke. If the subject_type of the subject loaded matches a task key in the current workflow, that task will be loaded first instead.
  • tasks: Hash mapping task keys to TASKs
  • delete_limit: Int - Threshold for removal; Number of people that must say "This is invalid" about a given subject before its status is set to "invalid". This status is similar to "retired" and "complete" in that the subject will be hidden from many views. See also Subject#deleting_users. Default null, indicating subjects can not be deleted.
  • retire_limit: Int - Number indicating threshold for retiring the subject operated on in a given workflow. Really only relevant to Mark worfklow. In Mark workflow, retire_limit is the number of times we require someone to say "There is nothing left to mark on this doc". In the Transcription & Verification workflows, retire_limit is ignored in favor of subject_generation_after.
  • generates_subjects: Bool - Indicates that some submitted classifications may generate secondary/tertiary subjects. Default true.
  • generates_subjects_for: String - Name of next workflow (if any) to associate with generated subjects, e.g. 'transcribe','verify', or (default) null, which indicates there is no next workflow.
  • generates_subjects_after: Int - Number of classifications a generated subject must represent before it's activated for the next workflow. In Transcribe and Verify workflow, upon activating a generated subject, the parent subject acquires status "complete". Default 1.
  • generates_subjects_max: Int - Max number of distinct annotations that a generated subject may represent before we mark its status 'contentious'. Default 10.
  • generates_subjects_method: String - Available options are:
    • one-per-classification: Indicates that the submitted classification's annotation should be used to generate a single subject without considering any other classifications submitted for that subject. Used in Mark when generating subjects for Transcribe.
    • collect-unique: Indicates that the generated subject should assemble all distinct classifications for the given subject as a list. Used in Transcribe when generating subjects for Verify.
    • select-most-popular: Indicates that the generated subject should consider all classifications for the given subject and select the annotation value that is most popular. Used in Verify for generating

Example generates_subjects_* configurations for various workflows:

Mark:

  ...
  "generates_subjects_for": "transcribe",
  "generates_subjects_after": 1,
  "generates_subjects_method": "one-per-classification",
  "generates_subjects_max": null, # meaningless in this context because method 'one-per-classification' implies there will be at most 1 classification per generated subject
  "retire_limit": 3,
  ...

Transcribe:

  ...
  "generates_subjects_for": "verify",
  "generates_subjects_after": 3, # activate subjects for verification after 3 classifications
  "generates_subjects_max": 10, # set status of a tertiary subject to 'contentious' if 10 different transcriptions submitted
  "generates_subjects_method": "collect-unique", # assemble submitted classifications into a list of distinct choices
  "delete_limit": 3, # require 5 delete votes before deleting a mark
  ...

Verify:

  ...
  "generates_subjects_for": null, # generate subjects, but not attached to any workflow
  "generates_subjects_after": 3, # require at least 3 verifications before counting classifications ("votes" in this context)
  "generates_subjects_max": 30, # if after 30 votes, we still don't have a strong majority voting for one transcription, set status to 'contentious'
  "generates_subjects_method": "most-popular", # we select the value with the most votes
  "generates_subjects_agreement": 0.75 # Require a 3/4 majority (after 3 votes) to set subject to 'done'
  ...

Task

  • key: String - (Established as key of the task in the tasks hash in the workflow json.) Workflow-unique alphanumeric (e.g. '0','1','mark_one')
  • tool: Enum "pickAndMarkOne", "pointTool", "rectangleTool", "pickOne", "textTool", "numberTool", "dateTool", "compositeTool", "verifyTool"
  • instruction: Text - Friendly prompt given to user, which contextualizes task (e.g. "How many penguins are there?", "What color is this penguin?", "Choose the type of document")
  • help: Filename of html file in project config to load into a little slide-out/modal. [to be better defined]
  • generates_subject_type: String - Unique string identifying the type of subject generated. This must be unique across all tasks in the workflow. The value must match a task key in the destination workflow.
  • tool_config: Hash - Specify arbitrary tool options. See "Tools"

Classification

An classification is a single statement a user makes about a subject in response to a single task. Drawing tasks create classifications that represent a single polygon. (Note many drawing tools are configured with repeat=true to generate multiple classifications.) PickOne tasks produce classifications that store the option chosen (e.g. 'yes','no'). Transcription tasks create classifications with the text entered.

  • started_at: Date - Submitted by JS when created
  • finished_at: Date - Submitted by JS when created
  • user_agent: String - Submitted by JS when created
  • subject: SUBJECT
  • workflow: WORKFLOW
  • annotation: Hash - Hash of data collected by task.

A note on the annotation field: The annotation should always be a Hash, even if the tool that generates it produces a single, scalar value. In those cases, the annotation should look like {value: '...'}.

Marking tool (e.g. pickOneMarkOne) example annotation:

{x: .., y: .., width: .., height: ..}

PickOne tool example annotation:

{value: 'yes'}

Transcription tool example annotation:

{value: 'Bond St'}

Or, if a compositeTool is used, the keys of each tool in the tools configuration option should be used as the keys for the collected values:

{first_name: 'Charlie', last_name: 'Brown'}

VerificationTool annotation should store the chosen annotation value:

{value: 'Bond Street'}

For example, when committing a mark, the data posted to the classifications endpoint might look like this:

{
  "classifications": [
    {
      "subject_id":"5554c24770617577c0040000",
      "workflow_id":"5554c24770617577c0000000",
      "subject_type":"em_marked_record",
      "task":"identify_records",
      "toolName":"rectangleTool",
      "annotation": {
        "x":189.72857142857143,
        "y":147.28764273515156,
        "width":1807.4142857142856,
        "height":1160.8263368109401
      },
      "started_at":"2015-05-14T15:48:24.878Z",
      "finished_at":"2015-05-14T15:48:24.878Z"
    }
  ]
}

When committing some transcribed data in the Transcribe workflow, the data posted to the classifications endpoint might look like this:

{
  "classifications": [
    {
      "subject_id":"5554c24770617577c0040000",
      "workflow_id":"5554c24770617577c0000000",
      "subject_type":"em_transcribed_mortgager_name",
      "task":"em_transcribe_mortgager_name",
      "toolName":"textTool",
      "annotation": {
        "value": 'Charlie Brown'
      },
      "started_at":"2015-05-14T15:48:24.878Z",
      "finished_at":"2015-05-14T15:48:24.878Z"
    }
  ]
}

When committing some transcribed data using a composite tool in the Transcribe workflow, the data posted to the classifications endpoint might look like this:

{
  "classifications": [
    {
      "subject_id":"5554c24770617577c0040000",
      "workflow_id":"5554c24770617577c0000000",
      "subject_type":"em_transcribed_valuation",
      "task":"em_transcribe_valuation",
      "toolName":"compositeTool",
      "annotation": {
        "date": '1879 November'
        "amount": '$1200'
      },
      "started_at":"2015-05-14T15:48:24.878Z",
      "finished_at":"2015-05-14T15:48:24.878Z"
    }
  ]
}

Subject

A "primary" subject represents a single image. A "secondary" subject represents an annotation made on a primary subject. A "tertiary" subject annotates a secondary subject. Fields include:

  • name: String - For 'root' subjects. Optional name for subject provided by CSV import.
  • location: Hash mapping identifiers to URLs:
    • standard: URL of standard image deriv
    • thumbnail: URL of thumbnail
  • type: String - Default for primary subjects is "root". Secondary subject types are determined by the subject_type configured in task. It's thus always either "root" or a user-supplied value like "em_mark_record".
  • meta_data: Hash - Includes arbitrary data known about root subjects - imported from subject CSVs that might be useful when transcribing like subject type, date
  • region: Hash - Defines the sub-region of the root subject. Not present in root subjects. Includes following fields:
    • tool_name: String - The name of the tool that generated the region (e.g. 'rectangleTool') [Should this maybe just be type and refer to the abstract class of poly (rectangle, ellipse, point) since maybe it's not really important what tool generated it?]
    • x: Int - The pixel x-coord, if applicable
    • y: Int - The pixel y-coord, if applicable
    • width: Int - The pixel width, if applicable
    • height: Int - The pixel width, if applicable
  • data: Hash - The classification data. For generated secondary/tertiary subjects, this hash should be copied from annotation property of the classification(s) that generated it.
  • classification_count: Int - Denormalized count of classifications of this subject.
  • retire_count: Int - Default 0. Number of times a user has said "There's nothing left to mark". Only relevant in Mark.
  • status: Enum "active", "done", "contentious", "invalid" - Only 'active' subjects Subjects marked done are not served to any workflow.
  • deleting_users: Array - List of user ObjectIds that marked this subject invalid in a workflow with delete_limit set.
  • width: Int - Pixel width of root subj
  • height: Int - Pixel height of root subj

Example root subject:

{
  "name": "Page 1",
  "status" : "active",
  "type" : "root",
  "location" : {
    "standard" : "http://demo.zooniverse.org/whaling-logs-test-data/KWM_51_JPGs/logbookofphillip00unse_0241.jpg",
    "thumbnail" : "http://demo.zooniverse.org/whaling-logs-test-data/KWM_51_JPGs/logbookofphillip00unse_0241.jpg"
  },
  "retire_count" : 0,
  "classification_count" : 1,
  "width" : "2048",
  "height" : "3380",
  "meta_data" : {
    "capture_location" : "n/a",
    "date" : "n/a",
    "set_key" : "n/a"
  },
  "workflow_id" : ObjectId("55477dbb7061751603000000"),
  "subject_set_id" : ObjectId("55477dbf70617516036f0200"),
  "random_no" : 0.6645554681836298,
  "updated_at" : ISODate("2015-05-04T14:10:08.327Z"),
  "created_at" : ISODate("2015-05-04T14:10:08.325Z")
}

Example secondary subject (a mark created on the root subject above):

{
  "name": null,
  "status" : "active",
  "type" : "em_mark_record",
  "location" : {
    "standard" : "http://demo.zooniverse.org/whaling-logs-test-data/KWM_51_JPGs/logbookofphillip00unse_0241.jpg",
    "thumbnail" : "http://demo.zooniverse.org/whaling-logs-test-data/KWM_51_JPGs/logbookofphillip00unse_0241.jpg"
  },
  "retire_count" : null,
  "classification_count" : 0,
  "width" : "2048",
  "height" : "3380",
  "meta_data" : {
    "capture_location" : "n/a",
    "date" : "n/a",
    "set_key" : "n/a"
  },
  "data" : {
    "x": 123,
    "y": 567,
    "width": 1356,
    "height": 987
  }
  "workflow_id" : ObjectId("5554c24770617577c0010000"),
  "subject_set_id" : ObjectId("55477dbf70617516036f0200"),
  "random_no" : 0.6645554681836298,
  "updated_at" : ISODate("2015-05-04T14:10:08.327Z"),
  "created_at" : ISODate("2015-05-04T14:10:08.325Z")
}

SubjectSet

A Subject always belongs to a single SubjectSet. Multi-page documents are represented by multiple subjects associated by a single subject-set. Fields include:

  • name: String
  • subjects: Many SUBJECTs

Group

Groups organize subject-sets into related collections.

Note that although group membership and metadata may be helpful to the project maintainer and transcriber (if group metadata is displayed with member subjects) annotations are applied to subjects exclusively. It's not currently possible to annotate a group.

Group fields include:

  • name: String - As given by groups CSV
  • description: Text - As given by groups CSV
  • cover_image_url: String - URL of group representative image
  • external_url: String - URL of another representation of this object (e.g. wikipedia) if avail
  • meta_data: Hash - Includes arbitrary known data imported from group CSVs that might be useful to display in transcription interface
  • selection_method: Enum "linear", "random" - Indicates method for selecting subject-sets from this group for marking, whether linearly or randomly.
  • subject_sets: Many SUBJECTSETs

Tools

Tools are pluggable, configurable widgets that perform a single, simple task related to identifying an area of the subject ("marking"), adding data to a subject ("transcribing"), or moving the user from one tool to the next ("core").

Tools are specified in a task config via tool and tool specific configuration is specified via tool_config. For example:

  ...
  tasks: {
    "determine_has_records": {

      "tool": "pickOne",

      "tool_config": {
        "options": {
          "yes": {
            "label": "Yes",
            "next_task": "identify_records"
          },
          "no": {
            "label": "No"
          }
        }
      }

    }
  }

Core Tools

Certain tools (e.g. 'pickOne') are "core tools", meaning they can appear in any workflow. Core tools are defined in components/core-tools. (If a tool isn't a core tool, it can be found in either components/mark or components/transcribe depending on the workflow in which it appears.)

Pick One (pickOne)

Pick One is a simple tool that presents two or more optional tasks. Supported configuration options:

  • options: Hash mapping keys to hashes with following properties:
    • label: String - Friendly label of option (e.g. "This looks like a Casualty Form...", "This looks like an attestation..")
    • next_task: String - Key of TASK to jump to if user clicks this option.

The classification generated by a pickOne task contains the following data fields:

  • value: String - The key of the option chosen.

Pick Many (pickMany)

Pick Many is similar to Pick One, but allows user to select multiple options before continuing, all of which are stored in a single generated classification. Supported configuration options:

  • options: Hash mapping keys to hashes with following properties:
    • label: String - Friendly label of option (e.g. "This looks like a Casualty Form...", "This looks like an attestation..")

Marking Tools

Marking tools include various methods for identifying specific points and areas of images. They're defined in components/mark/tools.

All marking tools accept the following config params (in addition to tool specific params noted below):

  • fill_color: String - CSS color. (Default "rgba(0,0,0,0.30)")
  • stroke_color: String - CSS color. (Default "#fff")
  • stroke_width: Integer Pixel stroke width (Default 3)

All marking tools generate the following classification data (in addition to tool-specific data noted below):

  • x: Integer - Pixel coordinate within parent subject
  • y: Integer - Pixel coordinate within parent subject

Pick One Mark One (pickOneMarkOne)

PickOneMarkOne is the sole marking tool. It produces a menu of "marking types" in the right column, which are associated with user-supplied labels.

Tool-specific config options include:

  • options: Array of hashes defining the kind of marking types that can be made.

Each hash passed to options should define a marking type using the following properties:

  • type: The marking type. Must be one of "pointTool", "rectangleTool", "textRowTool"
  • label: The label to display, which the user clicks on to activate the marking type.
  • color: The color of the displayed mark (?)

The supported marking types and their optional (proposed) additional config params are described below:

i. Point Tool (pointTool)

A simple point on the document. Optional config:

  • radius: Integer - Pixel radius (Default 40)

ii. Rectangle Tool (rectangleTool):

Rectangular selector for identifying arbitrary rectangular regions of a document.

Tool-specific config options include:

  • min_height: Integer (or float percentage of subject)
  • max_height: ditto

Tool-specific classification data generated by rectangleRow tools:

  • width: Integer - Width of region
  • height: Integer - Height of region

iii. Text Row Tool (textRowTool)

Document-wide rectangular selector suited to identifying rows of horizontal text that span the width of the document.

Tool-specific config options include:

  • min_height: Integer (or float percentage of subject)
  • max_height: ditto

Tool-specific classification data generated by textRow tools:

  • yUpper: Integer
  • yLower: Integer

Example pickOneMarkOne config:

"identify_records": {
  "tool": "pickOneMarkOne",
  "instruction": "Pick a field and mark it with the corresponding marking tool.",
  "tool_config": {
    "options": [
      { "type": "rectangleTool",
        "label": "Blocky region of the doc",
        "color": "green",
        "max_height": 0.6
      },
      { "type": "rectangleTool",
        "label": "Row of text",
        "color": "blue"
      }
    ]
  }
}

Transcribe Tools

Transcribe tools are widgets suitable for gathering typed data with configurable constraints.

Text Tool (textTool)

Probably the simplest transcription tool, the text tool presents a single text input. The tool can be augmented with options below.

Options:

  • limit: Integer - Character limit.
  • suggest: Indicates should autocomplete. Suggest supports the following possible values:
  • An array of literal strings (e.g. ["cat","dog","other"])
  • A URL returning auto-complete suggestions for current entry (e.g. "http://example.com/terms/suggest?term=%%TERM%%" )
  • The phrase "common", which indicates the most commonly typed values for the current input will be suggested.
  • multiline: Boolean - Indicates whether or not value is expected to have line-breaks. Note that sufficiently large values of limit imply use of a textarea regardless.
  • match: String - Regex defining valid strings (e.g. "^[a-z]+$")

Configuration example:

  ...
  tasks: {
    "transcribe_mortgager_name": {
      "tool": "textTool",
      "tool_config": {
        "limit": 100
      }
    }
  }
  ...

Number Tool (numberTool) [proposed]

An extension of the Text Tool (perhaps using match option to restrict characters like "^-?\d+([,.]\d+)?$"). Supported config options:

  • minimum: Integer/Float
  • maximum: Integer/Float

Date Tool (dateTool)

A date (and date range) picker that supports approximates dates and pre-1970 dates. Supported config options:

  • minimum: String - ISO 8601 date string establishing oldest allowed date (e.g. "-30000101" for 3000 BCE)
  • maximum: String - ISO 8601 date string establishing maximum allowed date (e.g. "20150227")
  • range: Boolean - If true, a date range may be selected
  • allow_approximate: Boolean - If true, user may check a box to indicate date is approximate.

Composite Tool (compositeTool)

A composite tool is a tool composed of two or more basic tools. A composite tool presents multiple tools side by side for cases where the mark being considered contains multiple distinct data that are confusing to consider in isolation. Config options include:

  • tools: Array of hashes defining what tools to compose. Each hash should include:
  • tool: Key of tool
  • tool_config: Hash of tool specific config options (refer to tool specific config options above)

Note that composite tool classifications are special in that they are a hash of the classifications generated by each of their constituent tools. For example, if a composite tool is configured like this:

"em_transcribe_valuation": {
  "tool": "compositeTool",
  "tool_config": {
    "tools": {
      "em_valuation_date": {
        "tool": "dateTool",
        "tool_config": {},
        "label": "Record Date"
      },  
      "em_valuation_amount": {
        "tool": "textTool",
        "tool_config": {},
        "label": "Amount"
      }
    }
  },  
  "instruction": "Enter any dated property valuations that were recorded"
}

Frontend Components

Mark

Props

  • workflow

Members

  • subject_set_viewer: SubjectSetViewer
  • subject_sets: Array of SubjectSets

SubjectSetViewer

Props

  • subject_set: SubjectSet

Members

  • subject_viewer: SubjectViewer

Transcribe

Props

  • workflow

Members

  • subject_viewer: SubjectViewer
  • subjects: Array of Subjects

SubjectViewer

Props

  • subject: A primary/secondary subject
  • tool: A (transcription) tool to render overlaid on the viewer
  • classification
  • annotation

Members