World Modelers Document Management

This repository aims to establish a framework for managing documents collected and curated prior to these documents being used by machine reading software. There are 3 primary goals for this effort:

Ensure that documents available to readers are consistently structured.
Ensure traceability of documents across the World Modelers pipeline.
Ensure that World Modeler teams understand the document schema as well as how to access the documents.

Schema Overview

The document-schema.json file contains a JSON Schema for World Modeler documents prior them being processed by machine reading software. This schema file can be loaded to the JSON Schema tool for for review through a graphical interface.

This schema was collaboratively designed by World Modeler teams at the UAZ hackathon on 6/11/2019. The goal was to create a flexible schema that is extensible to a wide range of document types and use cases, but maintains key information about document provenance.

Schema Validation

There are numerous schema validation libraries for a variety of programming languages, many of which can be found here. Schema-Validation.ipynb is a Jupyter Notebook demonstrating validation against this schema using the Python jsonschema library.

Document Processing

Documents can be processed using the Document-Processor.ipynb Jupyter Notebook. Ultimately, this processing step should be software, not a notebook, but since currently data curation is manual, not automated, this step requires an engineer-in-the loop to adjust the code to handle new documents. As such, this Notebook provides reference/stub code, not the ultimate document processing solution.

Document Retrieval

Documents can be retrieved using the Document-Retrieval.ipynb Jupyter Notebook. This provides an example of querying for or bulk downloading documents using Elasticsearch's Python API. Additionally, Elasticseach has clients in several other programming languages.

One thing to note about Elasticsearch is that the JSON document it returns contains top-level Elasticsearch specific metadata, while the actual document of interest must be accessed via the _source key. However the _source does not contain the document _id, which is contained at the top-level _id. Therefore, for a document to be compliant with document-schema.json we must extract _source and add the _id to it.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
cdr-schema		cdr-schema
.gitignore		.gitignore
Document-Processor.ipynb		Document-Processor.ipynb
Document-Retrieval.ipynb		Document-Retrieval.ipynb
README.md		README.md
Schema-Validation.ipynb		Schema-Validation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cdr-schema

cdr-schema

.gitignore

.gitignore

Document-Processor.ipynb

Document-Processor.ipynb

Document-Retrieval.ipynb

Document-Retrieval.ipynb

README.md

README.md

Schema-Validation.ipynb

Schema-Validation.ipynb

Repository files navigation

World Modelers Document Management

Schema Overview

Schema Validation

Document Processing

Document Retrieval

About

Releases

Packages

Languages

yanzv/Document-Schema

Folders and files

Latest commit

History

Repository files navigation

World Modelers Document Management

Schema Overview

Schema Validation

Document Processing

Document Retrieval

About

Resources

Stars

Watchers

Forks

Languages