## YAML Schema Validation with jsonschema

YAML is popular for uses case such as configuration files because it is easy to read and write, less susceptible to missing commas and curly braces, but Python tools for specifically validating YAML documents against a schema are not great.  There is an alternative:  Given that the Python *yaml* library loads documents directly into Python *dict*s, it is easy to use *jsonschema*, a well-respected and active implementation of *JSON Schema* for Python, to validate both a YAML schema and a YAML document without the commas and curly braces if the given document could be equivalently expressed as JSON.  In other words, the simpler and practical documents like config files.

For instance, the documentation for the standard Python library *logging* uses YAML for the example of the new *dict-config* method for configuring logging externally. Here is a slightly enhanced version to show multiple formatters, handlers and loggers.



```

---
# based on the YAML configuration file in the Python 3 logging documentation
# https://docs.python.org/3/howto/logging.html#logging-advanced-tutorial

version: 1

formatters:
  simple:
    format: '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
  brief:
    format: '%(asctime)s - %(message)s'
  
handlers:
  console:
    class: logging.StreamHandler
    level: DEBUG
    formatter: simple
    stream: ext://sys.stdout
  file_app:
    class: logging.FileHandler
    level: INFO
    formatter: brief
    filename: app.log 
    encoding: utf8
    
loggers:
  simpleExample:
    level: DEBUG
    handlers: [console]
    propagate: no
  simplerExample:
    level: DEBUG
    handlers: [file_app]
    propagate: no
    
root:
  level: DEBUG
  handlers: [console]
  
```

If the above was a file with path *logging.cfg*, the following is all the yaml coding that is required to read in the config:


>        import yaml
>        
>        with open('logging.cfg', 'rt') as f:
>            log_cfg = yaml.safe_load(f.read())
>        logging.config.dictConfig(log_cfg)


But no schema either in JSON or YAML is provided to perform validation.  So the following discussion walks through the creation of a *JSON Schema* schema to match the above logging config file.

The process is principally to identify properties of objects which are often nested with standard YAML indentation rules applying.  Given the logging config example, it starts off easily with some header information and the first property for the main object called *version*:

```
$schema: http://json-schema.org/draft-04/schema#
title: PythonLoggingConfig
type: object
properties: 
    version: 
        type: integer
```

After principle part of the config file and the schema consists of four sections, each containing an associative array of logging objects, which easily converts to a Python dict:  Formatters, Handlers, Loggers and Main (the parent Logger).  Each of these objects is in key-value format, but the trick is that the *name* is made up by the writer of the config file and not available to the schema.  The format of the file could have used a *name* property but did not.  JSON Schema solves this with an alternate list of properties called *patternProperties* that uses regular expressions, in this case a very generic regex for a word without spaces.  In the case of the formatters, the value associated with this name is a single property *format*.  Notice that consistent with YAML, the indentation of the objects and properties indicates the hierarchy being described.

```
    formatters:
        type: object
        patternProperties:
            '^\w+$':
                type: object
                properties:
                    format: 
                        type: string 
                required: [format]                       
        additionalProperties: no      # forces match of pattern 
```

The *required* field is used to list the required properties for the given object in the hierarchy.  The *additionaProperties: no * is important to ensure that the regex is being matched, since all properties in the schema are optional by default.  When writing the schema, it is important to constantly test your schema by attempting to generate schema Exceptions to make sure that what is intended to match is being matched and not set aside as option elements.  A correctly matching schema doesn't generate result statements.  

Another issue with the formatters, etc., is that the config file lists multiple formatter objects and there is no designation of multiples.  For arrays with square brackets in YAML and JSON, there is a schema designation for *array* and the *items* within, but not for associative arrays.  And despite a lack of explanation and examples in the documentation and relevant tutorials, the multiplicity is *implicit* and the above section of schema works for multiple formatters.

The rest of the schema follows the same pattern and the complete version is as follows:

The *handlers*, *loggers* and *root* sections each have a *level* property for each of the objects.  *JSON Schema* provides for reusable definitions and the level enum can be defined at the top of the schema as:

```
definitions:
    level_enum :
        enum: [NOTSET, DEBUG, INFO, WARNING, ERROR, CRITICAL ]
```

and the corresponding lines in the main section would reference it as:

```
        level:
            $ref: `#/definitions/level_enum`

```

where the referenced schema object replaces the second line.  The '#' refers to the root of the definitions section and the hierarchical string follows the structure of the main section of the schema.  

The *handlers* section presents the additional challenge of having some properties that belong to specific types identified by the regex names.  In particular a handler with class *logging.StreamHandler* will have a *stream* property and one with class *logging.FileHandler* will have *filename* and *encoding* properties.  *JSON Schema* has combining features for *allOf*, *anyOf* and *oneOf* which solve some of this.  In the logging case, the nature of the object hierarchy is such that we can't just wrap a *oneOf* around two groups of properties; rather, the entire list of properties needs to have multiple versions, like so:

```
    handlers:
        type: object
        patternProperties:
            '^\w+$':
                type: object
                oneOf:
                  - properties:
                        class: 
                            type: string 
                        level:
                            $ref: '#/definitions/level_enum'
                        formatter:
                            type: string   # one of keys in formatters
                        stream:
                             type: string 
                    required: [class, level, formatter] 
                    additionalProperties: no                                        
                  - properties:
                        class: 
                            type: string 
                        level:
                            $ref: '#/definitions/level_enum'
                        formatter:
                            type: string   # one of keys in formatters
                        filename:            
                            type: string                                   
                        encoding:          # enum : [ utf8, ?]
                            type: string                                                              
                    required: [class, level, formatter]                       
                    additionalProperties: no 
        additionalProperties: no   
```

It is also essential to add the *additionalProperties: no* for each group of properties to get the correct match.

The *loggers* and *root* sections follow the pattern of *handlers*.  The remainder of the schema is:

```
    loggers:
        type: object
        patternProperties:
            '^\w+$':
                type: object
                properties:
                    level:
                            $ref: '#/definitions/level_enum'
                    handlers:
                        type: array
                        items: 
                             type: string 
                    propagate: 
                        type: boolean
                required: [ level, handlers ]              
        additionalProperties: no      # forces match of pattern  
     
    root:
        type: object
        properties:
            level: 
                $ref: '#/definitions/level_enum'
            handlers: 
                type: array
                items: 
                    type: string                 
               
required: [version, formatters, handlers, loggers, root]

```

The final line is indented to the far left and refers to the root level of the schema.


Since the different sections of the schema make reference to the other sections, e.g. a *logger* will have a list of *handlers*, it would be desirable to constrain these references according to what is listed.  However, since these are the regex names, it is unclear how they might be extracted as a separate list without going beyond *JSON Schema*.    

The Python code using the *jsonschema* library to perform both the validation of the schema and the validation of the YAML or JSON config file using the schema is short and straight-forward.  Both the yaml and jsonschema modules need to be loaded into the environment, but both are easily found via pip, etc.  Currently, the most recent version of *JSON Schema* supported by *jsonschema* is *Draft 4* but *Draft 6* is in the works. (Note:  The following code is not given here in dynamic form because the libraries cannot be assumed to be available.) 

```

import json
import yaml
import jsonschema, jsonschema.exceptions
import sys

def is_valid_config(schema, config):
    
    try:    
        jsonschema.Draft4Validator.check_schema(schema)    
    except jsonschema.exceptions.SchemaError as e:
        print(e) 
        return False
    print('schema is valid')

    try:
        jsonschema.validate(config, schema)
    except jsonschema.exceptions.ValidationError as e:
        print(e) 
        return False

    return True

```