# Working with YAML

"[YAML]( https://en.wikipedia.org/wiki/YAML ) is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted."

"Originally YAML was said to mean *Yet Another Markup Language* ... it was then repurposed as *YAML Ain't Markup Language*, a recursive acronym, to distinguish its purpose as data-oriented, rather than document markup."


References:

- [Reading and parsing a YAML file with Python](https://python.land/data-processing/python-yaml)

In [1]:
%%capture
%%bash
apt-get update
apt-get install -y jq tree

## What is YAML?

YAML is a simplified data format for serializing ( i.e. converting to a string ) data structure.  It is a superset of JSON and therefore supports the six main data types/structures of JSON:

- numbers
- strings
- boolean
- nulls
- arrays
- objects/hashes/dictionaries

YAML uses an indented syntax to represent nested objects.  YAML documents start with three dashes `---` on their own line.

Examples:
- number ( unquoted otherwise it will be interpreted as a string ):
```
1234
```
as octal ( becomes 10 decimal )
```
012
```
as hex ( becomes 18 decimal )
```
0x12
```

- string (text can be quoted or unquoted):
```
  Apple
  "Banana"
```
- boolean ( lower case ):
```
true
false
```
- nulls ( lower case ):
```
null
```
- arrays ( surrounded with square brackets ):
```
[ 1 ,2 , 3 ]
```
or each element on its own line prefixed with a dash:

  ```
  - 1
  - 2
  - 3
  ```


- objects:
```
key: value
```
or with curly braces
```
{ key: value }
```
or multi-line with indents
```
key:
    value
```

- nested objects:

  ```
  key:
    value:
    - 1
    - 2
    - 3
  ```


## Why use YAML


It's easier to read ( sometimes ) as it does not have all the curly braces that JSON has.  It is also nicely indented, which goes along with the programming style of Python.  Furthermore, it is often used for configuration, for example, Docker compose files. ( See [ELK Stack example]( https://github.com/docker/awesome-compose/blob/master/elasticsearch-logstash-kibana/compose.yaml ) ).  Lastly, it is easy ( most of the time ) to convert into JSON and back.



In [2]:
!curl -s https://raw.githubusercontent.com/docker/awesome-compose/master/elasticsearch-logstash-kibana/compose.yaml


services:
  elasticsearch:
    image: elasticsearch:7.16.1
    container_name: es
    environment:
      discovery.type: single-node
      ES_JAVA_OPTS: "-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
      - "9300:9300"
    healthcheck:
      test: ["CMD-SHELL", "curl --silent --fail localhost:9200/_cluster/health || exit 1"]
      interval: 10s
      timeout: 10s
      retries: 3
    networks:
      - elastic
  logstash:
    image: logstash:7.16.1
    container_name: log
    environment:
      discovery.seed_hosts: logstash
      LS_JAVA_OPTS: "-Xms512m -Xmx512m"
    volumes:
      - ./logstash/pipeline/logstash-nginx.config:/usr/share/logstash/pipeline/logstash-nginx.config
      - ./logstash/nginx.log:/home/nginx.log
    ports:
      - "5000:5000/tcp"
      - "5000:5000/udp"
      - "5044:5044"
      - "9600:9600"
    depends_on:
      - elasticsearch
    networks:
      - elastic
    command: logstash -f /usr/share/logstash/pipeline/logstash-nginx.config
  kibana:
    image: ki

## Setup

In [3]:
import yaml
import json


In [4]:
%%bash
<<'eof' cat > config.yaml
---
rest:
  url: "https://example.org/primenumbers/v1"
  port: 8443
prime_numbers: [2, 3, 5, 7, 11, 13, 17, 19]
eof

cat -n config.yaml

     1	---
     2	rest:
     3	  url: "https://example.org/primenumbers/v1"
     4	  port: 8443
     5	prime_numbers: [2, 3, 5, 7, 11, 13, 17, 19]


In [5]:
ls -l

total 8
-rw-r--r-- 1 root root  112 Jun 17 19:56 config.yaml
drwxr-xr-x 1 root root 4096 Jun 14 17:39 [0m[01;34msample_data[0m/


## Reading and parsing (loading) a YAML file


Read YAML into a dictionary.

In [6]:
with open('config.yaml', 'r') as file:
  prime_service = yaml.safe_load(file)
prime_service


{'rest': {'url': 'https://example.org/primenumbers/v1', 'port': 8443},
 'prime_numbers': [2, 3, 5, 7, 11, 13, 17, 19]}

In [7]:
prime_service['rest']['url']


'https://example.org/primenumbers/v1'

## Reading and parsing (loading) YAML strings with Python


In [8]:
names_yaml = """
- 'eric'
- 'justin'
- 'mary-kate'
"""
names_yaml


"\n- 'eric'\n- 'justin'\n- 'mary-kate'\n"

In [9]:
names = yaml.safe_load(names_yaml)
names


['eric', 'justin', 'mary-kate']

In [10]:
type(names)

list

### Example

Creating a string of YAML text, assign it to a variable, and read the string into a dictionary. Notice the use of triple quotes.

In [11]:
omlet_recipe = '''
Ingredients:
- eggs: 2
- salt: 1 tsp
- water: 1 tbsp
Directions:
- break eggs into a bowl
- wisk in salt and water
- put pan on stove at high heat
- add egg mixture to pan and cook
- put on plat when done
- eat
'''
print(omlet_recipe)



Ingredients:
- eggs: 2
- salt: 1 tsp
- water: 1 tbsp
Directions:
- break eggs into a bowl
- wisk in salt and water
- put pan on stove at high heat
- add egg mixture to pan and cook
- put on plat when done
- eat



In [12]:
omlet_dict = yaml.safe_load(omlet_recipe)
omlet_dict['Ingredients'][0]

{'eggs': 2}

## Writing (dumping) YAML to a file



In [13]:
with open('names.yaml', 'w') as file:
  yaml.dump(omlet_dict, file)


In [14]:
!cat -n names.yaml

     1	Directions:
     2	- break eggs into a bowl
     3	- wisk in salt and water
     4	- put pan on stove at high heat
     5	- add egg mixture to pan and cook
     6	- put on plat when done
     7	- eat
     8	Ingredients:
     9	- eggs: 2
    10	- salt: 1 tsp
    11	- water: 1 tbsp


## Convert YAML to JSON

In [15]:
with open('config.yaml', 'r') as file:
  configuration = yaml.safe_load(file)
configuration


{'rest': {'url': 'https://example.org/primenumbers/v1', 'port': 8443},
 'prime_numbers': [2, 3, 5, 7, 11, 13, 17, 19]}

Convert the dictionary to a JSON string.

In [16]:
config_js = json.dumps(configuration)
config_js


'{"rest": {"url": "https://example.org/primenumbers/v1", "port": 8443}, "prime_numbers": [2, 3, 5, 7, 11, 13, 17, 19]}'

Convert the dictionary to a JSON file.

In [17]:
with open('config.json', 'w') as json_file:
    json.dump(configuration, json_file)


In [18]:
!cat -n config.json

     1	{"rest": {"url": "https://example.org/primenumbers/v1", "port": 8443}, "prime_numbers": [2, 3, 5, 7, 11, 13, 17, 19]}

In [19]:
!jq . config.json

[1;39m{
  [0m[34;1m"rest"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"url"[0m[1;39m: [0m[0;32m"https://example.org/primenumbers/v1"[0m[1;39m,
    [0m[34;1m"port"[0m[1;39m: [0m[0;39m8443[0m[1;39m
  [1;39m}[0m[1;39m,
  [0m[34;1m"prime_numbers"[0m[1;39m: [0m[1;39m[
    [0;39m2[0m[1;39m,
    [0;39m3[0m[1;39m,
    [0;39m5[0m[1;39m,
    [0;39m7[0m[1;39m,
    [0;39m11[0m[1;39m,
    [0;39m13[0m[1;39m,
    [0;39m17[0m[1;39m,
    [0;39m19[0m[1;39m
  [1;39m][0m[1;39m
[1;39m}[0m


Read the JSON file into a dictionary

In [20]:
dict_js = json.dumps(json.load(open('config.json')), indent=2)
print(dict_js)


{
  "rest": {
    "url": "https://example.org/primenumbers/v1",
    "port": 8443
  },
  "prime_numbers": [
    2,
    3,
    5,
    7,
    11,
    13,
    17,
    19
  ]
}


## Convert JSON to YAML

In [21]:
with open('config.json', 'r') as file:
    configuration = json.load(file)
configuration

{'rest': {'url': 'https://example.org/primenumbers/v1', 'port': 8443},
 'prime_numbers': [2, 3, 5, 7, 11, 13, 17, 19]}

In [22]:
with open('config.v02.yaml', 'w') as yaml_file:
    yaml.dump(configuration, yaml_file)


In [23]:
!cat -n config.v02.yaml

     1	prime_numbers:
     2	- 2
     3	- 3
     4	- 5
     5	- 7
     6	- 11
     7	- 13
     8	- 17
     9	- 19
    10	rest:
    11	  port: 8443
    12	  url: https://example.org/primenumbers/v1


# Converting free text to YAML to JSON

Using the ABQ air quality data.

In [24]:
!curl -s -O http://data.cabq.gov/airquality/aqindex/history/042222.0017


In [25]:
!ls -l 042222.0017


-rw-r--r-- 1 root root 8508 Jun 17 20:22 042222.0017


Convert the first few lines of the air quality data into YAML using `sed` and save to a file.

In [26]:
%%bash
head -6 042222.0017

BEGIN_FILE
FORMAT_VERSION,2
AGENCY,0017
FILENAME,042222.0017
DATA_VERSION,201904222215
TZONE,MST,7


In [27]:
!tail -6 042222.0017

BEGIN_DATA
Del Norte HS 2      ,350010023,4.1,0.7,1.9,3,1.8,2.8,1.6,1.3,4,2.3,15.5,14.7,15.1,13.6,14.5,16.9,17.9,7.9,1,10,12.6,12.9
Del Norte HS 2      ,350010023,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G
END_DATA
END_GROUP
END_FILE


In [28]:
%%bash
cat 042222.0017 |
tr -s '\r\n' '\n' |
sed -re '/,/ { s/^/  / }' |
sed -re '/,/! { s/$/:/ }' |
sed -re '{ s/,/: / }' |
sed -re '{ s/: (.*)/: "\1"/ }' |
tee aq.yaml


BEGIN_FILE:
  FORMAT_VERSION: "2"
  AGENCY: "0017"
  FILENAME: "042222.0017"
  DATA_VERSION: "201904222215"
  TZONE: "MST,7"
BEGIN_GROUP:
  VARIABLE: "CO"
  DATA_TYPE: "POINT"
  MEASUREMENT_TYPE: "SAMPLE"
  CHARACTERISTIC: "OBSERVED"
  START_DTG: "201904220000"
  END_DTG: "201904222159"
  INTERVAL: "60"
  START_REF: "0"
  NUMSTEPS: "22"
  AVG_TIME: "60"
  UNITS: "PPM"
  STATIONS: "2"
BEGIN_DATA:
  Del Norte HS 1      : "350010023,0.138,0.171,0.196,0.132,0.174,0.272,-999,-999,0.243,0.184,0.12,0.12,0.118,0.125,0.12,0.116,0.118,0.123,0.139,0.123,0.118,0.108"
  Del Norte HS 1      : "350010023,G,G,G,G,G,G,B,B,G,G,G,G,G,G,G,G,G,G,G,G,G,G"
  South Valley        : "350010029,0.106,0.059,0.069,0.155,0.209,0.304,0.351,0.307,0.051,0.043,0.045,0.069,0.058,0.127,0.054,0.062,0.048,0.048,0.033,0.023,0.025,0.033"
  South Valley        : "350010029,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G"
END_DATA:
END_GROUP:
BEGIN_GROUP:
  VARIABLE: "NO2"
  DATA_TYPE: "POINT"
  MEASUREMENT_TYPE: "SAMPLE"
  CHARAC

In [29]:
!ls -la aq.yaml
!wc aq.yaml

-rw-r--r-- 1 root root 9304 Jun 17 20:26 aq.yaml
 249  584 9304 aq.yaml


In [30]:
!cat aq.yaml


BEGIN_FILE:
  FORMAT_VERSION: "2"
  AGENCY: "0017"
  FILENAME: "042222.0017"
  DATA_VERSION: "201904222215"
  TZONE: "MST,7"
BEGIN_GROUP:
  VARIABLE: "CO"
  DATA_TYPE: "POINT"
  MEASUREMENT_TYPE: "SAMPLE"
  CHARACTERISTIC: "OBSERVED"
  START_DTG: "201904220000"
  END_DTG: "201904222159"
  INTERVAL: "60"
  START_REF: "0"
  NUMSTEPS: "22"
  AVG_TIME: "60"
  UNITS: "PPM"
  STATIONS: "2"
BEGIN_DATA:
  Del Norte HS 1      : "350010023,0.138,0.171,0.196,0.132,0.174,0.272,-999,-999,0.243,0.184,0.12,0.12,0.118,0.125,0.12,0.116,0.118,0.123,0.139,0.123,0.118,0.108"
  Del Norte HS 1      : "350010023,G,G,G,G,G,G,B,B,G,G,G,G,G,G,G,G,G,G,G,G,G,G"
  South Valley        : "350010029,0.106,0.059,0.069,0.155,0.209,0.304,0.351,0.307,0.051,0.043,0.045,0.069,0.058,0.127,0.054,0.062,0.048,0.048,0.033,0.023,0.025,0.033"
  South Valley        : "350010029,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G"
END_DATA:
END_GROUP:
BEGIN_GROUP:
  VARIABLE: "NO2"
  DATA_TYPE: "POINT"
  MEASUREMENT_TYPE: "SAMPLE"
  CHARAC

Read the YAML file into a string variable.

In [31]:
with open('aq.yaml', 'r') as file:
  aq_yaml = file.read()
print(aq_yaml)


BEGIN_FILE:
  FORMAT_VERSION: "2"
  AGENCY: "0017"
  FILENAME: "042222.0017"
  DATA_VERSION: "201904222215"
  TZONE: "MST,7"
BEGIN_GROUP:
  VARIABLE: "CO"
  DATA_TYPE: "POINT"
  MEASUREMENT_TYPE: "SAMPLE"
  CHARACTERISTIC: "OBSERVED"
  START_DTG: "201904220000"
  END_DTG: "201904222159"
  INTERVAL: "60"
  START_REF: "0"
  NUMSTEPS: "22"
  AVG_TIME: "60"
  UNITS: "PPM"
  STATIONS: "2"
BEGIN_DATA:
  Del Norte HS 1      : "350010023,0.138,0.171,0.196,0.132,0.174,0.272,-999,-999,0.243,0.184,0.12,0.12,0.118,0.125,0.12,0.116,0.118,0.123,0.139,0.123,0.118,0.108"
  Del Norte HS 1      : "350010023,G,G,G,G,G,G,B,B,G,G,G,G,G,G,G,G,G,G,G,G,G,G"
  South Valley        : "350010029,0.106,0.059,0.069,0.155,0.209,0.304,0.351,0.307,0.051,0.043,0.045,0.069,0.058,0.127,0.054,0.062,0.048,0.048,0.033,0.023,0.025,0.033"
  South Valley        : "350010029,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G"
END_DATA:
END_GROUP:
BEGIN_GROUP:
  VARIABLE: "NO2"
  DATA_TYPE: "POINT"
  MEASUREMENT_TYPE: "SAMPLE"
  CHARAC

In [32]:
for line in aq_yaml.split("\n"):
  if ": 0" in line:
    print(line)

Convert the string variable into a dictionary.

In [33]:
aq_dict = yaml.safe_load(aq_yaml)
aq_dict


{'BEGIN_FILE': {'FORMAT_VERSION': '2',
  'AGENCY': '0017',
  'FILENAME': '042222.0017',
  'DATA_VERSION': '201904222215',
  'TZONE': 'MST,7'},
 'BEGIN_GROUP': {'VARIABLE': 'WSV',
  'DATA_TYPE': 'POINT',
  'MEASUREMENT_TYPE': 'SAMPLE',
  'CHARACTERISTIC': 'OBSERVED',
  'START_DTG': '201904220000',
  'END_DTG': '201904222159',
  'INTERVAL': '60',
  'START_REF': '0',
  'NUMSTEPS': '22',
  'AVG_TIME': '60',
  'UNITS': 'MPH',
  'STATIONS': '1'},
 'BEGIN_DATA': {'Del Norte HS 2': '350010023,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G'},
 'END_DATA': None,
 'END_GROUP': None,
 'END_FILE': None}

In [34]:
(aq_dict['BEGIN_FILE']['TZONE']).split(",")[0]


'MST'

In [35]:
(aq_dict['BEGIN_GROUP']['INTERVAL']).split(",")[0]


'60'

Convert the dictionary to JSON.

In [36]:
aq_json = json.dumps( aq_dict, indent = 2 )
print(aq_json)


{
  "BEGIN_FILE": {
    "FORMAT_VERSION": "2",
    "AGENCY": "0017",
    "FILENAME": "042222.0017",
    "DATA_VERSION": "201904222215",
    "TZONE": "MST,7"
  },
  "BEGIN_GROUP": {
    "VARIABLE": "WSV",
    "DATA_TYPE": "POINT",
    "MEASUREMENT_TYPE": "SAMPLE",
    "CHARACTERISTIC": "OBSERVED",
    "START_DTG": "201904220000",
    "END_DTG": "201904222159",
    "INTERVAL": "60",
    "START_REF": "0",
    "NUMSTEPS": "22",
    "AVG_TIME": "60",
    "UNITS": "MPH",
    "STATIONS": "1"
  },
  "BEGIN_DATA": {
    "Del Norte HS 2": "350010023,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G"
  },
  "END_DATA": null,
  "END_GROUP": null,
  "END_FILE": null
}
