# Background


## What is an Avro File?
A binary compressed file with a schema and data in it.


## Avro Basics

### Avro File
The writer takes in a schema (JSON file) and the data which conforms to that schema,
and writes it to an avro file. The schema gets written first, then the data.

### Write Avro
The avro schema is pretty simple. Its a JSON file. It has entities,
their attributes, and the types of those attributes. You can represent
primitive types and complex types in order to represent the schema
for complicated nested JSON structures. Read more TBD.

### Read Avro
The reader doesn't need the schema since its embedded in the data. The
reader reads in and parses the avro file to JSON.


## Vanilla Avro vs PFB
Let's say a client receives an avro file. It reads in the avro data.
Now a client has the avro schema and all of the data that conforms to that
schema in a big JSON blob. It can do what it wants. Maybe it wants to
construct some data input forms. To do this it has everything it needs since
the schema has all of the entities, attributes, and types for those attributes
defined.

Now what happens if the client wants to reconstruct a relational database from
the data? How does it know what tables to create, and what the relationships
are between those tables? Which relationships are required vs not?


This is where PFB comes in. It has defined a specific avro schema which is
suitable for packaging up relational data so that it can be exchanged among
clients and then used to reconstruct a relational database.  


## Apache Avro
Apache's python avro package is super slow because its written in pure Python 

### Other Issues
- Also the setup.py is missing pycodestyle dependency
- pypy avro package can't find StringIO module. I think you need the snappy codec package for it to work.

## Fast Avro
- Written in cpython so its way faster than Apache's Python package
- This is used by the pypfb Python package
- Doesn't support schema hashing or parsing into canonical form (needed for
  schema hashing and diffing two schemas)


# Demonstration

## Setup

If you haven't already setup your environment before you run this notebook:

```shell
$ git clone <>
$ cd <>
$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
```

Restart Jupyter notebook

In [68]:
# Helper functions

import json
import os

from click.testing import CliRunner
import yaml
from fastavro import writer, reader, parse_schema
from pprint import pprint

from pfb.cli import main

def read_yaml(filepath):
    with open(filepath, "r") as yaml_file:
        return yaml.load(yaml_file, Loader=yaml.FullLoader)

def read_json(filepath, default=None):
    if (default is not None) and (not os.path.isfile(filepath)):
        return default

    with open(filepath, 'r') as data_file:
        return json.load(data_file)

def write_json(data, filepath, **kwargs):
    with open(filepath, 'w') as json_file:
        kwargs = {
            'indent': 4,
            'sort_keys': True
        }
        json.dump(data, json_file, **kwargs)

def minify(input_json_file, output_file):
    data = read_json(input_json_file)
    with open(output_file, 'w') as minified_file:
        s = json.dumps(data, separators=(',', ':'))
        minified_file.write(s)
        
def pfb_invoke(*args, **kwargs):
    # Use CliRunner to call Click cli from python
    runner = CliRunner()
    result = runner.invoke(main, args, **kwargs)
    try:
        assert result.exit_code == 0, result.output
    except AssertionError:
        print(str(result.exc_info))

    return result

## Vanilla Avro

In [47]:
# Output avro filepath
data_file = 'data/kf-vanilla.avro'

# Avro schema describing data that will go into avro file
schema = {
    "namespace": "kf-vanilla.avro",
     "type": "record",
     "name": "Participant",
     "fields": [
         {"name": "external_id", "type": "string"},
         {"name": "gender",  "type": ["null", "string"]}
     ]
}
write_json(schema, 'data/kf-vanilla-avro-schema.json')

# Parse the schema into memory so that subsequent ops are faster
parsed_schema = parse_schema(schema)

# Create some data that conforms to the schema
records = [
    {"external_id": "P1", "gender": "female"},
    {"external_id": "P2", "gender": "male"}
]

# Write the schema and data into the avro file
with open(data_file, 'wb') as out:
    writer(out, parsed_schema, records)

# Read the binary compressed avro data back into JSON
with open(data_file, 'rb') as fo:
    print(f'Avro file {data_file} schema:')
    pprint(reader(fo).metadata)
    fo.seek(0)
    print(f'\nAvro file {data_file} data:')
    for record in reader(fo):
        pprint(record)
    

Avro file data/kf-vanilla.avro schema:
{'avro.codec': 'null',
 'avro.schema': '{"type": "record", "name": "kf-vanilla.avro.Participant", '
                '"fields": [{"name": "external_id", "type": "string"}, '
                '{"name": "gender", "type": ["null", "string"]}]}'}

Avro file data/kf-vanilla.avro data:
{'external_id': 'P1', 'gender': 'female'}
{'external_id': 'P2', 'gender': 'male'}


## PFB Avro - Suitable for relational data

In [76]:
# Create test data using gen3 data simulator
# Requires the gen3 data dictionary to be stored on s3
data_dir = 'data/simulated/'
gen3_dd = 'data/kf-gen3-datadict.json'
schema_avro = 'data/kf-pfb-schema.avro'
output_avro = 'data/kf-pfb.avro'
program = 'kidsfirst'
project = 'drc'

# Execute if you don't have any test data yet
# !data-simulator simulate --url https://s3.amazonaws.com/singhn4-data-dict-bucket/kf-gen3-datadict.json --path data/simulated --program kidsfirst --project drc
# !ls -l data/simulated   

[2020-02-21 17:43:30,643][data-simulator][   INFO] Data simulator initialization...
[2020-02-21 17:43:30,644][data-simulator][   INFO] Loading dictionary from url https://s3.amazonaws.com/singhn4-data-dict-bucket/kf-gen3-datadict.json
[2020-02-21 17:43:37,509][data-simulator][   INFO] Initializing graph...
[2020-02-21 17:43:37,513][data-simulator][   INFO] Generating data...
[2020-02-21 17:43:37,514][data-simulator simulate][   INFO] Simulating data for node project
[2020-02-21 17:43:37,577][data-simulator simulate][   INFO] Simulating data for node family
[2020-02-21 17:43:37,578][data-simulator simulate][   INFO] Simulating data for node participant
[2020-02-21 17:43:37,579][data-simulator][   INFO] Done!
total 32
-rw-r--r--  1 singhn4  CHOP-EDU\Domain Users   27 Feb 21 17:43 DataImportOrder.txt
-rw-r--r--  1 singhn4  CHOP-EDU\Domain Users  191 Feb 21 17:43 family.json
-rw-r--r--  1 singhn4  CHOP-EDU\Domain Users  533 Feb 21 17:43 participant.json
-rw-r--r--  1 singhn4  CHOP-EDU\Doma

In [98]:
# Create schema avro file from gen3 data dict
kf_gen3_dd = read_yaml(gen3_dd)

print('******************* Writing PFB Schema *******************')
pfb_invoke('from', '-o', schema_avro, 'dict', gen3_dd)

# Show the avro schema in pfb file
print('******************* PFB Schema *******************')
result = pfb_invoke('show', '-i', schema_avro, 'schema')
pprint(json.loads(result.output))

# Write the test data to the output avro file
print('******************* Writing data to PFB file *******************')
result = pfb_invoke('from', '-o', output_avro, 'json',
           '-s', schema_avro, 
           '--program', program,
          '--project', project,
          data_dir)
print(result.output)
# Read the data back out from the pfb file
print('******************* PFB Data *******************')
print('PFB nodes')
result = pfb_invoke('show', '-i', output_avro, 'nodes')
print(result.output)

# Read the binary compressed avro data back into JSON
print('Avro data records')
with open(output_avro, 'rb') as fo:
    for record in reader(fo):
        pprint(record)

******************* Writing PFB Schema *******************
******************* PFB Schema *******************
[{'fields': [{'default': None,
              'name': 'created_datetime',
              'type': ['null', 'string']},
             {'default': None,
              'name': 'updated_datetime',
              'type': ['null', 'string']},
             {'default': None,
              'doc': 'String representing release name.\n',
              'name': 'name',
              'type': ['null', 'string']},
             {'default': None,
              'doc': 'The number identifying the major version.\n',
              'name': 'major_version',
              'type': ['null', 'long']},
             {'default': None,
              'doc': 'The number identifying the minor version.\n',
              'name': 'minor_version',
              'type': ['null', 'long']},
             {'default': None,
              'name': 'release_date',
              'type': ['null', 'string']},
             {'default':

 'relations': []}
