# Notes, 2/8/18
* Netcdf4 and HDF5 both have the same underlying file structure
    * File structure is called HDF5
* Python Netcdf4
    * Lots of unlimited size dimensions needs careful attention
        * If more than one unlimited dimension, the default chunk size is 1024
            * If this dimension ends up with small size, still have a size of 1024 allocated in the file
            * Lots of wasted file space
        * Unlimited dimensions appear to be detrimental to file size
            * Large reduction in file size can be obtained by minimizing the number of unlimited dimensions
        * Variable compression (zlib=True) slows way down with multiple unlimited dimensions
        * These effects are likely coming from the need to keep rebuilding the variable as the dimension sizes keep changing
            * The data has to be uncompressed and then re-compressed
            * The gross mismatch of chunk sizes likely exacerbates this
                * Both for performance and memory usage
    * Example from jedi_bufr2nc
        * jedi_bufr2nc.py Aircraft ../bufr2nc/test/data/gdas.t00z.prepbufr.nr aircraft.test.nc
        * The input prepbufr file is 49MB
        * When compression (level = 6) was used in jed_bufr2nc.py
            * Default chunk sizing
            * Process took about 20 minutes to run!
            * Output file was about 100MB!
            * A variable with size (1867,1) was using chunk size (1024,1024)
        * Shut off compression (zlib=False)
            * Process much faster --> 2 minutes
            * But output file huge, 1GB!
                * Ridiculous waste, the output data uncompressed should be around 350KB
        * Specified chunksize using a size of 1 for all unlimited dimensions
            * Runtime about the same
            * File size reduced to 86MB
                * Way better, but still excessive waste
        * Used nccopy for two more improvements
            * Change unlimited dims to fixed dims
                * nccopy -u infile outfile
            * Shuffle and compress file (level 6)
                * nccopy -d 6 -s infile outfile
            * Recommended to do these in two distict steps
                * compression works much more effectively when dims are fixed
        * Change unlimited dims to fixed dims
            * Long runtime: 5 - 10 minutes
            * File reduced to 6MB
            * Note: no compress has been applied at this point
        * Shuffle and compress file
            * Very fast, 1 second
            * File reduce to 211KB
                * Much more reasonable
        * Summary of impacts show in sequence the actions were tried
        
| Action | File Size | Var Size | Chunk Size |
|:-------|:---------:|:--------:|:----------:|
|Default chunking|1GB|(1867,1)|(1024,1024)|
|Chunking with size 1 for unlim dims|86MB|(1867,1)|(1,1)|
|Change unlim dims to fixed dims|6MB|(1867,1)|(1867,1)|
|Shuffle and compress|221KB|(1861,1)|(1861,1)|


        

# Notes, 2/9/18

## jedi_bufr2nc.py

* Cannot directly query the BUFR file (nor BUFR table) to find dimension sizes
* Options for efficiently writing the netCDF file
    * Two passes through the BUFR file
        * Pass1: read all obs and determine dimension sizes
            * Won't work to read a representative obs since number of levels can be different (T vs U example)
        * Pass2: read obs and transfer to the netcdf file
        
        * Pros:
            * Will get dimension sizes set to the minimum necessary per file
            * Can make all netcdf dimensions fixed size
            
        * Cons:
            * Slow to read the BUFR file twice (but not that bad)
            
    * One pass through the BUFR file and post process the netCDF file
        * Create file uncompressed and unlimited dimensions while reading the BUFR file
        * Convert dims to fixed
        * Compress
        
        * Pros: 
            * Save time by only reading BUFR file once
            * Compression may be optimized since the variables are complete before compression starts
                * May not make much difference
        
        * Cons:
            * Slow
                * The "convert dims to fixed" is an especially slow process

    * One pass through the BUFR file and ssume max dimensions for the netCDF file
        * Pros:
            * Fastest execution
        
        * Cons:
            * Wastes space
                * Compression may mitigate this since there will be a lot of repeated values


# Notes, 2/12/18

## jedi_bufr2nc.py

* Created srherbener/bufr2nc repository on GitHub
    * jedi_bufr2nc.py script
    * Two Fortran utilities
        * pb_decode.f90: dump out bufr table and obs (ufbint() calls)
        * pb_decode_events.f90: dump out bufr table and events (ufbevn() calls)
* Test case: gdas.t00z.prepbufr.nr
    * Runs in about 30 seconds
    * Input 49MB
    * Output (AIRCFT, AIRCAR) 200KB
        * 18 messages selected
        * 1867 obs recorded
* Xin's file: prepbufr.gdas.20160304.t06z.nr.48h
    * Runs in 1.5 hours
    * Input 62MB
    * Output (AIRCFT, AIRCAR) 4.3MB
        * 2268 messages selected
        * 222316 obs recorded
* Performance is not good, the Fortran programs run a little faster considering the python script makes two passes through the BUFR file.
    * Reading obs from a subset is slow