# How to Parse CIC Output

The output of the program CIC is a large json file. This is a guide on parsing the output using python.
First, we load the output into a python dictionary:

In [2]:
import json
inconsistencies = json.load(open("inconsistencies.json", 'r'))

FileNotFoundError: [Errno 2] No such file or directory: 'inconsistencies.json'

The format of the dictionary is the following:
```json
{
    "error" : { "data_node" : [  [ group of records (with facets) with one instance id ],
                                 [ group of records (with facets) with another instance id ],
                                 ... 
                              ],
                ...
              },
    ...

}
```
so the following call will retrieve the instance id of the first dataset record which has no original record from the data node aims3.llnl.gov:

In [None]:
first_instance_id = inconsistencies["No original record:"]["aims3.llnl.gov"][0][0]["instance_id"]

The reason for both index 0s above is to get to the first group, and then to get to the first record. A full list of facets for each record is as follows:  
`instance_id`: the instance_id of the record  
`number_of_files`: how many files are part of this record  
`_timestamp`: the time this record was created  
`data_node`: the data node corresponding to the record  
`replica`: either true or false, indicates whether record is an original or replica  
`institution_id`: the institution corresponding to the record  
`latest`: either true or false, indicates whether or not record is latest version of the dataset  
`version`: the version of the dataset record  
`retracted`: either true or false, indicates whether or not the record has been retracted  
`id`: the id of the dataset, different format from instance_id  
`activity_drs`: the first (or only) activity id of the dataset  
`activity_id`: the list of activity ids of the dataset  
`source_id`: the source id of the dataset  
`experiment_id`: the experiment id of the dataset  

The full list of errors is as follows:  
`"No original record:"` indicates a batch of replicas with no original record, a sign that the original was deleted improperly  
`"Inconsistent number of files (esgf replica issue):"` the facet `number_of_files` does not have the same value for at least one record in the dataset, but the timestamp and replica status indicates the record was changed by ESGF  
`"Inconsistent number of files (client issue):"` same as above, except that the timestamp and replica status indicates that the record was changed by the client  
`"Original record not latest version:"` the original record in a group was marked not latest  
`"Original record retracted:"` the original record was marked retracted  
`"Duplicate records:"` flags all records which are exact duplicates of one another
`"Replica retracted, original not retracted:"` indicates a group where the replicas were retracted but the original record was not  
`"Failed activity check:"` record has an invalid activity id according to CMIP6 CMOR table  
`"Failed experiment_id check:"` record's experiment id does not correspond to its activity id according to CMIP6 CMOR table  

As a final example, we will loop through the dictionary and write all the records flagged with a certain data node to a file:

In [None]:
DATA_NODE = "aims3.llnl.gov"
fp = open("file.txt", "w")
for err in inconsistencies.keys():
    for node in inconsistencies[err].keys():
        if node == DATA_NODE:
            for recs in inconsistencies[err][node]:
                fp.write(json.dumps(recs))
fp.close()