# NWB HACKATHON 2020: Injecting Metadata into IPFX converted NWB(s)


---



First, we have to install some dependencies.

In [None]:
from IPython.display import clear_output
!git clone https://github.com/smestern/ipfx.git
!git clone https://github.com/smestern/example-abf-files.git


fatal: destination path 'ipfx' already exists and is not an empty directory.
fatal: destination path 'example-abf-files' already exists and is not an empty directory.


In [None]:
!apt-get install -qq /content/ipfx
!pip uninstall statsmodels -y
!pip uninstall tables -y
!pip install statsmodels==0.9.0
!pip install tables==3.5.1
!pip install /content/ipfx --log /content/log.txt
!pip install pynwb
!pip install nwbwidgets
clear_output()

In [None]:
import shutil
import os
import h5py
import argparse
import logging
import pynwb
import json
log = logging.getLogger(__name__)
import pyabf
from ipfx.x_to_nwb.ABFConverter import ABFConverter
from ipfx.x_to_nwb.DatConverter import DatConverter
import numpy as np
import pandas as pd
import collections
from hdmf.utils import docval, popargs
from pynwb import NWBFile, register_class, load_namespaces, NWBHDF5IO, CORE_NAMESPACE, get_class
from pynwb.spec import NWBNamespaceBuilder, NWBGroupSpec, NWBAttributeSpec
from pynwb.file import LabMetaData

# Adding metadata

The above commands downloads a few example abf files to be used in this tutorial. Additionally, it downloads & installs my (slightly modified) version of Allen Institute's IPFX. From here we will attempt to organize and build the ABFs into nwbs using metadata from in jsons.

Adding meta data to the NWB presents a number of challenges. Notably, as of now ABFconverter does not utilize additional metadata included in the json. For example, including 'session description' in a json will not be written as the session description in the NWB. Although [pull request 400](https://github.com/AllenInstitute/ipfx/pull/400) attempts to fix this.

## Step 1 - Define your Metadata in a json

Much like MCC settings pulled earlier. The script is designed to utilize json files with metadata that may not be included at the time of conversion.

If you want overwrite a specific value, the json dictionary should be organize exactly like the organization of the NWB file. For example, if you want to overwrite the gain for sweep 1. In the NWB this is located at

-|Acquisition (group)  
---|index_000 (group)  
-----> gain: value  
So we include the following field in the json file:


```
"acquisition": {
    "index_000": {
      "gain": 555
    }
  }
```



I have prebuilt some metadata into the json file included:

In [None]:
json_path = "//content//example-abf-files//mcc-settings.json"
def loadJSON(filename):
      with open(filename) as fh:
                return json.load(fh)
meta_data = loadJSON(json_path)


## Step 2 - Inject the metadata

Here we will use the previously built NWB as an example. Lets load the NWB and inspect the gain and session description:



In [None]:
file = "//content//example-abf-files//M10_SA_A1_C07.nwb"
with h5py.File(file,  "r") as nwb:
  print(F"gain: {nwb['acquisition']['index_000']['gain'][()]}")
  print(F"sess desc: {nwb['session_description'][()]}")

gain: 555.0
sess desc: TEST DESC


The following function allows you to pass both a nwb file and dict of metadata. It will then inject the meta data into the file. Note that pyNWB and the nwb team seem to oppose to overwriting and/or deleting data, so this script has to go about this in a hack-y way

In [None]:
def confirm_metadata(file, mjson, meta_field=True):
    """
    Function Takes an input NWB, and INPUT json file(s). Checks to see if keys within the NWB that match JSON keys have matching content, if not, overwrites. Sometimes metadata
    added to the json file is ignored by the ABFCONVERTER. Adds novel metadata keys as NWB extensions safely using pyNWB. Overwrites using h5py.
    Takes:
    file: path to NWB (hdf5) file,
    mjson: path to json file to be injected into file.
    meta_field: If true, all novel data is filed under a new group in the NWB file titled 'metadata' otherwise metadata is placed in base group
    returns:
    file: path to NWB file.
    """
    def loadJSON(filename):
        if isinstance(filename, (list, np.ndarray)):
            full_dict = {}
            for js in filename:
                with open(js) as fh:
                     full_dict.update(json.load(fh))
            return full_dict
        else:
           with open(filename) as fh:
                return json.load(fh)
    def _h5_merge(dict1, dict2):
        ''' Recursively merges the input'''

        result = dict1

        for key, value in dict2.items():
            if isinstance(value, collections.Mapping):
                _h5_merge(result.get(key, {}), value)
            else:
                result[key][...] = dict2[key]

        return result
    def dict_to_list(dict1):
        list = []
        for key, value in dict1.items():
            if isinstance(value, (dict)):
                list.append(dict_to_list(value))
            else:
                list.append((str(key), str(value)))
        return list

    metadata = loadJSON(mjson)
    with h5py.File(file,  "r+") as f: ##Has to be opened with h5py as pyNWB does not support overwrite
        NWB_f = f
        nwb_keys = list(NWB_f.keys())
        meta_keys = list(metadata.keys())
        overlap_keys = np.intersect1d(nwb_keys,meta_keys) ##Look for overlapping and overwrite
        for key in overlap_keys:
            if isinstance(metadata[key], dict):
                d = _h5_merge(f[key], metadata[key]) ## if its a dict instance, we begin the process of merging
                ##recursively 
            else:
                f[key][...] = metadata[key] ## Otherwise just overwrite. 
        novel_keys = np.setdiff1d(meta_keys, overlap_keys) ##Grab the novel keys for later
    n_metadata = {key: metadata[key] for key in novel_keys}
    ##Now close the nwb and open with pynwb
    if False:
      ##This section is not working in colab so avoid
      with pynwb.NWBHDF5IO(file,  mode="r+") as f_io:
          ### Now add novel data using pynwb in a way thats a lot less brute force, and way more nwb friendly
          f = f_io.read()
          NWB_f = f
          if False:
              ##Currently not working 
              ##Class is compiled but attributes are not written to files
              meta_class = build_settings(n_metadata)
              test = get_class('MetaData', 'NHP')

              nwb_meta = test(name='meta', experiment_id=int(12), test='test')
              NWB_f.add_lab_meta_data(nwb_meta)
          else:
              ## For now just dump into scratch ##Goes against NWB Conventions however
              for key, value in n_metadata.items():
                  if isinstance(value, dict):
                      cont = dict_to_list(value)
                      for x in cont:
                        NWB_f.add_scratch([x[1]], name=str(x[0]), notes=str(key))
                  else:
                      NWB_f.add_scratch([value], name=str(key), notes="null")
          f_io.write(NWB_f)


Now we try it and evaluate if it works

In [None]:
#try it
confirm_metadata(file, json_path)

In [None]:
file = "//content//example-abf-files//M10_SA_A1_C07.nwb"
with h5py.File(file,  "r") as nwb:
  print(F"gain: {nwb['acquisition']['index_000']['gain'][()]}")
  print(F"sess desc: {nwb['session_description'][()]}")

gain: 555.0
sess desc: TEST DESC



You can see that the data was overwritten with the json file


## Step 3 - Putting it all together

Ideally we would utilize injection like this at the time of conversion. For our pratices we can call this immediatly after converting the ABF(s). Here we use the unmodified allen convert function.

In [None]:
def convert(inFileOrFolder, overwrite=False, fileType=None, outputMetadata=False, outputFeedbackChannel=False, multipleGroupsPerFile=False, compression=True):
    """
    Convert the given file to a NeuroDataWithoutBorders file using pynwb

    Supported fileformats:
        - ABF v2 files created by Clampex
        - DAT files created by Patchmaster v2x90

    :param inFileOrFolder: path to a file or folder
    :param overwrite: overwrite output file, defaults to `False`
    :param fileType: file type to be converted, must be passed iff `inFileOrFolder` refers to a folder
    :param outputMetadata: output metadata of the file, helpful for debugging
    :param outputFeedbackChannel: Output ADC data which stems from stimulus feedback channels (ignored for DAT files)
    :param multipleGroupsPerFile: Write all Groups in the DAT file into one NWB
                                  file. By default we create one NWB per Group (ignored for ABF files).
    :param compression: Toggle compression for HDF5 datasets

    :return: path of the created NWB file
    """

    if not os.path.exists(inFileOrFolder):
        raise ValueError(f"The file {inFileOrFolder} does not exist.")

    if os.path.isfile(inFileOrFolder):
        root, ext = os.path.splitext(inFileOrFolder)
    if os.path.isdir(inFileOrFolder):
        if not fileType:
            raise ValueError("Missing fileType when passing a folder")

        inFileOrFolder = os.path.normpath(inFileOrFolder)
        inFileOrFolder = os.path.realpath(inFileOrFolder)

        ext = fileType
        root = os.path.join(inFileOrFolder, "..",
                            os.path.basename(inFileOrFolder))

    outFile = root + ".nwb"

    if not outputMetadata and os.path.exists(outFile):
        if overwrite:
            os.remove(outFile)
        else:
            raise ValueError(f"The output file {outFile} does already exist.")

    if ext == ".abf":
        if outputMetadata:
            ABFConverter.outputMetadata(inFileOrFolder)
        else:
            ABFConverter(inFileOrFolder, outFile, outputFeedbackChannel=outputFeedbackChannel, compression=compression)
    elif ext == ".dat":
        if outputMetadata:
            DatConverter.outputMetadata(inFileOrFolder)
        else:
            DatConverter(inFileOrFolder, outFile, multipleGroupsPerFile=multipleGroupsPerFile, compression=compression)

    else:
        raise ValueError(f"The extension {ext} is currently not supported.")

    return outFile

Now we bulk convert again. Note this is from [this](https://github.com/smestern/ipfx/blob/master/ipfx/bin/run_bulk_to_nwb_conversion.py) script which allows you to call bulk conversion from the command line

In [None]:

bmeta=True
meta = "//content//example-abf-files//mcc-settings_con.json"
root_path = ["//content//example-abf-files//Example Files"]

for path in root_path:
        print(path)
        for r, celldir, f in os.walk(path):
              
              for c in celldir: ##Walks through each folder (cell folder) in the root folder

                  c = os.path.join(r, c) ##loads the subdirectory path
                  ls = os.listdir(c) ##Lists the files in the subdir
                  
                  abf_pres = np.any(['.abf' in x for x in ls]) #Looks for the presence of at least one abf file in the folder (does not check subfolders)
                  if abf_pres:
                        if bmeta == True: ##If the user provided an additonal json file, we copy that into the subfolder
                            shutil.copy(meta,c) 
                            
                        print(f"Converting {c}")
                        nwb_r = convert(c,
                                overwrite=True,
                                fileType='.abf',
                                outputMetadata=False,
                                outputFeedbackChannel=False,
                                multipleGroupsPerFile=True,
                                compression=True)
                        file = nwb_r
                        with h5py.File(file,  "r") as nwb:
                          print(F"gain: {nwb['acquisition']['index_00']['gain'][()]}")
                          print(F"sess desc: {nwb['session_description'][()]}")
                        confirm_metadata(nwb_r,meta) ##Call confirm metadata to overwrite the data
                        with h5py.File(file,  "r") as nwb:
                          print(F"gain: {nwb['acquisition']['index_00']['gain'][()]}")
                          print(F"sess desc: {nwb['session_description'][()]}")
                        os.remove(os.path.join(c,os.path.basename(meta)))
     

//content//example-abf-files//Example Files
Converting //content//example-abf-files//Example Files/Cell 1


  warn("Date is missing timezone information. Updating to local timezone.")


gain: 1.0
sess desc: PLACEHOLDER
gain: 555.0
sess desc: TEST DESC
Converting //content//example-abf-files//Example Files/Cell 2




gain: 1.0
sess desc: PLACEHOLDER
gain: 555.0
sess desc: TEST DESC


In principle we convert and overwrite in the same go

## (optional) Removing Sweeps that fail QC

Ideally, we could remove sweeps from the ABF files. However Removing sweeps from an ABF also removes ALL stim info from the ABF file.  
To skirt around this, we can delete the the sweeps using H5py, this means that the remaining sweep retain the stim info. 
NOTE: Deleting data from nwb's seems to be highly discouraged by the pyNWB team, so its not in best practice to do this. pyNWB contains no way to remove data in house.

In [None]:
def remove_sweeps(nwb_file, qcsweeps):
  qcsweeps = np.asarray(qcsweeps) #If given a list convert to array
  file_path = nwb_file
  print(f"QC'ing {file_path}")
  with h5py.File(file_path,  "r") as f:
            item = f['acquisition']
            sweeps = item.keys()
            print(sweeps)
  qc_names = []
  for x in qcsweeps:
          if x < 10:
              qc_names.append(f"index_0{x}") ##This is for if the file has under 100 sweeps, otherwise the names will be something like, index_00X
          else:
              qc_names.append(f"index_{x}")
  print(qc_names)
  with h5py.File(file_path,  "a") as f:
        item = f['acquisition'] ##Delete the response recording
        for p in qc_names:
              try:
                  del item[p] #For whatever reason these deletes try to do it twice ignore the second error message
                  print(f'deleted {p}')
              except:
                  print(f'{p} delete fail')
        item = f['stimulus'] #next delete the stimset
        for p in qc_names:
              try:
                  del item[p]
                  print(f'deleted {p}')
              except:
                  print(f'{p} delete fail')
        print(item.keys())
        item = f['general']['intracellular_ephys']['sweep_table'] #next delete the references in the sweep table, or else the nwbs may break analysis
        ## Since IPFX may go looking for sweeps that are absent
        for key, value in item.items():
              array = value[()]
              ind = np.arange(0, len(array))
              
              bool_mask = np.in1d(ind,qcsweeps, invert=True)
              new_data = array[bool_mask]
              try:
                del item[key]
                item[key] = new_data
                print(f'deleted and rewrote {key}')
              except: 
                print(f'{key} delete fail')


The function trys to delete sweeps that you pass in. However I may have missed something. 
Below shows an example of how utilize the function. It calls and confirms the sweeps are removed.

In [None]:
sweeps = [2,3]
remove_sweeps(nwb_r, sweeps)
with h5py.File(nwb_r,  "r") as f:
            item = f['acquisition']
            sweeps = item.keys()

            print(sweeps) ##Confirm deletion
            print(f['general']['intracellular_ephys']['sweep_table']['id'][()])

QC'ing /content/example-abf-files/Example Files/Cell 2/../Cell 2.nwb
<KeysViewHDF5 ['index_00', 'index_01', 'index_02', 'index_03', 'index_04', 'index_05', 'index_06', 'index_07', 'index_08', 'index_09', 'index_10', 'index_11', 'index_12', 'index_13', 'index_14', 'index_15', 'index_16', 'index_17', 'index_18', 'index_19', 'index_20', 'index_21', 'index_22', 'index_23', 'index_24', 'index_25', 'index_26', 'index_27', 'index_28', 'index_29', 'index_30', 'index_31', 'index_32', 'index_33', 'index_34', 'index_35', 'index_36', 'index_37', 'index_38', 'index_39', 'index_40', 'index_41', 'index_42', 'index_43', 'index_44', 'index_45', 'index_46', 'index_47', 'index_48', 'index_49', 'index_50', 'index_51', 'index_52', 'index_53', 'index_54', 'index_55', 'index_56', 'index_57', 'index_58', 'index_59', 'index_60', 'index_61', 'index_62', 'index_63', 'index_64', 'index_65', 'index_66', 'index_67', 'index_68', 'index_69', 'index_70']>
['index_02', 'index_03']
deleted index_02
deleted index_03
inde

Ideally, this is used in bulk by utilizing a dataframe that pairs cell_files with sweeps that fail QC. A rough implimentation of that is [here](https://github.com/smestern/ipfx/blob/master/ipfx/bin/run_sweep_QC.py)