## I/O

This notebook demonstrates several examples of configuring/adding backends for PLINK dataset import as well as registering custom I/O extensions.

In [1]:
import sys
sys.path.append(".")
from lib import api
import pandas as pd
import xarray as xr
%run nb/paths.py
xr.set_options(display_style='html');

In [2]:
# Path to PLINK dataset for demonstration
path = HAPMAP_PLINK_PATH_01
path

'/home/eczech/data/gwas/tutorial/1_QC_GWAS/HapMap_3_r3_1'

### Choosing Backends

Like pandas and xarray, settings for I/O backends are configured with a global options system (in this case, they are literally using Pandas options).

An advantage of this is that the options themselves are callable python objects with docstrings, which makes interactive exploration easier.  This also provides a convenient way to scope settings globally, locally with a context manager, or on a per-call basis.

In [3]:
# Config properties are live attributes (autocomplete makes them easy to find)
api.config.io.plink.backend

'auto'

In [4]:
# Most backends will default to 'auto' so that they can be chosen based on what's installed
api.config.describe('io.plink.backend')

io.plink.backend Options: ['pysnptools']; default is auto
    [default: auto] [currently: auto]


In [5]:
# Set an option globally to use pysnptools for PLINK I/O (https://github.com/MicrosoftGenomics/PySnpTools)
api.config.set('io.plink.backend', 'pysnptools')
api.config.describe('io.plink.backend')

io.plink.backend Options: ['pysnptools']; default is auto
    [default: auto] [currently: pysnptools]


In [6]:
# Use global options set above to read a file
ds = api.read_plink(path, chunks='auto', fam_sep=' ', bim_sep='\t')

# OR Define context manager for more local scoping
with api.config.context('io.plink.backend', 'pysnptools'):
    ds = api.io.read_plink(path, chunks='auto', fam_sep=' ', bim_sep='\t')
    
# OR Set backend for each call
ds = api.io.read_plink(path, chunks='auto', backend='pysnptools', fam_sep=' ', bim_sep='\t')

ds

Unnamed: 0,Array,Chunk
Bytes,240.55 MB,134.22 MB
Shape,"(1457897, 165)","(813440, 165)"
Count,5 Tasks,2 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 240.55 MB 134.22 MB Shape (1457897, 165) (813440, 165) Count 5 Tasks 2 Chunks Type int8 numpy.ndarray",165  1457897,

Unnamed: 0,Array,Chunk
Bytes,240.55 MB,134.22 MB
Shape,"(1457897, 165)","(813440, 165)"
Count,5 Tasks,2 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,240.55 MB,134.22 MB
Shape,"(1457897, 165)","(813440, 165)"
Count,5 Tasks,2 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 240.55 MB 134.22 MB Shape (1457897, 165) (813440, 165) Count 5 Tasks 2 Chunks Type bool numpy.ndarray",165  1457897,

Unnamed: 0,Array,Chunk
Bytes,240.55 MB,134.22 MB
Shape,"(1457897, 165)","(813440, 165)"
Count,5 Tasks,2 Chunks
Type,bool,numpy.ndarray


### Using Custom Backends

Adding unmanaged extension mechanisms for custom backends like those used in Pandas or exposed as a part of the [Astropy I/O Registry](https://docs.astropy.org/en/stable/io/registry.html#io-registry) can be done at runtime.  These are "unmanaged" in the sense that it is entirely up to the user to install dependencies and make them compatible.

In [7]:
# Create a custom backend that just reads PLINK pedigree data
class CustomPLINKBackend(api.io.PLINKBackend):
    
    id = 'custom-backend'
    
    def read_plink(self, path):
        cols = names=['sample_id', 'fam_id', 'pat_id', 'mat_id', 'is_female', 'phenotype']
        return pd.read_csv(path + '.fam', sep=' ', names=cols)

In [8]:
# Register the backend and use it
api.io.register_backend(CustomPLINKBackend())

df = api.read_plink(path, backend='custom-backend')
df

Unnamed: 0,sample_id,fam_id,pat_id,mat_id,is_female,phenotype
0,1328,NA06989,0,0,2,2
1,1377,NA11891,0,0,1,2
2,1349,NA11843,0,0,1,1
3,1330,NA12341,0,0,2,2
4,1444,NA12739,NA12748,NA12749,1,-9
...,...,...,...,...,...,...
160,1447,NA12752,NA12760,NA12761,1,-9
161,1346,NA12043,0,0,1,2
162,1375,NA12264,0,0,1,1
163,1349,NA10854,NA11839,NA11840,2,-9
