## Indexing

Indexes used by Xarray are crucial for efficient filtering as well as aligning multiple datasets.  There is no true SQL-style join operator in Xarray, but various methods like `merge`, `align`, and `assign` all implicitly use indexes to perform "joins".  A notable caveat with these functions is that they all assume indexes being aligned are unique, unlike traditional join operators that repeat non-unique values for fields in a join clause.  

An important outstanding question is whether or not data structure indexes should be managed internally.  A good example of this the Hail [Locus](https://hail.is/docs/0.2/genetics/hail.genetics.Locus.html?highlight=locus) struct representing a combination of `contig`, `position`, and `reference_genome`.  This is a core part of its data model and the existence of these fields is assumed or set along with an `alleles` array as the primary key by many internal method implementations.  We could assume that these fields are always required, or it could be up to users to build an index of some kind and trust that whatever index is set is sufficient.  Potential reasons for the latter include:

- Allowing different naming conventions
- Data may not be specific to genomic coordinates (e.g. amino acids in HLA association studies)
- There may be no coordinates at all (e.g. alignment-free GWAS)
- Operations for variant data may also be applicable to phenotypes (e.g. PheWAS)
- `contig`, `position`, `reference_genome`, and `alleles` do not always make a variant unique -- a user may also wish to incorporate rsID or some other identifier.

This notebook will show a few examples of how indexes can be set as well as how they can be used to filter and merge data.

In [1]:
import sys
sys.path.append(".")
from lib import api
import pandas as pd
import xarray as xr
import numpy as np
%run nb/paths.py
xr.set_options(display_style='html');

In [2]:
# Path to PLINK dataset for demonstration
path = HAPMAP_PLINK_PATH_01
path

'/home/eczech/data/gwas/tutorial/1_QC_GWAS/HapMap_3_r3_1'

In [3]:
# Load a dataset to work with
ds = api.read_plink(path, chunks='auto', fam_sep=' ', bim_sep='\t')
ds

Unnamed: 0,Array,Chunk
Bytes,240.55 MB,134.22 MB
Shape,"(1457897, 165)","(813440, 165)"
Count,5 Tasks,2 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 240.55 MB 134.22 MB Shape (1457897, 165) (813440, 165) Count 5 Tasks 2 Chunks Type int8 numpy.ndarray",165  1457897,

Unnamed: 0,Array,Chunk
Bytes,240.55 MB,134.22 MB
Shape,"(1457897, 165)","(813440, 165)"
Count,5 Tasks,2 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,240.55 MB,134.22 MB
Shape,"(1457897, 165)","(813440, 165)"
Count,5 Tasks,2 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 240.55 MB 134.22 MB Shape (1457897, 165) (813440, 165) Count 5 Tasks 2 Chunks Type bool numpy.ndarray",165  1457897,

Unnamed: 0,Array,Chunk
Bytes,240.55 MB,134.22 MB
Shape,"(1457897, 165)","(813440, 165)"
Count,5 Tasks,2 Chunks
Type,bool,numpy.ndarray


### Indexing and Selection

To perform a range query on contig and bp position, the appropriate data variables must first be moved into an index before submitting slice ranges:

In [49]:
# Filter to chromosome 1 where bp < 1M
contig_range = slice(1, 1)
bp_pos_range = slice(0, 1000000)
ds.set_index(variant=('contig', 'pos')).sel(variant=(contig_range, bp_pos_range))

Unnamed: 0,Array,Chunk
Bytes,13.04 kB,13.04 kB
Shape,"(79, 165)","(79, 165)"
Count,6 Tasks,1 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 13.04 kB 13.04 kB Shape (79, 165) (79, 165) Count 6 Tasks 1 Chunks Type int8 numpy.ndarray",165  79,

Unnamed: 0,Array,Chunk
Bytes,13.04 kB,13.04 kB
Shape,"(79, 165)","(79, 165)"
Count,6 Tasks,1 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.04 kB,13.04 kB
Shape,"(79, 165)","(79, 165)"
Count,6 Tasks,1 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 13.04 kB 13.04 kB Shape (79, 165) (79, 165) Count 6 Tasks 1 Chunks Type bool numpy.ndarray",165  79,

Unnamed: 0,Array,Chunk
Bytes,13.04 kB,13.04 kB
Shape,"(79, 165)","(79, 165)"
Count,6 Tasks,1 Chunks
Type,bool,numpy.ndarray


Note that these index values do not necessarily have to be unique, but virtually anything useful you might try to do with indexes downstream will fail if this is the case (just as in Pandas).  This example shows such a failure:

In [51]:
# Index by bp pos, which is not unique 
try:
    ds.set_index(variant='pos').sel(variant=[slice(0, 1000000)])
except Exception as e:
    print(e)

Reindexing only valid with uniquely valued Index objects


Selections for non-unique values in an index should be done with boolean masks, though this is less efficient for repeated reads:

In [52]:
# Select data for only chromosomes 6 and 7 (drop variables with `sample` dimension for brevity)
target_chromosomes = [6, 7]
ds.sel(variant=ds.contig.isin(target_chromosomes)).drop_dims('sample')

### Index Alignment (Joins)

In [118]:
df_pop = pd.concat([
    # Create population values (chosen randomly) for all existing samples
    pd.DataFrame(dict(
        sample_id=ds.sample_id.data,
        fam_id=ds.fam_id.data,
        population=np.random.choice(['EUR', 'AFR', 'EAS'], replace=True, size=ds.dims['sample'])
    )),
    # Add a few sample/family combinations that don't exist
    pd.DataFrame(dict(
        sample_id=['fake-SID1', 'fake-SID2'],
        fam_id=['fake-FID1', 'fake-FID1'],
        population=['fake-POP1', 'fake-POP2']
    ))
])
df_pop.head()

Unnamed: 0,sample_id,fam_id,population
0,1328,NA06989,AFR
1,1377,NA11891,AFR
2,1349,NA11843,AFR
3,1330,NA12341,EUR
4,1444,NA12739,EAS


In [119]:
# Convert the DataFrame to a Dataset and index appropriately for alignment
# * Note that the non-existent values in the dataset appear on the end of the array previews
ds_pop = (
    xr.Dataset.from_dataframe(df_pop)
    # The coordinate and dimension name for this dataset is based on the DataFrame index name, 
    # and this can either be renamed before creating a Dataset or after by renaming both
    .rename_dims(index='sample')
    .rename_vars(index='sample')
    .set_index(sample=['fam_id', 'sample_id'])
)
ds_pop

There are several options available for adding new data variables.  The simplest is to just assign them, but this is implicitly a left join.  Note that our "new" values are simply dropped here and that the size of the samples dimension stays the same:

In [120]:
# Set the index on the original dataset first (it will run without this, but all population values will be NA)
(
    ds.set_index(sample=['fam_id', 'sample_id'])
    .assign(population=ds_pop.population)
    .drop_dims('variant')
)

To preserve values on either side of aligned arrays, `merge` can be used (`join='left'` will produce the same result as above):

In [121]:
(
    # Use an outer join to show that number of samples has increased for non-intersecting index values
    xr.merge([ds.set_index(sample=['fam_id', 'sample_id']), ds_pop], join='outer')
    # xr.merge does drop attributes, so they need to be reassigned
    .assign_attrs(**ds.attrs)
    .drop_dims('variant')
)

Again, none of this will work if the index for alignment is not unique.  This variable assignment will fail, for example, when datasets are indexed only by `fam_id` and not both `(fam_id, sample_id)`:

In [124]:
# Set the index on the original dataset first (it will run without this, but all population values will be NA)
try:
    (
        ds.set_index(sample=['fam_id'])
        .assign(
            population=ds_pop.reset_index('sample')
            .set_index(sample='fam_id')
            .reset_coords('sample_id').population
        )
        .drop_dims('variant')
    )
except Exception as e: 
    print(e)

cannot reindex or align along dimension 'sample' because the index has duplicate values
