# Ranges

A [bioframe](https://bioframe.readthedocs.io/en/latest/) based realization of genomic ranges integration in MuData

`RangeAnnData` is an extension class to the AnnData class

## Prepare data

In [2]:
import numpy as np
import pandas as pd
import anndata as ad
from scipy.sparse import csr_matrix
from mudata import MuData
import mudata as md
from anndata import AnnData
import pyranges as pr
import bioframe as bf

from RangeAnnData import RangeAnnData, RangeMuData

## Generation of toy dataset

- Count matrix: sparse data with `scipy.csr_matrix`
- genomic ranges: provided by `PyRanges`
- Make sure the naming of columns follow the bed file format

In [3]:
counts = csr_matrix(np.random.poisson(1, size=(100, 1000)), dtype=np.float32)
exons, gr = pr.data.exons().df, pr.data.cpg().df
exons = exons.sample(frac=1, replace=False)

exons.columns = ['chrom', 'start', 'end'] + list(exons.columns[3:])
gr.columns = ['chrom', 'start', 'end']+ list(gr.columns[3:])

## Build AnnData object out of the dataset

In [4]:
adata = RangeAnnData(counts)
adata.obs_names = [f"Cell_{i:d}" for i in range(adata.n_obs)]
adata.var_names = [f"Gene_{i:d}" for i in range(adata.n_vars)]

Coordinates are integrated into `RangedAnnData` with `varm` by method `set_coord`

In [5]:
adata.set_coord(exons)
adata.varm['coord']

Unnamed: 0,chrom,start,end,Name,Score,Strand
Gene_0,chrY,15471646,15471866,NR_047637_exon_17_0_chrY_15471647_r,0,-
Gene_1,chrX,67417027,67417106,NM_002547_exon_13_0_chrX_67417028_r,0,-
Gene_2,chrX,117718697,117718820,NM_144658_exon_14_0_chrX_117718698_f,0,+
Gene_3,chrX,119004494,119005791,NM_006978_exon_0_0_chrX_119004495_r,0,-
Gene_4,chrX,99854504,99854882,NM_022144_exon_6_0_chrX_99854505_f,0,+
...,...,...,...,...,...,...
Gene_995,chrX,54577395,54577483,NM_001184819_exon_9_0_chrX_54577396_f,0,+
Gene_996,chrY,16834996,16835149,NR_028319_exon_2_0_chrY_16834997_f,0,+
Gene_997,chrX,31747747,31747865,NM_004013_exon_27_0_chrX_31747748_r,0,-
Gene_998,chrX,77294333,77294480,NM_000052_exon_17_0_chrX_77294334_f,0,+


## Filter by genomic ranges: `slice_granges`

Original data size:

In [6]:
adata.shape

(100, 1000)

Sometimes features are defined in some coordinate system such as a linear _genome sequence_ — for instance in assays that measure chromatin accessibility, transcription factor occupancy, DNA methylation. In fact, genes also have a property defining their location in the DNA, even though that's something frequently ignored in transcriptomics analysis pipelines.

Typically AnnData/MuData objects are subsetted (or _sliced_) along the feature dimension using method `slice_granges`


In [8]:
adata.slice_granges('chrX', 1, 1500000).shape

(100, 11)

# Intersection with a list of genomic ranges

Given a list of granges provided in pandas dataframe, 