# Ranges

A [bioframe](https://bioframe.readthedocs.io/en/latest/) based realization of genomic ranges integration in MuData

`RangeAnnData` is an extension class to the AnnData class

## Prepare data

In [1]:
import numpy as np
import pandas as pd
import anndata as ad
from scipy.sparse import csr_matrix
from mudata import MuData
import mudata as md
from anndata import AnnData
import pyranges as pr
import bioframe as bf

from RangeAnnData import RangeAnnData, RangeMuData

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


## Generation of toy dataset

- Count matrix: sparse data with `scipy.csr_matrix`
- genomic ranges: provided by `PyRanges`
- Make sure the naming of columns follow the bed file format

In [2]:
counts = csr_matrix(np.random.poisson(1, size=(100, 1000)), dtype=np.float32)
exons, gr = pr.data.exons().df, pr.data.cpg().df
exons = exons.sample(frac=1, replace=False)

exons.columns = ['chrom', 'start', 'end'] + list(exons.columns[3:])
gr.columns = ['chrom', 'start', 'end']+ list(gr.columns[3:])

## Build AnnData object out of the dataset

In [3]:
adata = RangeAnnData(counts)
adata.obs_names = [f"Cell_{i:d}" for i in range(adata.n_obs)]
adata.var_names = [f"Gene_{i:d}" for i in range(adata.n_vars)]

Coordinates are integrated into `RangedAnnData` with `varm` by method `set_coord`

In [4]:
adata.set_coord(exons)
adata.varm['coord']

Unnamed: 0,chrom,start,end,Name,Score,Strand
Gene_0,chrX,147733519,147733640,NM_001169125_exon_1_0_chrX_147733520_f,0,+
Gene_1,chrX,30718954,30719022,NM_001128127_exon_9_0_chrX_30718955_f,0,+
Gene_2,chrX,16875716,16875832,NM_002893_exon_7_0_chrX_16875717_r,0,-
Gene_3,chrX,128674251,128674455,NM_000276_exon_0_0_chrX_128674252_f,0,+
Gene_4,chrX,83588766,83588850,NM_001177479_exon_2_0_chrX_83588767_r,0,-
...,...,...,...,...,...,...
Gene_995,chrX,15838329,15838439,NM_005089_exon_9_0_chrX_15838330_f,0,+
Gene_996,chrX,1407411,1407535,NM_172249_exon_4_0_chrX_1407412_f,0,+
Gene_997,chrX,31893307,31893490,NM_004010_exon_31_0_chrX_31893308_r,0,-
Gene_998,chrX,73744193,73744644,NM_006517_exon_2_0_chrX_73744194_f,0,+


## Filter by genomic ranges: `slice_granges`

Original data size:

In [5]:
adata.shape

(100, 1000)

Sometimes features are defined in some coordinate system such as a linear _genome sequence_ — for instance in assays that measure chromatin accessibility, transcription factor occupancy, DNA methylation. In fact, genes also have a property defining their location in the DNA, even though that's something frequently ignored in transcriptomics analysis pipelines.

Typically AnnData/MuData objects are subsetted (or _sliced_) along the feature dimension using method `slice_granges`


In [6]:
adata.slice_granges('chrX', 1, 1500000).shape

(100, 11)

## Intersection with a list of genomic ranges

- Input: List of genomic ranges in pandas dataframe
- Output: Subsetted AnnData object with inner overlapping with 