# Tutorial: Reading SOMA Objects

In this notebook we'll learn how to read from various SOMA objects. We will assume familiarity with SOMA objects already, so it is recommended to go through the [Tutorial: SOMA Objects](https://github.com/single-cell-data/TileDB-SOMA/blob/main/apis/python/notebooks/tutorial_soma_objects.ipynb) before.

This implementation of SOMA relies on [TileDB](https://tiledb.com/), which is a storage format that allows working with large files without having to fully load them in memory. Files can be either read from disk or from a remote source, like an S3 bucket. 

The core feature of SOMA is to allow reading _subsets_ of the data using slices: only the portion of required data is read from disk/network.
SOMA uses [Apache Arrow](https://arrow.apache.org/) as an intermediate in-memory storage. From here, the slices can be further converted into more familiar formats, like a scipy.sparse matrix or a numpy ndarray. Consult the [Python bindings for Apache Arrow documentation](https://arrow.apache.org/docs/python/index.html) for more information.

In this notebook, we will use the Peripheral Blood Mononuclear Cells (PBMC) dataset. We will focus on reading from its `obs` `DataFrame` and from the `X` `SparseNDArray`. This is a small dataset that can fit in memory, but we'll focus on operations that work on subsets of data that will work on larger datasets as well.

## Reading a DataFrame

### Introduction

In [36]:
import tiledbsoma

In [37]:
experiment = tiledbsoma.open("data/sparse/pbmc3k")
obs = experiment.obs
obs

<DataFrame 'file:///opt/TileDB-SOMA/apis/python/notebooks/data/sparse/pbmc3k/obs' (open for 'r')>

All read operations need to be performed using the `.read()` method. For a `DataFrame`, we want to then call `.concat()` to obtain a [PyArrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html):

In [38]:
table = obs.read().concat()
table

pyarrow.Table
soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: large_string
----
soma_joinid: [[0,1,2,3,4,...,2633,2634,2635,2636,2637]]
obs_id: [["AAACATACAACCAC-1","AAACATTGAGCTAC-1","AAACATTGATCAGC-1","AAACCGTGCTTCCG-1","AAACCGTGTATGCG-1",...,"TTTCGAACTCTCAT-1","TTTCTACTGAGGCA-1","TTTCTACTTCCTCG-1","TTTGCATGAGAGGC-1","TTTGCATGCCTCAC-1"]]
n_genes: [[781,1352,1131,960,522,...,1155,1227,622,454,724]]
percent_mito: [[0.030177759,0.037935957,0.008897362,0.017430846,0.012244898,...,0.021104366,0.00929422,0.021971496,0.020547945,0.008064516]]
n_counts: [[2419,4903,3147,2639,980,...,3459,3443,1684,1022,1984]]
louvain: [["CD4 T cells","B cells","CD4 T cells","CD14+ Monocytes","NK cells",...,"CD14+ Monocytes","B cells","B cells","B cells","CD4 T cells"]]

From here, we can directly use any of the PyArrow Table methods, for instance:

In [39]:
table.sort_by([("n_genes", "descending")])

pyarrow.Table
soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: large_string
----
soma_joinid: [[270,1163,1891,926,277,...,2186,1522,662,1288,1840]]
obs_id: [["ACGAACTGGCTATG-1","CGATACGACAGGAG-1","GGGCCAACCTTGGA-1","CAGGTTGAGGATCT-1","ACGAGGGACAGGAG-1",...,"TAGTCTTGGCTGTA-1","GACGCTCTCTCTCG-1","ATCTCAACCTCGAA-1","CTAATAGAGCTATG-1","GGCATATGGGGAGT-1"]]
n_genes: [[2455,2033,2020,2000,1997,...,270,267,246,239,212]]
percent_mito: [[0.015774649,0.022166021,0.010576352,0.026962927,0.014631685,...,0,0.032258064,0,0.0016666667,0.012173913]]
n_counts: [[8875,6722,8415,8011,7928,...,652,682,609,600,575]]
louvain: [["Megakaryocytes","CD4 T cells","Dendritic cells","B cells","Dendritic cells",...,"CD4 T cells","Megakaryocytes","CD4 T cells","CD4 T cells","Megakaryocytes"]]

Alternatively, we can convert the `DataFrame` to a different format, like a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html):

In [40]:
table.to_pandas()

Unnamed: 0,soma_joinid,obs_id,n_genes,percent_mito,n_counts,louvain
0,0,AAACATACAACCAC-1,781,0.030178,2419.0,CD4 T cells
1,1,AAACATTGAGCTAC-1,1352,0.037936,4903.0,B cells
2,2,AAACATTGATCAGC-1,1131,0.008897,3147.0,CD4 T cells
3,3,AAACCGTGCTTCCG-1,960,0.017431,2639.0,CD14+ Monocytes
4,4,AAACCGTGTATGCG-1,522,0.012245,980.0,NK cells
...,...,...,...,...,...,...
2633,2633,TTTCGAACTCTCAT-1,1155,0.021104,3459.0,CD14+ Monocytes
2634,2634,TTTCTACTGAGGCA-1,1227,0.009294,3443.0,B cells
2635,2635,TTTCTACTTCCTCG-1,622,0.021971,1684.0,B cells
2636,2636,TTTGCATGAGAGGC-1,454,0.020548,1022.0,B cells


### Reading slices of data

As previously mentioned, the core feature of SOMA is reading slices of the data without fetching the whole dataset in memory. To do that, the `.read()` method supports a `coords` parameter that allows data slicing. 

Before we do that, let's take a look at the schema of the `obs` dataframe:

In [41]:
obs.schema

soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: large_string

With a SOMA DataFrame, you can only slice across an indexed column, so let's look at the indexed columns:

In [42]:
obs.index_column_names

('soma_joinid',)

In this case our index consists of just `soma_joinid`, which is an integer column that can be used to join other SOMA objects in the same experiment. 


Let's look at a few ways to slice the dataframe.

#### Select a single row

In [43]:
obs.read([[0]]).concat().to_pandas()

Unnamed: 0,soma_joinid,obs_id,n_genes,percent_mito,n_counts,louvain
0,0,AAACATACAACCAC-1,781,0.030178,2419.0,CD4 T cells


#### Select multiple, non contiguous rows

In [44]:
obs.read([[2, 5]]).concat().to_pandas()

Unnamed: 0,soma_joinid,obs_id,n_genes,percent_mito,n_counts,louvain
0,2,AAACATTGATCAGC-1,1131,0.008897,3147.0,CD4 T cells
1,5,AAACGCACTGGTAC-1,782,0.016644,2163.0,CD8 T cells


#### Select a slice of rows

In [45]:
obs.read([slice(0, 5)]).concat().to_pandas()

Unnamed: 0,soma_joinid,obs_id,n_genes,percent_mito,n_counts,louvain
0,0,AAACATACAACCAC-1,781,0.030178,2419.0,CD4 T cells
1,1,AAACATTGAGCTAC-1,1352,0.037936,4903.0,B cells
2,2,AAACATTGATCAGC-1,1131,0.008897,3147.0,CD4 T cells
3,3,AAACCGTGCTTCCG-1,960,0.017431,2639.0,CD14+ Monocytes
4,4,AAACCGTGTATGCG-1,522,0.012245,980.0,NK cells
5,5,AAACGCACTGGTAC-1,782,0.016644,2163.0,CD8 T cells


#### Select a subset of columns only

In [46]:
obs.read([slice(0, 5)], column_names=["obs_id", "louvain"]).concat().to_pandas()

Unnamed: 0,obs_id,louvain
0,AAACATACAACCAC-1,CD4 T cells
1,AAACATTGAGCTAC-1,B cells
2,AAACATTGATCAGC-1,CD4 T cells
3,AAACCGTGCTTCCG-1,CD14+ Monocytes
4,AAACCGTGTATGCG-1,NK cells
5,AAACGCACTGGTAC-1,CD8 T cells


### Filter data using complex queries

SOMA also allows to filter data using more complex queries. For a more detailed reference, take a look at the [query condition](https://github.com/single-cell-data/TileDB-SOMA/blob/main/apis/python/src/tiledbsoma/_query_condition.py) source code.

Here are a few examples:

#### Filter all cells with a Louvain categorization of "B cells"

In [47]:
obs.read(value_filter="louvain == 'B cells'").concat().to_pandas()

Unnamed: 0,soma_joinid,obs_id,n_genes,percent_mito,n_counts,louvain
0,1,AAACATTGAGCTAC-1,1352,0.037936,4903.0,B cells
1,10,AAACTTGAAAAACG-1,1116,0.026316,3914.0,B cells
2,18,AAAGGCCTGTCTAG-1,1446,0.015283,4973.0,B cells
3,19,AAAGTTTGATCACG-1,446,0.034700,1268.0,B cells
4,20,AAAGTTTGGGGTGA-1,1020,0.025907,3281.0,B cells
...,...,...,...,...,...,...
337,2628,TTTCAGTGTCACGA-1,700,0.034314,1632.0,B cells
338,2630,TTTCAGTGTGCAGT-1,637,0.018925,1321.0,B cells
339,2634,TTTCTACTGAGGCA-1,1227,0.009294,3443.0,B cells
340,2635,TTTCTACTTCCTCG-1,622,0.021971,1684.0,B cells


#### Filter all cells with a Louvain categorization of either "CD4 T cells" or "CD8 T cells"

In [48]:
obs.read(value_filter="(louvain == 'CD4 T cells') or (louvain == 'CD8 T cells')").concat().to_pandas()

Unnamed: 0,soma_joinid,obs_id,n_genes,percent_mito,n_counts,louvain
0,0,AAACATACAACCAC-1,781,0.030178,2419.0,CD4 T cells
1,2,AAACATTGATCAGC-1,1131,0.008897,3147.0,CD4 T cells
2,5,AAACGCACTGGTAC-1,782,0.016644,2163.0,CD8 T cells
3,6,AAACGCTGACCAGT-1,783,0.038161,2175.0,CD8 T cells
4,7,AAACGCTGGTTCTT-1,790,0.030973,2260.0,CD8 T cells
...,...,...,...,...,...,...
1455,2621,TTTAGCTGATACCG-1,887,0.022876,2754.0,CD4 T cells
1456,2626,TTTCACGAGGTTCA-1,721,0.013261,2036.0,CD4 T cells
1457,2627,TTTCAGTGGAAGGC-1,692,0.015169,1780.0,CD8 T cells
1458,2631,TTTCCAGAGGTGAG-1,873,0.006859,2187.0,CD4 T cells


#### Filter all cells with a Louvain categorization of "CD4 T cells" and more than 1500 genes

In [49]:
obs.read(value_filter="(louvain == 'CD4 T cells') and (n_genes > 1500)").concat().to_pandas()

Unnamed: 0,soma_joinid,obs_id,n_genes,percent_mito,n_counts,louvain
0,26,AAATCAACCCTATT-1,1545,0.024313,5676.0,CD4 T cells
1,357,ACTCTCCTGCATAC-1,1750,0.017436,5850.0,CD4 T cells
2,473,AGCTGCCTTTCATC-1,1703,0.029547,5212.0,CD4 T cells
3,945,CATACTTGGGTTAC-1,1938,0.02358,7167.0,CD4 T cells
4,1163,CGATACGACAGGAG-1,2033,0.022166,6722.0,CD4 T cells
5,1320,CTATACTGTTCGTT-1,1543,0.012395,4760.0,CD4 T cells
6,1548,GAGCATACTTTGCT-1,1753,0.016739,6691.0,CD4 T cells
7,1993,GTGATGACAAGTGA-1,1819,0.021172,6329.0,CD4 T cells
8,2313,TCGGACCTGTACAC-1,1567,0.014288,5599.0,CD4 T cells
9,2365,TGAGACACAAGGTA-1,1549,0.013242,5135.0,CD4 T cells
