# Feature Selection Tutorial

In this Jupyter notebook, we'll walk through the information-theoretic feature selection algorithms in PicturedRocks. 

In [1]:
import numpy as np
import scanpy.api as sc
import picturedrocks as pr

In [2]:
adata = sc.datasets.paul15()

... storing 'paul15_clusters' as categorical


In [3]:
adata

AnnData object with n_obs × n_vars = 2730 × 3451 
    obs: 'paul15_clusters'
    uns: 'iroot'

The `process_clusts` method copies the cluster column and precomputes various indices, etc. If you have multiple columns that can be used as target labels (e.g., different treatments, clusters via different clustering algorithms or parameters, or demographics), this sets and processes the given columns as the one we're currently examining.

This is necessary for supervised analysis and visualization tools in PicturedRocks that use cluster labels.

In [4]:
pr.read.process_clusts(adata, "paul15_clusters")

AnnData object with n_obs × n_vars = 2730 × 3451 
    obs: 'paul15_clusters', 'clust', 'y'
    uns: 'iroot', 'num_clusts', 'clusterindices'

Normalize per cell and log transform the data

In [5]:
sc.pp.normalize_per_cell(adata)

In [6]:
sc.pp.log1p(adata)

The `make_infoset` method creates a `SparseInformationSet` object with a discretized version of the data matrix. It is useful to have only a small number of discrete states that each gene can take so that entropy is a reasonable measurement. By default, `make_infoset` performs an adaptive transform that we call a recursive quantile transform. This is implemented in `pr.markers.mutualinformation.infoset.quantile_discretize`. If you have a different discretization transformation, you can pass a transformed matrix directly to `SparseInformationSet`.

In [7]:
infoset = pr.markers.makeinfoset(adata, True)

Because this dataset only has 3451 features, it is computationally easy to do feature selection without restricting the number of features. If we wanted to, we could do either supervised or unsupervised univariate feature selection (i.e., without considering any interactions between features).

In [8]:
# supervised
mim = pr.markers.mutualinformation.iterative.MIM(infoset)
most_relevant_genes = mim.autoselect(1000)

In [9]:
# unsupervised
ue = pr.markers.mutualinformation.iterative.UniEntropy(infoset)
most_variable_genes = ue.autoselect(1000)

At this stage, if we wanted to, we could slice our `adata` object as `adata[:,most_relevant_genes]` or `adata[:,most_variable_genes]` and create a new `InformationSet` object for this sliced object. We don't need to do that here.

## Supervised Feature Selection

Let's jump straight into supervised feature selection. Here we will use the `CIFE` objective

In [10]:
cife = pr.markers.mutualinformation.iterative.CIFE(infoset)

In [11]:
cife.score

array([0.15708602, 0.14179713, 0.18207442, ..., 0.25585655, 0.25485263,
       0.05838606])

In [12]:
top_genes = np.argsort(cife.score)[::-1]
print(adata.var_names[top_genes[:10]])

Index(['Prtn3', 'Mpo', 'Ctsg', 'Elane', 'Car2', 'Car1', 'H2afy', 'Calr',
       'Blvrb', 'Fam132a'],
      dtype='object')


Let's select 'Mpo'

In [13]:
ind = adata.var_names.get_loc('Mpo')

In [14]:
cife.add(ind)

Now, the top genes are

In [15]:
top_genes = np.argsort(cife.score)[::-1]
print(adata.var_names[top_genes[:10]])

Index(['Car1', 'Apoe', 'H2afy', 'Fam132a', 'Car2', 'Mt1', 'Blvrb', 'Srgn',
       'Mt2', 'Prtn3'],
      dtype='object')


Observe that the order has changed based on redundancy (or lack thereof) with 'Mpo'. Let's add 'Car1'

In [16]:
ind = adata.var_names.get_loc('Car1')
cife.add(ind)

In [17]:
top_genes = np.argsort(cife.score)[::-1]
print(adata.var_names[top_genes[:10]])

Index(['Apoe', 'Ptprcap', 'Gpr56', 'Myb', 'Mcm5', 'Uqcrq', 'Lyar', 'Cox5a',
       'S100a10', 'Snrpd1'],
      dtype='object')


If we want to select the top gene repeatedly, we can use `autoselect`

In [18]:
cife.autoselect(5)

To look at the markers we've selected, we can examine `cife.S`

In [19]:
cife.S

[1913, 552, 305, 2932, 769, 3002, 2025]

In [20]:
adata.var_names[cife.S]

Index(['Mpo', 'Car1', 'Apoe', 'Srm', 'Cox6a1', 'Taldo1', 'Ncl'], dtype='object')

This process can also done manually with a user-interface.

In [21]:
im = pr.markers.interactive.InteractiveMarkerSelection(adata, cife, dim_red="umap", show_genes=False)

Running umap on cells...



invalid value encountered in sqrt



In [22]:
im.show()

Output()

Note, that because we passed the same `cife` object, any genes added/removed in the interface will affect the `cife` object.

In [23]:
adata.var_names[cife.S]

Index(['Mpo', 'Car1', 'Apoe', 'Srm', 'Cox6a1', 'Taldo1', 'Ncl'], dtype='object')

## Unsupervised Feature Selection

This works very similarly. In the example below, we'll autoselect 5 genes and then run the interface. Note that although the previous section would not work without cluster labels, the following code will.

In [24]:
cife_unsup = pr.markers.mutualinformation.iterative.CIFEUnsup(infoset)

In [25]:
cife_unsup.autoselect(5)

(If you ran the example above, this will load faster because the t_SNE coordinates for genes and cells have already been computed. You can also customize which plots are displayed with keyword arguments (e.g., `InteractiveMarkerSelection(..., show_genes=False)`). Future versions may allow arbitrary plots.

In [26]:
im_unsup = pr.markers.interactive.InteractiveMarkerSelection(adata, cife_unsup, show_genes=False, show_cells=False, dim_red="umap")

In [27]:
im_unsup.show()

Output()