# Setup
This IPython notebook will walk through the steps of characterizing iModulons through the semi-automated tools in PyModulon. You will need:

* M and A matrices
* Expression data (e.g. `log_tpm_norm.csv`)
* Gene table and KEGG/GO annotations (Generated in `gene_annotation.ipynb`)
* Sample table, with a column for `project` and `condition`
* TRN file

Optional:
* iModulon table (if you already have some characterized iModulons)

In [2]:
from pymodulon.core import IcaData
from pymodulon.plotting import *
from os import path
import pandas as pd
import re
from Bio.KEGG import REST
from tqdm.notebook import tqdm

In [2]:
# Enter the location of your data here
data_dir = path.join('..','data')

# GO and KEGG annotations are in the 'external' folder
external_data = path.join('..','data')

## Check your sample table (i.e. metadata file)
Your metadata file will probably have a lot of columns, most of which you may not care about. Feel free to save a secondary copy of your metadata file with only columns that seem relevant to you. The two most important columns are:
1. `project`
2. `condition`

Make sure that these columns exist in your metadata file

In [3]:
df_metadata = pd.read_csv(os.path.join('..','data','metadata.tsv',index_col=0,sep='\t'))
df_metadata[['project','condition']].head()

Unnamed: 0_level_0,project,condition
Experiment,Unnamed: 1_level_1,Unnamed: 2_level_1
ERX1862757,sigD,WT
ERX1862758,sigD,DEL-sigD
ERX1862759,sigD,pVWEx1-sigD
ERX1862760,sigD,pVWEx1-sigD-IPTG
ERX2442263,altering-oxygen,JL-3h-2-Stuttgart


In [4]:
print(df_metadata.project.notnull().all())
print(df_metadata.condition.notnull().all())

False
False


## Check your TRN

Each row of the TRN file represents a regulatory interaction.  
**Your TRN file must have the following columns:**
1. `regulator` - Name of regulator (`/` or `+` characters will be converted to `;`)
1. `gene_id` - Locus tag of gene being regulated

The following columns are optional, but are helpful to have:
1. `regulator_id` - Locus tag of regulator
1. `gene_name` - Name of gene (can automatically update this using `name2num`)
1. `direction` - Direction of regulation ('+' for activation, '-' for repression, '?' or NaN for unknown)
1. `evidence` - Evidence of regulation (e.g. ChIP-exo, qRT-PCR, SELEX, Motif search)
1. `PMID` - Reference for regulation

You may add any other columns that could help you. TRNs may be saved as either CSV or TSV files. See below for an example:

In [7]:
df_trn = pd.read_csv(os.path.join('..','data','TRN.csv'))
df_trn.head()

Unnamed: 0,regulator,gene_id
0,WA5_RS00045,WA5_RS06090
1,WA5_RS00045,WA5_RS06085
2,WA5_RS00045,WA5_RS06080
3,WA5_RS00045,WA5_RS06075
4,WA5_RS00045,WA5_RS05065


The `regulator` and `gene_id` must be filled in for each row

In [8]:
print(df_trn.regulator.notnull().all())
print(df_trn.gene_id.notnull().all())

True
True


## Load the data
You're now ready to load your IcaData object!

In [134]:
df_mapping = pd.read_excel(os.path.join('..','data','MM_2ldfeeem.emapper.annotations.xlsx'))
df_trn['regulator_locus_tag']=''
df_trn['gene_id_locus_tag']=''
dict = {}
for row in df_mapping.index:
    maps = {df_mapping.loc[row]['seed_ortholog'].split(".")[1]:df_mapping.loc[row]['query']}
    dict.update(maps)
for row in df_trn.index:
    df_trn.loc[row]['regulator_locus_tag'] = dict.get(df_trn.loc[row]['regulator'])
    df_trn.loc[row]['gene_id_locus_tag'] = dict.get(df_trn.loc[row]['gene_id'])
df_trn.to_excel(r"E:\test\trn_test.xlsx")

In [74]:
data = pd.read_excel(os.path.join('..','data','TRN.xlsx',index_col=0))
data.to_csv(os.path.join('..','data','TRN.xlsx',encoding='utf-8'))

In [122]:
ica_data = IcaData(M = path.join(data_dir,'M.csv'),
                   A = path.join(data_dir,'A.csv'),
                   X = path.join(data_dir,'log_tpm_test.csv'),
                   gene_table = path.join(data_dir,'gene_info.csv'),
                   sample_table = path.join(data_dir,'metadata_test.tsv'),
                   trn = path.join(data_dir,'trn_atlas.csv'),
                   optimize_cutoff=True)



If you don't have a TRN (or have a very minimal TRN), use `threshold_method = 'kmeans'`

In [None]:
ica_data = IcaData(M = path.join(data_dir1,'M.csv'),
                   A = path.join(data_dir1,'A.csv'),
                   X = path.join(data_dir1,'log_tpm_norm.csv'),
                   gene_table = path.join(data_dir1,'gene_info.csv'),
                   sample_table = path.join(data_dir1,'metadata.tsv'),
                   trn = path.join(data_dir1,'trn_atlas.csv'),
                   optimize_cutoff=True)
#                    threshold_method = 'kmeans'

# Regulatory iModulons
Use `compute_trn_enrichment` to automatically check for Regulatory iModulons. The more complete your TRN, the more regulatory iModulons you'll find.

In [95]:
ica_data.compute_trn_enrichment()

Unnamed: 0,imodulon,regulator,pvalue,qvalue,precision,recall,f1score,TP,regulon_size,imodulon_size,n_regs
0,2,WA5_RS04335,6.285513e-22,5.656962e-21,0.483871,0.384615,0.428571,15.0,39.0,31.0,1.0
1,2,WA5_RS00460,1.739544e-09,7.827947e-09,0.16129,0.714286,0.263158,5.0,7.0,31.0,1.0
2,9,WA5_RS14715,4.494476e-07,8.539504e-06,0.16,0.177778,0.168421,8.0,45.0,50.0,1.0
3,13,WA5_RS09565,1.036594e-13,1.036594e-12,0.3,0.211268,0.247934,15.0,71.0,50.0,1.0
4,15,WA5_RS11920,1.462165e-09,1.023516e-08,0.363636,0.666667,0.470588,4.0,6.0,11.0,1.0
5,15,WA5_RS11905,4.771253e-08,1.669939e-07,0.363636,0.333333,0.347826,4.0,12.0,11.0,1.0
6,15,WA5_RS11985,3.043188e-06,7.100772e-06,0.272727,0.333333,0.3,3.0,9.0,11.0,1.0
7,16,WA5_RS00825,6.207284e-18,6.207284000000001e-17,0.384615,0.588235,0.465116,10.0,17.0,26.0,1.0
8,19,WA5_RS13100,2.153334e-07,1.507334e-06,0.157895,1.0,0.272727,3.0,3.0,19.0,1.0
9,26,WA5_RS03145,1.580327e-10,5.057046e-09,0.043956,1.0,0.084211,8.0,8.0,182.0,1.0


You can also search for AND/OR combinations of regulators using the `max_regs` argument.

Regulator enrichments can be directly saved to the `imodulon_table` using the `save` argument. This saves the enrichment with the lowest q-value to the table.

In [96]:
# First search for regulator enrichments with 2 regulators
ica_data.compute_trn_enrichment(max_regs=2,save=True)

# Next, search for regulator enrichments with just one regulator. This will supercede the 2 regulator enrichments.
ica_data.compute_trn_enrichment(max_regs=1,save=True)

Unnamed: 0,imodulon,regulator,pvalue,qvalue,precision,recall,f1score,TP,regulon_size,imodulon_size,n_regs
0,2,WA5_RS04335,6.285513e-22,5.656962e-21,0.483871,0.384615,0.428571,15.0,39.0,31.0,1.0
1,2,WA5_RS00460,1.739544e-09,7.827947e-09,0.16129,0.714286,0.263158,5.0,7.0,31.0,1.0
2,9,WA5_RS14715,4.494476e-07,8.539504e-06,0.16,0.177778,0.168421,8.0,45.0,50.0,1.0
3,13,WA5_RS09565,1.036594e-13,1.036594e-12,0.3,0.211268,0.247934,15.0,71.0,50.0,1.0
4,15,WA5_RS11920,1.462165e-09,1.023516e-08,0.363636,0.666667,0.470588,4.0,6.0,11.0,1.0
5,15,WA5_RS11905,4.771253e-08,1.669939e-07,0.363636,0.333333,0.347826,4.0,12.0,11.0,1.0
6,15,WA5_RS11985,3.043188e-06,7.100772e-06,0.272727,0.333333,0.3,3.0,9.0,11.0,1.0
7,16,WA5_RS00825,6.207284e-18,6.207284000000001e-17,0.384615,0.588235,0.465116,10.0,17.0,26.0,1.0
8,19,WA5_RS13100,2.153334e-07,1.507334e-06,0.157895,1.0,0.272727,3.0,3.0,19.0,1.0
9,26,WA5_RS03145,1.580327e-10,5.057046e-09,0.043956,1.0,0.084211,8.0,8.0,182.0,1.0


The list of regulatory iModulons are shown below

In [97]:
regulatory_imodulons = ica_data.imodulon_table[ica_data.imodulon_table.regulator.notnull()]
print(len(ica_data.imodulon_table),'Total iModulons')
print(len(regulatory_imodulons),'Regulatory iModulons')
regulatory_imodulons

56 Total iModulons
15 Regulatory iModulons


Unnamed: 0,regulator,pvalue,qvalue,precision,recall,f1score,TP,regulon_size,imodulon_size,n_regs
2,WA5_RS04335,6.285513e-22,5.656962e-21,0.483871,0.384615,0.428571,15.0,39.0,31.0,1.0
9,WA5_RS14715,4.494476e-07,8.539504e-06,0.16,0.177778,0.168421,8.0,45.0,50.0,1.0
13,WA5_RS09565,1.036594e-13,1.036594e-12,0.3,0.211268,0.247934,15.0,71.0,50.0,1.0
15,WA5_RS11920,1.462165e-09,1.023516e-08,0.363636,0.666667,0.470588,4.0,6.0,11.0,1.0
16,WA5_RS00825,6.207284e-18,6.207284000000001e-17,0.384615,0.588235,0.465116,10.0,17.0,26.0,1.0
19,WA5_RS13100,2.153334e-07,1.507334e-06,0.157895,1.0,0.272727,3.0,3.0,19.0,1.0
26,WA5_RS03145,1.580327e-10,5.057046e-09,0.043956,1.0,0.084211,8.0,8.0,182.0,1.0
29,WA5_RS05330,8.884442e-09,2.665333e-08,0.6,0.75,0.666667,3.0,4.0,5.0,1.0
36,WA5_RS11355,6.190132e-13,3.714079e-12,0.428571,0.6,0.5,6.0,10.0,14.0,1.0
37,WA5_RS07815,6.229139e-14,3.737484e-13,0.833333,0.714286,0.769231,5.0,7.0,6.0,1.0


You can rename iModulons in this jupyter notebook, or you can save the iModulon table as a CSV and edit it in Excel.

If two iModulons have the same regulator (e.g. 'Reg'), they will be named 'Reg-1' and 'Reg-2'

In [98]:
ica_data.rename_imodulons(regulatory_imodulons.regulator.to_dict())
ica_data.imodulon_table.head()

Unnamed: 0,regulator,pvalue,qvalue,precision,recall,f1score,TP,regulon_size,imodulon_size,n_regs
0,,,,,,,,,,
1,,,,,,,,,,
WA5_RS04335,WA5_RS04335,6.285513e-22,5.656962e-21,0.483871,0.384615,0.428571,15.0,39.0,31.0,1.0
3,,,,,,,,,,
4,,,,,,,,,,


In [100]:
regulatory_imodulons = ica_data.imodulon_table[ica_data.imodulon_table.regulator.notnull()]

# Functional iModulons

GO annotations and KEGG pathways/modules were generated in the 1_create_the_gene_table.ipynb notebook. Enrichments will be calculated in this notebook, and further curated in the 3_manual_iModulon_curation notebook.

## GO Enrichments

First load the Gene Ontology annotations

In [101]:
DF_GO = pd.read_csv(os.path.join('..','data','GO_annotations_curated.csv'),index_col=0)
DF_GO.head()

Unnamed: 0,gene_id,gene_name,gene_ontology
0,WA5_RS07925,gap,glyceraldehyde-3-phosphate dehydrogenase (NAD+...
1,WA5_RS07925,gap,NAD binding
2,WA5_RS09740,recA,SOS response
3,WA5_RS05070,glpX,magnesium ion binding
4,WA5_RS05070,glpX,manganese ion binding


In [102]:
DF_GO_enrich = ica_data.compute_annotation_enrichment(DF_GO,'gene_ontology')

In [103]:
DF_GO_enrich

Unnamed: 0,imodulon,gene_ontology,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size
0,21,NAD binding,0.006998,0.073476,0.047619,1.0,0.090909,1.0,1.0,21.0
1,21,glyceraldehyde-3-phosphate dehydrogenase (NAD+...,0.006998,0.073476,0.047619,1.0,0.090909,1.0,1.0,21.0
2,WA5_RS00045,"fructose 1,6-bisphosphate 1-phosphatase activity",0.004332,0.022742,0.076923,1.0,0.142857,1.0,1.0,13.0
3,WA5_RS00045,"fructose 1,6-bisphosphate metabolic process",0.004332,0.022742,0.076923,1.0,0.142857,1.0,1.0,13.0
4,WA5_RS00045,gluconeogenesis,0.004332,0.022742,0.076923,1.0,0.142857,1.0,1.0,13.0
5,WA5_RS00045,manganese ion binding,0.004332,0.022742,0.076923,1.0,0.142857,1.0,1.0,13.0
6,WA5_RS00045,magnesium ion binding,0.008646,0.036315,0.076923,0.5,0.133333,1.0,2.0,13.0


## KEGG Enrichments

### Load KEGG mapping
The `kegg_mapping.csv` file contains KEGG orthologies, pathways, modules, and reactions. Only pathways and modules are relevant to iModulon characterization.

In [104]:
DF_KEGG = pd.read_csv(os.path.join('..','data','kegg_mapping.csv'),index_col=0)
print(DF_KEGG.database.unique())
DF_KEGG.head()

['KEGG_Pathway' 'KEGG_Module' 'KEGG_Reaction']


Unnamed: 0,gene_id,database,kegg_id
1878,WA5_RS00005,KEGG_Pathway,map02020
1879,WA5_RS00005,KEGG_Pathway,map04112
1886,WA5_RS00010,KEGG_Pathway,map00230
1887,WA5_RS00010,KEGG_Pathway,map00240
1888,WA5_RS00010,KEGG_Pathway,map01100


In [105]:
kegg_pathways = DF_KEGG[DF_KEGG.database == 'KEGG_Pathway']
kegg_modules = DF_KEGG[DF_KEGG.database == 'KEGG_Module']

In [106]:
print(kegg_pathways)

          gene_id      database   kegg_id
1878  WA5_RS00005  KEGG_Pathway  map02020
1879  WA5_RS00005  KEGG_Pathway  map04112
1886  WA5_RS00010  KEGG_Pathway  map00230
1887  WA5_RS00010  KEGG_Pathway  map00240
1888  WA5_RS00010  KEGG_Pathway  map01100
...           ...           ...       ...
8725  WA5_RS15490  KEGG_Pathway  map01503
8729  WA5_RS15515  KEGG_Pathway  map02024
8730  WA5_RS15515  KEGG_Pathway  map03060
8731  WA5_RS15515  KEGG_Pathway  map03070
8733  WA5_RS15525  KEGG_Pathway  map03010

[3429 rows x 3 columns]


### Perform enrichment
Uses the `compute_annotation_enrichment` function

In [107]:
DF_pathway_enrich = ica_data.compute_annotation_enrichment(kegg_pathways,'kegg_id')
DF_module_enrich = ica_data.compute_annotation_enrichment(kegg_modules,'kegg_id')

In [108]:
DF_pathway_enrich

Unnamed: 0,imodulon,kegg_id,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size
0,WA5_RS04335,map00910,0.0004177699,0.094416,0.096774,0.2,0.130435,3.0,15.0,31.0
1,4,map00791,0.0001393622,0.01574793,0.095238,0.666667,0.166667,2.0,3.0,21.0
2,4,map05120,0.0001393622,0.01574793,0.095238,0.666667,0.166667,2.0,3.0,21.0
3,WA5_RS09565,map02010,9.595278e-09,2.168533e-06,0.26,0.12037,0.164557,13.0,108.0,50.0
4,WA5_RS11920,map00362,6.627727e-13,9.974037e-11,0.636364,0.259259,0.368421,7.0,27.0,11.0
5,WA5_RS11920,map01220,8.826582e-13,9.974037e-11,0.636364,0.25,0.358974,7.0,28.0,11.0
6,WA5_RS11920,map01120,9.380982e-07,7.067006e-05,0.636364,0.037037,0.07,7.0,189.0,11.0
7,WA5_RS11920,map00364,4.338704e-06,0.0002451368,0.272727,0.3,0.285714,3.0,10.0,11.0
8,WA5_RS11920,map00624,1.221815e-05,0.0005522604,0.181818,1.0,0.307692,2.0,2.0,11.0
9,WA5_RS11920,map00361,0.0002540247,0.009568265,0.181818,0.285714,0.222222,2.0,7.0,11.0


In [109]:
DF_module_enrich

Unnamed: 0,imodulon,kegg_id,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size
0,WA5_RS04335,M00323,4.630973e-08,1.079017e-05,0.129032,0.8,0.222222,4.0,5.0,31.0
1,3,M00479,1.332889e-06,0.0003105631,0.5,1.0,0.666667,2.0,2.0,4.0
2,WA5_RS09565,M00240,1.030342e-15,2.400696e-13,0.26,0.382353,0.309524,13.0,34.0,50.0
3,WA5_RS11920,M00568,1.273089e-06,0.0002966297,0.272727,0.428571,0.333333,3.0,7.0,11.0
4,21,M00166,0.0001393622,0.0108238,0.095238,0.666667,0.166667,2.0,3.0,21.0
5,21,M00308,0.0001393622,0.0108238,0.095238,0.666667,0.166667,2.0,3.0,21.0
6,21,M00552,0.0001393622,0.0108238,0.095238,0.666667,0.166667,2.0,3.0,21.0
7,21,M00001,0.0001886749,0.01099031,0.142857,0.176471,0.157895,3.0,17.0,21.0
8,21,M00165,0.00095917,0.04469732,0.095238,0.285714,0.142857,2.0,7.0,21.0
9,21,M00002,0.001630453,0.06331591,0.095238,0.222222,0.133333,2.0,9.0,21.0


### Convert KEGG IDs to human-readable names

In [None]:
for idx,key in tqdm(DF_pathway_enrich.kegg_id.items(),total=len(DF_pathway_enrich)):
    text = REST.kegg_find('pathway',key).read()
    try:
        name = re.search('\t(.*)\n',text).group(1)
        DF_pathway_enrich.loc[idx,'pathway_name'] = name
    except AttributeError:
        DF_pathway_enrich.loc[idx,'pathway_name'] = None
    
for idx,key in tqdm(DF_module_enrich.kegg_id.items(),total=len(DF_module_enrich)):
    text = REST.kegg_find('module',key).read()
    try:
        name = re.search('\t(.*)\n',text).group(1)
        DF_module_enrich.loc[idx,'module_name'] = name
    except AttributeError:
        DF_module_enrich.loc[idx,'module_name'] = None

In [111]:
DF_pathway_enrich

Unnamed: 0,imodulon,kegg_id,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size,pathway_name
0,WA5_RS04335,map00910,0.0004177699,0.094416,0.096774,0.2,0.130435,3.0,15.0,31.0,Nitrogen metabolism
1,4,map00791,0.0001393622,0.01574793,0.095238,0.666667,0.166667,2.0,3.0,21.0,Atrazine degradation
2,4,map05120,0.0001393622,0.01574793,0.095238,0.666667,0.166667,2.0,3.0,21.0,Epithelial cell signaling in Helicobacter pylo...
3,WA5_RS09565,map02010,9.595278e-09,2.168533e-06,0.26,0.12037,0.164557,13.0,108.0,50.0,ABC transporters
4,WA5_RS11920,map00362,6.627727e-13,9.974037e-11,0.636364,0.259259,0.368421,7.0,27.0,11.0,Benzoate degradation
5,WA5_RS11920,map01220,8.826582e-13,9.974037e-11,0.636364,0.25,0.358974,7.0,28.0,11.0,Degradation of aromatic compounds
6,WA5_RS11920,map01120,9.380982e-07,7.067006e-05,0.636364,0.037037,0.07,7.0,189.0,11.0,Microbial metabolism in diverse environments
7,WA5_RS11920,map00364,4.338704e-06,0.0002451368,0.272727,0.3,0.285714,3.0,10.0,11.0,Fluorobenzoate degradation
8,WA5_RS11920,map00624,1.221815e-05,0.0005522604,0.181818,1.0,0.307692,2.0,2.0,11.0,Polycyclic aromatic hydrocarbon degradation
9,WA5_RS11920,map00361,0.0002540247,0.009568265,0.181818,0.285714,0.222222,2.0,7.0,11.0,Chlorocyclohexane and chlorobenzene degradation


In [112]:
DF_module_enrich

Unnamed: 0,imodulon,kegg_id,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size,module_name
0,WA5_RS04335,M00323,4.630973e-08,1.079017e-05,0.129032,0.8,0.222222,4.0,5.0,31.0,
1,3,M00479,1.332889e-06,0.0003105631,0.5,1.0,0.666667,2.0,2.0,4.0,
2,WA5_RS09565,M00240,1.030342e-15,2.400696e-13,0.26,0.382353,0.309524,13.0,34.0,50.0,
3,WA5_RS11920,M00568,1.273089e-06,0.0002966297,0.272727,0.428571,0.333333,3.0,7.0,11.0,"Catechol ortho-cleavage, catechol => 3-oxoadipate"
4,21,M00166,0.0001393622,0.0108238,0.095238,0.666667,0.166667,2.0,3.0,21.0,"Reductive pentose phosphate cycle, ribulose-5P..."
5,21,M00308,0.0001393622,0.0108238,0.095238,0.666667,0.166667,2.0,3.0,21.0,"Semi-phosphorylative Entner-Doudoroff pathway,..."
6,21,M00552,0.0001393622,0.0108238,0.095238,0.666667,0.166667,2.0,3.0,21.0,"D-galactonate degradation, De Ley-Doudoroff pa..."
7,21,M00001,0.0001886749,0.01099031,0.142857,0.176471,0.157895,3.0,17.0,21.0,"Glycolysis (Embden-Meyerhof pathway), glucose ..."
8,21,M00165,0.00095917,0.04469732,0.095238,0.285714,0.142857,2.0,7.0,21.0,Reductive pentose phosphate cycle (Calvin cycle)
9,21,M00002,0.001630453,0.06331591,0.095238,0.222222,0.133333,2.0,9.0,21.0,"Glycolysis, core module involving three-carbon..."


## SubtiWiki categories

In [None]:
DF_subtiwiki = pd.read_csv(path.join(external_data,'subtiwiki_categories.csv'))
DF_subtiwiki.head()

In [26]:
# Change the subtiwiki annotation format into a list of genes and categories
DF_subtiwiki = DF_subtiwiki.rename({'BSU_number':'gene_id'},axis=1)
DF_subtiwiki = DF_subtiwiki.melt(id_vars='gene_id',value_vars=['FuncName1','FuncName2','FuncName3','FuncName4','FuncName5'])
DF_subtiwiki = DF_subtiwiki[DF_subtiwiki.value.notnull() & DF_subtiwiki.gene_id.isin(ica_data.gene_names)]
DF_subtiwiki.head()

Unnamed: 0,gene_id,variable,value
0,BSU_09670,FuncName1,Cellular processes
1,BSU_04560,FuncName1,Cellular processes
2,BSU_01770,FuncName1,Cellular processes
3,BSU_01780,FuncName1,Cellular processes
4,BSU_15190,FuncName1,Cellular processes


In [27]:
DF_subti_enrich = ica_data.compute_annotation_enrichment(DF_subtiwiki,'value')

## Save files

In [113]:
DF_GO_enrich['source'] = 'GO'
# DF_pathway_enrich['source'] = 'KEGG pathways'
# DF_module_enrich['source'] = 'KEGG modules'
# DF_subti_enrich['source'] = 'SubtiWiki'

DF_GO_enrich.rename({'gene_ontology':'annotation'},axis=1, inplace=True)
# DF_pathway_enrich.rename({'kegg_id':'annotation'},axis=1, inplace=True)
# DF_module_enrich.rename({'kegg_id':'annotation'},axis=1, inplace=True)
# DF_subti_enrich.rename({'value':'annotation'},axis=1, inplace=True)

DF_enrichments = pd.concat([DF_GO_enrich, DF_pathway_enrich, DF_module_enrich])

# Check for single gene iModulons

Some iModulons are dominated by a single, high-coefficient gene. These iModulons may result from:
1. Overdecomposition of the dataset to identify noisy genes
1. Artificial knock-out of single genes
1. Regulons with only one target gene

No matter what causes these iModulons, it is important to be aware of them. The find_single_gene_imodulons function identifies iModulons that are likely dominated by a single gene.

The iModulons identified by ``find_single_gene_imodulons`` may contain more than one gene, since a threshold-agnostic method is used to identify these iModulons.

In [114]:
sg_imods = ica_data.find_single_gene_imodulons(save=True)
len(sg_imods)

0

In [115]:
for i,mod in enumerate(sg_imods):
    ica_data.rename_imodulons({mod:'SG_'+str(i+1)})

# Save iModulon object

In [117]:
from pymodulon.util import explained_variance
from pymodulon.io import *

In [118]:
# Add iModulon sizes and explained variance
for im in ica_data.imodulon_names:
    ica_data.imodulon_table.loc[im,'imodulon_size'] = len(ica_data.view_imodulon(im))
    ica_data.imodulon_table.loc[im,'explained_variance'] = explained_variance(ica_data,imodulons=im)

This will save your iModulon table, your thresholds, and any other information stored in the ica_data object.

In [119]:
save_to_json(ica_data, os.path.join('..','data','cgu_raw.json.gz'))

If you prefer to view and edit your iModulon table in excel, save it as a CSV and reload the iModulon as before