Sparse Tensor Decomposition for Multi-species gene expression data.
This package provides a Python workflow for performing sparse tensor decomposition on gene expression data across multiple species and time points. It is optimized to handle:
- ~10,000 genes
- 3 species
- 4 time points
The workflow uses efficient tensor decomposition methods (PARAFAC/CP and Tucker) to identify patterns and factors in multi-dimensional gene expression data.
- Sparse tensor decomposition using PARAFAC (CP) and Tucker methods
- Optimized for gene expression data with automatic preprocessing
- Data loading utilities for CSV-based gene expression datasets
- Synthetic data generation for testing and validation
- Command-line interface for easy workflow execution
- UV package manager for fast, reliable dependency management
# Clone the repository
git clone https://github.com/sr320/workflow-stdm.git
cd workflow-stdm
# Install with uv
uv sync
# Activate the virtual environment
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On Windowspip install -e .The repository includes test datasets in the input-data/ directory. Simply run:
# Run workflow with default test data (10k genes, 3 species, 4 time points)
stdm run --rank 5 --method parafac
# Or use the smaller test dataset for quick testing
stdm run --input gene_expression_data_small.csv --rank 5 --method parafacstdm generate --output data.csv --genes 10000 --species 3 --timepoints 4stdm info --input data.csvstdm decompose --input data.csv --output results/ --rank 5 --method parafacThe package provides four main commands:
Use test data from input-data/ directory:
stdm run [OPTIONS]
Options:
-i, --input PATH Input CSV file (default: gene_expression_data.csv from input-data/)
-o, --output PATH Output directory (default: results)
-r, --rank INT Decomposition rank (default: 5)
-m, --method {parafac,tucker} Decomposition method (default: parafac)
--sparsity-threshold FLOAT Sparsity threshold (default: 0.01)
--normalize Normalize tensor
--log-transform Apply log2(x+1) transformation (default: True)
--no-log-transform Skip log transformation
--standardize Standardize gene expression (default: True)
--no-standardize Skip standardization
-v, --verbose Verbose output
# Examples:
stdm run # Use default test data
stdm run --input gene_expression_data_small.csv # Use small test data
stdm run --method tucker --rank 10 # Tucker with rank 10New Features:
- Auto-timestamping: Each run automatically creates a timestamped subdirectory (e.g.,
results/20251012_175729/) so results are never overwritten - Automatic README Report: Every run generates a comprehensive
README.mdin the output directory that includes:- Run timestamp and configuration
- Input data summary
- Output files description with usage examples
- Quality assessment and optimal parameter recommendations
- Python code examples for loading and analyzing results
- Biological interpretation guidelines
Using Your Own Data:
You can easily use your own gene expression data by providing the --input option:
# Use data from input-data directory
stdm run --input my_custom_data.csv
# Use data from any path
stdm run --input /path/to/my/data.csv --output my_results
# Use data from absolute path
stdm run --input ~/research/gene_expression.csv --output ~/results/analysis1Data Format Requirements:
Your CSV file should be in the following format:
gene,species,timepoint,expression
gene1,species1,t0,0.5
gene1,species1,t1,0.7
gene1,species2,t0,0.3
...
stdm generate [OPTIONS]
Options:
-o, --output PATH Output CSV file path (required)
-g, --genes INT Number of genes (default: 10000)
-s, --species INT Number of species (default: 3)
-t, --timepoints INT Number of time points (default: 4)
--sparsity FLOAT Sparsity level 0-1 (default: 0.3)
--noise FLOAT Noise level (default: 0.1)
--seed INT Random seed (default: 42)stdm decompose [OPTIONS]
Options:
-i, --input PATH Input CSV file path (required)
-o, --output PATH Output directory (required)
-r, --rank INT Decomposition rank (default: 5)
-m, --method {parafac,tucker} Decomposition method (default: parafac)
--sparsity-threshold FLOAT Sparsity threshold (default: 0.01)
--normalize Normalize tensor
--log-transform Apply log2(x+1) transformation
--standardize Standardize gene expression
-v, --verbose Verbose outputstdm info --input PATHfrom stdm import SparseTensorDecomposer, GeneExpressionLoader
# Load or generate data
loader = GeneExpressionLoader(n_genes=10000, n_species=3, n_timepoints=4)
tensor = loader.generate_synthetic_data(sparsity=0.3, seed=42)
# Or load from CSV
# tensor = loader.load_from_csv("data.csv")
# Preprocess
tensor = loader.preprocess(tensor, log_transform=True, standardize=True)
# Decompose
decomposer = SparseTensorDecomposer(rank=5, method="parafac")
decomposer.fit(tensor, verbose=True)
# Get results
gene_factors = decomposer.get_gene_factors()
species_factors = decomposer.get_species_factors()
time_factors = decomposer.get_time_factors()
summary = decomposer.get_summary()
print(f"Reconstruction error: {summary['reconstruction_error']:.6f}")
# Reconstruct tensor
reconstructed = decomposer.reconstruct()The expected CSV format for gene expression data:
gene,species,timepoint,expression
gene1,species1,t0,0.5
gene1,species1,t1,0.7
gene1,species2,t0,0.3
...
Each run generates a timestamped directory (e.g., results/20251012_175729/) containing:
gene_factors.npy- Gene factor matrix (n_genes × rank)species_factors.npy- Species factor matrix (n_species × rank)time_factors.npy- Time factor matrix (n_timepoints × rank)reconstructed_tensor.npy- Reconstructed tensorsummary.json- Decomposition summary with reconstruction errorREADME.md- Comprehensive analysis report with:- Run configuration and timestamp
- Input data statistics
- Quality assessment and parameter recommendations
- Detailed file descriptions
- Python code examples for result analysis
- Biological interpretation guidelines
Example directory structure after multiple runs:
results/
├── 20251012_143022/ # First run
│ ├── README.md
│ ├── gene_factors.npy
│ ├── species_factors.npy
│ ├── time_factors.npy
│ ├── reconstructed_tensor.npy
│ └── summary.json
└── 20251012_175729/ # Second run
├── README.md
├── gene_factors.npy
├── species_factors.npy
├── time_factors.npy
├── reconstructed_tensor.npy
└── summary.json
Canonical Polyadic (CP) decomposition, also known as PARAFAC, decomposes the tensor into a sum of rank-1 tensors. It's efficient and provides interpretable factors.
Tucker decomposition is a higher-order generalization of matrix SVD. It decomposes the tensor into a core tensor multiplied by factor matrices along each mode.
- Sparsity: The package automatically handles sparse tensors, setting values below the threshold to zero
- Normalization: Tensor normalization improves numerical stability
- Rank selection: Lower ranks are faster but may lose information; higher ranks capture more detail
- Memory: With 10k genes, 3 species, and 4 timepoints, the tensor requires ~960 KB (double precision)
# Install development dependencies
uv sync
# Run tests (if available)
pytest
# Format code
ruff format src/
# Lint code
ruff check src/MIT License
If you use this package in your research, please cite:
@software{workflow_stdm,
title = {Sparse Tensor Decomposition for Multi-species Gene Expression},
author = {Your Name},
year = {2025},
url = {https://github.com/sr320/workflow-stdm}
}