CRISPR Guide RNA Efficacy Prediction Pipeline

A high-performance Rust-based pipeline for evaluating and combining multiple CRISPR guide RNA efficacy prediction tools, with Python support scripts for visualization and analysis.

Overview

This pipeline integrates and evaluates various state-of-the-art CRISPR guide RNA prediction tools, trains consensus models, and provides comprehensive analysis of guide RNA efficacy across multiple datasets. The main pipeline is implemented in Rust for performance, with Python scripts for advanced visualizations.

Key Features

Multi-tool Integration: Supports 7+ prediction tools including DeepCRISPR, TransCRISPR, DeepSpCas9, TKO PSSM, Moreno-Mateos, Rule Set 2 & 3
Consensus Models: Trains both linear and logistic consensus modelson
Multiple Datasets: Compatible with TKO, Avana/DepMap, and GenomeCRISPR datasets
Comprehensive Evaluation: ROC curves, confusion matrices, gene-level analysis, and statistical testing
High Performance: Rust implementation ensures fast processing of large-scale genomic data

Requirements

Rust Dependencies

Rust 1.70+ (install from rustup.rs)
Cargo (comes with Rust)

Python Dependencies (for visualization scripts)

python >= 3.8
numpy
pandas
matplotlib
seaborn
scikit-learn
polars

External Tool Dependencies

Some prediction tools require their own installations and need to be placed in the root directory:

DeepCRISPR
TransCRISPR
DeepSpCas9
CHOPCHOP
CRISPOR

Installation

Clone the repository:

Build the Rust project:

cargo build --release

Set up Python environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Download required data files (see Data Setup section)

Usage

Running the Main Pipeline

cargo run --release

The pipeline will:

Load and validate datasets
Balance the data using stratified sampling
Run prediction tools (if enabled)
Train consensus models
Evaluate all tools and generate reports
Save results and model weights

Configuration

Edit main.rs to configure:

Dataset selection (TKO, Avana, or GenomeCRISPR)
Efficacy thresholds (default: poor < 60%, good > 90%)
Which prediction tools to run
Output directories

Python Analysis Scripts

Generate contribution analysis

python scripts/contribution.py


## Project Structure

crispr_pipeline/ ├── validator/src/ # Rust source code │ ├── main.rs # Main pipeline entry point │ ├── models.rs # Data models and types │ ├── analysis/ # Analysis modules │ │ ├── roc.rs # ROC curve analysis │ │ ├── prediction_evaluation.rs │ │ └── within_gene.rs # Gene-level analysis │ ├── data_handling/ # Dataset loaders │ │ ├── tko_one.rs # TKO dataset handler │ │ ├── avana_depmap.rs # Avana/DepMap handler │ │ └── genome_crispr.rs # GenomeCRISPR handler │ ├── prediction_tools/ # Tool integrations │ │ ├── deepcrispr_integration.rs │ │ ├── transcrispr.rs │ │ └── ... │ ├── combination_model.rs # Linear consensus model │ └── logistic_cv.rs # Logistic consensus model ├── scripts/ # Python analysis scripts ├── data/ # Input data directory ├── figures/ # Output figures └── prediction_evaluation_results/ # Evaluation outputs


## Data Setup

### Required Data Files

1. **Essential Gene Sets**:
   - `data/cegv2.txt` - Core essential genes

2. **Datasets** (choose one):
   - **TKO**: 
     - `data/tko/tko_one.xlsx`
     - `data/tko/tko_hg38_annotation_for_rust.txt`
   - **Avana or Ky /DepMap**:
     - `data/depmap/CRISPRInferredGuideEfficacy_23Q4.csv`
     - `data/depmap/KYGuideMap.csv`
     - `data/depmap/AvanaGuideMap_23Q4.csv`
   - **GenomeCRISPR**:
     - `data/genomecrispr/GenomeCRISPR_full05112017_brackets.csv`

### Output Files

The pipeline generates:
- `weights_linear_*.json` - Linear consensus model weights
- `weights_logistic_*.json` - Logistic consensus model weights  
- `weights.csv` - Final dataset with all predictions
- `prediction_evaluation_results/` - Comprehensive evaluation metrics
- `figures/` - ROC curves, violin plots, and other visualizations

## Prediction Tools Integrated

1. **Doench Rule Set 2** - Doench et al., 2016
2. **Doench Rule Set 3** - Latest Doench lab model
3. **Moreno-Mateos** - Moreno-Mateos et al., 2015
4. **DeepCRISPR** - Chuai et al., 2018
5. **TransCRISPR** - Transfer learning approach
6. **DeepSpCas9** - Kim et al., 2019
7. **TKO PSSM** - Position-specific scoring matrix from TKO data
8. **Linear Consensus** - Weighted linear combination
9. **Logistic Consensus** - Cross-validated logistic regression

## Advanced Features

### Stratified Sampling
The pipeline uses gene-aware stratified sampling to ensure balanced representation across efficacy ranges while maintaining gene diversity.

### Cross-Validation
Logistic consensus models are trained using k-fold cross-validation with gene-based splits to prevent overfitting.

### Statistical Analysis
Includes methods for identifying statistically significant "bad guide" predictors using various statistical tests.


## Contributing

Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Open a Pull Request


## Acknowledgments

This pipeline integrates tools and methods from multiple research groups. Please cite the original tool publications when using their predictions.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.idea		.idea
chopchop		chopchop
crispor		crispor
data		data
lift_over_to_hg38		lift_over_to_hg38
scripts		scripts
validator		validator
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
setup_genome.sh		setup_genome.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRISPR Guide RNA Efficacy Prediction Pipeline

Overview

Key Features

Requirements

Rust Dependencies

Python Dependencies (for visualization scripts)

External Tool Dependencies

Installation

Usage

Running the Main Pipeline

Configuration

Python Analysis Scripts

Generate contribution analysis

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CRISPR Guide RNA Efficacy Prediction Pipeline

Overview

Key Features

Requirements

Rust Dependencies

Python Dependencies (for visualization scripts)

External Tool Dependencies

Installation

Usage

Running the Main Pipeline

Configuration

Python Analysis Scripts

Generate contribution analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages