Pangolin_train

The steps described below are for recreating the training and testing steps in Zeng and Li, Genome Biology 2022. In progress--currently setup to produce top-k and AUPRC metrics for Pangolin as in Figure 1b.

`preprocessing`

Generate training and test datasets. To generate from intermediate files (splice_table_{species}.txt present in the repository for {species} = Human,Macaque,Mouse,Rat) follow Step 2. To run the whole pipeline (starting from RNA-seq reads), follow Step 1 and Step 2.

Step 1

Run snakemake -s Snakefile1 --config SPECIES={species} and snakemake -s Snakefile2 --config SPECIES={species} for {species} = Human,Macaque,Mouse,Rat. You will probably need to adjust file paths. This will map RNA-seq reads for each species and tissue, quantify usage of splice sites, and output tables of splice sites for each gene.

Dependencies: Snakemake, Samtools, fastp, STAR, RSEM, MMR, Sambamba, RegTools, SpliSER, pybedtools

Inputs:

Reference genomes and annotations from GENCODE and Ensembl
RNA-seq reads from ArrayExpress (mouse, rat, macaque, human)

Outputs:

splice_table_{species}.txt for each species, used to generate training datasets, and splice_table_Human.test.txt, used to generate test datasets. Each line is formatted as:

gene_id paralog_marker  chromosome  strand  gene_start  gene_end  splice_site_pos:heart_usage,liver_usage,brain_usage,testis_usage,...
# Note: paralog_marker is unused and set to 0 for all genes
# Note: See utils_multi.py for how genes with usage < 0 is interpreted

Step 2

Run ./create_files.sh. This will generate dataset*.h5 files, which are the training and test datasets (requires ~500GB of space). These can be used in the train and evaluate steps below.

Dependencies:

conda create -c bioconda -n create_files_env python=2.7 h5py bedtools or equivalent

Inputs:

splice_table_{species}.txt for each species and splice_table_Human.test.txt (included in the repository or generated from Step 1)
Reference genomes for each species from GENCODE and Ensembl

Outputs: dataset_train_all.h5 (all species) and dataset_test_1.h5 (human test sequences)

`training`

Run train.sh to train all models for the evaluations used in Figure 1b. Depending on your GPU, this may take a few weeks! I have uploaded models from just running the first two lines of train.sh to train/models for reference. (TODO: Add fine tuning steps for models used in later figures.)

Dependencies:

conda create -c pytorch -n train_test_env python=3.8 pytorch torchvision torchaudio cudatoolkit=11.3 h5py or equivalent

Inputs:

dataset_train_all.h5 from preprocessing steps

Outputs:

Model checkpoints in train/models

`evaluate`

Run test.sh to get top-k and AUPRC statistics for test datasets. (TODO: Add additional evaluation metrics.)

Dependencies

Same as those for training + sklearn

Inputs:

dataset_test_1.h5 from preprocessing steps
Follow training steps or clone https://github.com/tkzeng/Pangolin.git to get models

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
evaluate		evaluate
preprocessing		preprocessing
train		train
.gitignore		.gitignore
GRCh38.primary_assembly.genome.fa.fai		GRCh38.primary_assembly.genome.fa.fai
GRCm38.primary_assembly.genome.fa.fai		GRCm38.primary_assembly.genome.fa.fai
Macaca_mulatta.Mmul_10.dna.toplevel.fa.fai		Macaca_mulatta.Mmul_10.dna.toplevel.fa.fai
README.md		README.md
Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa.fai		Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa.fai

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pangolin_train

`preprocessing`

Step 1

Step 2

`training`

`evaluate`

About

Releases

Packages

Languages

tkzeng/Pangolin_train

Folders and files

Latest commit

History

Repository files navigation

Pangolin_train

preprocessing

Step 1

Step 2

training

evaluate

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`preprocessing`

`training`

`evaluate`

Packages