Skip to content

timothyjamesbecker/TensorSV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status GitHub All Releases License: GPL v3
Alt text

Moment Based SV Calling and Genotyping

Copyright (C) 2020-2021 Timothy James Becker

T. Becker and D.G. Shin,"TensorSV: structural variation inference 
using tensors and variable topology neural networks", 
2020 IEEE BIBM, Seoul, Korea (South), 2020, pp. 1356-1360

Requirements:

python 3.6+
cython 0.29+
numpy 1.18+
matplotlib 3.2.1
h5py 2.10
pysam 0.15.2
hfm 0.1.8
tensorflow 1.15.0 (works with tensorflow-gpu 1.15.0 for GPU as well)

PIP Installation:

python3 -m pip install https://github.com/timothyjamesbecker/TensorSV/releases/download/0.0.1/tensorsv-0.0.1.tar.gz

Basic Usage:

(1) Start by extracting the features from the BAM file using the hfm package. The script being used here: extractor.py is a high_level multi-bam aware extraction runner that ships with the hfm package. You can install this package from the git repo: https://github.com/timothyjamesbecker/hfm

extractor.py \
--ref_path ./reference_sequence.fa \
--in_path ./folder_of_bam_files/ \
--out_dir ./output_hdf5_files/ \
--seqs chr1,chr2, ... ,chr22,chrX,chrY,chrM \
--window 25 \
--branch 2 \
--cpus 12

(2) Next you need to normalize and standardize the HFM hdf5 files and capture targets if training is desired using the TensorSV script data_prep.py shown below. This script can run in parallel for each sample so setting your cpus to the number of samples when you have enough processors and memory is suggested. The result of this step will produce one *.norm.hdf file and one *.label.hdf per sample. For training you can run the data_prep.merge_samples function to mix together any samples that have under gone this process.

data_prep.py \
--vcf_in_path ./hgsv_hg38_hfm_server/hgsv.illumina.hg38.all.geno.vcf.gz \
--hfm_in_path ./hgsv_hg38_hfm_server/ \
--out_path ./hgsv_hg38_hfm_server/ \
--cpus 9

(3) Now you can either train a new SV model using train_sv.py or use an existing one in the next step.

train_sv.py \
--in_path ./hgsv_hg38_hfm_server/tensors/hgsv.hg38.labels.hdf5 \
--sub_sample_not 0.75 \
--out_path ./hgsv_hg38_hfm_server/cnn_75/ \
--sv_types DEL \ 
--filters all \ 
--form cnn  \ 
--levels 2,4 \
--cmxs 2,4  \
--batches 32,64,128 \
--epochs 10,25 \
--decays 1e-5,2e-5 \
--split_seed 0 \
--gpu_num 0 \
--verbose

(4) Now you can run the predict_sv.py on the normalized hdf5 from step (2) If you have used training diagnostics and your folder contains true and comparable calls, this will produce metrics on your calls to show your model accuracy.

predict_sv.py \
--base_dir ./hgsv_hg38_hfm_server/ \
--run_dir ./hgsv_hg38_hfm_server/cnn_75/ \
--out_dir ./hgsv_hg38_hfm_server/cnn_75_result/ \
--samples  HG00096,HG00268 \
--sv_type DEL \
--seqs chr19,chr20,chr21,chr22 \
--gpu_num 0

About

Tensor Based Structural Variation Analysis

Resources

License

Stars

Watchers

Forks

Packages

No packages published