# SNV Simulation

Simulate various kinds of DNA/RNA mutation scenarios. Data output to `test_input/`.

Produces three BAM/MAF combinations:
* **Wild-Type:** `sim_wt.sorted.bam` & `sim_wt.sorted.maf`
* **Heterozygous:** `sim_het.sorted.bam` & `sim_het.sorted.maf`
* **Homozygous:** `sim_hom.sorted.bam` & `sim_hom.sorted.maf`
* **Multiscenario:** Contains all possible mutation scenarios below.

Creates symlinks to simulate eight mutation scenarios:
RNAed, T-RNAed, VSE, T-VSE, VSL, T-VSL, LOH (amp & del), and SOM.

## Wild-Type
Create "wild-type" reads (no SNVs) using `wgsim`, align with `bwa` (creates a SAM), convert to BAM with `samtools`.

In [25]:
%%bash

ref="test_input/e_coli/NC_008253_1K.fna"
log_dir="test_input/sim_wt/logs"
output_file="test_input/sim_wt/sim_wt.sorted.bam"

fq_file="$(mktemp)"
sam_file="$(mktemp)"
bam_file="$(mktemp)"

mkdir -p "$(dirname $output_file)"
mkdir -p "$log_dir"

set -euxo pipefail

# Generate 1,000 reads, with a 0% rate of mutations and 0% error rate.
# Set seed to a constant for reproducability.
# wgsim outputs paired-end reads; we send second ends to /dev/null to get single-end reads.
wgsim -N 1000 -r 0 -S 8 -e 0 "$ref" "$fq_file" /dev/null 2>&1 >"$log_dir"/wgsim.log
bwa mem -M -t 8 -p "$ref" "$fq_file" >>"$sam_file" 2>"$log_dir"/bwa.log
samtools view -b -S -o "$bam_file" "$sam_file"
samtools sort "$bam_file" -o "$output_file"
samtools index "$output_file"

# Clean up intermediate files
rm "$fq_file"
rm "$sam_file"
rm "$bam_file"

[wgsim] seed = 8
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 1 sequences, total length: 1000


+ wgsim -N 1000 -r 0 -S 8 -e 0 test_input/e_coli/NC_008253_1K.fna /tmp/tmp.aKC6DQZCUv /dev/null
+ bwa mem -M -t 8 -p test_input/e_coli/NC_008253_1K.fna /tmp/tmp.aKC6DQZCUv


Produce simulated mutated reads using `bamsurgeon`. We just need three types of mutated BAMs to construct all simulated allelic asymmetries:
* Homozygous ref (already created above).
* Heterozygous (50% ref, 50% alt)
* Homozygous alt

In [26]:
%%bash

ref="test_input/e_coli/NC_008253_1K.fna"
picard_jar="/seq/software/picard/current/bin/picard.jar"
sim_wt_bam="test_input/sim_wt/sim_wt.sorted.bam"
het_output="test_input/sim_het/sim_het.sorted.bam"
hom_output="test_input/sim_hom/sim_hom.sorted.bam"
het_logs="test_input/sim_het/logs"
hom_logs="test_input/sim_hom/logs"

chrom="gi|110640213|ref|NC_008253.1|"
snv_base="C"
snv_pos="200"
het_vaf="0.5"
hom_vaf="1"

mkdir -p "$(dirname $het_output)"
mkdir -p "$(dirname $hom_output)"
mkdir -p "$het_logs"
mkdir -p "$hom_logs"
het_bam_file="$(mktemp)"
hom_bam_file="$(mktemp)"

set -euxo pipefail

# Make heterozygous BAM
spikein_file="$(mktemp)"
echo "$chrom   $snv_pos     $snv_pos     $het_vaf     $snv_base" >"$spikein_file"

bamsurgeon addsnv.py \
    --single \
    --picardjar "$picard_jar" \
    --aligner mem \
    -v "$spikein_file" \
    -f "$sim_wt_bam" \
    -r "$ref" \
    -o "$het_bam_file" \
    --tmpdir "$het_logs" \
    2>&1 >"$het_logs/bamsurgeon.log"

samtools sort "$het_bam_file" -o "$het_output"
samtools index "$het_output"
rm -r "addsnv_logs_tmp."*

# Make homozygous BAM
echo "$chrom   $snv_pos     $snv_pos     $hom_vaf     $snv_base" >"$spikein_file"

bamsurgeon addsnv.py \
    --single \
    --picardjar "$picard_jar" \
    --aligner mem \
    -v "$spikein_file" \
    -f "$sim_wt_bam" \
    -r "$ref" \
    -o "$hom_bam_file" \
    --tmpdir "$hom_logs" \
    2>&1 >"$hom_logs/bamsurgeon.log"

samtools sort "$hom_bam_file" -o "$hom_output"
samtools index "$hom_output"
rm -r "addsnv_logs_tmp."*

rm "$het_bam_file"
rm "$hom_bam_file"
rm "$spikein_file"

[Fri Apr 14 19:12:23 EDT 2017] picard.sam.SamToFastq INPUT=test_input/sim_het/logs/haplo_gi|110640213|ref|NC_008253.1|_200_200.tmpbam.2ddc01a1-2da2-4ab7-acfd-814e7abdf287.bam FASTQ=test_input/sim_het/logs/haplo_gi|110640213|ref|NC_008253.1|_200_200.tmpbam.2ddc01a1-2da2-4ab7-acfd-814e7abdf287.fastq INCLUDE_NON_PRIMARY_ALIGNMENTS=false VALIDATION_STRINGENCY=SILENT    OUTPUT_PER_RG=false RG_TAG=PU RE_REVERSE=true INTERLEAVE=false INCLUDE_NON_PF_READS=false CLIPPING_MIN_LENGTH=0 READ1_TRIM=0 READ2_TRIM=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Apr 14 19:12:23 EDT 2017] Executing as moorena@cga02.broadinstitute.org on Linux 2.6.32-642.15.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-SNAPSHOT
[Fri Apr 14 19:12:23 EDT 2017] picard.sam.SamToFastq done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058354688
[M::bwa_idx_load_from_disk] r

++ mktemp
+ spikein_file=/tmp/tmp.6dijihv4EW
+ echo 'gi|110640213|ref|NC_008253.1|   200     200     0.5     C'
+ bamsurgeon addsnv.py --single --picardjar /seq/software/picard/current/bin/picard.jar --aligner mem -v /tmp/tmp.6dijihv4EW -f test_input/sim_wt/sim_wt.sorted.bam -r test_input/e_coli/NC_008253_1K.fna -o /tmp/tmp.xL89jSyUqY --tmpdir test_input/sim_het/logs
+ samtools sort /tmp/tmp.xL89jSyUqY -o test_input/sim_het/sim_het.sorted.bam
+ samtools index test_input/sim_het/sim_het.sorted.bam
+ rm -r addsnv_logs_tmp.xL89jSyUqY
+ echo 'gi|110640213|ref|NC_008253.1|   200     200     1     C'
+ bamsurgeon addsnv.py --single --picardjar /seq/software/picard/current/bin/picard.jar --aligner mem -v /tmp/tmp.6dijihv4EW -f test_input/sim_wt/sim_wt.sorted.bam -r test_input/e_coli/NC_008253_1K.fna -o /tmp/tmp.HNilmEjOrV --tmpdir test_input/sim_hom/logs
+ samtools sort /tmp/tmp.HNilmEjOrV -o test_input/sim_hom/sim_hom.sorted.bam
+ samtools index test_input/sim_hom/sim_hom.sorted.bam
+ rm -r 

 Call variants using `samtools mpileup` and `bcftools`.

In [27]:
%%bash

declare -A bams=(
    ["wt"]="test_input/sim_wt/sim_wt.sorted.bam" \
    ["het"]="test_input/sim_het/sim_het.sorted.bam" \
    ["hom"]="test_input/sim_hom/sim_hom.sorted.bam" \
)

declare -A outputs=(
    ["wt"]="test_input/sim_wt/sim_wt.sorted.vcf" \
    ["het"]="test_input/sim_het/sim_het.sorted.vcf" \
    ["hom"]="test_input/sim_hom/sim_hom.sorted.vcf" \
)

ref="test_input/e_coli/NC_008253_1K.fna"

set -euxo pipefail

for type in "${!bams[@]}"; do
    samtools mpileup -g -f "$ref" "${bams[$type]}" | \
        bcftools call -c -v - >"${outputs[$type]}"
done

+ for type in '"${!bams[@]}"'
+ samtools mpileup -g -f test_input/e_coli/NC_008253_1K.fna test_input/sim_het/sim_het.sorted.bam
+ bcftools call -c -v -
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
+ for type in '"${!bams[@]}"'
+ samtools mpileup -g -f test_input/e_coli/NC_008253_1K.fna test_input/sim_hom/sim_hom.sorted.bam
+ bcftools call -c -v -
[mpileup] 1 samples in 1 input files
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
<mpileup> Set max per-file depth to 8000
+ for type in '"${!bams[@]}"'
+ samtools mpileup -g -f test_input/e_coli/NC_008253_1K.fna test_input/sim_wt/sim_wt.sorted.bam
+ bcftools call -c -v -
[mpileup] 1 samples in 1 input files
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
<mpileup> Set max per-file depth to 8000


Oncotate VCFs

In [28]:
%%bash

eval `/broad/software/dotkit/init -b`
reuse -q .python-2.7.1-sqlite3-rtrees
use -q Oncotator

declare -A vcfs=(
    ["wt"]="test_input/sim_wt/sim_wt.sorted.vcf" \
    ["het"]="test_input/sim_het/sim_het.sorted.vcf" \
    ["hom"]="test_input/sim_hom/sim_hom.sorted.vcf" \
)

declare -A outputs=(
    ["wt"]="test_input/sim_wt/sim_wt.sorted.maf" \
    ["het"]="test_input/sim_het/sim_het.sorted.maf" \
    ["hom"]="test_input/sim_hom/sim_hom.sorted.maf" \
)

set -euxo pipefail

for type in "${!vcfs[@]}"; do
    log_loc="$(dirname ${vcfs[$type]})/logs/oncotator.log"
    oncotator --input_format VCF --log_name "$log_loc" "${vcfs[$type]}" "${outputs[$type]}" hg19
done

Verbose mode on
Path:
['/xchip/tcga/Tools/oncotator/onco_env_2.7.1/bin', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7/site-packages/Oncotator-1.9.0.0-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python27.zip', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7/plat-linux2', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7/lib-tk', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7/lib-old', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7/lib-dynload', '/broad/software/free/Linux/redhat_6_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7', '/broad/software/free/Linux/redhat_6_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7/plat-linux2', '/broad/software/free/Linux/redhat_6_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7/lib-tk', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7/site-packages']
 
Verbose mode on
Path:
['/xchip/tcga/Tools/oncotat

+ for type in '"${!vcfs[@]}"'
++ dirname test_input/sim_het/sim_het.sorted.vcf
+ log_loc=test_input/sim_het/logs/oncotator.log
+ oncotator --input_format VCF --log_name test_input/sim_het/logs/oncotator.log test_input/sim_het/sim_het.sorted.vcf test_input/sim_het/sim_het.sorted.maf hg19
2017-04-14 19:13:02,142 INFO [oncotator.Oncotator:239] Oncotator v1.9.0.0
2017-04-14 19:13:02,142 INFO [oncotator.Oncotator:240] Args: Namespace(allow_overwriting=False, cache_url=None, canonical_tx_file=None, collapse_filter_cols=False, collapse_number_annotations=False, dbDir='/xchip/cga/reference/annotation/db/oncotator_v1_ds_gencode_current/', default_cli=[], default_config=None, genome_build='hg19', infer_genotypes='false', infer_onps=False, input_file='test_input/sim_het/sim_het.sorted.vcf', input_format='VCF', log_name='test_input/sim_het/logs/oncotator.log', noMulticore=False, output_file='test_input/sim_het/sim_het.sorted.maf', output_format='TCGAMAF', override_cli=[], override_config=None, pre

Create symlinks for each possible mutation scenario.

Scenarios stored in `test_input/scenarios`.

| Scen.\Files | dna_normal.maf     | dna_tumor.maf      | rna_normal.maf     | rna_tumor.maf      |
|-------------|--------------------|--------------------|--------------------|--------------------|
| RNAed       | sim_wt.sorted.maf  | sim_wt.sorted.maf  | sim_het.sorted.maf | sim_het.sorted.maf |
| T-RNAed     | sim_wt_sorted.maf  | sim_wt_sorted.maf  | sim_wt.sorted.maf  | sim_het.sorted.maf |
| VSE         | sim_het.sorted.maf | sim_het.sorted.maf | sim_hom.sorted.maf | sim_hom.sorted.maf |
| T-VSE       | sim_het.sorted.maf | sim_het.sorted.maf | sim_het.sorted.maf | sim_hom.sorted.maf |
| VSL         | sim_het.sorted.maf | sim_het.sorted.maf | sim_wt.sorted.maf  | sim_wt.sorted.maf  |
| T-VSL       | sim_het.sorted.maf | sim_het.sorted.maf | sim_het.sorted.maf | sim_wt.sorted.maf  |
| LOH (amp)   | sim_het.sorted.maf | sim_hom.sorted.maf | sim_het.sorted.maf | sim_hom.sorted.maf |
| LOH (del)   | sim_het.sorted.maf | sim_wt.sorted.maf  | sim_het.sorted.maf | sim_wt.sorted.maf  |
| SOM         | sim_wt.sorted.maf  | sim_het.sorted.maf | sim_wt.sorted.maf  | sim_het.sorted.maf |

In [49]:
%%bash

declare -A scenarios=(\
    ["rnaed"]="wt wt het het" \
    ["t-rnaed"]="wt wt wt het" \
    ["vse"]="het het hom hom" \
    ["t-vse"]="het het het hom" \
    ["vsl"]="het het wt wt" \
    ["t-vsl"]="het het het wt" \
    ["loh-amp"]="het hom het hom" \
    ["loh-del"]="het wt het wt" \
    ["som"]="wt het wt het" \
)

output_files=("dna_normal.maf" "dna_tumor.maf" "rna_normal.maf" "rna_tumor.maf")

# BAMs (target paths to ln) must be relative to the directories in which the symlinks are created
wt_maf="../../sim_wt/sim_wt.sorted.maf"
het_maf="../../sim_het/sim_het.sorted.maf"
hom_maf="../../sim_hom/sim_hom.sorted.maf"
scen_dir="test_input/scenarios"

set -euxo pipefail

for scen in "${!scenarios[@]}"; do
    scen_types=(${scenarios[$scen]})
    for i in $(seq 0 $((${#output_files[@]}-1))); do
        case "${scen_types[$i]}" in
            "wt") target_bam="$wt_maf";;
            "het") target_bam="$het_maf";;
            "hom") target_bam="$hom_maf";;
        esac

        mkdir -p "$scen_dir/$scen"
        ln -sf "$target_bam" "$scen_dir/$scen/${output_files[$i]}"
    done
done

+ for scen in '"${!scenarios[@]}"'
+ scen_types=(${scenarios[$scen]})
++ seq 0 3
+ for i in '$(seq 0 $((${#output_files[@]}-1)))'
+ case "${scen_types[$i]}" in
+ target_bam=../../sim_wt/sim_wt.sorted.maf
+ mkdir -p test_input/scenarios/t-rnaed
+ ln -sf ../../sim_wt/sim_wt.sorted.maf test_input/scenarios/t-rnaed/dna_normal.maf
+ for i in '$(seq 0 $((${#output_files[@]}-1)))'
+ case "${scen_types[$i]}" in
+ target_bam=../../sim_wt/sim_wt.sorted.maf
+ mkdir -p test_input/scenarios/t-rnaed
+ ln -sf ../../sim_wt/sim_wt.sorted.maf test_input/scenarios/t-rnaed/dna_tumor.maf
+ for i in '$(seq 0 $((${#output_files[@]}-1)))'
+ case "${scen_types[$i]}" in
+ target_bam=../../sim_wt/sim_wt.sorted.maf
+ mkdir -p test_input/scenarios/t-rnaed
+ ln -sf ../../sim_wt/sim_wt.sorted.maf test_input/scenarios/t-rnaed/rna_normal.maf
+ for i in '$(seq 0 $((${#output_files[@]}-1)))'
+ case "${scen_types[$i]}" in
+ target_bam=../../sim_het/sim_het.sorted.maf
+ mkdir -p test_input/scenarios/t-rnaed
+ ln -sf ../..

Create BAMs for multi-scenario, no normal dna, normal only, and dna only scenario sets.

Run R2D2 on each test scenario; print called scenarios for each test file.

In [67]:
%%bash

scenarios=("rnaed" "t-rnaed" "vse" "t-vse" "vsl" "t-vsl" "loh-amp" "loh-del" "som")
base_dir="test_input/scenarios/"
output_dir="test_input/output"
mkdir -p "$output_dir"
. venv/bin/activate

set -euo pipefail

for scen in "${scenarios[@]}"; do
    output_path="test_input/output/$scen.tsv"
    python r2d2.py \
        --dna_normal "$base_dir/$scen/dna_normal.maf" \
        --dna_tumor "$base_dir/$scen/dna_tumor.maf" \
        --rna_normal "$base_dir/$scen/rna_normal.maf" \
        --rna_tumor "$base_dir/$scen/rna_tumor.maf" \
        --output "$output_path"
        
    echo -e "$scen: $(cat $output_path | awk '{if(NR==2) print $1}')"
done

rnaed: rnaed_all_inputs
t-rnaed: t_rnaed_all_inputs
vse: vse_all_inputs
t-vse: t_vse_all_inputs
vsl: vsl_all_inputs
t-vsl: t_vsl_all_inputs
loh-amp: loh_all_inputs
loh-del: loh_all_inputs
som: somatic_all_inputs


  stride //= shape[i]
  stride //= shape[i]


**Multiscenario:** Create BAM

In [132]:
%%bash

ref="test_input/e_coli/NC_008253_1K.fna"
picard_jar="/seq/software/picard/current/bin/picard.jar"
sim_wt_bam="test_input/sim_wt/sim_wt.sorted.bam"
multiscen_logs="test_input/multiscen/logs"
chrom="gi|110640213|ref|NC_008253.1|"

declare -A multiscen_output
multiscen_output["dna_normal"]="test_input/multiscen/multiscen_dna_normal.sorted.bam"
multiscen_output["dna_tumor"]="test_input/multiscen/multiscen_dna_tumor.sorted.bam"
multiscen_output["rna_normal"]="test_input/multiscen/multiscen_rna_normal.sorted.bam"
multiscen_output["rna_tumor"]="test_input/multiscen/multiscen_rna_tumor.sorted.bam"

declare -A spikein_files
spikein_files["dna_normal"]="$(mktemp)"
cat <<EOM >"${spikein_files["dna_normal"]}"
$chrom    300    300    0.5    C
$chrom    400    400    0.5    C
$chrom    500    500    0.5    C
$chrom    600    600    0.5    G
$chrom    700    700    0.5    G
EOM

spikein_files["dna_tumor"]="$(mktemp)"
cat <<EOM >"${spikein_files["dna_tumor"]}"
$chrom    300    300    0.5    C
$chrom    400    400    0.5    C
$chrom    500    500    0.5    C
$chrom    600    600    0.5    G
$chrom    700    700    1    G
$chrom    800    800    0.5    C
EOM

spikein_files["rna_normal"]="$(mktemp)"
cat <<EOM >"${spikein_files["rna_normal"]}"
$chrom    100    100    1    C
$chrom    300    300    1    C
$chrom    400    400    0.5    C
$chrom    600    600    0.5    G
$chrom    700    700    0.5    G
EOM

spikein_files["rna_tumor"]="$(mktemp)"
cat <<EOM >"${spikein_files["rna_tumor"]}"
$chrom    100    100    1    C
$chrom    200    200    1    C
$chrom    300    300    1    C
$chrom    400    400    1    C
$chrom    800    800    0.5    C
EOM

set -euxo pipefail
mkdir -p "$multiscen_logs"
for output_type in "${!multiscen_output[@]}"; do    
    multiscen_bam_file="$(mktemp)"

    (
        bamsurgeon addsnv.py \
            --single \
            --picardjar "$picard_jar" \
            --aligner mem \
            -v "${spikein_files[$output_type]}" \
            -f "$sim_wt_bam" \
            -r "$ref" \
            -o "$multiscen_bam_file" \
            -z 100 \
            --tmpdir "$multiscen_logs" \
            --mindepth 0 \
            -p 8 \
            2>&1 >"$multiscen_logs/bamsurgeon_$output_type.log"

        samtools sort "$multiscen_bam_file" -o "${multiscen_output[$output_type]}"
        samtools index "${multiscen_output[$output_type]}"

        rm "${spikein_files[$output_type]}"
        rm "$multiscen_bam_file"
    ) &
done

wait

rm -r "addsnv_logs_tmp."*

[Tue Apr 18 22:42:34 EDT 2017] picard.sam.SamToFastq INPUT=test_input/multiscen/logs/haplo_gi|110640213|ref|NC_008253.1|_200_200.tmpbam.17f949e1-88de-4bfb-a55c-03a6a3bda4cb.bam FASTQ=test_input/multiscen/logs/haplo_gi|110640213|ref|NC_008253.1|_200_200.tmpbam.17f949e1-88de-4bfb-a55c-03a6a3bda4cb.fastq INCLUDE_NON_PRIMARY_ALIGNMENTS=false VALIDATION_STRINGENCY=SILENT    OUTPUT_PER_RG=false RG_TAG=PU RE_REVERSE=true INTERLEAVE=false INCLUDE_NON_PF_READS=false CLIPPING_MIN_LENGTH=0 READ1_TRIM=0 READ2_TRIM=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Apr 18 22:42:34 EDT 2017] Executing as moorena@cga02.broadinstitute.org on Linux 2.6.32-642.15.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Picard version: 2.9.0-SNAPSHOT
[Tue Apr 18 22:42:34 EDT 2017] picard.sam.SamToFastq INPUT=test_input/multiscen/logs/haplo_gi|110640213|ref|NC_008253.1|_500_500.tmpbam.ce3bfcd0-

+ mkdir -p test_input/multiscen/logs
+ for output_type in '"${!multiscen_output[@]}"'
++ mktemp
+ multiscen_bam_file=/tmp/tmp.JctsEjn9G5
+ for output_type in '"${!multiscen_output[@]}"'
+ bamsurgeon addsnv.py --single --picardjar /seq/software/picard/current/bin/picard.jar --aligner mem -v /tmp/tmp.OBCbR3xocd -f test_input/sim_wt/sim_wt.sorted.bam -r test_input/e_coli/NC_008253_1K.fna -o /tmp/tmp.JctsEjn9G5 -z 100 --tmpdir test_input/multiscen/logs --mindepth 0 -p 8
++ mktemp
+ multiscen_bam_file=/tmp/tmp.EsjNVYK7qL
+ for output_type in '"${!multiscen_output[@]}"'
+ bamsurgeon addsnv.py --single --picardjar /seq/software/picard/current/bin/picard.jar --aligner mem -v /tmp/tmp.MnrJMoRcxb -f test_input/sim_wt/sim_wt.sorted.bam -r test_input/e_coli/NC_008253_1K.fna -o /tmp/tmp.EsjNVYK7qL -z 100 --tmpdir test_input/multiscen/logs --mindepth 0 -p 8
++ mktemp
+ multiscen_bam_file=/tmp/tmp.0JDKspwi3q
+ for output_type in '"${!multiscen_output[@]}"'
+ bamsurgeon addsnv.py --single --picardjar 

**Multiscenario:** Mpileup and Oncotate

In [141]:
%%bash

eval `/broad/software/dotkit/init -b`
reuse -q .python-2.7.1-sqlite3-rtrees
use -q Oncotator

ref="test_input/e_coli/NC_008253_1K.fna"
multiscen_base="test_input/multiscen"
types=("dna_normal" "dna_tumor" "rna_normal" "rna_tumor")
log_dir="$multiscen_base/logs"
mkdir -p "$log_dir"

declare -A bams
declare -A outputs
for type in "${types[@]}"; do
    bams["$type"]="$multiscen_base/multiscen_$type.sorted.bam"
    outputs["$type"]="$multiscen_base/multiscen_$type.sorted.vcf"
done

set -euxo pipefail

# SNV calling (mpileup -> bcftools) and Oncotation
for type in "${!bams[@]}"; do
    vcf="${outputs[$type]}" # we need to do string substition, which doesn't work with array dereferences
    samtools mpileup -g -f "$ref" "${bams[$type]}" | \
        bcftools call -c -v - >"$vcf"
        
    log_loc="$log_dir/oncotator_$type.log"
    output_name="${vcf%.*}.maf"
    oncotator --input_format VCF --log_name "$log_loc" "$vcf" "$output_name" hg19
done

Verbose mode on
Path:
['/xchip/tcga/Tools/oncotator/onco_env_2.7.1/bin', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7/site-packages/Oncotator-1.9.0.0-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python27.zip', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7/plat-linux2', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7/lib-tk', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7/lib-old', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7/lib-dynload', '/broad/software/free/Linux/redhat_6_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7', '/broad/software/free/Linux/redhat_6_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7/plat-linux2', '/broad/software/free/Linux/redhat_6_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7/lib-tk', '/xchip/tcga/Tools/oncotator/onco_env_2.7.1/lib/python2.7/site-packages']
 
Verbose mode on
Path:
['/xchip/tcga/Tools/oncotat

+ for type in '"${!bams[@]}"'
+ vcf=test_input/multiscen/multiscen_rna_tumor.sorted.vcf
+ samtools mpileup -g -f test_input/e_coli/NC_008253_1K.fna test_input/multiscen/multiscen_rna_tumor.sorted.bam
+ bcftools call -c -v -
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
+ log_loc=test_input/multiscen/logs/oncotator_rna_tumor.log
+ output_name=test_input/multiscen/multiscen_rna_tumor.sorted.maf
+ oncotator --input_format VCF --log_name test_input/multiscen/logs/oncotator_rna_tumor.log test_input/multiscen/multiscen_rna_tumor.sorted.vcf test_input/multiscen/multiscen_rna_tumor.sorted.maf hg19
2017-04-19 00:30:18,026 INFO [oncotator.Oncotator:239] Oncotator v1.9.0.0
2017-04-19 00:30:18,026 INFO [oncotator.Oncotator:240] Args: Namespace(allow_overwriting=False, cache_url=None, canonical_tx_file=None, collapse_filter_cols=False, collapse_number_annotations=False, dbDi

Call R2D2 on multiscenario MAFs.

In [192]:
%%bash

base_dir="test_input/multiscen"
output_path="test_input/output/multiscen.tsv"
. venv/bin/activate

set -euo pipefail

python r2d2.py \
    --dna_normal "$base_dir/multiscen_dna_normal.sorted.maf" \
    --dna_tumor "$base_dir/multiscen_dna_tumor.sorted.maf" \
    --rna_normal "$base_dir/multiscen_rna_normal.sorted.maf" \
    --rna_tumor "$base_dir/multiscen_rna_tumor.sorted.maf" \
    --output "$output_path"

In [193]:
import pandas as pd

df = pd.read_csv('test_input/output/multiscen.tsv', sep='\t')
df.sort_values('Start_position')

Unnamed: 0,scenario,Hugo_Symbol,Chromosome,Start_position,End_position,Strand,Variant_Classification,Variant_Type,Reference_Allele,DNA_Normal_Allele1,DNA_Normal_Allele2,DNA_Tumor_Allele1,DNA_Tumor_Allele2,RNA_Normal_Allele1,RNA_Normal_Allele2,RNA_Tumor_Allele1,RNA_Tumor_Allele2
6,rnaed_all_inputs,Unknown,gi|110640213|ref|NC_008253.1|,100,100,__UNKNOWN__,IGR,SNP,,,,,,T,C,T,C
7,t_rnaed_all_inputs,Unknown,gi|110640213|ref|NC_008253.1|,200,200,__UNKNOWN__,IGR,SNP,,,,,,,,T,C
0,vse_all_inputs,Unknown,gi|110640213|ref|NC_008253.1|,300,300,__UNKNOWN__,IGR,SNP,G,G,C,G,C,G,C,G,C
1,t_vse_all_inputs,Unknown,gi|110640213|ref|NC_008253.1|,400,400,__UNKNOWN__,IGR,SNP,A,A,C,A,C,A,C,A,C
2,vsl_all_inputs,Unknown,gi|110640213|ref|NC_008253.1|,500,500,__UNKNOWN__,IGR,SNP,T,T,C,T,C,,,,
3,t_vsl_all_inputs,Unknown,gi|110640213|ref|NC_008253.1|,600,600,__UNKNOWN__,IGR,SNP,C,C,G,C,G,C,G,,
4,loh_all_inputs,Unknown,gi|110640213|ref|NC_008253.1|,700,700,__UNKNOWN__,IGR,SNP,T,T,G,T,G,T,G,,
5,somatic_all_inputs,Unknown,gi|110640213|ref|NC_008253.1|,800,800,__UNKNOWN__,IGR,SNP,,,,G,C,,,G,C


## Lung CA test

In [9]:
%%bash

base_dir="test_input/lung_ca"
output_path="test_input/output/lung_ca/lung_ca_1.tsv"
. venv/bin/activate

set -euxo pipefail

python r2d2.py \
    --dna_normal "/local/cga-fh/cga/DNA_RNA_Mutation_Lung/Pair/lung_ca_1_TN/jobs/capture/germline/SingleSampleVCF2MAF/job.86959054/lung_ca_1_germline.maf" \
    --dna_normal_column "allele_frequency" \
    --dna_tumor "/local/cga-fh/cga/DNA_RNA_Mutation_Lung/Pair/lung_ca_1_TN/jobs/capture/snv/maf_pon_filter/job.87162299/lung_ca_1_TN.pon_annotated.pass.maf" \
    --dna_tumor_column "i_tumor_f_full" \
    --rna_tumor "/local/cga-fh/cga/DNA_RNA_Mutation_Lung/Pair/lung_ca_1_TN/jobs/maf_basic_filter_rna/job.87180341/lung_ca_1_TN.basic.filter.pass.maf" \
    --rna_tumor_column "i_tumor_f_full" \
    --rna_normal "/local/cga-fh/cga/DNA_RNA_Mutation_Lung/Pair/lung_ca_1_TN/jobs/capture/germline/OncotatorVcfToMaf/job.86903155/lung_ca_1_TN.capture.germline.maf" \
    --rna_normal_column "i_allele_frequency" \
    --output "$output_path"

+ python r2d2.py --dna_normal /local/cga-fh/cga/DNA_RNA_Mutation_Lung/Pair/lung_ca_1_TN/jobs/capture/germline/SingleSampleVCF2MAF/job.86959054/lung_ca_1_germline.maf --dna_normal_column allele_frequency --dna_tumor /local/cga-fh/cga/DNA_RNA_Mutation_Lung/Pair/lung_ca_1_TN/jobs/capture/snv/maf_pon_filter/job.87162299/lung_ca_1_TN.pon_annotated.pass.maf --dna_tumor_column i_tumor_f_full --rna_tumor /local/cga-fh/cga/DNA_RNA_Mutation_Lung/Pair/lung_ca_1_TN/jobs/maf_basic_filter_rna/job.87180341/lung_ca_1_TN.basic.filter.pass.maf --rna_tumor_column i_tumor_f_full --rna_normal /local/cga-fh/cga/DNA_RNA_Mutation_Lung/Pair/lung_ca_1_TN/jobs/capture/germline/OncotatorVcfToMaf/job.86903155/lung_ca_1_TN.capture.germline.maf --rna_normal_column i_allele_frequency --output test_input/output/lung_ca/lung_ca_1.tsv
  from pkg_resources import resource_stream
INFO:root:Loaded dna_normal data from /local/cga-fh/cga/DNA_RNA_Mutation_Lung/Pair/lung_ca_1_TN/jobs/capture/germline/SingleSampleVCF2MAF/job.86

In [12]:
import pandas as pd

df = pd.read_csv('test_input/output/lung_ca/lung_ca_1.tsv', sep='\t')
df.sort_values('Start_position', inplace=True)
df.to_csv('lung_ca_1_scenarios.csv', index=False, header=True)
df

Unnamed: 0,scenario,Hugo_Symbol,Chromosome,Start_position,End_position,Strand,Variant_Classification,Variant_Type,Reference_Allele,DNA_Normal_Allele1,DNA_Normal_Allele2,DNA_Tumor_Allele1,DNA_Tumor_Allele2,RNA_Normal_Allele1,RNA_Normal_Allele2,RNA_Tumor_Allele1,RNA_Tumor_Allele2
7298,t_rnaed_all_inputs,Unknown,GL000195.1,43874.0,43874.0,+,IGR,SNP,,,,,,,,A,C
5644,loh_all_inputs,RPH3AL,17,63683.0,63683.0,+,Missense_Mutation,SNP,G,G,A,,,G,A,,
1864,loh_all_inputs,PLEKHG4B,5,143197.0,143197.0,+,Missense_Mutation,SNP,G,G,A,,,G,A,,
1865,loh_all_inputs,PLEKHG4B,5,143534.0,143534.0,+,Missense_Mutation,SNP,G,G,A,,,G,A,,
1866,loh_all_inputs,PLEKHG4B,5,156287.0,156287.0,+,Silent,SNP,G,G,A,,,G,A,,
1867,loh_all_inputs,PLEKHG4B,5,174106.0,174106.0,+,Missense_Mutation,SNP,G,G,A,,,G,A,,
1868,loh_all_inputs,PLEKHG4B,5,181660.0,181660.0,+,Silent,SNP,T,T,G,,,T,G,,
1869,loh_all_inputs,PLEKHG4B,5,181730.0,181730.0,+,Missense_Mutation,SNP,C,C,G,,,C,G,,
1870,loh_all_inputs,PLEKHG4B,5,181762.0,181762.0,+,Silent,SNP,A,A,G,,,A,G,,
1871,loh_all_inputs,CCDC127,5,205565.0,205565.0,+,Silent,SNP,G,G,A,,,G,A,,
