# Workflow for Calling Variants on Founders

## Overview:

![workflow_founders](../../figures/FOUNDERS_workflow.png)

### create g.vcfs per sample:
 - edit [config.yaml](./snakemake_version/config/config.yaml) to use with your data

run snakemake pipeline

on Rackham:
 - modify the [slurm.json](./snakemake_version/config/slurm.json) to change the project name and  default time per job

In [11]:
%%bash
num_jobs=150 # number of concurrently running jobs on the cluster 
cd ./snakemake_version/
nohup snakemake  -j $num_jobs \
                 -s map_and_call_founders.snek \
                --cluster-config config/slurm.json \
                --cluster config/slurm_scheduler.py  \
                --cluster-status config/slurm_status.py  \
                --rerun-incomplete \
                --use-conda > make_gvcfs_founders.out &

### call variants on all big chromosomes:
- get list of all chromosomes, e.g.:
```bash
grep ">" path/to/reference.fna.gz | sort -u > ./allchroms.txt
```

chromosomes are in [bigchroms.txt](./call_variants_big_chromosomes/bigchroms.txt)
and separated in 
 - [bigchroms_large.txt](./call_variants_big_chromosomes/bigchroms_large.txt)  
 - [bigchroms_medium.txt](./call_variants_big_chromosomes/bigchroms_medium.txt)  
 - [bigchroms_small.txt](./call_variants_big_chromosomes/bigchroms_small.txt)  

for job-resource allocation.

create sbatch files for each big chromosome:  
big gets 20 threads, medium 15 and small 10

In [15]:
%%bash
bash ./call_variants_big_chromosomes/create_sbatch_for_chrom.sh

This uses [gatk_gt_gvcf_creator.py](./call_variants_big_chromosomes/gatk_gt_gvcf_creator.py) for each big chromosome and makes a sbatch with resources according to size

In [None]:
%%bash
# submit sbatch files:
ls /home/tilman/storage/subset3/vcf/run_001* | xargs -n1 -I{} sbatch {}

### call variants on all small and unassociated scaffolds:
#### filter them out of each .g.vcf 
- in the reference that i'm using, their names start with "NW"

In [None]:
%%bash
sbatch ./extract_small_scaffolds/filter_gvcfs_per_chrom.sh

this script uses [filter_gvcfs_per_chrom.py](./extract_small_scaffolds/filter_gvcfs_per_chrom.py)

#### fix filtered g.vcf files:
GATK is more stringent with regards to floating point numbers than htslib which is used in cyvcf2. htslib codes 0.0 as 0, gatk doesnt like that. apparently this is fixed in gatk 4.0, but im using 3.8

i found a perl-script online to fix that, written by Peter Danecek:  
https://github.com/samtools/bcftools/blob/develop/misc/fix-broken-GATK-Double-vs-Integer

i called it [fix_broken_gvcf.pl](./extract_small_scaffolds/fix_broken_gvcf.pl).


In [1]:
%%bash
#sbatch ./extract_small_scaffolds/fix_small_gvcfs.sh
cat ./extract_small_scaffolds/fix_small_gvcfs.sh


#!/bin/bash -l
#SBATCH -A snic2018-3-170
# -p specifies the type of resource i want ( whole node, or core? ) and n the amount
#SBATCH -p core -n 10
#SBATCH -t 2:00:00
#SBATCH -J fix_gvcf
#SBATCH --mail-type=all
#SBATCH --get-user-env
#SBATCH --mail-user=tillman.ronneburg@imbim.uu.se
#SBATCH -o /home/tilman/storage/subset3/fix_gvcf_%j.out
#SBATCH -e /home/tilman/storage/subset3/fix_gvcf_%j.error
ls /home/tilman/storage/subset3/gvcf_small/*.vcf | xargs -n1 -I{} basename {} | xargs -n1 -P20 -I{} bash -c "cat /home/tilman/storage/subset3/gvcf_small/{} | /home/tilman/scripts/fix_broken_gvcf.pl  > /home/tilman/storage/subset3/gvcf_small_floatingpoint_fixed/{}"



#### call variants on small scaffolds:

In [24]:
%%bash 
sbatch ./extract_small_scaffolds/call_variants_small_fp_fixed_sbatch.sh

### Filter variants and merge them:
 - config file is at [./filtering/config/config.yaml](./filtering/config/config.yaml)  
 
executing lenient filtering on rackham:

In [None]:
%%bash 
cd ./filtering/
bash ./run_filtering.sh