### __Pipeline 1: SNP annotation__

#### __Requirements__

- a python environment (installed with conda, for example);
- SNPEff (and the organism's database: _Drosophila melanogaster_ last genome build);
- SIFT4 ( and the database:  _Drosophila melanogaster_ last genome build);
- python `subprocess`;

Load some python libraries

In [1]:
import os
import subprocess
from simplify_snpeff import simplify_snpeff
from vcf_to_tsv import vcf_to_tsv

Here is documented the steps of annotating a multi-sample `.vcf` with *SNPeff*  and *SIFT4*. This pipeline also takes the annotated multi-sample VCF files and produce a `.tsv` table containing the annotation and relevant information as codons, amino acids, and more importantly the allele counts. These later information are needed for the calculation of the site-frequency spectrum (SFS) (next pipeline).

Run the following code to check if the interface to bash is working:

In [1]:
!echo "Hello!"

Hello!


Lets get started with the pipeline. First run the following code to annotate the SNPs in an annotated vcf with SNPEff (SNPEff should be available in your computar with the desired database, in this case for _Drosophila melanogaster_ last genome build). First type `!ls` to see which folder we are:

In [2]:
!ls

__init__.py
[1m[36mexamples[m[m
pipeline_to_annotate_snps.ipynb
pipeline_to_annotate_snps_in_refaltcount_vcf_format.ipynb
pipeline_to_annotate_snps_in_standard_vcf_format.ipynb
simplify_snpeff.py
vcf_to_tsv.py


Move to the folder where the data is stored: `examples/`

In [5]:
%cd ../../masked/vcfs
!ls

/Users/tur92196/DGN/dpgp3/masked/vcfs
README.md    ZI_Chr3L.vcf chrms        [1m[36mremade[m[m       [1m[36msnpeff[m[m
ZI_Chr2L.vcf ZI_Chr3R.vcf [1m[36mfiltered[m[m     [1m[36msfss[m[m         [1m[36mtables[m[m
ZI_Chr2R.vcf ZI_ChrX.vcf  [1m[36mliftover[m[m     [1m[36msift4g[m[m


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


#### **SNPEff**

Create folders to save the output files. I am going to save everything inside `/annotate_vcf`.

In [7]:
%%bash

# Save the annotated SNP files into snpeff folder
SNPEFF_folder="snpeff"
if [ !  -d "$SNPEFF_folder" ]; 
then
    mkdir -p $SNPEFF_folder && ls $SNPEFF_folder
else
    ls $SNPEFF_folder
fi

Especify the variables for the SNPEFF executable and configuration file:

In [8]:
SNPEFF="/Users/tur92196/snpEff/snpEff.jar"
SNPEFF_config="/Users/tur92196/snpEff/snpEff.config"

!echo $SNPEFF
!echo $SNPEFF_config

/Users/tur92196/snpEff/snpEff.jar
/Users/tur92196/snpEff/snpEff.config


Now, run SNPEFF in one file (for Dmel, each vcf corresponds to a chromosome). For this notebook, I started a small `.vcf` file for chromosome 2L.
Remember to use the tag `-classic` to run SNPEff with the EFF field that contains the reference/alternative codons (It seems to be the classic SNPEff behaviour).

In [9]:
OUTPUT="snpeff/"
!java -jar $SNPEFF ann -c $SNPEFF_config -classic -noStats -v Drosophila_melanogaster filtered/ZI_chr2L_remade_lifted_fltr.vcf > $OUTPUT/ZI_chr2L_remade_lifted_fltr_ann.vcf

00:00:00 SnpEff version SnpEff 5.1f (build 2023-07-21 13:58), by Pablo Cingolani
00:00:00 Command: 'ann'
00:00:00 Reading configuration file '/Users/tur92196/snpEff/snpEff.config'. Genome: 'Drosophila_melanogaster'
00:00:00 Reading config file: /Users/tur92196/snpEff/snpEff.config
00:00:00 done
00:00:00 Reading database for genome version 'Drosophila_melanogaster' from file '/Users/tur92196/snpEff/./data/Drosophila_melanogaster/snpEffectPredictor.bin' (this might take a while)
00:00:03 done
00:00:03 Loading Motifs and PWMs
00:00:03 Building interval forest
00:00:03 done.
00:00:03 Genome stats :
#-----------------------------------------------
# Genome name                : 'Drosophila_melanogaster'
# Genome version             : 'Drosophila_melanogaster'
# Genome ID                  : 'Drosophila_melanogaster[0]'
# Has protein coding info    : true
# Has Tr. Support Level info : true
# Genes                      : 23932
# Protein coding genes       : 14280
#----------------------------

Nothing fancy, just run each chromosome using the same code above (just make sure to change the input and the output file names).

In [10]:
# chr2R
!java -jar $SNPEFF ann -c $SNPEFF_config -classic -noStats -v Drosophila_melanogaster filtered/ZI_chr2R_remade_lifted_fltr.vcf > $OUTPUT/ZI_chr2R_remade_lifted_fltr_ann.vcf

# chr3L
!java -jar $SNPEFF ann -c $SNPEFF_config -classic -noStats -v Drosophila_melanogaster filtered/ZI_chr3L_remade_lifted_fltr.vcf > $OUTPUT/ZI_chr3L_remade_lifted_fltr_ann.vcf

# chr3R
!java -jar $SNPEFF ann -c $SNPEFF_config -classic -noStats -v Drosophila_melanogaster filtered/ZI_chr3R_remade_lifted_fltr.vcf > $OUTPUT/ZI_chr3R_remade_lifted_fltr_ann.vcf

00:00:00 SnpEff version SnpEff 5.1f (build 2023-07-21 13:58), by Pablo Cingolani
00:00:00 Command: 'ann'
00:00:00 Reading configuration file '/Users/tur92196/snpEff/snpEff.config'. Genome: 'Drosophila_melanogaster'
00:00:00 Reading config file: /Users/tur92196/snpEff/snpEff.config
00:00:00 done
00:00:00 Reading database for genome version 'Drosophila_melanogaster' from file '/Users/tur92196/snpEff/./data/Drosophila_melanogaster/snpEffectPredictor.bin' (this might take a while)
00:00:03 done
00:00:03 Loading Motifs and PWMs
00:00:03 Building interval forest
00:00:03 done.
00:00:03 Genome stats :
#-----------------------------------------------
# Genome name                : 'Drosophila_melanogaster'
# Genome version             : 'Drosophila_melanogaster'
# Genome ID                  : 'Drosophila_melanogaster[0]'
# Has protein coding info    : true
# Has Tr. Support Level info : true
# Genes                      : 23932
# Protein coding genes       : 14280
#----------------------------

With the annotated vcf (annVCF) run the following function `simplify_snpeff` to generate a simplified version of SNPEff annotation (for the moment, the function retains only the first terms; it should work as SNPeff `gatk` output format).

In [6]:
# Move to the folder containing the annotated files
%cd snpeff/
!ls 

/Users/tur92196/DGN/dpgp3/masked/vcfs/snpeff
ZI_chr2L_remade_lifted_fltr_ann.vcf
ZI_chr2L_remade_lifted_fltr_ann_simplified.vcf
ZI_chr2R_remade_lifted_fltr_ann.vcf
ZI_chr2R_remade_lifted_fltr_ann_simplified.vcf
ZI_chr3L_remade_lifted_fltr_ann.vcf
ZI_chr3L_remade_lifted_fltr_ann_simplified.vcf
ZI_chr3R_remade_lifted_fltr_ann.vcf
ZI_chr3R_remade_lifted_fltr_ann_simplified.vcf


In [12]:
# First run the smallest chromosome
imputfile = "ZI_chr2L_remade_lifted_fltr_ann.vcf"
result = simplify_snpeff(file=imputfile, outfile=None, keeponlyterms=False, method="First")
print(result)

file processed


In [13]:
# Then run the other chromosomes
chrm_list = ["2R", "3L", "3R"]

for c in chrm_list:
    print("Running chromosome: " + c)
    imputfilechrmW = "ZI_chr{chrm}_remade_lifted_fltr_ann.vcf".format(chrm = c)
    print(imputfilechrmW)
    simplify_snpeff(file=imputfilechrmW, outfile=None, keeponlyterms=False, method="First")

Running chromosome: 2R
chr2R_example_ann.vcf
Running chromosome: 3L
chr3L_example_ann.vcf
Running chromosome: 3R
chr3R_example_ann.vcf


#### **SIFT4 Annotations**
We can run another SNP annotator and compare the results (or select SNPs that have the same annotation). We can attach SIFT4 annotations on the SNPEff annotated VCF, by running this program on an annotated VCF. We are going to run SIFT4 on the simplified version of the VCF with SNPEff annoations.

Move back to `examples/`

In [7]:
%cd ..
!ls

/Users/tur92196/DGN/dpgp3/masked/vcfs
README.md    ZI_Chr3L.vcf chrms        [1m[36mremade[m[m       [1m[36msnpeff[m[m
ZI_Chr2L.vcf ZI_Chr3R.vcf [1m[36mfiltered[m[m     [1m[36msfss[m[m         [1m[36mtables[m[m
ZI_Chr2R.vcf ZI_ChrX.vcf  [1m[36mliftover[m[m     [1m[36msift4g[m[m


Then create a folder to save SIFT4 results

In [15]:
%%bash
SIFT4_folder="sift4g"
if [ !  -d "$SIFT4_folder" ]; 
then
    mkdir -p $SIFT4_folder && ls $SIFT4_folder
else
    ls $SIFT4_folder
fi

In [16]:
# Run SIFT4 on chromosome 4:
SIFT4G="/Users/tur92196/local/sift4g/SIFT4G_Annotator.jar"
INPUT="snpeff/"
OUTPUT="sift4g/"

!java -jar $SIFT4G -c -i $INPUT/ZI_chr2L_remade_lifted_fltr_ann_simplified.vcf -d /Users/tur92196/local/sift4g/BDGP6.83 -r $OUTPUT

Start Time for SIFT4G code: Tue Feb 20 15:42:43 EST 2024

Started Running .......
Running in Single transcripts mode

Chromosome	WithSIFT4GAnnotations	WithoutSIFT4GAnnotations	Progress
2L			1			13			Completed : 1/1

Merging temp files....
SIFT4G Annotation completed !
Output directory:sift4g_
End Time for parallel code: Tue Feb 20 15:42:44 EST 2024



Nothing fancy, just run each chromosome using the same code above (just make sure to change the names of the files and the outout)

In [17]:
# chr2R
!java -jar $SIFT4G -c -i $INPUT/ZI_chr2R_remade_lifted_fltr_ann_simplified.vcf -d /Users/tur92196/local/sift4g/BDGP6.83 -r $OUTPUT

# chr3L
!java -jar $SIFT4G -c -i $INPUT/ZI_chr3L_remade_lifted_fltr_ann_simplified.vcf -d /Users/tur92196/local/sift4g/BDGP6.83 -r $OUTPUT

# chr3R
!java -jar $SIFT4G -c -i $INPUT/ZI_chr3R_remade_lifted_fltr_ann_simplified.vcf -d /Users/tur92196/local/sift4g/BDGP6.83 -r $OUTPUT

Start Time for SIFT4G code: Tue Feb 20 15:43:01 EST 2024

Started Running .......
Running in Single transcripts mode

Chromosome	WithSIFT4GAnnotations	WithoutSIFT4GAnnotations	Progress
2R			0			14			Completed : 1/1

Merging temp files....
SIFT4G Annotation completed !
Output directory:sift4g_
End Time for parallel code: Tue Feb 20 15:43:01 EST 2024

Start Time for SIFT4G code: Tue Feb 20 15:43:02 EST 2024

Started Running .......
Running in Single transcripts mode

Chromosome	WithSIFT4GAnnotations	WithoutSIFT4GAnnotations	Progress
3L			0			14			Completed : 1/1

Merging temp files....
SIFT4G Annotation completed !
Output directory:sift4g_
End Time for parallel code: Tue Feb 20 15:43:03 EST 2024

Start Time for SIFT4G code: Tue Feb 20 15:43:03 EST 2024

Started Running .......
Running in Single transcripts mode

Chromosome	WithSIFT4GAnnotations	WithoutSIFT4GAnnotations	Progress
3R			0			14			Completed : 1/1

Merging temp files....
SIFT4G Annotation completed !
Output directory:sift4g_
En

#### __Table with SNPEff and SIFT annotation and additional annotations (mutational context, etc)__
Using `vcf_to_tsv.py` transform the double-annotated vcf (after running SNPEff and SIFT4) to a table in TSV format. 

In [12]:
!ls

README.md    ZI_Chr3L.vcf chrms        [1m[36mremade[m[m       [1m[36msnpeff[m[m
ZI_Chr2L.vcf ZI_Chr3R.vcf [1m[36mfiltered[m[m     [1m[36msfss[m[m         [1m[36mtables[m[m
ZI_Chr2R.vcf ZI_ChrX.vcf  [1m[36mliftover[m[m     [1m[36msift4g[m[m


Then create a folder called `tables` to save the converted vcf file to a tsv file.

In [19]:
%%bash
TABLES_folder="tables"
if [ !  -d "$TABLES_folder" ]; 
then
    mkdir -p $TABLES_folder && ls $TABLES_folder
else
    ls $TABLES_folder
fi

Run the following code to convert a vcf file to a table. Remember to import the function in `vcf_to_tsv.py`

In [21]:
inputfile = "sift4g/ZI_chr2L_remade_lifted_fltr_ann_simplified_SIFTpredictions.vcf"
outputfile = "tables/ZI_chr2L_ann_table.tsv"
reference = "/Users/tur92196/DGN/reference/dm6.fa"
samtools_path = "/Users/tur92196/local/samtools1_8/bin/samtools"

result = vcf_to_tsv(file=inputfile, outfile=outputfile, reference=reference, nflankinbps=4, samtools_path=samtools_path)
print(result)

Start processing the annotate VCF file. It might take a while...
Fixing ref and alt mutations on flanking bases. It might take a while...
Exporting the processed VCF to a .TSV file
file processed


Now, get the table for the rest of the chromosomes:

In [22]:
# Then run the other chromosomes
chrm_list = ["2R", "3L", "3R"]

for c in chrm_list:
    print("Running chromosome: chr" + c)
    inputfile = "sift4g/ZI_chr{chrm}_remade_lifted_fltr_ann_simplified_SIFTpredictions.vcf".format(chrm = c)
    outputfile = "tables/ZI_chr{chrm}_ann_table.tsv".format(chrm = c)

    vcf_to_tsv(file=inputfile, outfile=outputfile, reference=reference, nflankinbps=4, samtools_path=samtools_path)

## This part was actually run in CLI mode, but I put it here for the reference. You can also ran it in Jupyter notebook mode.

Running chromosome: chr2R
Start processing the annotate VCF file. It might take a while...
Fixing ref and alt mutations on flanking bases. It might take a while...
Exporting the processed VCF to a .TSV file
Running chromosome: chr3L
Start processing the annotate VCF file. It might take a while...
Fixing ref and alt mutations on flanking bases. It might take a while...
Exporting the processed VCF to a .TSV file
Running chromosome: chr3R
Start processing the annotate VCF file. It might take a while...
Fixing ref and alt mutations on flanking bases. It might take a while...
Exporting the processed VCF to a .TSV file
