In [63]:
import numpy as np
import pandas as pd

import Bio
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
import os

In [64]:
#for seq_record in SeqIO.parse("sequences/genome/AY029768.1_MYS.fasta", "fasta"):
for seq_record in SeqIO.parse("sequences/genome/NC_00278.1_MYS.fasta", "fasta"):
    
    # just wanted to do a sanity check
    print(f"Malaysian strain genome size: {len(seq_record)}")

Malaysian strain genome size: 18246


In [65]:
for seq_record in SeqIO.parse("sequences/genome/AY988601.1_BGD.fasta", "fasta"):
    
    # just wanted to do a sanity check
    print(f"Bangaldeshi strain genome size: {len(seq_record)}")

Bangaldeshi strain genome size: 18252


# 1. BLAST the glycoprotein and phosphoprotein sequences to get all sequences from NCBI

Used reference sequences `sequences/PG/ref_G_seq` and `sequences/PG/ref_P_seq` for the glycoprotein and phosphoproteins. Queried these on BLAST using <i>Henipaviruses</i> as the search organism. 

83 full search results for G (could be individual protein, multiple, or full genome) stored at `sequences/PG/glyco_blast_83.fasta`, and 80 full search results for P stored at `sequences/PG/phospho_blast_80.fasta`. Removed duplicates in Geneious.

Descriptions stored at `sequences/glyco_blast_descriptions.csv` and `sequences/phospho_blast_descriptions.csv`. Need this to get the isolate locations and other metadata.

```bash
bowtie2-build sequences/genome/AY029768.1_MYS_CDS.fna MYS_cds
bowtie2 -x MYS_cds -f sequences/PG/G_seqs.fasta -S G_aln.sam
```

From the above, found that two glycoprotein sequences did not align to either genome because they have too many ambiguous nucleotides in them. These are <b>MN549407</b> and <b>MN549410</b>. They were removed from the .fasta files.

The sequence headers were updated, and the V, W, and C proteins were removed because they are duplicates of the phosphoprotein sequence. <b>This was found to cause problems when aligning the glycoprotein sequences</b>. It makes sense that you shouldn't have redundant sequences in a reference sequence, but I'm not sure why it didn't cause problems for the phosphoprotein alignment. 

In [66]:
def read_fasta(fName):
    
    lengths = []
    names = []
    seqs = []

    for seq_record in SeqIO.parse(fName, "fasta"):

        if "_" in seq_record.id:
            names.append(seq_record.id.split("_")[0])
        else:
            names.append(seq_record.id)
        seqs.append(seq_record.seq)
        lengths.append(len(seq_record))

    print(f"Found {len(seqs)} sequences in {os.path.basename(fName)}")
    
    res_dict = dict(zip(np.array(names), seqs))
    
    return dict(sorted(res_dict.items(), key=lambda x: x[0])), np.array(lengths)

In [67]:
G = read_fasta("sequences/PG/G_seqs.fasta")
P = read_fasta("sequences/PG/P_seqs.fasta")

Found 79 sequences in G_seqs.fasta
Found 79 sequences in P_seqs.fasta


# 2. Single nucleotide polymorphism (SNP) calling

## End goal is to convert the FASTA file into a VCF (variant calling format) file

<b>Amazing tutorial:</b> https://www.ebi.ac.uk/sites/ebi.ac.uk/files/content.ebi.ac.uk/materials/2014/140217_AgriOmics/dan_bolser_snp_calling.pdf

Adapted the following steps from it

<b></b>
```bash
bwa index sequences/genome/AY029768.1_MYS_CDS.fna
```

<b>Perform the alignment</b>
```bash
bwa aln sequences/genome/AY029768.1_MYS_CDS.fna sequences/PG/P_seqs.fasta > alignments/P_MYS_aln.sai
```

<b>Convert to SAM file format, which is human-readable</b>
```bash
bwa samse sequences/genome/AY029768.1_MYS_CDS.fna alignments/P_MYS_aln.sai sequences/PG/P_seqs.fasta > alignments/P_MYS_aln.sam
```

<b>Convert SAM to BAM and sort the BAM file</b>
```bash
   samtools view -b alignments/G_BGD_aln.sam > alignments/G_BGD_aln.bam
   samtools sort alignments/G_BGD_aln.bam -o alignments/G_BGD_aln_sorted.bam
```

<b>Index the genome file again with `samtools`</b>
```bash
samtools faidx sequences/genome/AY029768.1_MYS_CDS.fna
```

<b>Run 'mpileup' to generate VCF format</b>
```bash
bcftools mpileup -f sequences/genome/AY988601.1_BGD_CDS.fna alignments/P_MYS_aln_sorted.bam > alignments/P_MYS_aln.bcf
bcftools mpileup -f sequences/genome/AY988601.1_BGD_CDS.fna alignments/G_BGD_aln_sorted.bam > alignments/G_BGD_aln.bcf
```

<b>Call SNPs</b>
```bash
bcftools view -v snps alignments/P_MYS_aln.bcf > alignments/P_MYS_SNPs.vcf
bcftools view -v snps alignments/G_MYS_aln.bcf > alignments/G_MYS_SNPs.vcf
```

## Exact code to run:

```
bash align_make_vcf.sh "sequences/genome/AY988601.1_BGD_CDS.fna" "sequences/PG/P_seqs.fasta" "alignments" "P_BGD"
bash align_make_vcf.sh "sequences/genome/NC_00278.1_MYS_CDS.fna" "sequences/PG/P_seqs.fasta" "alignments" "P_MYS"

bash align_make_vcf.sh "sequences/genome/AY988601.1_BGD_CDS.fna" "sequences/PG/G_seqs.fasta" "alignments" "G_BGD"
bash align_make_vcf.sh "sequences/genome/NC_00278.1_MYS_CDS.fna" "sequences/PG/G_seqs.fasta" "alignments" "G_MYS"
```

# 3. Make trees using the G and P CDS's

```bash
fasttree -nt sequences/PG/G_seqs.fasta > trees/G_cds.nwk
fasttree -nt sequences/PG/P_seqs.fasta > trees/P_cds.nwk
```

https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionVI/lessons/01_alignment.html