# Sequence Alignment

In this notebook, I will use the sequence data downloaded from GenBank of SARS-CoV-2 to create a multiple sequence alignment of the full-length sequences. First, I filter the sequences to exclude shorter sequences. Then, I use the MUSCLE alignment software to align the sequences. 

In [1]:
from Bio import SeqIO
from Bio import SeqRecord
import os

In [2]:
#Read in unfiltered data
unfiltered = SeqIO.parse("../../data/raw/SARS-CoV-2.gbk", "genbank")

In [3]:
#Drop data without (close to) full length sequences
full_length_records = []
for record in unfiltered:
    if len(record.seq) > 29000:
        full_length_records.append(record)

In [4]:
#Write filtered data to file
SeqIO.write(full_length_records, "../../data/raw/SARS-CoV-2.fasta", "fasta")

5398

Download and install MUSCLE for multiple sequence aligment at http://www.drive5.com/muscle

MUSCLE needs to be in your $PATH for it to work with the Biopython wrapper. Once you download and unzip the tarball file, I suggest renaming the executable file to just 'muscle.' Then, use the following command to move it to your usr/local/bin.

`cp path_to_muscle /usr/local/bin`

In [6]:
#Align sequences with MUSCLE
from Bio.Align.Applications import MuscleCommandline

In [8]:
muscle_cline = MuscleCommandline(input="../../data/raw/SARS-CoV-2.fasta", out="../../data/processed/SARS-CoV-2_aligned.fasta")

In [None]:
muscle_cline()