# Project 2

### Scientific Question: Compared to SARS-CoV, can mutations in the Receptor Binding Domain (RBD) of the spike protein of SARS-CoV-2 affect how it interacts with the ACE2 receptor and as a result enables the COVID-19 variant to spread more easily in humans?

  The Receptor Binding Domain (RBD) is an important subunit that is responsible for binding of SARS-CoV-2 by the cell receptor ACE2. ACE2, also known as angiotensin-converting enzyme 2, is a receptor protein that supplies an entry point for the coronavirus to infect the host cells. (Lan et al., 2020)
  
  Studies have shown that "the infectivity of different SARS-CoV strains in host cells is proportional to the binding free energy between the spike (S) protein receptor-binding domain (RBD) and angiotensin-converting enzyme 2 (ACE2) expressed by the host cells." (Wang et al., 2021) Mutations in the S protein, specifically the D614G mutation, can correspond to an increase in transmissbility. (van Dorp el at., 2020)
  
  Many studies have been conducted in the past year on the S protein and its mutation D614G. The structures of the S  protein receptor binding domain can be obtained from the PDB (https://www.rcsb.org/). As stated mentions on this databases website:
  
  "This resource is powered by the Protein Data Bank archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps students and researchers understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease."

### Scientific Hypothesis: If there a mutations in the Receptor Binding Domain (RBD) of the spike protein in mutations, then this can affect how the RBD interacts with the ACE2 receptor and overall transmissibility.

Since viruses evolve rapidly, there are many variants and mutations that are on the PDB data. With this being said, a few of these structures were saved as a FASTA seqeunce to conduct a pairwise seqeunce alignment. The sequeces were put into a list and set with a varible. This data was then used to plot a heatplot to illustrate the differences and similarities between the two seqeunces.

To answer this scientific question and test my hypothesis, I first had to find a FASTA file that contained the sequence for the D614G mutation. In order to find this file, I chose to search for "RDB D614G" on the protein databank (https://www.rcsb.org/structure/6GVF). I chose which sequences to use then downloaded the FASTA Sequence file and used this data to import it into Python.

### Part 1: Load the Packages

Packages loaded include the following:

- Numpy: alternative to the regular Python list; used for scientic computing much more effectively compared to the regular Python List; can perform calculations over arrays(collection of values)

- SeqIO: stands for Sequence Input/Output; used to input and output assorted sequence file formats

- pairwise2: pairwise sequence alignment that provides functions to get global and local alignments between two sequences.

- format_alignment: shows the aligned and unaligned parts of both sequences.

- nglview: a Jupyter widget to interactively view molecular structures and trajectories from molecular dynamics simulations using fast and scalable molecular graphics

In [2]:
# Importted all the packages needed
from Bio import SeqIO
import numpy as np
from Bio import pairwise2
from Bio.pairwise2 import format_alignment

import nglview as nv



### Part 2: Load in the data and perform Bioinformatics Analyses (Pairwise Sequence Alignment)

FASTA file is a FASTA format that is text-based format for representing nucleotides. 

In the code below, the FASTA sequence was first put into a list, and set as a variable. I then used these varibles to conduct and format alignment.

In [16]:
# Assigned sequence to the varible WOMutation
FastaWoMutation = list(SeqIO.parse("rcsb_pdb_7KDJ.fasta", "fasta"))
WOMutation = FastaWoMutation[0].seq

# To check that the variable WOMutation is working
print(WOMutation)

# Assigned sequence to the varible Mutation
FastaMutation = list(SeqIO.parse("rcsb_pdb_7KRQ.fasta", "fasta"))
Mutation = FastaMutation[0].seq

# To check that the variable Mutation is working
print(Mutation)

# Defined two sequences to be aligned
X = WOMutation
Y = Mutation

# Obtained list of global alignments between the two sequences X and Y
# Get a list of the global alignments between WOMutation and Mutation
# No parameters. Identical characters have score of 1, else 0.
# No gap penalties.
alignments = pairwise2.align.globalxx(X, Y)

for a in alignments:
    print(format_alignment(*a))

MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNEVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGR

### Part 3: Upload the two S proteins to visualize the macromolecular structures

Here I am using NGL Viewer to view the structures of the S protein without the D614G mutation ("7KDJ") and the S protein with the mutation ("7KRQ"). NGL Viewer can interactively display large 3D molecular complexes. This method could be used to aid in explaining the relationship between the protein structure and function. Here it is used to compared the protein structures and functions of the S protein with the mutation and the S protein without the mutation.

In [8]:
#load "7KDJ" from RCSB PDB and display viewer widget
view = nv.show_pdbid("7KDJ")  
view

NGLWidget()

In [5]:
#load "7KRQ" from RCSB PDB and display viewer widget
view = nv.show_pdbid("7KRQ")  
view

NGLWidget()

### Part 4: Analysis of the results

The visualization of structures show how strucuturally different "7KDJ" and "7KRQ". The "7KDJ" is S protein without the D614G mutation. The "7KDJ" is S protein with the mutation. Compared to the structure of "7KRQ", the structure of "7KDJ" has rought three prutuding amino acids that are in alpha-helix structure, which can attach to the ACE2 receptor of the cells. This can suggest that the spike protein DG14 mutation can increase cell entry by obtaining higher affinity. This change can enhance viral transmision. Therefore, my hypothesis was correct, and mutations in the Receptor Binding Domain of the spike protein can affect how it interacts with the ACE2 receptor.