![](https://www.pennmedicine.org/news/-/media/images/pr%20news/news/2021/october/dna.ashx)

# **Goals**
In this notebook, you will:
*   Brainstorm some ideas about how AI can help in the fight against COVID-19.
*   Learn the basics of virus biology.
*   Learn about genetic material, transcription, and translation.
*   Learn how genetic material can mutate.
*   Think deeply about the pros and cons of building models to predict where different SARS-CoV-2 lineages originated from.

*Note: There is a bit of biology to learn before we dive into the machine learning part of this project. Be patient and please ask about anything you don't understand!*

# SARS-CoV-2

<img src="https://sanjuancounty.colorado.gov/sites/sanjuancounty/files/styles/extra_large_thumbnail_650x650_/public/04-2020/coronavirus_banner.png" alt="drawing" width="1000"/>

SARS-CoV-2 is the virus that causes COVID-19 and is the cause of the worldwide pandemic.  




In [None]:
#@title Exercise: Brainstorm 3 ways that we can use AI in the fight against COVID-19.

_1_ = 'detect origins' #@param {type:"string"}
_2_ = 'differentiate people who are infected' #@param {type:"string"}
_3_ = 'find the different symptoms' #@param {type:"string"}

SARS-CoV-2 is actively mutating, so by looking at the sequence of a specific SARS-CoV-2 virus we can actually predict which country a particular SARS-CoV-2 virus came from!

***In this project, we are going to build a classifier that predicts which country a SARS-CoV-2 virus is coming from, so that we can identify where there may be outbreaks in the future.  We're going to talk more about that later.***

Viruses? Mutating? Sequence?

<img src="https://live.staticflickr.com/8466/8393213472_24d08168b0_b.jpg" alt="drawing" width="500"/>


First, let's go over a little bit of virus biology.


# What is a virus?


A virus is a tiny, infectious particle that infects a host cell in order to reproduce.  Viruses take over the host cell and use its resources to make more viruses, basically reprogramming it to become a virus factory.

Please take 10 minutes to read this [excellent Khan Academy](https://www.khanacademy.org/science/high-school-biology/hs-human-body-systems/hs-the-immune-system/a/intro-to-viruses) article on viruses.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/Basic_Scheme_of_Virus_en.svg/1280px-Basic_Scheme_of_Virus_en.svg.png" alt="drawing" width="400"/>



### **Exercise: Answer the following questions about viruses. (Run cell for answers.)**

In [None]:
#@title 1. True or False? Viruses can reproduce entirely on their own.
Answer = "False" #@param ["True", "False"]
print("Answer: False")

Answer: False


In [None]:
#@title 2. True or false? Viruses use the same genetic code that is used in living cells (DNA/RNA)
Answer = "True" #@param ["True", "False"]
print("Answer: True")

Answer: True


In [None]:
#@title 3. What is the difference between viruses and bacteria?
Answer = "" #@param {type:"string"}
print("Although they both make us sick viruses and bacteria are very different.")
print("Bacteria are small and single-celled, but they are living organisms ")
print("that do not depend on a host cell to reproduce. We can use antibiotics ")
print("to fight bacteria, but we cannot use antibiotics on viruses.")

Although they both make us sick viruses and bacteria are very different.
Bacteria are small and single-celled, but they are living organisms 
that do not depend on a host cell to reproduce. We can use antibiotics 
to fight bacteria, but we cannot use antibiotics on viruses.


In [None]:
#@title 4. What are 3 key features that almost all viruses have in common?
Answer = "capsid, genome, envelope" #@param {type:"string"}
print("Capsid, genome, and envelope.")

Capsid, genome, and envelope.


# What is a genome?

A genome is the collection of DNA or RNA in an organism that codes for its various functions and processes.  DNA and RNA are slightly different forms of genetic material - genetic material ultimately tells an organism which proteins or enzymes to make. Proteins/enzymes are the molecular components that make all lifeforms function.  Read [this primer](https://www.khanacademy.org/science/high-school-biology/hs-molecular-genetics/hs-rna-and-protein-synthesis/a/intro-to-gene-expression-central-dogma) and [this primer](https://www.khanacademy.org/science/high-school-biology/hs-molecular-genetics/hs-rna-and-protein-synthesis/a/the-genetic-code) on genetic material and the central dogma for a review.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/5b/Central_dogma_of_molecular_biology.svg/1259px-Central_dogma_of_molecular_biology.svg.png" alt="drawing" width="300"/>

Although they need a host cell to help them actually convert DNA or RNA to proteins, viruses also have genomes.  SARS-CoV-2 has a genome made of RNA! This means the genome has some slightly different characteristics than DNA genomes (like those of humans), including some nuances in the transcription/translation cycle, a faster mutation rate, and the ability to swap parts of its genome with nearby viruses (called recombination). You can read more on the +ssRNA (positive-sense single-stranded RNA) virus [wikipedia page](https://en.wikipedia.org/wiki/Positive-sense_single-stranded_RNA_virus) if you are interested.


Please answer the following questions:

### **Exercise: Answer the following questions about genomes.**

In [None]:
#@title 1. What is a start-codon, and what is the specific sequence of start codon?
Answer = "AUG" #@param {type:"string"}
print("A start codon is the site at which translation begins.")
print("The universal start codon is AUG.")


A start codon is the site at which translation begins.
The universal start codon is AUG.


In [None]:
#@title 2. How many nucleotides code for a single amino acid?
Answer = "3"  #@param [1,2,3,5,20]
print(3)

3


In [None]:
#@title 3. What are the 4 nucleic acids in DNA? How about in RNA?
Answer = "ATCG , AUCG" #@param {type:"string"}
print("Adenine (A), cytosine (C), guanine (G), thymine (T) for DNA.")
print("Adenine (A), cytosine (C), guanine (G), uracil (U) for RNA.")

Adenine (A), cytosine (C), guanine (G), thymine (T) for DNA.
Adenine (A), cytosine (C), guanine (G), uracil (U) for RNA.


In [None]:
#@title 4. (BONUS) Why do you think there are multiple possible codons for single amino acids? (ie CCA, CCU, CCG, CCC all code for Leucine)
Answer = "AAA" #@param {type:"string"}
print("Redundancy so that a single mutation in a DNA sequence does not change the amino acid it ultimately gets translated to.")


Redundancy so that a single mutation in a DNA sequence does not change the amino acid it ultimately gets translated to.


# How do viruses mutate?

Over time, virus genomes can mutate. Mutation can occur if during replication, a mistake is made in copying the RNA or DNA of a viral particle. Some types of viruses can also swap genetic material with nearby viral particles, a process called recombination.  Sometimes a mutation "breaks" a virus, and makes it unable to reproduce in which case the lineage of virus with that mutation will die off. Sometimes a mutation "improves" a virus, and makes it better and spreading or reproducing, in which case the mutation will probably stick around. Sometimes, mutations don't improve or break a virus and are just the result of random events.

Here is an example of how the influenza virus has mutated/evolved in the last 100 years!
![](https://live.staticflickr.com/7155/6830073073_9f847b8273_z.jpg)


Over time, what started as a virus with one unique genome, may mutate into several different strains of virus. This happens all the time with the influenza virus (the flu), and is the reason we need to make a new flu vaccine every year. Different regions of the world may also end up with a different strain of a virus. You can read a little bit more about viral evolution [here](https://www.khanacademy.org/science/biology/biology-of-viruses/virus-biology/a/evolution-of-viruses).


# SARS-CoV-2 Evolution and Ethics

Like most RNA viruses, SARS-CoV-2 is actively mutating, and there are thousands of different SARS-CoV-2 genomes (which we will refer to as *lineages* in the future) around the world. Researchers are constantly investigating if any of the genomes are more or less contagious or dangerous. Different variants have been discovered so far, including the B.1.1.7 (Alpha), B.1.351 (Beta), B.1.617.2 (Delta), and P.1 (Gamma) variants. These variants might flip certain nucleobases (A, U, C or G) to change the behavior of their spike proteins, or drop nucleobases in certain spots, referred to as a *deletion*. Mutations such as these may cause a variant to be more infectious or dangerous. You can read more information on [the CDC's website](https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classifications.html) about these "Variants of Concern".

We can gain important knowledge by looking at the different SARS-CoV-2 genomes. Primarily, we can make a good guess as to what region of the world a specific SARS-CoV-2 lineage came from. In this project, **we will be building a model that will predict which region of the world a SARS-CoV-2 lineage comes from based on its genome sequence**. Before we get started, we should talk about the ethical implications of this.

### **Exercise: Discuss the following questions.**
*These will be important points to address in your presentation later on!*

In [None]:
#@title 1. How could a model that predicts origin of a SARS-CoV-2 lineage help us with contact tracing and suppression?
Response = "" #@param {type:"string"}
print("Such a model could help us a) detect high-risk areas, b) identify the source of initial cases to close")
print("it off to prevent future infections, and c) help us better understand the transmission chain.")

In [None]:
#@title 2. What could be some negative consequences of having such a model? What happens if the model is not 100% correct? (Most models are not perfect!)
Response = "" #@param {type:"string"}
print("The model could be inaccurate, which would lead to a false sense of security and a misallocation")
print("of already finite resources. Genomic data is also sensitive, so having such a model")
print("raises some ethical concerns about how our data could potentially be mishandled.")
print("There's lots of other potential negative consequences too!")

In [None]:
#@title 3. What are some ways we could mitigate some of these negative consequences?
_1_ = "" #@param {type:"string"}
_2_ = "" #@param {type:"string"}
_3_ = "" #@param {type:"string"}
print("1. Ensure a stringent approval process to access genomic data and limit to just researchers")
print("2. Double-check all findings from the model with other intuitions and past genomic trends.")
print("3. Pass in more data into your model to improve its accuracy and limit overfitting.")
print("There's lots of other ways too!")

# Wrapping up

***Great job!*** That was a lot of biology that got thrown at you!

![](https://i.chzbgr.com/full/1865001216/h896D6369/phew)



In [None]:
#@title ####**Exercise: To wrap up, what are two questions you still have about virology/genomics? Ask your group/instructor!**
Question_1 = "" #@param {type:"string"}
Question_2 = "" #@param {type:"string"}
