# What do we know about virus genetic, origin and evolution

![virusgenetics.png](attachment:virusgenetics.png)

# Findings
What do we know about virus genetics, origin, and evolution? What do we know about the virus origin and management measures at the human-animal interface?

<a href="#1.-Real-time-tracking-of-whole-genomes-and-a-mechanism-for-coordinating-the-rapid-dissemination-of-that-information-to-inform-the-development-of-diagnostics-and-therapeutics-and-to-track-variations-of-the-virus-over-time">1. Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time</a>

<a href="#2.-Access-to-geographic-and-temporal-diverse-sample-sets-to-understand-geographic-distribution-and-genomic-differences,-and-determine-whether-there-is-more-than-one-strain-in-circulation">2. Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation</a>

<a href="#3.-Evidence-that-livestock-could-be-infected-(e.g.,-field-surveillance,-genetic-sequencing,-receptor-binding)-and-serve-as-a-reservoir-after-the-epidemic-appears-to-be-over">3. Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over</a>

<a href="#4.-Animal-host(s)-and-any-evidence-of-continued-spill-over-to-humans">4. Animal host(s) and any evidence of continued spill-over to humans</a>

<a href="#5.-Socioeconomic-and-behavioral-risk-factors-for-this-spill-over">5. Socioeconomic and behavioral risk factors for this spill-over</a>

<a href="#6.-Sustainable-risk-reduction-strategies">6. Sustainable risk reduction strategies</a>

# Technical Overview

In [None]:
!pip install git+https://github.com/dgunning/cord19.git

In [None]:
from cord import ResearchPapers
from cord.tasks import Tasks
from cord.core import image, get_docs
import pandas as pd
pd.options.display.max_colwidth=300

### Load the ResearchPapers

The research can be loaded with index="**text**" or index="**abstract**". If index="**abstract**", the **BM25** will be created from the metadata abstracts. The option index="text" takes a little longer - around 4 minutes on Kaggle - but will be more accurate. For now we use index="abstract", because a lot of the research and paper identification was curated offline with the more powerful BM25 index, the abstract index is still decent, and we also have a second very powerful index built from the **specter word embeddings** for some of the functionality.

In [None]:
papers = ResearchPapers.load()

# Search Strategy

For each of the topic questions in the task we try to identify what we call a **seed question**. This could be a reframing of the original question to remove ambiguity, expand synonyms, or use our understanding of what the Task author is trying to get at. If the question is adequate we use it as the SeedQuestion. Each Task then has a Question and a SeedQuestion, and the SeedQuestion is what we use to start a search.

For some topics neither the Question nor the SeedQuestion would seem to get good results. In this case we use some other source for a search - we might say take a paragraph from some other document or web page related to the topic and search with that. That will return research papers in the same vector space we are interested in. Once we have locate the vector space we use the similarity index (using functions like **papers.similar_to**) to find papers in that vector space that seem to fit the topic.

In other case, if we know of some authorities on a given topic, say an author or an organization, we may search by author name, or org name, then find similar papers.

The point of this search strategy is that is a **human-guided - AI powered document discovery**, fronted by really easy to use user interface and functions inside a Jupyter notebook. It would be easy to automate this, but given that the whole point of these CORD 19 notebooks is to find solutions to the most pressing health emergency in 100 years - AI gimmicks take a back seat. 

### Task Overview

In [None]:
Tasks.GeneticOrigin

#### 100 µs simulation of the main protease of SARS-CoV-2, a critical enzyme for the maturation of viral particles and a potential target for antiviral drugs, started from the apo enzyme structure determined by X-ray crystallography (PDB entry 6Y84). 

Courtesy of @DEShawResearch

In [None]:
image('../input/cord-images/covidprotease.GIF')

# Findings

## 1. Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time


In [None]:
papers.search_2d(Tasks.GeneticOrigin[1].SeedQuestion)

**MicroReact** and **Nextstrain** are genemic visualization platforms that "seamlessly connect the different steps of increasingly complex analyses and visualization of genomic epidemiology and phylodynamics". This quote comes from **Advances in Visualization Tools for Phylogenomic and Phylodynamic Studies of Viral Diseases (2019)**. The paper goes on to say that "*enhancing collaborations and dissemination of visualizations is increasingly achieved through sharing of online resources for hosting annotated tree reconstructions (17), online workspaces (18) and continuously updated pipelines that accommodate increasing data flow during infectious disease outbreaks (19)*"

- [Nextstrain Dashboard](https://nextstrain.org/ncov/global/2020-04-09)
- [MicroReact Dashboard](https://microreact.org/project/COVID-19)

The best place for covid genetic tracking is IMHO is [NextTrain](nextstrain.org). For realtime updates follow [@nextstrain](https://twitter.com/nextstrain?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor)

MicroReact is developed by the [Centre for Genomic Disease Surveillance](https://www.pathogensurveillance.net/)

### Papers for this topic

In [None]:
papers.display('kfbrar54','nm4dx5pq', 'qed8hayx','yb6if23t', 'abqrh2aw', 'm2iiswan','th1da1bb', 'afcgqjwq', 'r5te5xob',  'dao10kx9', 'ry9wpcxo', 'xetzg7gp', 'szg12wfa', '1qkwsh6a', 'jmrg4oeb', '20zr7mtt')

Since the paper **Advances in Visualization Tools for Phylogenomic and Phylodynamic Studies of Viral Disease**s seems to be a good resource, we can look at similar papers

In [None]:
papers.similar_to('nm4dx5pq')

### References

[PhyloGeoTool: interactively exploring large phylogenies in an epidemiological context](https://www.ncbi.nlm.nih.gov/pubmed/28961923)

[PhyloGeoTool on github](https://github.com/rega-cev/phylogeotool)

[MicroReact Genetic Visualization](https://microreact.org/showcase)

[Nextstrain situation report](https://nextstrain.org/narratives/ncov/sit-rep/2020-04-10)

[How to interpret phylogenetic trees](https://nextstrain.org/narratives/trees-background/)

[@PathogenWatch](https://twitter.com/Pathogenwatch)

[@SangerInstitute](https://twitter.com/sangerinstitute)

[Sanger Institute](https://www.sanger.ac.uk/)

### A transmission tree (courtesy of Nextstrain)

In [None]:
image('/kaggle/input/cord-images/transmissiontree.png')

## 2. Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation

In [None]:
papers.search_2d(Tasks.GeneticOrigin[2].SeedQuestion)

This topic overlaps with the previous one, except that it focuses on access to and the sharing of the genetic samples, without which the genomic platforms would not work. 

Nextstrain only uses *sequenced **SARS-COV-2** data - samples where the genetic material has been taken. Not all samples are sequenced.
A country may have many #hCoV19 cases, but may not have sequenced any of them*. from @nexstrain 
The data goes to Nextstrain courtesy of https://www.gisaid.org/ and so country and regional researchers have to sequence the data and then supply it to **GISAID**.

### The National Center for Biotechnology Information
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank(®) nucleic acid sequence database and the PubMed database of citations and abstracts for published life science journals. The Entrez system provides search and retrieval operations for most of these data from 39 distinct databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets.

### European Nucleotide Archive
The European Nucleotide Archive (ENA) provides a comprehensive record of the world's nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation.

1. ### Papers related to this topic

In [None]:
papers.display('vd35a2eq', '0mobdg2p','dblrxlt1', 'hyrzder6', 'aeogp8c7', 'w67z5qof', 'cuyrw4nc', 't93xjcvm', "9t73wadp", "kz5udher", "60vrlrim", "ca6pff0p")

### The Nagoya Protocol

The sub question of this topic was whether the Nagoya Protocol could be leveraged. Well not quite, according to **GISAID** who say

"*Needless to say, ‘access and benefit sharing’ as developed under the Convention on Biodiversity (CBD) is oriented towards **extremely different goals** than the **exchange of pathogens that the world shares for scientific purposes and emergency response**, but would probably rather see eradicated. **The exchange of pathogens involves time pressure**, and global cooperation, neither of which is typical to the exchange of genetic resources envisioned under the CBD*" 

Further

"*While this Report strongly suggests the Nagoya Protocol provides ‘an opportunity to advance public health’, it does not consider the significant risks and that indeed there have already been significant delays in the sharing of seasonal influenza virus samples associated with implementation of the Nagoya Protocol. In late 2018 alone, several cases involving delays in sharing influenza viruses emerged, comprising national influenza centers in Southeast Asia and South America with a long-standing record of timely sharing as required under the terms of reference in the Global Influenza Surveillance and Response System (GISRS). Those national influenza centers found themselves having to **delay the sharing of influenza** viruses **due to conflict with national legislation** on ABS arising from the recent implementation of the NP and consequently missed the timing for the seasonal vaccine composition meeting*"

[GISAID's Comments on the WHO Report of
The Public Health Implications of Implementation of the Nagoya Protocol](https://www.gisaid.org/references/statements-clarifications/who-report-on-the-public-health-implications-of-nagoya-protocol-13-may-2019/)

### References

[GISAID](https://www.gisaid.org/)

[Submit Data to GISAID](https://www.gisaid.org/epiflu-applications/submitting-data-to-epiflutm/)

[European Nucleotide Archive](https://www.ebi.ac.uk/ena)

[National Center for Biotechnology Information](https://www.ncbi.nlm.nih.gov/)

## 3. Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over
Evidence of whether farmers are infected, and whether farmers could have played a role in the origin.
Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia.
Experimental infections to test host range for this pathogen.

In [None]:
papers.search_2d(Tasks.GeneticOrigin[3].SeedQuestion)

In [None]:
papers.display('c31amc2q', 'k9rjvtcy', 'xwx9w9fi','xuczplaf','dblrxlt1', 'p1jbb1wa','w296pll9', 'mugq630z', 'tjdxn29l', 'ljllvlrd', '4ko557n1','srq1bo2v', '2inlyd0t', 'jjbez46k', 'h5sox8bq', 'lfndq85x')

## 4. Animal host(s) and any evidence of continued spill-over to humans

This topic overlaps with the previous topic, but I will interpret this one to be related to animal hosts generally, and not specific to livestock or domesticated animals with which humans will have continual close contact. Nevertheless the papers will overlap.



In [None]:
papers.search_2d(Tasks.GeneticOrigin[4].SeedQuestion)

In [None]:
papers.display( 'k9rjvtcy', '1qkwsh6a', 'rxrlbw60', 'q8im1agz', 'ba8zx73b', 'he853mwa', '49oco16h', 'dblrxlt1', 'njundv6l')

## 5. Socioeconomic and behavioral risk factors for this spill-over


In [None]:
papers.search_2d(Tasks.GeneticOrigin[5].SeedQuestion)

In [None]:
papers.display( 'k9rjvtcy', 'c31amc2q', '65b267ic', 'k9ygdhqg', 'xz0385np', '6h5393o9', 'brhvfsgy', '8a1cia8s', 'he853mwa')

In [None]:
papers.similar_to('c31amc2q')

## 6. Sustainable risk reduction strategies

In [None]:
papers.search_2d(Tasks.GeneticOrigin[6].SeedQuestion)

In [None]:
papers.display('ljllvlrd', 'd5l60cgc', 'b6kx9nnb', '7gs54990', 'p1jbb1wa', '971d0sir', '89fol3pq', 'oxs4o9xe', 'xuczplaf')

## Related work

This notebook is a derivative of the popular [CORD Research Engine with BM25 and Specter Embeddings](https://www.kaggle.com/dgunning/cord-research-engine-bm25-specter-embeddings). The code is maintained on [github](https://github.com/dgunning/cord19)

In [None]:
get_docs('DesignNotes')

In [None]:
get_docs('SearchStrategy')

In [None]:
get_docs('Roadmap')

In [None]:
get_docs('References')