## Goals
This notebook primarily strives to help us better understand which animal models are being used to test which therapeutics. This question is relevant to these 3 subtasks under the task "What do we know about vaccines and therapeutics?"
* Exploration of use of best animal models and their predictive value for a human vaccine.
* Efforts to develop animal models and standardize challenge studies
* Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models [in conjunction with therapeutics] 

## Contribution
The majority of Kaggle contributions for therapeutics have been hyper focused on explicitly looking at the effects on humans. We believe there's value in looking at potential animal models as the body of research grows and more treatments are taken to the clinical trial stage. To take a look at the distribution animal models first you need a list of animals and a list of therapeutics. This notebook not only uses data from CORD-19 but also at a high quality publicly available knowledge graph extracted from CORD-19, and the WHO's published appendix of therapeutics. It is our belief that these two datasources can be extremely generalizable and powerful to tasks beyond the scope of this notebook, and hope for their wider adoption.

## Acknowledgements

1. Species Named Entities scraped from the CORD-19 dataset within a larger Knowledge Graph built by Qingyun Wang, Heng Ji, Jiawei Han, Shih-Fu Chang, and Kyunghyun Cho as a collaboration between UIUC, Columbia, and NYU. The full knowledge graph was originally published to the [Illinois CS website](http://blender.cs.illinois.edu/covid19/) and now made available on Kaggle as a [dataset](https://www.kaggle.com/yitongtseo/cord19-named-entities) with permission of the authors.
2. Therapeutic appendix from WHO. Publicly available as a [PDF](https://www.who.int/blueprint/priority-diseases/key-action/Table_of_therapeutics_Appendix_17022020.pdf), and made available on Kaggle as a well formatted [JSON dataset](https://www.kaggle.com/yitongtseo/whotherapeutics)
3. Code for CORD-19 JSON loading, and data pre-formatting written by [Maria and Gtteixeira](****https://www.kaggle.com/maria17/cord-19-explore-drugs-being-developed)




## 1. Import [Extracted Species Entities](http://blender.cs.illinois.edu/covid19/) from CORD-19

In [None]:
#! /bin/env python3

import json
import os
import pprint
from dataclasses import dataclass
from typing import List
from enum import Enum
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.lancaster import LancasterStemmer


class EntityType(Enum):
    CELL_LINE = 'CellLine'
    MUTATION = 'Mutation'
    SPECIES = 'Species'
    GENUS = 'Genus'
    STRAIN = 'Strain'
    GENE = 'Gene'
    DOMAIN_MOTIF = 'DomainMotif'
    CHEMICAL = 'Chemical'
    DISEASE = 'Disease'

class SectionType(Enum):
    TITLE = 'title'
    ABSTRACT = 'abstract'
    INTRODUCTION = 'introduction'
    METHODS = 'methods'
    CONCLUSION = 'conclusion'
    RESULTS = 'results'
    DISCUSSION = 'discussion'
    REFERENCES = 'references'
    OTHER = 'other'

@dataclass
class Location:
    offset: int
    length: int

@dataclass
class Entity:
    type: EntityType
    text: str
    location: List[Location]
    source: SectionType

@dataclass
class Section:
    text: str
    section_type: SectionType

@dataclass
class Paper:
    id: str
    _id: str
    entities: List[Entity]
    sections: List[Section]
    

def resolveSectionType(section_str: str) -> SectionType:
    for st in SectionType:
        if section_str.lower().find(st.value):
            return st
    return SectionType.OTHER

        
paper_limit = float("inf")
current_index = 0

processed_papers:List[Paper] = []
for dirname, _, filenames in os.walk('/kaggle/input/cord19-named-entities/entities/pmcid'):
    for filename in filenames:
        current_index += 1
        if current_index > paper_limit:
            break
        with open(os.path.join(dirname, filename), 'r') as f:
            data = json.load(f)
            title = abstract = None
            entities = []
            sections = []
            for passages in data['passages']:
                section_type = resolveSectionType(passages['infons']['section'])
                sections.append(
                    Section(
                        passages['text'],
                        section_type,
                    )
                )
                for extracted_entity in passages['annotations']:
                    entities.append(
                        Entity(
                            EntityType(extracted_entity['infons']['type']),
                            extracted_entity['text'],
                            [Location(l['offset'], l['length']) for l in extracted_entity['locations']],
                            section_type,
                        )
                    )
            processed_papers.append(
                Paper(
                    data['id'],
                    data['_id'],
                    entities,
                    sections,
                )
            )

print('--data loaded--')
print('# of papers: ', len(processed_papers))
entity_number = sum([
    len(paper.entities)
    for paper in processed_papers
])
print('# of entities: ', entity_number)

In [None]:
# Filter for papers that specifically mention model organisms
model_organism_keywords = ['model organism', 'animal model']
model_organism_papers = [
    paper
    for paper in processed_papers
    if any(
        any(
            phrase in section.text
            for phrase in model_organism_keywords
        )
        for section in paper.sections
    )
]
print(len(model_organism_papers), 'papers mention model organisms')


In [None]:
# Stem and tokenize words
# Stemming is a bit over aggressive... (e.g. 'human' -> 'hum')
# But it's alright because we maintain the same stemming to
# generate the word2vec model later, and we create
# species_lookup to map back later.
def normalize_animal(ls, animal):
    return ' '.join([ls.stem(word.lower()) for word in word_tokenize(animal)])
    

entities_by_paper = [
    set([(e.text, e.type) for e in entities])
    for entities in [
        paper.entities
        for paper in model_organism_papers
    ]
]
species = [
    entity[0]
    for entity_set in entities_by_paper 
    for entity in entity_set
    if entity[1] == EntityType.SPECIES
]


ls = LancasterStemmer()
normalized_species = [normalize_animal(ls, a) for a in species]
species_lookup = {
    normalize_animal(ls, a): a
    for a in species
}
count_lookup = {
    animal: normalized_species.count(animal)
    for animal in set(normalized_species)
}
counts = [
    (a, species_lookup[a], count_lookup[a])
    for a in set(normalized_species)
]
most_common_species = sorted(
    counts, 
    key=lambda s: s[2], 
    reverse=True
)

print('Total of', len(species), 'species mentions tallied')
most_common_species[:20]

In [None]:
import re

# This is somewhat clumsy ... :^/, unfortunately it's the best I've got
def filter_non_model_animals(species):
    filtered_species = []
    for animal_tuple in species:
        normalized_name, name, count = animal_tuple
        # Filter out acronyms (e.g. "HIV", "MERS")
        if re.search('[A-Z]{2}', name):
            continue
        # Filter out viruses
        if 'virus' in name.lower():
            continue
        # Filter out names containing a digit (e.g. "H5N1", "50s")
        if re.search('\d', name):
            continue
        # Filter out names containing a "-" (e.g. "Mers-Cov", "HCoV-HKU1")
        if '-' in normalized_name:
            continue
        if name in ['NiV', 'SeV', 'CoV', 'HeV', 'Mtb']:
            continue
        filtered_species.append(animal_tuple)
    return filtered_species
  
# Take the top 1K
most_common_species = filter_non_model_animals(most_common_species)[0:1000]
most_common_species[0:20]

## 2. Import Therapeutics from WHO's [Public Appendix](https://www.who.int/blueprint/priority-diseases/key-action/Table_of_therapeutics_Appendix_17022020.pdf)

In [None]:
with open('../input/whotherapeutics/therapeutics.json') as json_data:
    therapeutic_data = json.load(json_data)
therapeutics = []
for dictionary in therapeutic_data:
    therapeutics.extend(dictionary['product_type'])
therapeutics[:10]


## 3. Load & pre-format CORD-19 JSON ([source](****https://www.kaggle.com/maria17/cord-19-explore-drugs-being-developed))

Import the necessary libraries.  

In [None]:
# !pip install ipywidgets matplotlib pandas spacy tqdm
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_bc5cdr_md-0.2.4.tar.gz

In [None]:
%%bash -e
if ! [[ -f ./xyz2mol.py ]]; then
  wget https://raw.githubusercontent.com/jensengroup/xyz2mol/master/xyz2mol.py
fi

In [None]:
!pip install py3Dmol
!pip install -U chembl_webresource_client
import sys
!conda install --yes --prefix {sys.prefix} -c rdkit rdkit

In [None]:
import glob
import json
import pandas as pd
import pickle
import spacy
from spacy import displacy
from spacy.matcher import Matcher
from tqdm import tqdm
import en_ner_bc5cdr_md
import os
from collections import Counter
import matplotlib.pyplot as plt
from chembl_webresource_client.new_client import new_client
import rdkit
from rdkit import Chem
from rdkit.Chem import Draw
import py3Dmol # Amazing library for 3D visualization
from rdkit import Chem
from rdkit.Chem import AllChem
from ipywidgets import interact, interactive, fixed
from IPython.display import Image
import cv2
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets
from sklearn.cluster import KMeans
import sys



Load and Clean Data

In [None]:
def doi_to_url(doi):
    if isinstance(doi, float):
        return None
    elif doi.startswith('http'):
        return str(doi)
    elif doi.startswith('doi'):
        return 'https://' + str(doi)
    else:
        return 'https://doi.org/' + str(doi)

df_meta = pd.read_csv('../input/CORD-19-research-challenge/metadata.csv')
df_meta['url'] = df_meta.doi.apply(doi_to_url)

Load JSON Data

In [None]:
data_path = '../input/CORD-19-research-challenge'
json_files = glob.glob(f'{data_path}/**/**/*.json', recursive=True)
# Limit the JSON files, 208388 is too many...
json_files = json_files[:50000]

def to_covid_json(json_files):
    jsonl = []
    for file_name in tqdm(json_files):
        row = {"doc_id": None, "title": None, "abstract": '', "body": None}

        with open(file_name) as json_data:
            data = json.load(json_data)

            row['doc_id'] = data['paper_id']
            row['title'] = data['metadata']['title']
            
            abstract = ''
            if 'abstract' in data:
                abstract_list = [
                    abst['text']
                    for abst in data['abstract']
                ]
                abstract = "\n".join(abstract_list)
                row['abstract'] = abstract

            # And lastly the body of the text. 
            body_list = [bt['text'] for bt in data['body_text']]
            body = "\n".join(body_list)
            row['body'] = body
            
        jsonl.append(row)
    
    return jsonl
    

def get_data():
    try:
        with open('df_cache.pickle', 'rb') as f:
            unpickler = pickle.Unpickler(f)
            # if file is not empty scores will be equal
            # to the value unpickled
            df = unpickler.load()
    except (FileNotFoundError, EOFError):
        df = pd.DataFrame(to_covid_json(json_files))
        with open('df_cache.pickle', 'wb') as f:
            pickle.dump(df, f)
    return df

df = get_data()
print(df.shape)
df.head(3)

In [None]:
df.describe()

All document IDs are unique, nothing to tidy up. But there seem to be missing titles, abstracts and possibly missing bodies.

In [None]:
def no_title(row):
    return not row.title.strip()

def no_abstract(row):
    return not row.abstract or not row.abstract.strip()

def no_body(row):
    return not row.body.strip()

def no_title_abstract_body(row):
    return no_title(row) and no_abstract(row) and no_body(row)

mask = df.apply(no_title_abstract_body, axis=1)
print('Number of articles that have no text data at all:', df.loc[mask].shape[0])

In [None]:
# insert missing values for empty strings
df.loc[df.apply(no_title, axis=1), 'title'] = None
df.loc[df.apply(no_abstract, axis=1), 'abstract'] = None
df.loc[df.apply(no_body, axis=1), 'body'] = None
df.head(3)

In [None]:
print('Missing value counts by column')
len(df) - df.count()

We will be working with abstracts. They provide an appropriate level of detail for the question at hand. Thus, we will drop all documents that do not have an abstract.

In [None]:
import gc
gc.collect()

df = df.dropna(subset=['abstract'])


In [None]:
covid19_names = {
    'COVID19',
    'COVID-19',
    '2019-nCoV',
    '2019-nCoV.',
#     'novel coronavirus',  # too ambiguous, may mean SARS-CoV
    'coronavirus disease 2019',
    'Corona Virus Disease 2019',
    '2019-novel Coronavirus',
    'SARS-CoV-2',
    'covid',
    'coronavirus',
}

def has_covid19(text):
    for name in covid19_names:
        if text and name.lower() in text.lower():
            return True
    return False

df['title_has_covid19'] = df.title.apply(has_covid19)
df['abstract_has_covid19'] = df.abstract.apply(has_covid19)
# df['body_has_covid19'] = df.body.apply(has_covid19)
df_covid19 = df[df.title_has_covid19 | df.abstract_has_covid19]


df_covid19['all'] = df_covid19[['title','abstract']].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)


## 4. Train Word2Vec model from CORD-19 Titles & Abstracts

In [None]:

from __future__ import unicode_literals

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
%matplotlib inline

import gensim


from gensim.models import word2vec
from gensim import corpora, models, similarities


from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as sch
import re

import nltk
from nltk.corpus import brown
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer


In [None]:
data = df_covid19['all'] #13344 total

texts_tokenized = [[word.lower() for word in word_tokenize(document)] for document in data ]

# remove stopwords
english_stopwords = stopwords.words('english')
texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized]

# remove punctuations
punctuations = [',','.',':',';','?','(',')','[',']','&','!','*','@','#','$','%']
texts_filtered = [[word for word in document if not word in punctuations] for document in texts_filtered_stopwords]

# stem
st = LancasterStemmer()
texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]

#size is the dimensionality of the feature vectors.
#window is the maximum distance between the current and predicted word within a sentence.
#min_count = ignore all words with total frequency lower than this.
#workers = use this many worker threads to train the model (=faster training with multicore machines).
#iter = number of iterations (epochs) over the corpus. Default is 5.
model = gensim.models.Word2Vec(texts_filtered, size=1000, window=10, min_count=5, workers=5, iter=20)

# RESULTS!

## 5. Therapeutics by Effectiveness
By calculating each therapeutic's similarity with positive words ("effective", "preventive", "successful", "controlled") in our word2vec embedding, we can score their perceived effectiveness.

In [None]:
def clean_input_tensor(model, tensor):
    tensor = {
        ' '.join([r.lower() for r in row])
        for row in tensor
        if all(r.lower() in model.wv.vocab for r in row)
    }
    return [row.split(' ') for row in tensor]

def get_similarity_matrix(model, rows, columns, row_name):
    rows = clean_input_tensor(model, rows)
    columns = clean_input_tensor(model, columns)
    matrix = [
        [' '.join(row)] + [
            model.wv.n_similarity(row, col)
            for col in columns
        ]
        for row in rows
    ]
    return pd.DataFrame(matrix, columns = [row_name] + [' '.join(c) for c in columns])  

columns = [['effective'], ['preventive'],['successful'],['controlled']]
rows = [[t] for t in therapeutics]
therapeutic_similarties_df = get_similarity_matrix(model, rows, columns, 'therapeutic')
therapeutic_similarties_df.sort_values(by='effective', ascending=False)

# RESULTS!
## 6. Model Organisms by Therapeutic
### Which therapeutics have been tested with which model organisms?

By calculating the similarity between each therapeutic embedding and animal model embedding in our word2vec embedding, we can determine which drugs are associated with which animal models. And hopefully establish which drugs have been tested with which animals :) 

In [None]:
# Pick out the organisms with the highest similarity to "model organism" to filter irrelevant organisms
processed_species = [species[0].lower().split(' ') for species in most_common_species]
model_organism_measure = [
    (animal_tokens, model.wv.n_similarity(animal_tokens, ['animal', 'model', 'organism']))
    for animal_tokens in processed_species
    if all(token in model.wv.vocab for token in animal_tokens)
]
model_organisms = sorted(
    model_organism_measure, 
    key=lambda s: s[1], 
    reverse=True
)[0:200]
# I'm not sure  how well this filter works...
model_organisms[0:5]

In [None]:

columns = [[t] for t in therapeutics]
rows = [animal[0] for animal in model_organisms]
organism_matrix_df = get_similarity_matrix(model, rows, columns, 'animal')


def create_sorted_matrix(organism_matrix_df, species_lookup, proximity_threshold=0.2):
    animal_lookup = None
    sorted_animal_models = {}
    for column, rows in organism_matrix_df.iteritems(): 
        if column == 'animal':
            animal_lookup = rows
            continue
        sorted_model_organisms = sorted(
            [
                (r[1], animal_lookup[r[0]])
                for r in enumerate(rows)
            ], 
            key=lambda s: s[0], 
            reverse=True
        )[0:20]
        sorted_animal_models[column] = [
            species_lookup[o[1]] if o[0] > proximity_threshold else ''
            for o in sorted_model_organisms 
        ]
    return pd.DataFrame.from_dict(sorted_animal_models)

create_sorted_matrix(organism_matrix_df, species_lookup)


## Verification of Results

A quick literature search shows that most results are meaningful and can be read directly as such and such drug has been tested with such and such animal :): 

|Drug|Model Organism|Source|
|---|---|---|
|Ritanovar|Marmosets|[source](https://www.ncbi.nlm.nih.gov/pubmed/26198719)|
|ifn-α|Ascaris|[source](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3080889/)|
|ifn-β|Ascaris|[source](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3080889/)|
|Hydroxychloroquine|Mice|[source](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6417644/)| 
|Corticosteroids|Humans|[source](https://www.ncbi.nlm.nih.gov/pubmed/29607554)|
|Kaletra|Mice|[source](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4524660/)|

## Further Improvement:

* Drugs which have yet to be animal tested (such as ([CR3022](https://science.sciencemag.org/content/early/2020/04/02/science.abb7269?rss=1)) seem to be overly greedily matched with animals (such as goats, masked palm civets, bank voles, and red foxes).
* Drugs which share names with certain diseases (such as MERS-4) have similarity scores clsoe to the animals from which said diseases originated ([camels in the case of MERs](https://www.ncbi.nlm.nih.gov/pubmed/24896817))



## Continuation/TODOs:
* Return the sets of papers that study each animal-therapeutic combination. This could be an important tool to aid researchers.
* Find a better way to filter model organisms from the full list of all animals. Currently we rely on clumsy filters: only including articles that mention "animal models" or "model organisms", and regexes to filter acronyms, virus names, names with digits, etc. We've also implemented filtering model organism names that are not close to ["model", "organism"] in our word2vec embedding but it doesn't seem very effective...
* Cluster species names (e.g. "patient" and "human" should be clustered together). However we have found that using the embedding trained from CORD-19 does not produce sensible clusters by AffinityPropagation or t-SNE. Using a different more general embedding may be necessary.
