**This notebook utilizes data from the COVID-19 Open Research Dataset Challenge (CORD-19) to answer the question: What do we know about vaccines and therapeutics?**

**Methods:
NLP, text mining ([spaCy](https://allenai.github.io/scispacy/)), dataframe processing and visualization resources. **

In [None]:
%%bash -e
if ! [[ -f ./xyz2mol.py ]]; then
  wget https://raw.githubusercontent.com/jensengroup/xyz2mol/master/xyz2mol.py
fi

In [None]:
!pip install py3Dmol
!pip install -U chembl_webresource_client
import sys
!conda install --yes --prefix {sys.prefix} -c rdkit rdkit

In [None]:
import glob
import json
import pandas as pd
import pickle
import spacy
from spacy import displacy
from spacy.matcher import Matcher
from tqdm import tqdm
import en_ner_bc5cdr_md
import os
from collections import Counter
import matplotlib.pyplot as plt
from chembl_webresource_client.new_client import new_client
import rdkit
from rdkit import Chem
from rdkit.Chem import Draw
import py3Dmol # Amazing library for 3D visualization
from rdkit import Chem
from rdkit.Chem import AllChem
from ipywidgets import interact, interactive, fixed
from IPython.display import Image
import cv2
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets
from sklearn.cluster import KMeans
import gc

# Load and Preprocess Data

**Load Metadata**

In [None]:
def doi_to_df(doi):
    if isinstance(doi,float):
        return None
    elif doi.startswith('http'):
        return str(doi)
    elif doi.startswith('doi'):
        return 'https://'+str(doi)
    else:
        return 'https://doi.org/'+str(doi)

In [None]:
df_meta = pd.read_csv('../input/CORD-19-research-challenge/metadata.csv')
df_meta['url'] = df_meta.doi.apply(doi_to_df)
df_meta.head(3)

In [None]:
df=df_meta
df['abstract'].replace({"Unknown":np.nan},inplace=True)
df=df.dropna(subset=['abstract'])
df=df[['title','abstract']].copy()
df.reset_index(drop=True,inplace=True)
df.describe()

In [None]:
covid19_terms = {
    '2019-nCoV',
    '2019-nCoV.',
    '2019 ncov',
    '2019 n cov',
    '2019n cov',
    '2019ncov',
    '2019-novel Coronavirus',
    'coronavirus 2',
    'coronavirus 2019',
    'covid',
    'COVID-19',
    'COVID19',
    'n cov 2019',
    'ncov 2019',
    'wuhan pneumonia',
    'wuhan virus',
    'wuhan coronavirus'
    }

In [None]:
def has_covid19(text):
    for name in covid19_terms:
        if text and name.lower() in text.lower():
            return True
    return False

In [None]:
df['title_has_covid19']=df.title.apply(has_covid19)
dfTitle=df['title_has_covid19']

df['abstract_has_covid19']=df.abstract.apply(has_covid19)
dfAbstract=df['abstract_has_covid19']

In [None]:
df_covid19=df[dfTitle | dfAbstract]
print(df_covid19.shape)

**Apply Scispacy Model**
Using Scispacy's en_ner_bc5cdr_md. 
It provides only two NER classes: 
DISEASE and CHEMICAL. The latter likely carries drugs and therapeutics.

In [None]:
nlp=en_ner_bc5cdr_md.load()

In [None]:
example_text="""
Abstract Coronavirus 229E was grown to high titers in diploid fibroblast 
cells under medium containing twice the normal concentrations of amino acids 
and vitamins. Growth curves showed maximum virus production at multiplicities 
of infection of 0.1 and 1; maximum titers of intracellular virus occurred at 
22–24 hr and of extracellular virus at 26 hr postadsorption. Tube infectivity 
titers ranged from 109.0–109.5 TCID50/ml and plaque titers from 1010.2–1010.9 
y PFU/ml at the time of peak virus production, when no cytopathology was 
evident. Virus titer dropped rapidly between 26 and 56 hr, coincident with 
increasing cytopathology. A single precipitin band was observed in 
immunodiffusion and immunoelectrophoresis between concentrated virus 
preparations and antiserum to purified 229E. Neuraminidase and hemagglutinin 
assays were negative. Virus was purified by two procedures: adsorption to and 
elution from human “0” erythrocytes and CaHPO4 gel followed by equilibrium 
sucrose gradient centrifugation, and PEG precipitation followed by equilibrium 
glycerol/tartrate gradients and rate zonal sucrose or glycerol/tartrate 
gradients. Final lots of purified virus containing <0.02% of the crude tissue 
culture proteins had absorption maxima at 256 nm and minima at 241.2 nm and a 
mean extinction coefficient of E 1cm 1% = 54.3 at 256 nm. The fully corrected 
sedimentation coefficient for the intact virion was S 20,v 0 = 381 S. PAGE by 
different techniques revealed seven polypeptides of mean apparent molecular 
weights between 16,900 and 196,100. Six contained carbohydrate and one 
contained lipid. Electropherograms of 3H- and 14C-labeled virus were identical 
to those of stained gels. Two glycoproteins constituting 25% of the virion 
protein were identified by bromelin digestion as the spike proteins. The 
density in sucrose and in potassium tartrate was 1.18 g/ml for the virion and 
1.15 g/ml for the “despiked” particle.
"""

In [None]:
doc=nlp(example_text)

In [None]:
colors={
        'CHEMICAL': 'lightpink',
        'DISEASE': 'lightorange',
}
displacy.render(doc, style='ent', options={
    'colors': colors
})

In [None]:
def apply_spacy(texts, nlp):
    docs = []
    for t in texts:
        if t:
            docs.append(nlp(t))
        else:
            docs.append(None)
    return docs

In [None]:
def annotate_with_spacy(df):
    df['title_doc'] = apply_spacy(df.title, nlp)
    df['abstract_doc'] = apply_spacy(df.abstract, nlp)
    return df

def get_spacy_df(df):
    try:
        with open('df_spacy_cache.pickle', 'rb') as f:
            df_spacy = pickle.load(f)
    except FileNotFoundError:
        df_spacy = annotate_with_spacy(df)
        with open('df_spacy_cache.pickle', 'wb') as f:
            pickle.dump(df_spacy, f)
    return df_spacy

In [None]:
df_spacy = get_spacy_df(df_covid19)
df_spacy.iloc[0].abstract_doc.ents

**Match relevant tokens, e.g. COVID-19, trial and usage indicators**


In [None]:
trial_indicators = {
    'trial',
    'study',
    'experiment',
    'evaluate',
    'evaluation',
    're-evaluate',
    'report',
    'test',
    'testing',
    'target',
    'data',
    'show',
    'outcome',
    'evaluation',
    'find',
    'agent',
    
       
}

usage_indicators = {
    'approve',
    'approval',
    'therapeutic',
    'therapy',
    'inhibitory',
    'effect',
    'administer',
    'achieve',
    'improve'
    'alleviate',
    'reduce',
    'antiviral',
    'against',
    'suppress',
    'beneficial',
    'evidence',
    'take',
	'prescribe',
	'treatment',
	'receive',
	'treat',
	'regimen',
	'therapy',
	'use',
	'efficacy',
	'course',
	'drug',
}

idea_indicators = {
    'promising',
    'promise',
    'speculate',
    'believe',
    'would',
    'could',
    'may',
    'possibly',
    'might',
    'should',
    'hypothesize',
    'appear',
    'lack',
    'unclear',
    'need',
} 

matcher = Matcher(nlp.vocab)
for n in trial_indicators:
    matcher.add("trial", None, [{'LEMMA': w.lemma_} for w in nlp(n)])
for n in usage_indicators:
    matcher.add("usage", None, [{'LEMMA': w.lemma_} for w in nlp(n)])
for n in idea_indicators:
    matcher.add("idea", None, [{'LEMMA': w.lemma_} for w in nlp(n)])

example_sent = "Clinical trials (for example, ChiCTR2000029539) have been initiated to test HIV protease inhibitors such as lopinavir and ritonavir in patients infected with 2019-nCoV."
doc = nlp(example_sent)
matches = matcher(doc)
for match_name, start, end in matches:
    print(nlp.vocab.strings[match_name], ':', doc[start:end])

In [None]:
def doc_to_matches(doc):
    match_results = {
        'trial': [],
        'usage': [],
        'idea': []
    }
    if not doc:
        return match_results

    matches = matcher(doc)
    for match_id, start, end in matches:
        match_name = nlp.vocab.strings[match_id]
        match_results[match_name].append((start, end))
    return match_results

def get_matches_df(docs):
    matches = []
    for doc in docs:
        matches.append(doc_to_matches(doc))
    df = pd.DataFrame(matches)
    return df
        
df_matches = get_matches_df(df_spacy.abstract_doc)
df_matches.columns = ['abstract_trial_matches', 'abstract_usage_matches', 'abstract_idea_matches']
df_with_matches = pd.concat([df_spacy.reset_index(drop=True), df_matches], axis=1)
df_with_matches.head(3)

In [None]:
# df_covid19 = df_with_matches[df_with_matches.abstract_has_covid19]
print('Example abstracts', df_with_matches.shape)
for i, row in list(df_with_matches.iterrows())[:5]:
    print('TITLE:', row.title)
    print('\n')
    print(row.abstract)
    print('\n', '-' * 50, '\n')

# Extract all drugs and therapeutics from abstracts¶
Drop all chemicals that appear less than N times in the whole dataset. In the remaining, blacklist all false positives after manual inspection. Plot the remaining chemicals by occurrence frequency.

In [None]:
BLACKLIST = {
 'ACE2s',
 '2019-nCoV',
 '95%CI',
 'ACE2-Fc',
 'AMB',
 'AMI',
 'AMK',
 'AOM',
 'AST-045',
 'AST-N041',
 'ATP',
 'BPO3-P',
 'Betacoronavirus',
 'CAP',
 'CAZ',
 'CC',
 'CIP',
 'CP',
 'CLAVE',
 'COVID-2019',
 'CR3022',
 'creatinine', 
 'CTX',
 'CTX-M',
 'CoV-2',
 'DES',
 'DHPG',
 'DIP',
 'E2',
 'ESBL',
 'Enterobacteriaceae',
 'FASTA',
 'FCA',
 'FCS',
 'FOS',
 'GEN',
 'GM',
 'HK',
 'HPDI',
 'IFR',
 'IM',
 'IVA',
 'JA',
 'KLK13',
 'LA',
 'LPV/r',
 'LYM%',
 'La',
 'LcS',
 'Li',
 'MERS-CoV.',
 'MICs',
 'Metapneumovirus',
 'Médecine',
 'NAL',
 'NCP',
 'NG',
 'NLR',
 'NO',
 'NOR',
 'NP',
 'NS7b',
 'OC',
 'OFL',
 'OP',
 'Prefixes',
 'Résumé',
 'S.',
 'SARS-CoV-2',
 'SARS-COV-2',
 'SARS-Cov2',
 'SARS-CoV2',
 'SARS-CoV-2 infection',
 'SARS-CoV-2 infections',
 'SARS-CoV-2 pneumonia',
 'SARS-CoV.',
 'SARS-Cov-2',
 'SARS-related',
 'SGC7901',
 'SHV',
 'SP',
 'Sarbecovirus',
 'Se',
 'TCM',
 'TCR',
 'TCB',
 'TGEV',
 'TOB',
 'TSL-EO',
 'Texte',
 'VME',
 'VP',
 'WeChat',
 'ZJ01',
 '[ST]A',
 'alcohol',
 'amino acid',
 'amino acids',
 'aminoglycosides',
 'bat-SL-CoVZXC21',
 'betacoronavirus',
 'cholesterol',
 'coronavirus',
 'des cas',
 'https://doi.org/10',
 'infector-infectee',
 "l'origine",
 'lactate',
 'lockdowns',
 'na',
 'nucleic acid',
 'nucleic acids',
 'nucleotide',
 'NBCZone',
 'oxygen',
 'quinolones',
 'rinitis',
 'self-imposed',
 'sodium',
 'smoking',
 'β-coronavirus',
 '℃'}


def count_chemical_ents(df):
    ent_str = []
    for i, row in df.iterrows():
        if row.abstract_doc:            
            for ent in row.abstract_doc.ents:
                if ent.label_ == 'CHEMICAL':
                    ent_str.append(row.abstract_doc[ent.start:ent.end].text)
            
    filtered = [e for e in Counter(ent_str).most_common() if e[1] > 8 and e[0] not in BLACKLIST]
    return dict(filtered)

counts = count_chemical_ents(df_with_matches)
print('Count Frequencies\n')
print(counts)

plt.figure(figsize=(20,20))
plt.rc('xtick', labelsize=20) 
plt.rc('ytick', labelsize=20) 
plt.xticks(rotation=90)
plt.title('Frequency of CHEMICAL-type Strings in Abstracts', fontsize=20)
plt.bar(counts.keys(), counts.values())

# Organise matches by Drugs/Therapeutics

Above, we compiled a list of drugs/therapeutics that are relevant in the context of COVID-19. Now, we can dive deeper into the contexts these drugs appear in.

To this end, we match words that indicate the context of the drug mention:

drug is in an idea stage (e.g. 'darunavir could be useful against COVID-19')
drug is in a trial stage (e.g. 'lopinavir is currently being trialled')
drug is in usage stage (e.g. 'patients are being treated with ritonavir')
These 'indicator' words are marked as additional entities in context.

In [None]:
def doc_id_to_link(doc_id, df_meta, df_data):
    rows = df_meta[df_meta.sha == doc_id]
    if rows.empty:
        return 'UNKNOWN URL AND TITLE'
    url = rows.iloc[0].url
    title = rows.iloc[0].title
    if url and title:
        return '<a href="{}">'.format(url) + title + '</a>'
    elif title:
        return title
    elif url:
        return '<a href="{}">'.format(url) + 'UNKNOWN TITLE' + '</a>'
    else:
        return 'UNKNOWN URL AND TITLE'

def chemical_df(chemicals, df_data, df_meta):
    rows = []    
    for chem in chemicals:
        chem_row = {
            'chemical_name': chem,
            'chemical': [],
            'trials': [],
            'usages': [],
            'ideas': []
        }
        matcher = Matcher(nlp.vocab)
        matcher.add("query", None, [{'LEMMA': w.lemma_} for w in nlp(chem)])
        for i, row in df_data.iterrows():
            chem_matches = matcher(row.abstract_doc)
            for chem_match in chem_matches:
                chem_row['chemical'].append((row.doc_id, chem_match[1], chem_match[2]))
                for trial_match in row.abstract_trial_matches:
                    if abs(trial_match[1] - chem_match[1]) < 15:
                        chem_row['trials'].append((row.doc_id, trial_match[0], trial_match[1]))
                for usage_match in row.abstract_usage_matches:
                    if abs(usage_match[1] - chem_match[1]) < 15:
                        chem_row['usages'].append((row.doc_id, usage_match[0], usage_match[1]))
                for idea_match in row.abstract_idea_matches:
                    if abs(idea_match[1] - chem_match[1]) < 15:
                        chem_row['ideas'].append((row.doc_id, idea_match[0], idea_match[1]))
        rows.append(chem_row)
    return pd.DataFrame(rows)
        
    
df_chemical = chemical_df(list(counts.keys()), df_with_matches, df_meta)
df_chemical.head(3)