
## Summary
This notebook is an entry in the COVID-19 Open Research Dataset Challenge task: Create summary tables that address therapeutics, interventions, and clinical studies

Specifically, the organisers want to know what the literature reports about:

    What is the best method to combat the hypercoagulable state seen in COVID-19?
    What is the efficacy of novel therapeutics being tested currently?

Two csv files are produced:

What is the best method to combat the hypercoagulable state seen in COVID-19.csv

What is the efficacy of novel therapeutics being tested currently.csv

In addition a 3rd csv file is produced which includes all papers that mention the hypercoagulable state or its synonyms:

Hypercoagulable state seen in COVID-19 papers.csv


If you found this notebook useful please give it a vote or better still please leave a comment at the [end of the notebook](#vote). Any comments would be much appreciated but ideas on improvement would be great.

## Data Analysis

On the 10th June, the literature set provided consisted of 138,794 papers.

As the requirement is to find novel responses to recent problems, only those paper with a 2020 date have been analysed. This cut has been necessary from the point of view of memory constraints. 

2020 papers: 35,675

The following further cuts were made:

23,718 after removing duplicates
 
16,788 after removing papers that do not mention covid19 or its synomyms. This is acheived using the method supplied in covid19-tools provided by Andy White. Thank you!
 

## Finding Relevant Papers 

For round 1 I found that the Universal Sentence encoder produced the most useful results.

For round 2 the requirement is much more focussed and so a keyword approach using regular expressions has been used to provide the required data.

The method presented here are produced completely automatically without an hand editting. To improve the chance of delivering useful papers, more data is provided than the minimum set required. Also selection has leaned towards "false positives' so that borderline relevant papers are presented rather than being excluded.

Given the requirements and the limited data available from the early stage trials reported, it has not been possible to provide useful data on two requirements:

endpoints

illness severity

The csv files are ordered by the sentiment analysis of the conclusion column. The order is descending from  +1 to -1. 


## Therapy identification

Four methods are used to identify potential therapies:

* Using Spacy as identified in  medalCORD-19: Explore Drugs Being Developed by Maria and Gtteixeira (https://www.kaggle.com/maria17/cord-19-explore-drugs-being-developed). Thank you. 
* by identifying drugs by their ending as defined in Wikipedia (https://en.wikipedia.org/wiki/Drug_nomenclature).
* by referring to the FDA's drug directory (https://www.fda.gov/drugs/drug-approvals-and-databases/national-drug-code-directory)
* A fourth method has been extended for round 2 and allows others to add potential therapies to the list of therapies examined.


The Spacy method produces many false matches and so an extensive stopword list has been produced and only those therapies mentioned more that 4 times are included here.


### Results

217 therapies have been identified and the 2020 papers have been examined to find matches.

Of the 16k papers examined, a therapy was mentioned in the conclusion of 453 papers. These papers are saved in the 'What is the efficacy of novel therapeutics being tested currently' csv file.

Of the 16k papers examined,hypercoagulable state or its synonym was mentioned 
in the 522 papers . These papers are saved in the "hypercoagulable state seen in COVID-19 papers" csv file.

Of the 522 papers mentioned above, 39 papers include a mention of a therapy either in the conclusion or treatment cells. These papers are saved in the "What is the best method to combat the hypercoagulable state seen in COVID-19" csv file.


## Pros and Cons on Therapy Discovery Method

Identifying potential therapies by the ending of their name is very quick but does not produce a complete list. The Spacy method can fill some of the gaps but I did identify a few problems. When I set it up to identify word bigrams, it sometimes produced long chains of unrelated words. Perhaps I was doing something wrong, anyway I decided to use the method on a single word basis. Bigrams are handled by exception which is not ideal. 

Each method contributes to the list of drugs found in the literature. Although the FDA's directory is comprehensive it does not always name drugs in the same way as done in the literature. Also some anti-virals are not currently listed.



In [None]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#This is an updated list aimed at responding to the round 2 challenge
#Not used in the end

df_queries = pd.DataFrame({'question': [\
'Effectiveness of drugs being developed and tried to treat hypercoagulable states',\
'Clinical and bench trials to investigate treatment of hypercoagulable states',\
'What is the best method to combat the hypercoagulable state seen in COVID-19',\
]})

In [None]:

import numpy as np # linear algebra
import pdb
import os
import nltk, string
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer

import covid19_tools as cvt

from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

stop_words = set(stopwords.words('english'))

'''remove punctuation, lowercase, stem'''
remove_punctuation_map = dict((ord(char), ' ') for char in string.punctuation)    
def normalize(text):
    return nltk.word_tokenize(text.lower().translate(remove_punctuation_map))

def clean_text(text):
    text = text.lower().translate(remove_punctuation_map)
    
    return ' '.join(lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text))

#From Andy White utility script. Thank you!

def abstract_title_filter(df, search_string):
    return (df.Abstract.str.lower().str.replace('-', ' ')
            .str.contains(search_string, na=False) |
            df.Title.str.lower().str.replace('-', ' ')
            .str.contains(search_string, na=False))


covid19_synonyms = ['covid',
                    'coronavirus disease 19',
                    'sars cov 2', # Note that search function replaces '-' with ' '
                    '2019 ncov',
                    '2019ncov',
                    r'2019 n cov\b',
                    r'2019n cov\b',
                    'ncov 2019',
                    r'\bn cov 2019',
                    'coronavirus 2019',
                    'wuhan pneumonia',
                    'wuhan virus',
                    'wuhan coronavirus',
                    r'coronavirus 2\b']

def count_and_tag(df: pd.DataFrame,
                  synonym_list: list,
                  tag_suffix: str) -> (pd.DataFrame, pd.Series):
    counts = {}
    df[f'tag_{tag_suffix}'] = False
    for s in synonym_list:
        synonym_filter = abstract_title_filter(df, s)
        counts[s] = sum(synonym_filter)
        df.loc[synonym_filter, f'tag_{tag_suffix}'] = True
    print(f'Added tag_{tag_suffix} to DataFrame')
    return df, pd.Series(counts)
#ends

df=pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv')
print ('Size of literature Set on 10th June 138,794,19')
print ('Size of literature Set', df.shape)
#'Size of literature Set on 10th April 51078,18'


In [None]:
import random
random.seed(42)


In [None]:

df = df.sort_values(by='publish_time',ascending=True)
#to include only the most recent papers
df = df.loc[df['publish_time'] > '2020']
df.shape


In [None]:
df = df.rename(columns={'source_x': 'Source', 'title': 'Title', 'abstract': 'Abstract', 'publish_time': 'Publish_Date', 'authors': 'Authors', 'journal': 'Journal', 'url': 'Ref URL'})
#drop duplicate abstracts
df = df.drop_duplicates(subset='Abstract', keep="first")

print ('Size of literature Set after removing duplicates on 10th June 23718,19')
print ('Size after removing duplicates', df.shape)
#4/3/20 38667,18
#Size of literature Set after removing duplicates on 10th April 41952,18

In [None]:
#Clean the text
df_queries['query_bow'] = df_queries.question.apply(clean_text)
df_queries['query_bow'] = df_queries['query_bow'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

#Only include papers that reference covid-19
df_a, covid19_counts = count_and_tag(df, covid19_synonyms, 'disease_covid19')
df_covid19 = df[df['tag_disease_covid19'] == True ]
df_covid19 = df_covid19.reset_index()
df_covid19 = df_covid19.drop(['index'], axis=1)

#dont limit abstracts
#limit Abstract to 3500 words
#df_covid19["Abstract"] = df_covid19["Abstract"].str[:3500]

#Split the abstract into sentences
df_covid19['org_abstract'] = df_covid19['Abstract']
df_covid19_by_sentence = df_covid19.set_index(df_covid19.columns.drop('org_abstract',1).tolist())\
.org_abstract.str.split('\. ', expand=True).stack().reset_index()\
.rename(columns={0:'Sent Abstract'})


In [None]:
df_covid19.shape

In [None]:
print(covid19_counts)

In [None]:
df_by_sentence = df_covid19_by_sentence.copy()
#df_covid19_bow_full ['bow_raw'] = df_covid19_bow_full ['title'] + " " + df_covid19_bow_full ['abstract']
df_by_sentence ['bow_raw'] = df_by_sentence ['Sent Abstract']

In [None]:
df_by_sentence['bow'] = df_by_sentence.bow_raw.apply(clean_text)
df_by_sentence['bow'] = df_by_sentence['bow'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
#df_by_sentence.head(5)


#Only consider sentences > 20 chars

df_covid19_bow_f = df_by_sentence[df_by_sentence['Abstract'].map(len) > 20]
df_covid19_bow = df_covid19_bow_f.reset_index()
#Subset for testing
# df_covid19_bow_fs = df_covid19_bow_f.loc[1218:1230].copy()
# df_covid19_bow = df_covid19_bow_fs.reset_index(



In [None]:
#df_covid19_bow = df_covid19_bow[['Title','Sent Abstract','Abstract','bow', 'cord_uid', 'Journal', 'Authors','Publish_Date', 'Source', 'Ref URL']]

#produce a bag of words for queries and sentence of papers
total_bow = ["".join(x) for x in (df_queries['query_bow'])]
total_bow += ["".join(x) for x in (df_covid19_bow['bow'])]


In [None]:
print (len(total_bow))

## Discover chemicals in the papers 

In [None]:
#Use the Spacy model
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_bc5cdr_md-0.2.4.tar.gz

import spacy
import en_ner_bc5cdr_md
from spacy import displacy

nlp = en_ner_bc5cdr_md.load()



In [None]:
#'full_text_file', 'has_pdf_parse', 'has_pmc_xml_parse', 'WHO #Covidence', 'Microsoft Academic Paper ID'

In [None]:
# abstract split into sentence
df_covid19_bow = df_covid19_bow[['Title','Sent Abstract','Abstract','bow', 'cord_uid', 'Journal', 'Authors','Publish_Date', 'Source', 'Ref URL']]

#abstract complete
df_covid19['therapies'] = np.nan
df_covid19 = df_covid19[['therapies','cord_uid', 'sha', 'Source', 'Title', 'doi', 'pmcid', 'pubmed_id',
       'license', 'Abstract', 'Publish_Date', 'Authors', 'Journal',
         'Ref URL', 'tag_disease_covid19','org_abstract']]

In [None]:
#get a list of strings. each string is an abstract
abstract_sent_list = ["".join(x) for x in (df_covid19_bow['bow'])]

#convert the list of strings to one long string
abstract_bow_full = ' '.join(abstract_sent_list)




#split the string into a list of single word strings
abstract_bow_full_list = abstract_bow_full.split()
#removes repeats because of the Scapy memory limitation 
abstract_bow = ' '.join(set(abstract_bow_full.split()))
#abstract_bow = abstract_bow_full

In [None]:
import re

#Prep before running Spacy Model and search to find therapies


#remove any number
abstract_bow = re.sub(r'\d+', '',abstract_bow )

#remove single char words
abstract_bow = re.sub(r'(?:^| ).(?:$| )', ' ', abstract_bow)
#remove 2 char words
abstract_bow = re.sub(r'(?:^| )..(?:$| )', ' ', abstract_bow)
#remove 3 char words
abstract_bow = re.sub(r'(?:^| )...(?:$| )', ' ', abstract_bow)

#Then produce a list of unique words

abstract_bow_list = abstract_bow.split() 

In [None]:
#abstract_bow_list = abstract_bow.split() 
all_therapies_list = []
all_chemicals_str = ''
for word in abstract_bow_list:
    if len(word) > 5 and len(word) <30:
        wordspace = word+' '
        doc = nlp(wordspace)
        entry = doc.ents
        if entry:
            #print ('word, entry[0].label_',word,entry[0].label_)
            if entry[0].label_ == 'CHEMICAL':
                #print ('word, nlp(word)',word, nlp(word).ents)            
                all_chemicals_str += ' ' + word
                all_therapies_list.append(word)
                #pdb.set_trace()
 

In [None]:
print (len(all_therapies_list))

## Load and Prep Drug List

In [None]:
#df_druglist=pd.read_csv('/kaggle/input/fdadrugs/druglist.csv')
df_druglist=pd.read_csv('/kaggle/input/fdasubsubstance/substancename.csv')
df_druglist = df_druglist.drop_duplicates(subset='SUBSTANCENAME', keep="first")
df_druglist = df_druglist[df_druglist['SUBSTANCENAME'].notna()]

#Split the names: one name per line
df_druglist['org_SUBSTANCENAME'] = df_druglist['SUBSTANCENAME']
df_druglist = df_druglist.set_index(df_druglist.columns.drop('SUBSTANCENAME',1).tolist())\
.SUBSTANCENAME.str.split('\; ', expand=True).stack().reset_index()\
.rename(columns={0:'SUBSTANCENAME'})
df_druglist = df_druglist.drop_duplicates(subset='SUBSTANCENAME', keep="first")

#convert to list and go to lower case
drug_list = df_druglist['SUBSTANCENAME'].tolist()
drug_list = [element.lower() for element in drug_list] ; 
#drug_list = drug_list.tolower()
#print (drug_list)

## Add drugs from FDA list

In [None]:
#Add FDA drug list


drugs_str = ''

for drug in drug_list:
    #print (drug)
    drugspace = ' '+drug+' '
    if abstract_bow_full.find(drugspace) > -1:
        #print (drug)
        if drugs_str == '':
            drugs_str = drug
        else:
            drugs_str = drugs_str + ", " + drug
        all_therapies_list.append(drug)
        

#Add therapies, interventions and issues mentioned in the literature        
all_therapies_list.append("stem cell")
all_therapies_list.append("nitric oxide")
all_therapies_list.append("interferon")
all_therapies_list.append("cas13")
all_therapies_list.append("cepharanthine")
all_therapies_list.append("selamectin")
all_therapies_list.append("camostat")
all_therapies_list.append("nafamostat")
all_therapies_list.append("fusan")
all_therapies_list.append("hyperbaric oxygen therapy")
all_therapies_list.append("plasma exchange")
all_therapies_list.append("Mesenchymal Stromal Cells")
all_therapies_list.append("Enoxaparin")
#hypercoagulable relate

all_therapies_list.append("anticoagulant")
all_therapies_list.append("antithrombosis")
all_therapies_list.append("tocovid")
all_therapies_list.append("thrombectomy")




In [None]:
#Add drugs from Wikipedia list
#Add drugs mentioned in literature or task

drugs_str = ''
for word in abstract_bow_list:
    ending = re.findall(r'(?:.*?(\w{3})\b)', word)
    if ((ending == ['vir']) or (ending == ['mab']) or (ending == ['dol'])or (ending == ['axine'])or (ending == ['oxacin'])\
        or (ending == ['tinib']) or (ending == ['lisib'])
        or word.find('interferon') > -1\
        or word.find('naproxen') > -1\
        or word.find('clarithromycin') > -1\
        or word.find('minocycline') > -1\
        or word.find('homoharringtonine') > -1):
            #drugs_str = drugs_str + " " + word
            all_therapies_list.append(word)

# #drugs_unique = ' '.join(set(drugs_str.split()))


In [None]:
stopwords = ['air','lead','oxygen','hydrogen','urea',\
'diamond', 'creatinine','gold','nitrogen',\
'alanine','water',\
'creatine','alcohol',\
'silicon', 'lactic acid','sodium''phosphorus','egg',\
'zinc', 'influenza b virus','hydrogen peroxide',\
'hepatitis c virus', 'garlic','glycine',\
'thyroid', 'leucine','sodium hypochlorite',\
'potassium','squash', 'efavirenz',\
'sulbactam sodium','nifedipine,'\
'uric acid', 'legionella pneumophila'\
'hama','chine','acid','mechan','extract','iran','icacy','guan','isse','ester',\
'betacoronavirus', 'hydrogen','methyl', 'provincia','greg','virol', 'urea',\
'lombardy','phosphate','pakistan','sarbecovirus', 'hcovs', 'crrt', 'acei',\
'hunan', 'ecmo', 'statin','henan', 'daegu', 'biomed','jinyintan','aspartate','hama',\
'virus','hopkins','ncovs','tryptanthrin','oseltamivir', 'hpdi','macau','ethanol',\
'youtube','fcov','xiao','aedt','enac','philippine', 'york', 'repatriate',\
'phosphorus','alcohol','hebei', 'chloride', '‐ncov', 'thiol', 'yokohama', 'paul','stein', \
'saharan', 'zinc','indigodole','ncapp','glucose','methanol','phys',\
'oxygen','angiotensin','amino','ncip','nucleotide','lactate','tongji',\
'separ', 'tianjin','creatinine', 'nitrogen', 'cochrane','qpcr','dpcr','alanine','trypsin','purine','sichuan',\
'triphosphate','virus’','ptsd', 'ddpcr','melatonin', 'sodium','emergencia',\
'creatine', 'tibetan', 'ciclesonide','sulfate','brote', 'cuidado', 'jama', 'tmax','proline', 'sirolimus,'
'youan','quench','ccov','chicago', 'sudan','abstractan''espii','ncov”', 'intensivos',\
'gompertz', 'iraq', 'berlin','nucleoside', 'virales','taiyuan','quencher','coronavirus’',\
'cxcr','cpet','nucleal','brigham', 'niclosamide', 'nettree','冠状病毒（', 'hcov–host',\
'covd','pbmcs','gyeongsangbuk','hydrochloride', 'iata', 'belgium', 'scovs', 'gyeongbuk',\
'abstractin','abstractan','espii','bscs','mercado','youan','fubar','leucine','bilirubin','mab','tcr',\
'hypochlorite','contexte', 'τtrans',"we’ve",'dado','cardiovasculaire','particulière',\
'mographic','fema','primari','indonesia', 'hepg','infectado','denominada','ganglioside','fatty', 'fetp',\
'carbon', 'palo', 'mhla','zeit','qiaamp','tacrolimus','sanitarium', 'begg','contingencia','bolivia','calmette','vitamin','mape','ncov’',\
'glycine', 'comorbidités', 'asif','ishr','existencia','dioxide','sanitizer', 'tracker','línea','™','unsaturated',\
'silver', 'chlorine', 'cholesterol', 'cobalt', 'formaldehyde', 'escherichia coli',\
'morphine','benzalkonium chloride','serine','papain',\
'catecholamine','confirmado','metformin','torovirus','minipcr', 'propone','℃','nab','oxide','—','\u200b','enc','∞',\
 'prednisolone','angiotensin ii','nitric', 'ibuprofen','abidol', 'hek',\
 'monoxide','formoterol','contiene','carbohydrate',\
 'bast','dlco','volatile','instrucciones','isopropyl',\
 'ferroprotein','avis', 'oral–fecal',\
 'calcium','shankar', 'ccaa', 'lactobacillus','hydroxychloroquine sulfate', 'jingmen','pescado', 'speared',\
 'profesionales', 'wikipedia','bioline','tamil','anthony','bifidobacterium',\
 '−recovered','lebanon','legionella pneumophila','territorio', 'yeast',\
 'appetite','simplot','nfκb', 'francisco', 'hower', 'dcps','toremifene', 'liées', 'carol',\
 'ziff', 'l’origine', 'urgencia', 'viele', 'adecuadas','spectacle', 'ámbito', 'revu',\
 'grossesse','ethylene', 'mayoría', 'colorado','tehran','identificado','τend',\
 'aldehyde','engen','spine', 'scct',\
 'veroe','huoshenshan', 'uric acid','infectées', 'evag', 'cipomo','soporte', 'leishenshan','ukraine',\
 'xiaobo', 'allyl', 'choloroquine','infecciones','hurst', 'equipos',\
 'tcga','normokalemia', 'number—the','guerin', 'canadian',\
 'podría','penta','新型冠状病毒肺炎（covid', 'horsham',\
'≈','chloroquine phosphate','humboldt', 'eyedrop',\
'adenosine','adenosyl','thymosin','herramienta','coordenado','benzalkonium','estuvieron','disulfiram',\
'digluconate','collectrin','guanosine','provocan','methionine','guanine','garantice','thiazide',\
'sospechosos','chlorhexidine','rituximab','cefoperazone','sulbactam','mefloquine','cobicistat',\
'tin', 'irat','trat','treatmen', 'char','includi', 'pea','trib','contin', 'ques','inflamm', 'xpress', 'asymp', '1309'\
'ether','chloro', 'pregnan','quip','iron','amine','perte','clare','ather','uncer',\
'pathol', 'sage', 'fran', 'studie','virtu','mace','elm','maine', 'ipal','toch','merci','taco'\
'chas', 'breas', 'trus','huan','itus', 'viol', 'tibio', 'tenu','lora', 'coco',\
'sanita', 'xact','covide', 'decontaminat', 'mansfield','thea','proteo', 'gamm', 'nonin',\
'foca', 'laryngology','cooper', 'turkey', 'quot','spri', 'hemos','spice','lama', 'haled','quid',\
'scad','irvine', 'ilam','biotec', 'disinfectan', 'telle', 'ozone', 'transporte', 'rabbit',\
'incrementa', 'progrip', 'dice', 'apple','brescia', 'erythemato', 'louis', 'cas13','nicotine',\
'hfabp','eppi', 'ð½ðµ','lamb', '新型冠状病毒（''capitulate', 'kinin','bean', 'accru',\
'testosterone','angii', 'frontlines', 'huangshi','ephrology','tambi', 'germicidal', 'sace',\
'chicken', 'carbon dioxide', 'silc', 'carmen', 'calu', 'btcov', 'essai', 'cannabis',\
'–angiotensin', 'disembark', 'david', 'trevo', 'human milk', 'puesta', 'bata','conjugate',\
'paco', 'barré','qiao','parietex™', 'iodine', 'harte','zenker', 'conjugated', 'incheon', 'kegg','briggs',\
'ipom', 'imid', 'acetyl','contagios', 'ltcfs', 'crso', 'chla', 'titanium', 'dmts', 'utah','georgia',\
'nama', 'nitrous','kurdistan','hads','infodemiology', 'dornase alfa','covids','ð¸ñ…', 'hyde', 'methylene',\
'sbrt','urethane','gypsum','glycol', 'utla', 'staphylococcus aureus', 'sanitarios', 'nord',\
'orange', 'benefice', 'bis®','devra', 'wisconsin', 'nont', 'zhao','ltch', 'tyrosine',\
'sras','cortisone','codogno','seiqrd', 'aaga', 'liquorice', 'scandinavia', 'guérin', '焦虑症状', 'gia™','arabic', 'polyphenol', 'russian','quebec', 'habe', 'ldrt','psychotropes',\
'hrsace', 'fast™', 'hepatitis b virus''chws', 'assistdem', 'précautions', 'conoce','mcgrath', \
'cuadro', 'microsoft', 'mycoplasma pneumoniae', 'butyrate', 'sitr', 'cvir', 'methadone', 'arksey', 'sprayshield™', \
'alteplase','hydrocortisone','langone','aspirin', 'hdrt', 'austrian', 'philadelphia','naproxen', 'doacs','nicht', 'istanbul',\
'sysf','chlorogenic', 'saho','assut','prendre', 'yemen', 'ffpe', 'vnrs', 'ceap', 'qfpdt','polyethylene', 'digestifs','ifnl', 'panama', 'escenarios', 'ammonium',\
'honey', 'gbind', 'puis', 'hpsc', 'suramin','rhine', 'ccbs', 'según', 'chad', 'sniffin','nagpaul', 'yolo','polysorb', 'mead','connaissances','zahl', 'alovudine',\
'médecin', 'bridgepoint', 'dane', 'opium', 'connu', 'considerar', 'zinc sulfate','gelpoint', 'verlauf',\
'gps™','chuv', 'mvaova', 'hslam', 'certaines', 'dinucleotides', 'saccharin', 'civid', 'geben', 'phenolic', \
'picovacc', 'stella', 'flavonoid'\
'folic acid', 'bengal','chup', 'ð°ð¼','multidisciplinario', 'ölüm', 'slnb', 'cscs', 'twinsuk', 'enfin',\
'antihistamine', 'biologiques',  'oestrogen','methotrexate', 'sinophobic', 'â\x80¯±â\x80¯','sulphate',\
'ñ€ðµñ\x81ð¿ð¸ñ€ð°ñ‚ð¾ñ€ð½ð¾ð³ð¾', 'â\x80¯ms', 'tropospheric','scfv', 'violencia','qualité','amies',\
'tiotropium', 'xpert®','butyl', 'folic','courte','wssci', 'shcs', 'carbonate', 'cardiff',\
'caffeic','wifi','hoffmann','ciaad','hospitalarios', 'ionized', 'glucoside', 'hospitalière'\
'auteur', 'copper', 'resnetv', 'folgen', 'higgins', 'calhr','jensenone', 'florian',\
'cov及', 'poliovirus','paraffin', 'novid', 'dynamesh®', 'dlnms', 'pyruvate', 'sdel', 'qifen',\
'cin','ether',\
'новой', 'hydroxybutyrate', 'erences', 'chelators', 'klebsiella pneumoniae','tdd‐ncp','trendstm',\
'indium', 'propio', 'macaflavanone', 'ppab', 'evicel', 'hypothèse', 'tiantan', 'niran', 'progrip™',\
'glucan', 'vodan', 'chiffon', 'kaplan–meier', 'instituciones', 'alveolo', 'atlanta','antidepressant',\
'monophosphate', 'mycophenolic', 'hazelwood', 'amcs', 'flavone', 'eswabs', 'autacoid', 'ascorbic acid',\
'aichi', 'chemai', 'moroccan', 'grecco','permitan','cuimc', 'benzene', 'supone', 'ambarl', 'desarrolladas',\
'effectivene','steroid','nemen', 'taco','chas','immunol', 'nterleukin', 'hejia','к', 'sterol',\
'luminal','新型冠状病毒（', 'tcid', 'capitulate','resveratrol',\
'biliar','poblaci','tonavir','calcitonin','aldosterone',\
'cli', 'tri', 'gly','pandemonium','auteur',\
'lithuania', 'swissadme','constituye','preperiod','santiago', 'parsimony','magnesium',\
'cirugías', 'diesem', 'rtv–ifn','servir', 'verteporfin', 'inflamatorio', 'hungarian', 'sildenafil', 'prácticas',\
'quirúrgicas', 'cell·μl', 'catharina', 'miglustat', 'gargle', 'manhattan','équipes','human‐to‐human,'\
'condado', 'atovaquone', 'dihydrotestosterone','conformité','stopcovid','lysosomotropic','matthew',\
'vir','ð¸ð½ñ„ðµðºñ†ð¸ð¸','ñ‡ðµð»ð¾ð²ðµðºð°','ð¼ðµñ€ð¾ð¿ñ€ð¸ñ\x8fñ‚ð¸ð¹', 'individuelle','dorries','face‐to‐face','sars–cov','imágenes',\
'revisiones','palomo','nivolumab','persian','im√°genes','remifentanil','velosorb‚Ñ¢','hyaluronan',\
'neurolog√≠a','homocysteine','esfuerzos','clungene','sonicision‚Ñ¢','diasorin','biblioteca',\
'adecuado','pleuraseal','adrenaline','lockdown','pontryagin',\
'médecine','charmm','selenium','wheat',\
'diphosphate','calmette–guérin','bromhexine','alabama',\
'infectadas','flavonoid','telehospice','distrés','citrate','high‐risk','minnesotan',\
'joseph','beneficios','metronomic','neurología','escherichia','elecsys®',\
'hepatitis b virus','human‐to‐human','pork','higiene',\
'alabama','ascorbic','egg yolk','allium','low‐risk','‘lockdown’','infectadas','mild‐to‐moderate',\
'verpleegkundigen','verpleegkundige','sucrose','copenhagen','syndrome‐related','life‐saving','harvey','human papillomavirus',\
'implicaciones','demand‐side','covideo']


In [None]:
all_therapies_list  = [word for word in all_therapies_list if word not in stopwords]



In [None]:

#Count occurrences of therapies in the full bag of words
potentials =[[x,abstract_bow_full.count(" "+x+" ") ]for x in set(all_therapies_list)]


In [None]:
#Count occurrences of therapies in the full bag of words
#potentials =[[x,abstract_bow_full.count(x) ]for x in set(all_therapies_list)]

#sort list
from operator import itemgetter
all_drugs = sorted (potentials, key=itemgetter(1), reverse  = True)
#therapies = all_drugs
#remove noise
therapies = []
for sublist in all_drugs:
    if sublist[1] >3:
        #print (sublist)
        therapies.append(sublist)

# #there is still noise and so exclude items mentioned twice or less
# therapies_f = []
# for sublist in all_drugs:
#     if sublist[1] >2:
#         print (sublist)
#         therapies_f.append(sublist)

# #Remove subwords found
# #eg prednisolone and chloroquine
# #as a by product ritonavir is also removed which is unfortuate but not a problem
        
# therapies = []
# for therapy in therapies_f:
#     subword = False
#     for therapy_t in therapies_f:
#         #print ('therapy, therapy_t',therapy, therapy_t)
#         if (therapy_t[0].find(therapy[0])) >0:
#             subword = True
#     if subword == False:
#         therapies.append(therapy)

    

In [None]:
print (all_drugs)

In [None]:
print (therapies)

In [None]:
print(len(therapies))

In [None]:
import json
with open('therapyfile.txt','w') as f:
    json.dump(therapies, f)

In [None]:
df_covid19.to_csv('df_covid19.csv')

# Find Therapies

In [None]:
df_covid19 = pd.read_csv('df_covid19.csv')

In [None]:
therapies = [['tizoxanide', 9], ['dupilumab', 9]]

In [None]:
import json
with open('therapyfile.txt') as f:
    therapies = json.load(f)

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
import re

import nltk
from textblob import TextBlob


In [None]:
def abstract_title_filter(df, search_string):
    return (df.Abstract.str.lower().str.replace('-', ' ')
            .str.contains(search_string, na=False) |
            df.Title.str.lower().str.replace('-', ' ')
            .str.contains(search_string, na=False))

# def abstract_title_filter(df, search_string):
#     return (df.Abstract.str.lower()
#             .str.contains(search_string, na=False) |
#             df.Title.str.lower()
#             .str.contains(search_string, na=False))

def tag_therapy(df, synonym_list: list,
                  tag_suffix: str) -> (pd.DataFrame, pd.Series):
    counts = {}
    df[f'tag_{tag_suffix}'] = False
    for s in synonym_list:
        synonym_filter = abstract_title_filter(df, s)
        
        df.loc[synonym_filter, f'tag_{tag_suffix}'] = True
    print(f'Added tag_{tag_suffix} to DataFrame')
    return df



In [None]:
pd.options.mode.chained_assignment = None  # default='warn'

#set a flag for occurrence of a therapy in abstract
for therapy in therapies:
    s = therapy[0]
    df_covid19[f'tag_{s}'] = False
    therapy_filter = abstract_title_filter(df_covid19, s.replace('-', ' '))
    df_covid19.loc[therapy_filter, f'tag_{s}'] = True

    


    


In [None]:
#set therapy based upon flags set
df_covid19['Therapy'] = 'none'
for therapy in therapies:
    s = therapy[0]
    df_covid19['Therapy'] = np.where(df_covid19[f'tag_{s}'],s,df_covid19['Therapy'])



In [None]:
#correct spello?
df_covid19['Abstract'] = df_covid19['Abstract'].replace({'ivermetin' : 'ivermectin'}, regex=True)


In [None]:
#set a flag for occurrence of "hypercoagulable" or similar
hypercoagulable_synonyms = ['hypercoagul','coagula','platelet recovery','fibrin','plasma exchange',
                    'thrombo','ccb','revascularization']

#for test
# hypercoagulable_synonyms = ['coagulant']

df_covid19, covid19_counts = count_and_tag(df_covid19, hypercoagulable_synonyms, 'hypercoagulable')

    
    


In [None]:
#Determine the Study Type
    Randomized_Control_Trial = ['were randomized', 'randomized.*trial was']
    Model   = ['modelling','model','molecular docking','modeling','immunoinformatics']
    Systematic_Review = ['systematic review', 'meta-analysis', 'data sources.*searched', 'search.*published']
    Literature_Review = ['literature','search in']
    Retrospective_Observational = ['record review','retrospective observational', 'observational cohort','retrospective clinical','retrospective cohort','simulated.*study','scoping review']
    Non_randomized_Trial = ['non-randomized trial','Clincal trial','single arm protocol']
    d = {'Systematic Review' : Systematic_Review,
     'Literature Review' : Literature_Review,
     'Retrospective Observational' : Retrospective_Observational,
    'Non-randomized_Trial' : Non_randomized_Trial,
    'Randomized Control Trial' : Randomized_Control_Trial,'Model' : Model}

    d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
    df_covid19['Study Type'] = 'Other'
    for k, v in d1.items():
        df_covid19.loc[df_covid19['Abstract'].str.contains(k, case=False), 'Study Type'] = v


In [None]:
# dftest = pd.DataFrame({'Yes': ['heart. Methods and liver.  functions (p<0.05). conclusion The degrees of lymphopenia and proinflammatory','blank. Method verse is here. Result So that in . Conclusion There is no word Primary endpoint Systematic review here. but  not here', 'AIM To evaluate the clinical value of abnormal laboratory results of multiple organs. RESULTS Elevated neutrophil-to-LYM ratio (NLR), D-dimer(D-D), interleukin (IL)-6, IL-10, IL-2, interferon-Y, and age were significantly associated with the severity of illness. D-D increased from 0.5 to 8, and the risk ratio increased from 2.75 to 55 heart and liver functions (p<0.05). CONCLUSIONS The degrees of lymphopenia and proinflammatory cytokine storm were higher in severe COVID-19 patients than in mild cases. The degree was associated with the disease severity. Advanced age, NLR, D-D, and cytokine levels may serve as useful prognostic factors for the early identification of severe COVID-19 cases. ','here we have endpoint. method molecular. conclusion docking', 'beginning here. result Here primary outcome. conclusion we have model'], 'No': [131, 10,6,2,4]})

# endstr = '(?=(?i)\..result)'

# dftest['Method'] = dftest.Yes.str.extract('((?<=(?i)\..method).*(?=(?i)\..result))')
# dftest.fillna('None', inplace=True)

# dftest['Method1'] = dftest.Yes.str.extract('((?<=(?i)\..method).*(?=(?i)\..conclusion))')
# dftest['Results'] = dftest.Yes.str.extract('((?<=(?i)\..result).*(?=(?i)\..conclusion))')
# dftest['Method'] = np.where(dftest['Method']  == 'None', dftest['Method1'], dftest['Method'])

# dftest


#(?=(?i)\..result)|

In [None]:
df_covid19['Conclusion']= 'none'
df_covid19['Sample Severity of Symptoms'] = 'Not Available'
df_covid19['Primary Endpoint(s) of Study'] = 'Not Available'






    #print ("s",s)
    

#Find the conclusions, assuming they start with the word conclusion and go on to the end of the abstract
searched_word_list   = ['\. conclusion', 'in conclusion', '\. discussion','in summary', 'preliminary findings','our findings','\. recommendation','learning points','\. interpretation','we conclude','we suggest']
searched_words = '|'.join(r"{}".format(x) for x in searched_word_list)
searched_words = '(?i)' + searched_words
new = df_covid19['Abstract'].str.split(searched_words,n=1, expand=True)
new['len'] = new[0].str.len()

    
#    new = df_covid19_t['Abstract'].str.split('(?i)\. conclusion',n=1, expand=True)
    #print ('new',new)
    #print ('newshape',new.shape)
    #print ('newtype',type(new))
    #print ('new.columns',len(new.columns))
if len(new.columns) == 3:
    new[1].fillna('None', inplace=True)
    df_covid19['Ab Len'] = new['len']
    df_covid19['Conclusion'] = "Conclusion"+ new[1]
    #if abstract is small give the full abstract
    df_covid19['Conclusion'] = np.where(df_covid19['Ab Len'] < 350,df_covid19['Abstract'] , df_covid19['Conclusion'])
    
else:
    df_covid19['Conclusion'] = df_covid19['Abstract']

                                              

In [None]:
#fill results and methods column where found

df_covid19['Method'] = df_covid19.Abstract.str.extract('((?<=(?i)\..method).*(?=(?i)\..result))')
df_covid19.fillna('None', inplace=True)

df_covid19['Method1'] = df_covid19.Abstract.str.extract('((?<=(?i)\..method).*(?=(?i)\..conclusion))')
df_covid19['Method'] = np.where(df_covid19['Method']  == 'None', df_covid19['Method1'], df_covid19['Method'])

df_covid19['Results'] = df_covid19.Abstract.str.extract('((?<=(?i)\..result).*(?=(?i)\..conclusion))')


In [None]:
numberset = 'one|two|three|four|five|six|seven|eight|nine|ten'

#Search for sample size
searched_word_list   = ['\s\d* eligible participants', '\s\d* patients','\s' + numberset + 'patient']
searched_words = '|'.join(r"\b{}\b".format(x) for x in searched_word_list)
searched_words = '(?i)' + searched_words
df_covid19['Sample Size'] = df_covid19['Abstract'].apply(lambda texta: [sent for sent in sent_tokenize(texta)
                                       if re.search(searched_words,sent)])
 

In [None]:
#Search for treatment 
#removed 'all consecutive patients',
    searched_word_list   = ['were.*treated','treated patients','patients treated','participants treated','participants receiving','we enrolled','patients in the study','patients enrolled','patients.*indentified','patients.*analyzed']
    searched_words = '|'.join(r"\b{}\b".format(x) for x in searched_word_list)
    searched_words = '(?i)' + searched_words
    df_covid19['Treatment'] = df_covid19['Abstract'].apply(lambda texta: [sent for sent in sent_tokenize(texta)
                                       if re.search(searched_words,sent)])
    df_covid19['Treatment'].apply(str)


In [None]:
# pd.set_option('display.max_columns', 500)
# df_covid19_bow

In [None]:
# w = 'molecular docking'
# print (stemmer.stem(w.lower()))

In [None]:

def therapy_prep_abstract_enhanced (df,s):

    df_covid19_t = df[df[f'tag_{s}'] == True ]
    
    
#    df_covid19_t = df_covid19_t[df_covid19_t['therapyinconc'== True ] ]
#    df_covid19_t = df_covid19_t[df_covid19_t['conc'] == True ]
    
#     df_covid19_t = df_covid19_t[df_covid19_t['rev'] == False ]
#     df_covid19_t = df_covid19_t[df_covid19_t['mmod'] == False ]
#     df_covid19_t = df_covid19_t[df_covid19_t['hmod'] == False ]
    
    df_covid19_t = df_covid19_t.reset_index()
    df_covid19_t = df_covid19_t.drop(['index'], axis=1)
    
#     df_covid19_t['Conclusion']= 'none'
#     #print ("s",s)
    


        
#For the therapy question, remove the paper if the therapy is not mentioned in the conclusion
    if s != "hypercoagulable":      
        df_covid19_t =  df_covid19_t[df_covid19_t.Conclusion.str.lower().str.replace('-', ' ').str.contains(s) ==True]
        df_covid19_t['Therapy'] = s

#Determine the sentiment of the Conclusion
    df_covid19_t['Conclusion'].apply(str)
    df_covid19_t['Conclusion Sentiment'] =df_covid19_t['Conclusion'].apply(lambda Conclusion:pd.Series(TextBlob(Conclusion).sentiment.polarity))

   
    
    
#Search for endpoints
#     searched_word_list   = ['primary*endpoint', 'primary*end point', 'primary*outcome','secondary*endpoint', 'secondary*end point','secondary*outcome']
#     searched_words = '|'.join(r"\b{}\b".format(x) for x in searched_word_list)
#     searched_words = '(?i)' + searched_words
#     df_covid19_t['Endpoint(s) of Study'] = df_covid19_t['Abstract'].apply(lambda texta: [sent for sent in sent_tokenize(texta)
#                                        if re.search(searched_words,sent)])
    
#Search for symptoms
#     searched_word_list  =['pneumonia']
#     searched_words = '|'.join(r"\b{}\b".format(x) for x in searched_word_list)
#     searched_words = '(?i)' + searched_words
#     df_covid19_t['Sample Severity of Symptoms'] = df_covid19_t['Abstract'].apply(lambda texta: [sent for sent in sent_tokenize(texta)
#                                        if re.search(searched_words,sent)])


  
    



    
    df_covid19_display = df_covid19_t[['Study Type','Therapy','Conclusion Sentiment','Conclusion','Method','Results','Treatment','Sample Size','Publish_Date','Title','Ref URL','Journal','Abstract',  'Authors', 'Source', 'doi','cord_uid','Sample Severity of Symptoms','Primary Endpoint(s) of Study']]
#    df_covid19_display = df_covid19_t[['Study Type','Sample Severity of Symptoms','Conclusion Sentiment','Sentiment','Treatment Prep','Publish_Date','Title','Ref URL','Journal','Therapy','Conclusion','Endpoint(s) of Study','Authors', 'Source', 'doi','cord_uid']]
    df_covid19_display = df_covid19_display.rename(columns={'Publish_Date': 'Date', 'Title':'Study', 'doi':'DOI', 'cord_uid':'CORD_UID','Ref URL':'Study Link',\
                                                           'Therapy':'Therapeutic method(s) utilized/assessed'\
                                                           })

    #    df_covid19_display = df_covid19_t[['Therapy','Endpoint','Publish_Date','Title','Abstract', 'Journal', 'Authors', 'Source', 'Ref URL','endp']]
#sort by date, earliest first
    df_covid19_display['Date'] = pd.to_datetime(df_covid19_display.Date)
    df_covid19_display['Date'] = df_covid19_display['Date'].dt.date
    df_covid19_display.sort_values(by=['Date'], inplace=True, ascending=False)
    df_covid19_display = df_covid19_display.reset_index()
    df_covid19_display = df_covid19_display.drop(['index'], axis=1)
    
    
#    df_covid19_display = df_covid19_display.rename(columns={'Abstract': '___________________Abstract___________________'})
    df_covid19_display = df_covid19_display.rename(columns={'Conclusion Sentiment':'Clinical Sentiment (range 1/-1'})

    
    return (df_covid19_display)


# Literature mentioning therapies 

mention of endpoint

In [None]:
#This method finds sentences with the word conclusion. Some conclusions are multisentence though.

# searched_words=['conclus', 'conclude']
# df_covid19 = df_covid19.replace(np.nan, '', regex=True)

# #    df_covid19_t['abstract']=df_covid19_t['Abstract'].apply(str)

# df_covid19['Abstract'].apply(str)
# #    df_covid19_t = df_covid19_t.assign(Abstract=lambda df_covid19_t: df_covid19_t.Abstract +" a conclusion and result Ends.")

# df_covid19['ConclusionList'] = df_covid19['Abstract'].apply(lambda texta: [sent for sent in sent_tokenize(texta)
#                                        if any(True for w in word_tokenize(sent) 
#                                                if stemmer.stem(w.lower()) in searched_words)])
# df_covid19['ConclusionList'].apply(str)
# df_covid19['Conclusion']=df_covid19.ConclusionList.apply(lambda x: ' '.join(map(str, x)))

In [None]:
df_covid19 = df_covid19.replace(np.nan, 'NA', regex=True)


In [None]:
# searched_word_list  =['pneumonia']
# searched_words = '|'.join(r"\b{}\b".format(x) for x in searched_word_list)
# searched_words = '(?i)' + searched_words

# print (searched_words)

# df_covid19['Sample Severity of Symptoms'] = df_covid19['Abstract'].apply(lambda texta: [sent for sent in sent_tokenize(texta)
#                                        if re.search(searched_words,sent)])


# df_covid19['Sample Severity of Symptoms'] = df_covid19['Abstract'].apply(lambda texta: [sent for sent in sent_tokenize(texta)
#                                        if any(True for w in word_tokenize(sent) 
#                                                if stemmer.stem(w.lower()) in searched_words)])

# .str.contains(k, case=False), 'Study Type'] 

In [None]:
df_covid19.to_csv('df_covid19a.csv')

In [None]:
df_covid19 = pd.read_csv('df_covid19a.csv')

In [None]:
#therapies by abstract
i = 0



for therapy in therapies:
    s = therapy[0]
    df_covid19_display = therapy_prep_abstract_enhanced (df_covid19,s)
#     if s == "hypercoagulable":
#         df_covid19_display_hypercoagulable = df_covid19_display
#     else:
    if i ==0:
        df_covid19_display_full = df_covid19_display
    else:
        df_covid19_display_full = pd.concat([df_covid19_display_full, df_covid19_display], ignore_index=True)
    i = i + 1



In [None]:
df_covid19_display_full.sort_values??

In [None]:
df_covid19_display_full = df_covid19_display_full.sort_values('Clinical Sentiment (range 1/-1',ascending=False)
dfStyler = df_covid19_display_full.style.set_properties(**{'text-align': 'left',"font-size": "120%"})
dfStyler.set_table_styles([dict(selector='th', props=[("font-size", "150%"),('text-align', 'center')])])
   

In [None]:
df_covid19_display_full.shape

In [None]:

df_covid19_display_full.to_csv('What is the efficacy of novel therapeutics being tested currently.csv')


In [None]:
# dfStyler = df_covid19_display_hypercoagulable.style.set_properties(**{'text-align': 'left',"font-size": "120%"})
# dfStyler.set_table_styles([dict(selector='th', props=[("font-size", "150%"),('text-align', 'center')])])


In [None]:
# searched_words = '|'.join(r"\b{}\b".format(therapy[0]) for therapy in therapies)
# searched_words = '(?i)' + searched_words
 

# Find Papers that mention Hypercoaguable or synonym

In [None]:
s = "hypercoagulable"
df_covid19_display_hypercoagulable = therapy_prep_abstract_enhanced (df_covid19,s)


In [None]:
df_covid19_display_hypercoagulable = df_covid19_display_hypercoagulable.sort_values('Clinical Sentiment (range 1/-1',ascending=False)

dfStyler = df_covid19_display_hypercoagulable.style.set_properties(**{'text-align': 'left',"font-size": "120%"})
dfStyler.set_table_styles([dict(selector='th', props=[("font-size", "150%"),('text-align', 'center')])])


In [None]:
df_covid19_display_hypercoagulable.shape

In [None]:
df_covid19_display_hypercoagulable.to_csv('Hypercoagulable state seen in COVID-19 papers.csv')


# Only present papers that mention Hypercoagulable in the conclusion or treatment 

In [None]:
searched_words = '|'.join(r"\b{}\b".format(therapy[0]) for therapy in therapies)
searched_words = '(?i)' + searched_words
   

In [None]:
df_covid19_display_conc =  df_covid19_display_hypercoagulable[df_covid19_display_hypercoagulable.Conclusion.str.lower().str.replace('-', ' ').str.contains(searched_words) ==True]

#add method and results
#df_covid19_display_conc['method'] = df_covid19_display_conc.Abstract.str.extract(r'(?<=METHOD)*r'(?<=RESULT))')
# df_covid19_display_conc['Method'] = df_covid19_display_conc.Abstract.str.extract('((?<=method).*(?=(?i)result))')
# df_covid19_display_conc['Results'] = df_covid19_display_conc.Abstract.str.extract('((?<=result).*(?=(?i)conclusion))')



In [None]:
df_covid19_display_conc.shape

In [None]:
df_covid19_display_treat =  df_covid19_display_hypercoagulable[df_covid19_display_hypercoagulable.Treatment.str.lower().str.replace('-', ' ').str.contains(searched_words) ==True]


In [None]:
df_covid19_display_treat.shape

In [None]:
df_covid19_display = pd.concat([df_covid19_display_conc, df_covid19_display_treat], ignore_index=False)
df_covid19_display.shape

In [None]:
df_covid19_display = df_covid19_display.drop_duplicates(subset='Abstract', keep="first")


In [None]:
df_covid19_display.shape

# Papers that mention Hypercoagulable or a synonym in the conclusion or as a treatment

In [None]:
df_covid19_display = df_covid19_display.sort_values('Clinical Sentiment (range 1/-1',ascending=False)

dfStyler = df_covid19_display.style.set_properties(**{'text-align': 'left',"font-size": "120%"})
dfStyler.set_table_styles([dict(selector='th', props=[("font-size", "150%"),('text-align', 'center')])])



In [None]:
df_covid19_display.to_csv('What is the best method to combat the hypercoagulable state seen in COVID-19.csv')


#### EndOfFile<a id='vote'></a>