# Abstract


***Background:*** Some drugs that might treat COVID-19 have been studied, and dozens of publications reporting new chemicals that interact with COVID-19 are being published every day. Since scientific publications have no pre-defined format or organization, the retrieval of these chemicals can only be conducted on the basis of their co-occurrence with coronavirus diseases within a publication. Unfortunately, co-occurring chemicals and coronavirus diseases can be mentioned together without any relation between them. To clarify this concept, Example 1 below reports an example of explicitly stated relation between chemical〈type I interferon〉and MERS-CoV. On the other hand, in Example 2 chemical〈cysteine〉and cutaneous SARS-CoV co-occur, but without any relation between them. 

* Example 1: In vitro, <font color='orange'>MERS-CoV</font> is highly sensitive to <font color='green'>type I interferon</font>.
* Example 2: Next, we replaced the corresponding <font color='green'>cysteine</font> in <font color='orange'>SARS-CoV</font> Nsp9 by alanine and serine.

This forces researchers to analyze a large amount of documents to find the actual relation of interest.

<br>

***Results:*** To provide researchers with formats that can be more easily queried and analyzed, bio-entity relations in the publications must first be annotated. We used state-of-the-art text-mining tools to automatically extract thousands of Chemical-COVID-19 relations from 29,322 publications in the COVID-19 Open Research Dataset (CORD-19).

Firstly, the bio-entities in the publications were recognized by [scispaCy][1]. Then, from these entities we removed the entities that are neither Chemicals nor coronavirus Diseases. As regards Chemicals, we filtered out the entities that do not appear in the [CTD][2] chemical vocabulary. Unwanted Diseases were discarded considering whether the entities belong to a predefined list of coronavirus Diseases or not. The entity annotation includes both the mentions text spans (e.g., clarithromycin) and the normalized concepts identifiers in MESH (e.g., D017291). At the end of this step, 406,787 Chemical mentions (8,886 concepts) and 88,943 coronavirus Disease mentions (1 concept) have been recognized.

After that, the relations between the extracted entities were annotated by [BioBERT][3] trained on the BioCreative V Chemical Disease Relation ([CDR][4]) dataset. This dataset consists of 1,500 PubMed articles with 1,279 annotated chemicals, 1,188 diseases and 3,116 chemical-disease interactions. For the BioBERT model development, we fine-tuned BioBERT on the BioCreative training set. Then, to check the ability of the model to extract the correct relations, the BioBERT model was evaluated on the BioCreative test set. We obtained an F1-Score of 59.25 (Pr:57.20; Re: 61.44). Since the promising results, the model was finally used to automatically annotate 287 potential Chemical-COVID-19 relations from CORD-19.

Given the dependency of relations on concepts, our test collection contains the Chemical/Disease entity mentions and entity concepts annotations, full-text publications retrieved from CORD-19, and the relation annotations among the recognized entity concepts in the same set of articles. Figure 1 shows an example of annotated publication in our annotated dataset. **We made the dataset and all the intermediate data produced in this work available to the research community at this** [link][7].

<br>

![Example of annotated publication](http://hlt-services4.fbk.eu/kaggle/example.PubTator.png)

Figure 1: *In article〈ID:426da8c3fb9c6792b5d26214d55471099877e337〉there are 72 Chemical mentions (10 concepts) and 19 Disease mentions (1 concept). For example, Chemical〈Ribavirin〉 has 52 mentions formed by the set {Ribavirin, ribavirin} but one concept (ID: D012254). Disease 〈coronavirus〉 has 19 mentions formed by the set {coronavirus, SARS-CoV} but one concept (ID: C000657245). Then, the article contains one relation between Chemical〈ID: D012254〉and Disease〈ID: C000657245〉.* 

<br>

***Discussion and Conclusions:*** We presented an annotated dataset of relations between Chemicals and COVID-19 entities. The thousands of relations extracted from the literature and included in our annotated dataset will make it easier for researchers to harvest data from publications.

The application of machine learning for Chemical-COVID-19 relation extraction from text is severely limited by the lack of annotated resources. To overcome this lack of resources for COVID-19, we have exploited the annotation of the BioCreative V Chemical-Disease Relation dataset to train an effectively model for relation extraction. A manual assessment of the quality of the extracted relations would be too expensive to perform. For this reason, we used the results obtained by our model on the BioCreative dataset as a good estimator of the quality of the relations extracted from CORD-19. This approximation is possible because both the relations in the Biocreative dataset and the relations extracted from CORD-19 belong to the same kind of relations of chemicals treating diseases. To have a more accurate assessment, we intend to form a committee of domain experts able to assess the quality of a sample check of the extracted relations from CORD-19. This will also help us gain a better understanding of errors affecting the current annotation and improve the implemented annotation pipeline. 

Another point we are thinking of for a future release of our annotated dataset, is related to the annotation of the entity mention relations. The current version of the annotated dataset only contains the relations among the entity concepts in relationship at document-level. However, we think it might be of interest for researchers to have text evidence that explicitly describes the relation between two related entity mentions. Figure 2 highlights how the current concept-level annotation of the dataset might be enriched with the mention-level annotation.

Finally, to quickly and easily browse publications through searching by chemicals related to COVID-19, we are in the process of developing a web application for publication retrieval (Figure 2). **A very preliminary version of such an application is available at this** [address][6].

<br>

![Example of annotated publication](http://hlt-services4.fbk.eu/kaggle/covid19demoonline_2.png)

Figure 2: *At document-level there is a relation between chloroquine and SARS-CoV concepts. At mention-level the text:〈This suggests that previous similar favourable results obtained in cellulo only with chloroquine for highly pathogenic viruses, such as the SARS-CoV〉supports a relation between two specific mentions of chloroquine and SARS-CoV concepts.*

<br>

[1]: https://allenai.github.io/scispacy/
[2]: http://ctdbase.org/reports/CTD_chemicals.tsv.gz
[3]: https://github.com/dmis-lab/biobert
[4]: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/
[5]: https://academic.oup.com/database/article/doi/10.1093/database/baw071/2630422
[6]: http://hlt-services4.fbk.eu/covid19demo/
[7]: www.kaggle.com/dataset/02a02946d7730a0519924d5717f05edaf3cd29fea311e43c7a8d6498676a63bf

I the rest of this document we report the main steps to produce our annotated dataset starting from the publications in CORD-19. Since the whole processing could take several hours, we will deal with the annotation of a single publication to let researchers try executing our code and check immediately for the expected output. However, commenting a single line of code allows us to annotate the whole dataset as described in Section *Bio-entity Recognition*. Then, all the intermediate files produced by our code available have been made available at this [address][7]. To run our code the Accelerator option available from the kaggle interface has to be set on GPU.

Section **Installing sciSPACY** describes how to install scispaCy and how we customized it. Section **Bio-Entity Recognition** reports the code that can be executed to annotate the Chemical and COVID-19 entities. Section **Relation Extraction** shows how to install BioBERT and how to run the model trained on the Biocreative CDR dataset to annotate the relations among the exctracted entities. Finally, Section **Post-Processing** reports the code to trasform the predictions produced by BioBERT into the relation annotations in our dataset. 


# Installing scispaCy

![installation](http://hlt-services4.fbk.eu/kaggle/installation.png)

scispaCy is used to tokenize the COVID-19 dataset and to recognize the Chemicals and Coronavirus Diseases in the publications. The Chemical entities were annotated matching the entities recognized by scispaCy with the entities in the CTD chemical vocabulary. The Diseases entities were annotated matching the entities with the entities contained in a predefined list of coronavirus Diseases. In the followng we describe how to install scispaCy and how we used scispaCy to implement our entity recognizer. 

In [None]:
!# Install scispaCy for tokenization, sentence splitting and bio-entity recognition
!pip install scispacy
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz

In [None]:
import spacy
from spacy.tokenizer import Tokenizer
import unittest

class SCISPACY:
    """This class wraps scispaCy. Tokenization, sentence splitting and bio-entity recognition are performed. The CTD chemicals vocabulary is used to recognize the Chemical entities. The list
    of predefined coronavirus Diseases is used to recognize the Disease entities.
    """

    #list of chemicals with their MESH ID in the CTD vocabulary
    ctd_chemical_dictionary = {}
    
    #list of Coronavirus synonyms with their MESH ID
    coronavirus_dictionary = {'covid-19': 'C000657245',
                              'covid 19': 'C000657245',
                              'coronavirus': 'C000657245',
                              'corona virus': 'C000657245',
                              'corona-virus': 'C000657245',
                              '2019-ncov': 'C000657245',
                              '2019 ncov': 'C000657245',
                              '2019ncov': 'C000657245',
                              'sars-cov': 'C000657245',
                              'sars cov': 'C000657245',
                              'mers-cov': 'C000657245',
                              'mers': 'C000657245',
                              'mers-coronavirus': 'C000657245',
                              'mers coronavirus': 'C000657245',
                              'mers cov': 'C000657245',
                              'severe acute respiratory syndrome coronavirus': 'C000657245',
                              'severe acute respiratory syndrome': 'C000657245',
                              'severe acute respiratory syndrome-associated coronavirus': 'C000657245',
                              'severe acute respiratory syndrome associated coronavirus': 'C000657245',
                              'middle east respiratory syndrome coronavirus': 'C000657245',
                              'middle east respiratory syndrome': 'C000657245',
                              'middle east respiratory coronavirus': 'C000657245',
                              '2019-ncov disease': 'C000657245',
                              '2019 ncov disease': 'C000657245',
                              '2019ncov disease': 'C000657245',
                              '2019-ncov infection': 'C000657245',
                              '2019 ncov infection': 'C000657245',
                              '2019ncov infection': 'C000657245',
                              '2019 novel coronavirus disease': 'C000657245',
                              '2019 novel coronavirus infection': 'C000657245',
                              'coronavirus disease-19': 'C000657245',
                              'coronavirus disease 2019': 'C000657245',
                              'hcov': 'C000657245',
                              'cov': 'C000657245'}

    
    def __init__(self, custom_tokenizer):

        print("Init scispaCy")

        print(spacy.__version__)

        # load the scispaCy model
        self.nlp = spacy.load("en_core_sci_lg")

        # create a custom tokenizer that does not split multi-words entities like mers-cov
        if (custom_tokenizer == True):
            self.nlp.tokenizer = self.create_custom_tokenizer(self.nlp)
            self.nlp.add_pipe(self.prevent_sentence_boundary_detection, name='prevent-sbd', before='parser')

        # load the CTD vocabulary
        input_file = open("/kaggle/input/ctd-chemicals/CTD/CTD_chemicals.tsv", encoding='utf-8')
        # perform file operations
        for line in input_file:
            if not line.startswith("#"):
                fields = line.rstrip("\n").split("\t")
                chemical_name = fields[0]
                chemical_id = fields[1][5:]
                if len(chemical_name) > 4:
                    self.ctd_chemical_dictionary[chemical_name.lower()] = chemical_id
                synonyms = fields[7].split("|")
                for synonym in synonyms:
                    if len(synonym) > 4:
                        self.ctd_chemical_dictionary[synonym.lower()] = chemical_id

        #print("Loaded " + str(len(self.ctd_chemical_dictionary)) + " chemicals from CTD.")


    def create_custom_tokenizer(self, nlp):
        """This tokenizer does not split multi-words entities like mers-cov"""

        my_prefix = r'[\-]'

        all_prefixes_re = spacy.util.compile_prefix_regex(tuple(list(nlp.Defaults.prefixes) + [my_prefix]))

        # Handle ( that doesn't have proper spacing around it
        custom_infixes = ['\.\.\.+', '(?<=[0-9])-(?=[0-9])', '[!&:,()]']
        infix_re = spacy.util.compile_infix_regex(tuple(list(nlp.Defaults.infixes) + custom_infixes))

        suffix_re = spacy.util.compile_suffix_regex(nlp.Defaults.suffixes)

        return Tokenizer(nlp.vocab, nlp.Defaults.tokenizer_exceptions,
                         prefix_search=all_prefixes_re.search,
                         infix_finditer=infix_re.finditer, suffix_search=suffix_re.search,
                         token_match=None)


    def custom_boundary(self, docx):

        is_print = False
        for token in docx[:-2]:
            if (token.text[-1:] == "." and (docx[token.i + 1].text[0:1].isupper() or not docx[token.i + 1].is_alpha or (not docx[token.i + 1].is_lower and not docx[token.i + 1].is_upper))) or \
                    token.text == ";":
                docx[token.i + 1].is_sent_start = True

        return docx


    def prevent_sentence_boundary_detection(self, docx):

        for token in docx:
            # This will entirely disable spaCy's sentence detection
            token.is_sent_start = False

        return self.custom_boundary(docx)

    
    def tokenize(self, text):

        annotatedDocument = [] 
        doc = self.nlp(text)
        token_index = 0
        sentence = []
        for token in doc:
            if token.is_sent_start == True and len(sentence) != 0:
                annotatedDocument.append(sentence)
                sentence = []

            row = [token_index, token.idx, token.idx + len(token.text), token.text]
            sentence.append(row)
            token_index = token_index + 1

        annotatedDocument.append(sentence)

        return annotatedDocument


    def analyze_covid19(self, text):
        """Tokenize the COVID-19 dataset and recognize the Chemical and Disease entities.
        scispaCy is not able to distinguish between Chemicals and coronavirus Diseases. To do that we match 
        the entities recognized by scispaCy with the chemicals in the CTD chemical vocabulary and with the diseases in the predefined list of coronavirus diseases.
        """

        recognized_entities = [] 
        doc = self.nlp(text)
        token_index = 0
        sentence = []
        entity = []
        for token in doc:

            entity_iob = token.ent_iob_

            if entity_iob == "B":
                if entity != []:
                    mesh_id = None
                    entity_type = None
                    if entity[2].lower() in self.ctd_chemical_dictionary:
                        mesh_id = self.ctd_chemical_dictionary[entity[2].lower()]
                        entity_type = "Chemical"
                    elif entity[2].lower() in self.coronavirus_dictionary:
                        mesh_id = self.coronavirus_dictionary[entity[2].lower()]
                        entity_type = "Disease"
                    if mesh_id != None:
                        entity[3] = entity_type
                        entity[4] = mesh_id
                        recognized_entities.append(entity.copy())
                entity = [token.idx, token.idx + len(token.text), token.text, None, None]
            elif entity_iob == "I":
                entity = [entity[0], token.idx + len(token.text), entity[2] + " " + token.text, None, None]
            else:
                if entity != []:
                    mesh_id = None
                    entity_type = None
                    if entity[2].lower() in self.ctd_chemical_dictionary:
                        mesh_id = self.ctd_chemical_dictionary[entity[2].lower()]
                        entity_type = "Chemical"
                    elif entity[2].lower() in self.coronavirus_dictionary:
                        mesh_id = self.coronavirus_dictionary[entity[2].lower()]
                        entity_type = "Disease"
                    if mesh_id != None:
                        entity[3] = entity_type
                        entity[4] = mesh_id
                        recognized_entities.append(entity.copy())
                    entity = []

        return recognized_entities

if __name__ == '__main__':
    custom_tokenizer = False
    obj = SCISPACY(custom_tokenizer)
    print("Check if scispaCy has been installed correctly!")
    text = "Abatacept dose-dependently reduces T-cell proliferation, serum concentrations of acute-phase reactants, and other markers of inflammation, including the production of rheumatoid factor by B cells."
    print(obj.tokenize(text))


# Bio-entity Recognition

![entity recognition](http://hlt-services4.fbk.eu/kaggle/entity_recognition.png)

Before relation extraction, Chemicals and Coronavirus Diseases have to been recognized. We used the pipeline based on scispaCy implemented above to recognize the bio-entities in the COVID-19 dataset. Since this processing could take several hours to demonstrate the use of the implemented code, one single publication taken from the COVID-19 dataset is analyzed. To process the whole COVID-19 dataset it is sufficient to remove the filter in the annotate_covid_19_dataset function defined below.

* Input: The COVID-19 dataset directory containing the json file of its publications.
* Output: a single file in PubTator format containing the full text publications and the recognized Chemical and Disease entities.

In [None]:
!# Create a running directory to store the processed publications 
!mkdir ./CORD-19-Running-Dataset/

In [None]:
import json
import os

def annotate_covid_19_dataset():
    """Annotates the COVID-19 dataset with scispaCy. 
    """
    
    title_counter = 0
    abstract_counter = 0
    custom_tokenizer = False
    annotator = SCISPACY(custom_tokenizer)
    # the output file
    output_file = open("./CORD-19-Running-Dataset/CORD-19-Dataset.PubTator.txt", 'w', encoding='utf-8')
    # the input dataset
    covid_19_dataset = '/kaggle/input/CORD-19-research-challenge/'
    for root, dirs, files in os.walk(covid_19_dataset):

        for file_name in files:
            if '.json' in file_name:

                # Remove this filter if you want to analyze the whole dataset. Analyzing COVID-19 with scispaCy might take several hours.
                if file_name != '426da8c3fb9c6792b5d26214d55471099877e337.json':
                    continue

                chemical_entity_found = False
                disease_entity_found = False
                abstract = ""
                with open(os.path.join(root, file_name)) as json_file:
                    data = json.load(json_file)
                    title = data['metadata']['title'].strip()
                    if not title.endswith("."):
                        title = title + "."
                        #print("Title", title)
                    title_len = len(title.split(" "))

                    abstract = ""
                    if len(data['abstract']) > 0:
                        abstract_object = data['abstract']
                        for text_i in abstract_object:
                            if abstract == "":
                                abstract = text_i['text'].strip()
                            else:
                                abstract = abstract + " " + text_i['text'].strip()
                        if not abstract.endswith("."):
                            abstract = abstract + "."

                    for text_i in data['body_text']:
                        text = text_i['text'].strip()
                        if not abstract.endswith(".") and text[0:1].isupper():
                            abstract = abstract + ". " + text
                        else:
                            abstract = abstract + " " + text
                            
                    #maximum document length that we can analyze with our pipeline
                    abstract = abstract[:900000]

                    abstract_len = len(abstract.split(" "))

                    #remove really short document
                    if title_len < 5 and abstract_len < 50:
                        continue

                    document = title + "\n" + abstract

                    entities = annotator.analyze_covid19(document)

                    for entity in entities:
                        if entity[3] == 'Chemical':
                            chemical_entity_found = True
                        if entity[3] == 'Disease':
                            disease_entity_found = True

                    #write documents that contain at least one entity of type Chemical and one entity of type Disease
                    #these are the documents that can contain a Chemical-Disease relation.
                    if chemical_entity_found == True and disease_entity_found == True:

                        # print(entities)
                        output_file.write(file_name[:-5] + "|t|" + title + "\n")
                        output_file.write(file_name[:-5] + "|a|" + abstract + "\n")
                        for entity in entities:
                            output_file.write(file_name[:-5] + "\t" + str(entity[0]) + "\t" + str(entity[1]) + "\t" + entity[2] + "\t" +
                                entity[3] + "\t" + entity[4] + "\n")
                        output_file.write("\n")

    output_file.close()
    
annotate_covid_19_dataset()

print("Annotated publication saved in: ./CORD-19-Running-Dataset/CORD-19-Dataset.PubTator.txt")





In [None]:
!# See the annotated publication in PubTator format. It contains the full text publication and the Chemical and coronavirus Diseases recognized by scispaCy
!cat ./CORD-19-Running-Dataset/CORD-19-Dataset.PubTator.txt

# *Bio-entity Relation extraction*

![relation extraction](http://hlt-services4.fbk.eu/kaggle/relation_extraction.png)

We formulate the relation extraction task as a binary classification problem, in which examples are generated from sentences as follows. We generate examples for all the sentences containing at least two entities of type Chemical and Disease. Thus the number of examples generated for each sentence is given by the combinations of distinct entities selected two at a time. Then, we use BioBERT to extract the relations among the candidate entity pairs. BioBERT requires the input dataset to be in a particular format. First, we transform the PubTator file annotated in the step above into the Semeval 2010 task 8 format. This file contains the candidate entities pairs and other usefull information that will allow us to trace the corresponding document in PubTator once had the annotation of BioBERT. Then, we use this file to produce the file in the format required by BioBERT. Finally we run the script provided with the BioBERT distribution to extract the relations among the candidate entity pairs.

* Input: The PubTator file containing the full-text publications and the recognized Chemical and Disease entities.
* Output:
* The candidate entity pairs in Semeval 2010 task 8 format.
* The candidate entity pairs in the format required by BioBERT
* The file containing the prediction done by BioBERT

In [None]:
import re

class Pubtator2Semeval:
    """This class converts the PubTator file in input into an output file in a format similar to the Semeval 2010 task 8 format.
     
    e.g.,
    
    input: PubTator file
    
    output:
    .......................
    4e9acd476a5b8224baf48582bd4f4c095feafe27        CID:0   D008139 C000657245      loperamide      MERS-CoV        12      22      Other medications , such as myc
    ophenolic acid , chloroquine , chlorpromazine , loperamide and lopinavir , have shown an inhibitory effect on MERS-CoV replication in vitro ;   4983    4993  5044    5052
    .......................
     
    Fields:
    1. the document id
    2. a placeholder for the label predicted by the classifier
    3. chemical entity MESH id
    4. disease entity MESH id
    5. chemical name
    6. disease name
    7. chemical token position in the sentence
    8. disease token position in the sentence
    9. sentence containing the candidate pairs
    10-11. first-end character offset of the chemical entity in the whole document
    12-13. first-end character offset of the disease entity in the whole document
     
    """

    annotator = None

    def __init__(self, custom_tokenizer):
        print("Init")
        self.annotator = SCISPACY(custom_tokenizer)


    def run(self, input_file_name, output_file_name):

        positive_examples = 0
        negative_examples = 0
        title_counter = 0
        abstract_counter = 0
        relation_counter = 0

        # file in and file out
        input_file = open(input_file_name, encoding='utf-8')
        output_file = open(output_file_name, 'w', encoding='utf-8')

        entities = []
        relations = {}
        title = None
        abstract = None

        # perform file operations
        for line in input_file:

            line = line.rstrip("\n")
            #print(line)

            # get the title
            if re.match("^[0-9a-z]+\|t\|", line):
                title_counter = title_counter + 1
                entities = []
                relations = {}
                title = line

            # get the abstract
            elif re.match("^[0-9a-z]+\|a\|", line):
                abstract_counter = abstract_counter + 1
                abstract = line.replace("(ABSTRACT TRUNCATED AT 400 WORDS)", "").replace(
                    "(ABSTRACT TRUNCATED AT 250 WORDS)", "").replace(" ", "_").replace(" ", "_").replace(" ",
                                                                                                         "_").replace(
                    " ", "_")

            # get the list of entities
            elif re.match("^[0-9a-z]+\t[0-9]+\t[0-9]*\t.+\t.+\t[DC0-9|\-]+", line):
                match = re.match("^([0-9a-z])+\t([0-9])+\t([0-9]*)\t(.+)\t(.+)\t([DC0-9|\-]+)", line)
                e_id = match.group(6)
                if e_id != '-1':
                    entity = line
                    entities.append(entity)

                #print(line)

            # get the list of relations among entities
            elif re.match("^[0-9a-z]+\tCID\t[DC0-9|\-]+\t[DC0-9|]+", line):
                match = re.match("^([0-9a-z]+)\t(CID)\t([DC0-9|\-]+)\t([DC0-9|]+)",
                                     line)

                group = match.group(2)
                e1Id = match.group(3)
                e2Id = match.group(4)

                if e1Id + " " + e2Id in relations:
                    labels = relations[e1Id + " " + e2Id]
                    labels.append(group)
                else:
                    labels = []
                    labels.append(group)
                    relations[e1Id + " " + e2Id] = labels
                relation_counter = relation_counter + 1

            else:

                document = re.match("^[0-9a-z]+\|t\|(.*)($)", title).group(1) + \
                           " " + \
                           re.match("^[0-9a-z]+\|a\|(.*)($)", abstract).group(1)

                tokenized_document = self.annotator.tokenize(document)
               
                #generating the canditate Chemical-Disease pairs
                for e1 in entities:
                    e1_match = re.match("^([0-9a-z]+)\t([0-9]+)\t([0-9]+)\t(.+)\t(.+)\t([DC0-9|]+)", e1)
                    e1_document_id = e1_match.group(1)

                    e1_start = int(e1_match.group(2))
                    e1_end = int(e1_match.group(3))
                    e1_name = re.sub('[ ]', '#', e1_match.group(4))
                    e1_type = e1_match.group(5)
                    e1_id = e1_match.group(6)

                    for e2 in entities:
                        e2_match = re.match("^([0-9a-z]+)\t([0-9]+)\t([0-9]+)\t(.+)\t(.+)\t([DC0-9|]+)", e2)
                        e2_start = int(e2_match.group(2))
                        e2_end = int(e2_match.group(3))
                        e2_name = re.sub('[ ]', '#', e2_match.group(4))
                        e2_type = e2_match.group(5)
                        e2_ids = e2_match.group(6)

                        for e2_id in e2_ids.split("|"):

                            if "Chemical" in e1_type and "Disease" in e2_type:

                                sentence, e1_token_position, e2_token_position = self.get_sentence(e1_document_id,
                                    tokenized_document, e1_start, e1_end, e2_start, e2_end, e1_name,
                                    e2_name)

                                #remove really long senetences (i.e., sentences longer than 218 words, which is the maximum 
                                #number of words for a sentence in the dataset used for system training)
                                if len(sentence) != 0 and len(sentence.split(" ")) <= 218:

                                    if (e1_id + " " + e2_id) in relations:

                                        labels = relations[e1_id + " " + e2_id]
                                        if len(labels) == 1: #no multilabel relations
                                            for label in labels:
                                                output_file.write(e1_document_id +
                                                                "\t" + "CID:1" +
                                                                "\t" + e1_id + "\t" + e2_id +
                                                                "\t" + e1_name + "\t" + e2_name +
                                                                "\t" + str(e1_token_position) +
                                                                "\t" + str(e2_token_position) +
                                                                "\t" + sentence + "\n")
                                                positive_examples = positive_examples + 1
                                    else:
                                        output_file.write(e1_document_id +
                                                        "\t" + "CID:0" +
                                                        "\t" + e1_id + "\t" + e2_id +
                                                        "\t" + e1_name + "\t" + e2_name +
                                                        "\t" + str(e1_token_position) +
                                                        "\t" + str(e2_token_position) +
                                                        "\t" + sentence + "\n")
                                        negative_examples = negative_examples + 1

        input_file.close()
        output_file.close()

        
    def get_sentence(self, document_id, text, e1_start, e1_end, e2_start, e2_end, e1_name, e2_name):
        """Get the sentence of the document that contains the candidate entity pair.

        """

        #print(text)
        out_sentence = ""

        e1_position_id = None
        e2_position_id = None

        e1_in_sentence = False
        e2_in_sentence = False

        if e1_start == e2_start:
            return out_sentence, e1_position_id, e2_position_id

        for in_sentence in text:

            out_sentence = ""

            id = 0
            for token in in_sentence:

                start = token[1]
                end = token[2]
                form = token[3]

                if e1_name in form and e1_start >= start and e1_end <= end:
                    e1_in_sentence = True
                    e1_position_id = id
                    id = id + 1
                    out_sentence = out_sentence + " " + form
                elif form in e1_name and start >= e1_start and end <= e1_end:
                    e1_in_sentence = True
                    if not out_sentence.endswith(e1_name):
                        e1_position_id = id
                        id = id + 1
                        out_sentence = out_sentence + " " + e1_name

                elif e2_name in form and e2_start >= start and e2_end <= end:
                    e2_in_sentence = True
                    e2_position_id = id
                    id = id + 1
                    out_sentence = out_sentence + " " + form
                elif form in e2_name and start >= e2_start and end <= e2_end:
                    e2_in_sentence = True
                    if not out_sentence.endswith(e2_name):
                        e2_position_id = id
                        id = id + 1
                        out_sentence = out_sentence + " " + e2_name

                else:
                    out_sentence = out_sentence + " " + form
                    id = id + 1

            if e1_in_sentence == True and e2_in_sentence == True:
                break

            out_sentence = ""
            e1_position_id = None
            e2_position_id = None
            e1_in_sentence = False
            e2_in_sentence = False


        return out_sentence.lstrip(), e1_position_id, e2_position_id


def contains_any(str, set):
    """Check whether 'str' contains ANY of the chars in 'set'"""
    return 1 in [c in str for c in set]


if __name__ == '__main__':
    print("This only executes when is executed rather than imported")
    pubtator2semeval = Pubtator2Semeval(True)
    pubtator2semeval.run("./CORD-19-Running-Dataset/CORD-19-Dataset.PubTator.txt",
                         "./CORD-19-Running-Dataset/CORD-19-Dataset.PubTator.semeval.txt")
    print("Semeval dataset saved in: ./CORD-19-Running-Dataset/CORD-19-Dataset.PubTator.semeval.txt")

In [None]:
!# See the candidate entity pairs to be annotated by BioBERT
!cat ./CORD-19-Running-Dataset/CORD-19-Dataset.PubTator.semeval.txt

In [None]:
"""
This script converts the file in input that is in Semeval 2010 task 8 format into a file compatible with BioBERT.

e.g.,

Input:
4e9acd476a5b8224baf48582bd4f4c095feafe27        CID:0   D008139 C000657245      loperamide      MERS-CoV        12      22      Other medications , such as mycophenolic acid , 
chloroquine , chlorpromazine , loperamide and lopinavir , have shown an inhibitory effect on MERS-CoV replication in vitro ;   4983    4993  5044    5052

Output:
Other medications , such as myc ophenolic acid , chloroquine , chlorpromazine , @GENE$ and lopinavir , have shown an inhibitory effect on @DISEASE$ replication in vitro ;

Note: we use @GENE$ instead of @CHEMICAL$ to mark chemicals for our convenience, since our scripts can work on multiple datasets with entities of the different type.

"""
from __future__ import print_function
import numpy as np
import re

import gzip
import os
import sys

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


if (sys.version_info > (3, 0)):
    import pickle as pkl
else: #Python 2.7 imports
    import cPickle as pkl

import networkx as nx
import random


np.set_printoptions(suppress=True)

# = [folder_training+'./CDR_TrainingDevelopmentSet.PubTator.semeval', folder_training+'./CDR_TestSet.PubTator.semeval']
files = ['', './CORD-19-Running-Dataset/CORD-19-Dataset.PubTator.semeval.txt']


def createTrainingDataSet(fileIn, fileOut):

    output_file = open(fileOut, 'w', encoding='utf-8')

    for line in open(fileIn, encoding='utf-8'):

        splits = line.strip().split('\t')

        label = splits[1]
        new_label = None
        if label == "CID:0":
            new_label = "0"
        else:
            new_label = "1"

        entity1_pos = int(splits[6])
        entity2_pos = int(splits[7])

        example = ""
        if entity1_pos == entity2_pos:
            example = "@GENE$ @DISEASE$\t0"

        else:
            sentence = splits[8].split(" ")
            sentence[entity1_pos] = "@GENE$"
            sentence[entity2_pos] = "@DISEASE$"

            example = ' '.join(sentence) + "\t" + new_label

        output_file.write(example + "\n")

    output_file.close()



def createTestDataSet(fileIn, fileOut):
    """
    Creates the dataset to be annotaed in the BioBERT format

    :param data_file_in: the dataset in Semeval format
    :param file_out: the dataset in BioEBRT format

    """

    output_file = open(fileOut, 'w', encoding='utf-8')
    output_file.write("index	sentence	label\n")

    counter = 0
    for line in open(fileIn, encoding='utf-8'):

        splits = line.strip().split('\t')

        label = splits[1]
        new_label = None
        if label == "CID:0":
            new_label = "0"
        else:
            new_label = "1"

        docId = splits[0]
        entity1_pos = int(splits[6])
        entity2_pos = int(splits[7])

        example = ""
        if entity1_pos == entity2_pos:
            example = str(counter) + "\t" + "@GENE$ @DISEASE$\t0"

        else:

            sentence = splits[8].split(" ")
            sentence[entity1_pos] = "@GENE$"
            sentence[entity2_pos] = "@DISEASE$"

            example = str(counter) + "\t" +  ' '.join(sentence) + "\t" + new_label

        output_file.write(example + "\n")

        counter = counter + 1

    output_file.close()

#createTrainingDataSet(files[0], files[0] + ".BioBERT.txt")
createTestDataSet(files[1], files[1] + ".BioBERT.txt")
print("BioBERT dataset saved in: " + files[1] + ".BioBERT.txt")



In [None]:
!# See the candidate entity pairs to be annotated in the format required by BioBERT 
!cat ./CORD-19-Running-Dataset/CORD-19-Dataset.PubTator.semeval.txt.BioBERT.txt

The relations among the recognized bio-entities are annotated by BioBERT. The used BioBERT models were created training BioBERT on the BioCreative V Chemical Disease Relation (CDR) dataset. 

In [None]:
!# Download BioBERT
!git clone https://github.com/dmis-lab/biobert.git
!# Install BioBERT
!cd ./biobert; pip install -r requirements.txt
!# Install CUDA required by BioBERT
!conda install -y cudatoolkit=9.0
!# Install the BioBERT models created training BioBERT on the BioCreative V Chemical Disease Relation (CDR) dataset
!cp -r /kaggle/input/biocreativev-cdr-biobert-model/BiocreativeV_CDR_BioBERT_Model/* ./biobert/
!cp -r /kaggle/input/biobert-v10-pubmed-pmc-pretrained-model/biobert_v1.0_pubmed_pmc_pretrained_model/WEIGHTS ./biobert/
!# Copy the dataset to be annotated in the test.tsv file required by BioBERT
!#cp /kaggle/input/covid19annotateddataset/COVID-19-Annotated-Dataset/COVID-19-Dataset.PubTator.semeval.BioBERT.txt ./CORD-19-Running-Dataset/test.tsv
!cp ./CORD-19-Running-Dataset/CORD-19-Dataset.PubTator.semeval.txt.BioBERT.txt ./CORD-19-Running-Dataset/test.tsv
!# Annotate the data set to be annotated by BioBERT
!chmod 750 ./biobert/run_re.sh
!./biobert/run_re.sh

In [None]:
!#See the BioBERT annotation
!cat /kaggle/working/biobert/RE_output/test_results.tsv

# Post-Processing

![post processing](http://hlt-services4.fbk.eu/kaggle/post_processing.png)


After the relations have been extracted, the relation annotations produced by BioBERT have to be included into the PubTator file that contains the full-text publication and recognized entities.    

In [None]:
def get_predicted_labels(data_file_in, file_out):
    """
    Reads the file containing the predictions made by BioBERT and returns the predicted class labels, i.e., CID:0 (negative example), CID:1(positive example).

    :param data_file_in: the predictions made by BioBERT
    :param file_out: the labels produced by BioBERT

    """

    lines = []

    file_out = open(file_out, 'w', encoding="utf8")

    for line in open(data_file_in, encoding="utf8"):
        # lines.append(line.strip())

        line = line.strip()

        if line != "":

            predCPR0 = float(line.split("\t")[0])
            predCPR1 = float(line.split("\t")[1])
            if predCPR0 > predCPR1:
                label = "CID:0"
            else:
                label = "CID:1"
            file_out.write(label + "\n")

    file_out.close()

get_predicted_labels("./biobert/RE_output/test_results.tsv", './CORD-19-Running-Dataset/test_results.tsv.biobert.txt')
print("File saved in: ./CORD-19-Running-Dataset/test_results.tsv.biobert.tx")



In [None]:
!#See the labels assigned by BioBERT to the candidate entity pairs
!cat ./CORD-19-Running-Dataset/test_results.tsv.biobert.txt

In [None]:
def read_predictions(data_file_in):
    """
    Reads the labels assigned by BioBERT to the candidate entity pairs
    
    :param data_file_in: the predicted labels 

    :return: the list of predicted labels 
    """

    classes = []
    prob = []

    for line in open(data_file_in, encoding="utf8"):
        line = line.strip()
        classes.append(line)
        #prob.append(line.split("\t")[1])

    return classes, prob


def predictions2Pubtator(pubtator_data_file_in, semeval_data_file_in, predictions_file_in, annotated_data_file_out):
    """
    Reads the Chemical-Disease relations produced by BioBERT and creates a PubTator file containing the predicted relations
    
    :param pubtator_data_file_in: the data set to be annotated
    :param semeval_data_file_in: the file containing the examples in input to BioBERT
    :param predictions_file_in: the file containing the predicted relations produced by BioBERT
    :param annotated_dataset_file_out: the dataset in input including the relazions produced by BioBERT

    :return:
    """

    predictions, probabilities = read_predictions(predictions_file_in)
    file_out = open(annotated_data_file_out, 'w', encoding="utf8")
    line_counter = 0
    docid_label_map = {}
    print(semeval_data_file_in)
    visited = {}
    
    for line in open(semeval_data_file_in, encoding="utf8"):
        splits = line.strip().split('\t')
        doc_id = splits[0]

        label = predictions[line_counter]
        e1_id = splits[2]
        e2_id = splits[3]

        if label != "CID:0":
            label = 'CID'
            value = doc_id + "\t" + label + "\t" + e1_id + "\t" + e2_id

            if value not in visited:
                visited[value] = 1

                if doc_id in docid_label_map:
                    entry = docid_label_map[doc_id]
                    entry.append(value)
                else:
                    entry = []
                    entry.append(value)
                    docid_label_map[doc_id] = entry

        line_counter = line_counter + 1

    doc_id = ""
    prev_doc_id = ""
    print(pubtator_data_file_in)
    string_buffer = ""
    for line in open(pubtator_data_file_in, encoding="utf8"):
        line = line.strip()
        if line != "":
            splits = line.split('\t')
            if '|t|' in line:
                if string_buffer != "":
                    if doc_id in docid_label_map:
                        entry = docid_label_map[doc_id]
                        for rel in entry:
                            string_buffer = string_buffer + rel + "\n"
                    file_out.write("%s\n" % string_buffer)
                string_buffer = line + "\n"

            elif not(len(line.split("\t")) == 4 and 'CID' in line):
                doc_id = splits[0]
                string_buffer = string_buffer + line + "\n"
                
    if string_buffer != "":
        if doc_id in docid_label_map:
            entry = docid_label_map[doc_id]
            for rel in entry:
                string_buffer = string_buffer + rel + "\n"
        file_out.write("%s\n" % string_buffer)

    file_out.close()


#the dataset in Semeval format
development_semeval_data_in = './CORD-19-Running-Dataset/CORD-19-Dataset.PubTator.semeval.txt'
#the dataset in PubTator format
development_pubtator_data_in = './CORD-19-Running-Dataset/CORD-19-Dataset.PubTator.txt'
# the predictons made by BioBERT
prediction_data_in = './CORD-19-Running-Dataset//test_results.tsv.biobert.txt'
# An intermediate format that can use for debugging.
annotated_development_semeval_data_out = development_semeval_data_in + ".annotated.txt.index.txt"
# Our final annotated dataset in PubTator format.
annotated_development_pubtator_data_out = development_pubtator_data_in + '.annotated.txt'

predictions2Pubtator(development_pubtator_data_in, development_semeval_data_in, prediction_data_in, annotated_development_pubtator_data_out)

print("File saved in: " + annotated_development_pubtator_data_out)

In [None]:
!# See the annotated publication that includes the Chemical/Disease entity mentions and entity concepts annotations, full-text retrieved from CORD-19, 
!# and the relation annotations among the recognized entity concepts in the same file. 
!# In the current publication there is a relation (marked by the label CID) between ribavirin (D012254) and SarsCov (C000657245)
!cat ./CORD-19-Running-Dataset/CORD-19-Dataset.PubTator.txt.annotated.txt