### LitCoin NLP Challenge: Part 1

*This 2-phase competition is part of the NASA Tournament Lab and hosted by NCBI (The National Center for Biotechnology Information), NCATS (The National Center for Advancing Translational Sciences) and NIH (National Institutes of Health). These institutions, in collaboration with bitgrit and CrowdPlat, have come together to bring this challenge where one can deploy their data-driven technology solutions towards accelerating scientific research in medicine and ensure that data from biomedical publications can be maximally leveraged and reach a wide range of biomedical researchers.*

*Each phase of the competition is designed to spur innovation in the field of natural language processing, asking competitors to design systems that can accurately recognize scientific concepts from the text of scientific articles, connect those concepts into knowledge assertions, and determine if that claim is a novel finding or background information.*

>Part 1: Given only an abstract text, the goal is to find all the nodes or biomedical entities (position in text and BioLink Model Category).    
>Part 2: Given the abstract and the nodes annotated from it, the goal is to find all the relationships between them (pair of nodes, BioLink Model Predicate and novelty). 

**The goal of the first part of the LitCoin NLP Challenge is to identify the position and type of biomedical concepts (entities) mentions within a research paper’s title and abstract.**

In [1]:
# loading libraries
import pandas as pd
import numpy as np
import spacy as sp

In [28]:
# loading data
folder = 'C:/Users/saaye/OneDrive/Documents/Machine Learning Projects/LitCoin NLP Challenge/data/'

abstract_test = pd.read_csv(folder + 'abstracts_test.csv', sep='\t')
abstract_train = pd.read_csv(folder + 'abstracts_train.csv', sep='\t')
entities_train = pd.read_csv(folder + 'entities_train.csv', sep='\t')
relations_train = pd.read_csv(folder + 'relations_train.csv', sep='\t')
submission_example = pd.read_csv(folder + 'submission_example.csv', sep='\t')

In [3]:
# printing dimensions of all datasets
print(f'Abstract training dataset has {abstract_train.shape[0]} columns and {abstract_train.shape[1]} rows.')
print(f'Relations training dataset has {relations_train.shape[0]} columns and {relations_train.shape[1]} rows.')
print(f'Entities training dataset has {entities_train.shape[0]} columns and {entities_train.shape[1]} rows.')

Abstract training dataset has 400 columns and 3 rows.
Relations training dataset has 4280 columns and 6 rows.
Entities training dataset has 13636 columns and 7 rows.


In [4]:
# printing an example of the abstract database
print(f'An example of an observation is as follows: \n # abstract_id: PubMed ID of the research paper. : {abstract_train["abstract_id"][0]}')
print(f' # title: title of the research paper. : {abstract_train["title"][0]}')
print(f' # abstract: abstract or summary of the research paper. : {abstract_train["abstract"][0]}')

An example of an observation is as follows: 
 # abstract_id: PubMed ID of the research paper. : 1353340
 # title: title of the research paper. : Late-onset metachromatic leukodystrophy: molecular pathology in two siblings.
 # abstract: abstract or summary of the research paper. : We report on a new allele at the arylsulfatase A (ARSA) locus causing late-onset metachromatic leukodystrophy (MLD). In that allele arginine84, a residue that is highly conserved in the arylsulfatase gene family, is replaced by glutamine. In contrast to alleles that cause early-onset MLD, the arginine84 to glutamine substitution is associated with some residual ARSA activity. A comparison of genotypes, ARSA activities, and clinical data on 4 individuals carrying the allele of 81 patients with MLD examined, further validates the concept that different degrees of residual ARSA activity are the basis of phenotypical variation in MLD.. 


In [5]:
# input string considered for these indices is the concatenation of the title and abstract strings
abstract_test["full"] = abstract_test["title"] + ' ' + abstract_test["abstract"]
abstract_test = abstract_test.reset_index().rename(columns={"index": "id"})

The position of a biomedical entity's mention in the text is determined by two ‘offset’ numbers: ‘offset_start’ and ‘offset_finish’, which indicate the index of the character where a given mention substring begins and ends, respectively. The input string considered for these indices is the concatenation of the title and abstract strings in the following manner: string = title + ' ' + abstract (i.e., there is one extra character between the 2 when accounting for the offset). 

In [90]:
# custom entity recognition using spacy
nlp = sp.blank("en") # creates a blank pipeline
ruler = nlp.add_pipe("entity_ruler") # adds entity ruling component to the pipeline
patterns = entities_train[["type", "mention"]].rename(columns={"type": "label", "mention": "pattern"}).to_dict("records") # a dictionary of the patterns
ruler.add_patterns(patterns) # adds patterns to the pipeline
texts = abstract_test["full"].tolist()
answers = []
for doc in nlp.pipe(texts, n_process=4, batch_size=2000):
    answers.append([(ent.label_, ent.start_char, ent.end_char) for ent in doc.ents])

In [39]:
submission = abstract_test.drop(["title", "abstract", "full"], axis=1)
answers = pd.DataFrame(answers).reset_index().rename(columns={"index": "id"})
answers = submission.merge(answers, how="inner", on="id")

In [96]:
answers

[[('DiseaseOrPhenotypicFeature', 23, 35),
  ('DiseaseOrPhenotypicFeature', 58, 66),
  ('DiseaseOrPhenotypicFeature', 78, 89),
  ('DiseaseOrPhenotypicFeature', 113, 125),
  ('DiseaseOrPhenotypicFeature', 208, 219),
  ('DiseaseOrPhenotypicFeature', 325, 337),
  ('OrganismTaxon', 338, 342),
  ('DiseaseOrPhenotypicFeature', 448, 460),
  ('OrganismTaxon', 565, 569),
  ('DiseaseOrPhenotypicFeature', 684, 692),
  ('DiseaseOrPhenotypicFeature', 823, 831),
  ('DiseaseOrPhenotypicFeature', 936, 944),
  ('DiseaseOrPhenotypicFeature', 1052, 1060),
  ('DiseaseOrPhenotypicFeature', 1134, 1146),
  ('ChemicalEntity', 1190, 1191),
  ('DiseaseOrPhenotypicFeature', 1319, 1331),
  ('ChemicalEntity', 1363, 1364),
  ('ChemicalEntity', 1405, 1418),
  ('DiseaseOrPhenotypicFeature', 1427, 1439),
  ('DiseaseOrPhenotypicFeature', 1498, 1506),
  ('DiseaseOrPhenotypicFeature', 1560, 1568)],
 [('DiseaseOrPhenotypicFeature', 20, 45),
  ('DiseaseOrPhenotypicFeature', 146, 171),
  ('DiseaseOrPhenotypicFeature', 173, 1

In [7]:
# create a named entity visualizer 
type = entities_train["type"].unique().tolist() # keys for color value
color = ["#5dd8d2", "#9d34f1", "#444c63", "#ec0639", "#57d921", "#fe2a9f"] # color values
colors = {type[i]: color[i] for i in range(len(type))} # creates dictionary of the color keys and values
options = {"ents": type, "colors": colors} # assigns the colors to entity
sp.displacy.render(doc, style="ent", jupyter=True, options=options) # renders the entities