### LitCoin NLP Challenge: Part 1

*This 2-phase competition is part of the NASA Tournament Lab and hosted by NCBI (The National Center for Biotechnology Information), NCATS (The National Center for Advancing Translational Sciences) and NIH (National Institutes of Health). These institutions, in collaboration with bitgrit and CrowdPlat, have come together to bring this challenge where one can deploy their data-driven technology solutions towards accelerating scientific research in medicine and ensure that data from biomedical publications can be maximally leveraged and reach a wide range of biomedical researchers.*

*Each phase of the competition is designed to spur innovation in the field of natural language processing, asking competitors to design systems that can accurately recognize scientific concepts from the text of scientific articles, connect those concepts into knowledge assertions, and determine if that claim is a novel finding or background information.*

>Part 1: Given only an abstract text, the goal is to find all the nodes or biomedical entities (position in text and BioLink Model Category).    
>Part 2: Given the abstract and the nodes annotated from it, the goal is to find all the relationships between them (pair of nodes, BioLink Model Predicate and novelty). 

**The goal of the first part of the LitCoin NLP Challenge is to identify the position and type of biomedical concepts (entities) mentions within a research paper’s title and abstract.**

In [20]:
import pandas as pd
import numpy as np
import spacy as sp

In [2]:
abstract_test = pd.read_csv('LitCoin Dataset/abstracts_test.csv', sep='\t')
abstract_train = pd.read_csv('LitCoin Dataset/abstracts_train.csv', sep='\t')
entities_train = pd.read_csv('LitCoin Dataset/entities_train.csv', sep='\t')
relations_train = pd.read_csv('LitCoin Dataset/relations_train.csv', sep='\t')

In [3]:
print(f'Abstract training dataset has {abstract_train.shape[0]} columns and {abstract_train.shape[1]} rows.')
print(f'Relations training dataset has {relations_train.shape[0]} columns and {relations_train.shape[1]} rows.')
print(f'Entities training dataset has {entities_train.shape[0]} columns and {entities_train.shape[1]} rows.')

Abstract training dataset has 400 columns and 3 rows.
Relations training dataset has 4280 columns and 6 rows.
Entities training dataset has 13636 columns and 7 rows.


In [4]:
print(f'An example of an observation is as follows: \n # abstract_id: PubMed ID of the research paper. : {abstract_train["abstract_id"][0]}')
print(f' # title: title of the research paper. : {abstract_train["title"][0]}')
print(f' # abstract: abstract or summary of the research paper. : {abstract_train["abstract"][0]}')

An example of an observation is as follows: 
 # abstract_id: PubMed ID of the research paper. : 1353340
 # title: title of the research paper. : Late-onset metachromatic leukodystrophy: molecular pathology in two siblings.
 # abstract: abstract or summary of the research paper. : We report on a new allele at the arylsulfatase A (ARSA) locus causing late-onset metachromatic leukodystrophy (MLD). In that allele arginine84, a residue that is highly conserved in the arylsulfatase gene family, is replaced by glutamine. In contrast to alleles that cause early-onset MLD, the arginine84 to glutamine substitution is associated with some residual ARSA activity. A comparison of genotypes, ARSA activities, and clinical data on 4 individuals carrying the allele of 81 patients with MLD examined, further validates the concept that different degrees of residual ARSA activity are the basis of phenotypical variation in MLD.. 


In [5]:
abstract_train['full'] = abstract_train['title'] + ' ' + abstract_train['abstract']

The position of a biomedical entity's mention in the text is determined by two ‘offset’ numbers: ‘offset_start’ and ‘offset_finish’, which indicate the index of the character where a given mention substring begins and ends, respectively. The input string considered for these indices is the concatenation of the title and abstract strings in the following manner: string = title + ' ' + abstract (i.e., there is one extra character between the 2 when accounting for the offset). 

In [6]:
df = pd.merge(entities_train, abstract_train, on='abstract_id')
df.head()

Unnamed: 0,id,abstract_id,offset_start,offset_finish,type,mention,entity_ids,title,abstract,full
0,0,1353340,11,39,DiseaseOrPhenotypicFeature,metachromatic leukodystrophy,D007966,Late-onset metachromatic leukodystrophy: molec...,We report on a new allele at the arylsulfatase...,Late-onset metachromatic leukodystrophy: molec...
1,1,1353340,111,126,GeneOrGeneProduct,arylsulfatase A,410,Late-onset metachromatic leukodystrophy: molec...,We report on a new allele at the arylsulfatase...,Late-onset metachromatic leukodystrophy: molec...
2,2,1353340,128,132,GeneOrGeneProduct,ARSA,410,Late-onset metachromatic leukodystrophy: molec...,We report on a new allele at the arylsulfatase...,Late-onset metachromatic leukodystrophy: molec...
3,3,1353340,159,187,DiseaseOrPhenotypicFeature,metachromatic leukodystrophy,D007966,Late-onset metachromatic leukodystrophy: molec...,We report on a new allele at the arylsulfatase...,Late-onset metachromatic leukodystrophy: molec...
4,4,1353340,189,192,DiseaseOrPhenotypicFeature,MLD,D007966,Late-onset metachromatic leukodystrophy: molec...,We report on a new allele at the arylsulfatase...,Late-onset metachromatic leukodystrophy: molec...


In [33]:
nlp = sp.load('en_core_web_lg')

In [8]:
text = df['full'][0]

In [34]:
ruler = nlp.add_pipe('entity_ruler', before='ner')

In [35]:
pattern1 = (set(df[df['type'] == 'DiseaseOrPhenotypicFeature']['mention']))
pattern2 = (set(df[df['type'] == 'ChemicalEntity']['mention']))
pattern3 = (set(df[df['type'] == 'OrganismTaxon']['mention']))
pattern4 = (set(df[df['type'] == 'GeneOrGeneProduct']['mention']))
pattern5 = (set(df[df['type'] == 'SequenceVariant']['mention']))
pattern6 = (set(df[df['type'] == 'CellLine']['mention']))

patterns = [{'label': 'DiseaseOrPhenotypicFeature', 'pattern': pattern1},
           {'label': 'ChemicalEntity', 'pattern': pattern2},
           {'label': 'OrganismTaxon', 'pattern': pattern3},
           {'label': 'GeneOrGeneProduct', 'pattern': pattern4},
           {'label': 'SequenceVariant', 'pattern': pattern5},
           {'label': 'CellLine', 'pattern': pattern6}]

In [36]:
ruler.add_patterns(pattern)

In [210]:
%%timeit
doc = nlp(' '.join(df.full), disable = ['ner', 'parser', 'tagger'])

MemoryError: Unable to allocate 7.83 GiB for an array with shape (4380340, 480) and data type float32

In [277]:
abstract_train['token'] = abstract_train['full'].apply(lambda x: nlp(x))

In [278]:
df = pd.merge(df, abstract_train, how='outer', on='abstract_id')

In [44]:
[(ent.text, ent.label_) for ent in doc.ents]

[('two', 'CARDINAL'), ('MLD', 'WORK_OF_ART')]

In [43]:
with nlp.select_pipes(enable="tagger"):
    ruler.add_patterns(patterns)

In [296]:
def find_entity(abstract):
    entity = []
    doc = nlp(''.join(abstract))
    entity.append([(ent.text, ent.label_) for ent in doc.ents])
    return entity

In [302]:
[(ent.text, ent.label_) for ent in doc.ents]

[]

In [37]:
doc = nlp(text)



In [32]:
[(ent.text, ent.label_) for ent in doc.ents]

[('Apple', 'ORG'), ('San Francisco', 'GPE')]

In [31]:
from spacy.lang.en import English

nlp = English()
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Apple"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
ruler.add_patterns(patterns)

doc = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Apple', 'ORG'), ('San Francisco', 'GPE')]
