<a href="https://colab.research.google.com/github/vj1494/PipelineIE/blob/master/PipelineIE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**INFORMATION EXTRACTION PIPELINE (PipelineIE) - Triple Extraction**

This Notebook is a demonstration of the usage of an open source project named [PipelineIE](https://github.com/vj1494/PipelineIE) which extracts information from free text (English) and domain specific text like biomedical domain. It also supports custom spaCy models. The triplets are based on the subject verb object rule. The pipeline takes care of resolving coreferences using neuralcoref/Stanford CoreNLP and Entity Linking using spaCy and scispaCy model or any custom spaCy model which helps in mapping the subject and object to its original entity.

The 'default' pipeline uses spaCy's 'en_core_web_md' model. An end user can customize their pipeline and use their choice of coreference resolver and entity linker.

This Notebook uses the default biomedical pipeline which uses ScispaCy's en-core-sci-lg model with neuralcoref in coreference resolution and in entity linking and dependency parsing for triple extraction. Further it also demonstrates the usage of custom pipeline by making use of UMLS for entity linker, default pipeline and shows some use cases for various scenarios (in comments as cases).

Installing neuralcoref from source

In [None]:
!git clone https://github.com/huggingface/neuralcoref.git
%cd neuralcoref
!pip install -r requirements.txt
!pip install -e .

Installing and setting up PipelineIE

In [None]:
%cd ../
!git clone https://github.com/vj1494/PipelineIE.git
%cd PipelineIE
!pip install -r requirements.txt
!pip install -e .

Restart Runtime (Jupyter requires this after installation of new packages) and Import PipelineIE

In [None]:
from pipeline_ie.pipeline_ie import PipelineIE
import pandas as pd
pd.set_option('display.max_colwidth', -1)

Biomedical Pipeline (The biomedical pipeline uses ScispaCy's 'en_core_sci_lg' with neuralcoref for coreference resolution and also for entity linking and dependecy parsing for triple extraction using textaCy (subject verb object rule))

In [7]:
text = "Co-culture of NK cells with transfected EC enhanced E-selectin, IL-8, and NF-kappaB-dependent promoter activity."

#Biomedical PipelineIE
#Default Biomedical Pipeline uses ScispaCy en_core_sci_lg model
#Same model is used for neuralcoref, entity linkage and triple extraction 
#pipeline_ie="default" uses spacy en model
pie = PipelineIE(text, pipeline="biomedical")

#Returns a dataframe
pie.pipeline_triplet()

Unnamed: 0,Sentences,Triplet
1,"Co-culture of NK cells with transfected EC enhanced E-selectin, IL-8, and NF-kappaB-dependent promoter activity.","[[(Co-culture), enhanced, (E-selectin)], [(Co-culture), enhanced, (IL-8)], [(Co-culture), enhanced, (NF-kappaB-dependent, promoter, activity)]]"


Biomedical Pipeline using UMLS as entity linker (Requires high RAM)

In [3]:
text = "Co-culture of NK cells with transfected EC enhanced E-selectin, IL-8, and NF-kappaB-dependent promoter activity."

#Biomedical PipelineIE
#Default Biomedical Pipeline uses ScispaCy en_core_sci_lg model
#Same model is used for neuralcoref, entity linkage and triple extraction 
#pipeline_ie="default" uses spacy en model
#Updating entity linker from spacy to UMLS in biomedical pipeline (requires entire pipeline properties to be mentioned)
pie = PipelineIE(text, pipeline="biomedical", properties={'coref':'neuralcoref', 'entity_link': 'umls', 'ie': 'triplet'})

#Returns a dataframe
pie.pipeline_triplet()



https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/umls_semantic_type_tree.tsv not found in cache, downloading to /tmp/tmpuo_ow9zo
Finished download, copying /tmp/tmpuo_ow9zo to cache at /root/.scispacy/datasets/21a1012c532c3a431d60895c509f5b4d45b0f8966c4178b892190a302b21836f.330707f4efe774134872b9f77f0e3208c1d30f50800b3b39a6b8ec21d9adf1b7.umls_semantic_type_tree.tsv


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]


Unnamed: 0,Sentences,Triplet
1,"Co-culture of NK cells with transfected EC enhanced E-selectin, IL-8, and NF-kappaB-dependent promoter activity.","[[(Co-culture), enhanced, (E-selectin)], [(Co-culture), enhanced, (IL-8)], [(Co-culture), enhanced, (NF-kappaB-dependent, promoter, activity)]]"


Default Pipeline (Suitable for generalized tasks or tasks not specific to a domain)


In [27]:
text = "Ajay leads the Support team. He requested a long leave. SBI raised several tickets and he leads its product maintainance."

#Default PipelineIE
#Default Biomedical Pipeline uses spaCy's en model
#Same model is used for neuralcoref, entity linkage and triple extraction 
#pipeline_ie="default" uses spacy en model
pie = PipelineIE(text)

#Returns a dataframe
pie.pipeline_triplet()

Unnamed: 0,Sentences,Triplet
1,Ajay leads the Support team.,"[[(Ajay), leads, (Support, team)]]"
2,Ajay requested a long leave.,"[[(Ajay), requested, (long, leave)]]"
3,SBI raised several tickets and Ajay leads SBI product maintainance.,"[[(SBI), raised, (several, tickets)], [(Ajay), leads, (SBI, product, maintainance)]]"


Loading a custom spaCy model, using CoreNLP for coreference resolution and passing input through a file

In [None]:
#Download spaCy model. For this example we download en_core_web_lg.
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.3.1/en_core_web_lg-2.3.1.tar.gz

In [None]:
!wget http://nlp.stanford.edu/software/stanford-corenlp-latest.zip
!unzip stanford-corenlp-latest.zip
!ls

In [6]:
df_demo = pd.DataFrame({'Text':["Ajay leads the Support team. He requested a long leave. SBI raised several tickets and he leads its product maintainance."]})
df_demo.to_csv("Demo.csv")

In [7]:
!ls
!pwd

Demo.csv     sample_data	     stanford-corenlp-latest.zip
neuralcoref  stanford-corenlp-4.2.0
/content


**CASE 1:**

In [None]:
'''
Consider 
1. if you have an input file named Demo.csv having input in column named Text
2. if you want to use a custom spacy model such as en_core_web_lg
3. if you want to use Stanford CoreNLP for coreference resolution (assuming it has been downloaded and its location is /content/stanford-corenlp-4.2.0)

then below is how you will set required parameters, create and run the information extraction pipeline

'''
#pie = PipelineIE(file_name='/content/Demo.csv',col_name='Text', spacy_model='en_core_web_lg',corenlp_home="/content/stanford-corenlp-4.2.0", properties={'coref': 'corenlp', 'entity_link': 'spacy', 'ie': 'triplet'})

#pie.pipeline_triplet()

**CASE 2**

In [None]:
'''
Consider 
1. if you have a folder named Input_Files with each file having input in column named Text and all files are in csv format
2. if you want to use a custom spacy model such as en_core_web_lg
3. if you want to use Stanford CoreNLP for coreference resolution (assuming it has been downloaded and its location is /content/stanford-corenlp-4.2.0)

then below is how you will set required parameters, create and run the information extraction pipeline

'''
#pie = PipelineIE(input="csv",folder_dir=='/content/Input_Files',col_name='Text', spacy_model='en_core_web_lg',corenlp_home="/content/stanford-corenlp-4.2.0", properties={'coref': 'corenlp', 'entity_link': 'spacy', 'ie': 'triplet'})

#pie.pipeline_triplet()

**CASE 3**

In [None]:
'''
Consider 
1. if you have a folder named Input_Files with each file having input in column named Text and all files are in csv format
2. if you want to use a custom spacy model such as en_core_web_lg
3. if you want to use Stanford CoreNLP for coreference resolution (assuming it has been downloaded and its location is /content/stanford-corenlp-4.2.0)
4. if you want to set Input_Files folder location, CoreNLP Home, CoreNLP properties like memory, timeout etc and column name in config.ini file

then below is how you will set required parameters, create and run the information extraction pipeline

'''
#pie = PipelineIE(input="csv", spacy_model='en_core_web_lg', properties={'coref': 'corenlp', 'entity_link': 'spacy', 'ie': 'triplet'})

#pie.pipeline_triplet()