Named Entity Recognition and Dependency Parsing for Information Extration using spacy.

1. Consider any text file (research article, technical blog or any unstructured corpus used before) <br>
2. Perform NER to extract entities from individual sentences using spacy. <br>
3. Use Dependency Parsing, POS tagging to extract relationships between the entities.
4. Create a tuple for Information Extraction <br>
T1( Entity1, Entity2, Relation label)
5. Display no of such tuples extracted from considered corpus as extracted information

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
doc = nlp('Life is a gift from god.Do your Best.everything is a blessing.')

In [None]:
# Cannot identify because sentences isn't properly ended
for sent in doc.sents:
    print(sent)

Life is a gift from god.
Do your Best.everything is a blessing.


In [None]:
for word in doc:
    print(word.text, word.tag_, word.pos_, word.dep_)
    
# Explain the tags
spacy.explain('pobj')
spacy.explain('nsubj')

Life NN NOUN nsubj
is VBZ AUX ROOT
a DT DET det
gift NN NOUN attr
from IN ADP prep
god NNP PROPN pobj
. . PUNCT punct
Do VBP AUX csubj
your PRP$ PRON dobj
Best.everything NFP PUNCT nsubj
is VBZ AUX ROOT
a DT DET det
blessing NN NOUN attr
. . PUNCT punct


'nominal subject'

In [None]:
doc.text.split()

['Life',
 'is',
 'a',
 'gift',
 'from',
 'god.Do',
 'your',
 'Best.everything',
 'is',
 'a',
 'blessing.']

## Name Entity Recognition

In [None]:
kalam= "A. P. J. Abdul Kalam was an Indian aerospace scientist and politician who served as the 11th President of India from 2002 to 2007. He was born and raised in Rameswaram, Tamil Nadu and studied physics and aerospace engineering. He spent the next four decades as a scientist and science administrator, mainly at the Defence Research and Development Organisation (DRDO) and Indian Space Research Organisation (ISRO) and was intimately involved in India's civilian space programme and military missile development efforts."

In [None]:
nlp_kalam = nlp(kalam)
[(i, i.label_, i.label) for i in nlp_kalam.ents]

[(A. P. J. Abdul Kalam, 'PERSON', 380),
 (Indian, 'NORP', 381),
 (11th, 'ORDINAL', 396),
 (India, 'GPE', 384),
 (2002, 'DATE', 391),
 (2007, 'DATE', 391),
 (Rameswaram, 'GPE', 384),
 (Tamil Nadu, 'PERSON', 380),
 (the next four decades, 'DATE', 391),
 (the Defence Research and Development Organisation, 'ORG', 383),
 (Indian Space Research Organisation, 'ORG', 383),
 (India, 'GPE', 384)]

In [None]:
from spacy import displacy
displacy.render(nlp_kalam, style='ent', jupyter=True)

### 1. Consider any text file (research article, technical blog or any unstructured corpus used before) <br>
### 2. Perform NER to extract entities from individual sentences using spacy. <br>

#### Corpus - ABC

In [None]:
# printing 3000 words
import nltk
data = nltk.corpus.abc.raw('science.txt')[:3000]
print(data)

Cystic fibrosis affects 30,000 children and young adults in the US alone
Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although side effects include a nasty coughing fit and a harsh taste. 
That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine.
They found that inhaling a mist with a salt content of 7 or 9% improved lung function and, in some cases, produced less absenteeism from school or work. 
Cystic fibrosis, a progressive and frequently fatal genetic disease that affects about 30,000 young adults and children in the US alone, is marked by a thickening of the mucus which makes it harder to clear the lungs of debris and bacteria. 
The salt water solution "really opens up a new avenue for approaching patients with cystic fibrosis and how to treat them," says Dr Gail Weinmann, of the US National Heart, Lung, and Blood Institute, which sponsored one of the s

In [None]:
nlp_data = nlp(data)
[(i, i.label_, i.label) for i in nlp_data.ents]

[(Cystic, 'NORP', 381),
 (30,000, 'CARDINAL', 397),
 (US, 'GPE', 384),
 (two, 'CARDINAL', 397),
 (week, 'DATE', 391),
 (The New England Journal of Medicine, 'ORG', 383),
 (7 or 9%, 'PERCENT', 393),
 (Cystic, 'NORP', 381),
 (about 30,000, 'CARDINAL', 397),
 (US, 'GPE', 384),
 (Gail Weinmann, 'PERSON', 380),
 (the US National Heart, 'ORG', 383),
 (Lung, 'PERSON', 380),
 (Blood Institute, 'ORG', 383),
 (one, 'CARDINAL', 397),
 (Mark Elkins, 'PERSON', 380),
 (the Royal Prince Alfred Hospital, 'ORG', 383),
 (Sydney, 'GPE', 384),
 (Australia, 'GPE', 384),
 (83, 'CARDINAL', 397),
 (7%, 'PERCENT', 393),
 (under 1%, 'PERCENT', 393),
 (first, 'ORDINAL', 396),
 (second, 'ORDINAL', 396),
 (US, 'GPE', 384),
 (Profsesor Scott Donaldson, 'PERSON', 380),
 (the University of North Carolina, 'ORG', 383),
 (Chapel Hill, 'FAC', 9191306739292312949),
 (7%, 'PERCENT', 393),
 (Dr Felix Ratjen, 'ORG', 383),
 (the Hospital for Sick Children, 'ORG', 383),
 (Toronto, 'GPE', 384),
 (Canada, 'GPE', 384),
 (30 minu

In [None]:
from spacy import displacy
displacy.render(nlp_data, style='ent', jupyter=True)

## Dependency Parsing

 Applying Dependency Parsing on single sentence.

In [None]:
from spacy import displacy
doc = nlp('India is my country. It is beautiful')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

Apply Dependency Parsing on the data (sentence by sentence)

In [None]:
from spacy import displacy
from nltk.tokenize import sent_tokenize
token_text_eng = sent_tokenize(data, language = 'english')
for i in token_text_eng:
    doc = nlp(i)
    displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

### 3. Use Dependency Parsing, POS tagging to extract relationships between the entities.
### 4. Create a tuple for Information Extraction <br>
 T1( Entity1, Entity2, Relation label, pos1, pos2)<br>
 T1( Entity1, Entity2, Relation label)
### 5. Display no. of such tuples extracted from considered corpus as extracted information

In [None]:
# Using first 100000 words 
text = nltk.corpus.abc.raw('science.txt')[:100000]

doc = nlp(text)

tuples = []

# Loop through the sentences in the document
for sent in doc.sents:
    # Extract named entities from the sentence
    entities = [(ent.text, ent.label_) for ent in sent.ents]
    # Perform dependency parsing and POS tagging on the sentence
    for token in sent:
        if token.text in [ent[0] for ent in entities]:
            for child in token.children:
                if child.text in [ent[0] for ent in entities]:
                    # Extract the relation label and POS tags
                    label = token.dep_
                    pos1 = token.pos_
                    pos2 = child.pos_
                    # Get the entity texts and labels
                    entity1 = [(ent[0], ent[1]) for ent in entities if ent[0] == token.text][0]
                    entity2 = [(ent[0], ent[1]) for ent in entities if ent[0] == child.text][0]
                    # Create a tuple and add it to the list
                    tuple = (entity1, entity2, label, pos1, pos2)
                    tuples.append(tuple)

print("Number of tuples extracted:", len(tuples))
print()

for tuple in tuples:
    print(tuple)

Number of tuples extracted: 34

(('Sydney', 'GPE'), ('Australia', 'GPE'), 'pobj', 'PROPN', 'PROPN')
(('Toronto', 'GPE'), ('Canada', 'GPE'), 'pobj', 'PROPN', 'PROPN')
(('Australian', 'NORP'), ('German', 'NORP'), 'amod', 'ADJ', 'ADJ')
(('Jews', 'NORP'), ('Ashkenazi', 'NORP'), 'pobj', 'PROPN', 'PROPN')
(('Europe', 'LOC'), ('US', 'GPE'), 'pobj', 'PROPN', 'PROPN')
(('Jews', 'NORP'), ('European', 'NORP'), 'pobj', 'PROPN', 'ADJ')
(('Jews', 'NORP'), ('Ashkenazi', 'NORP'), 'attr', 'PROPN', 'PROPN')
(('CSIRO', 'ORG'), ('NASA', 'ORG'), 'conj', 'PROPN', 'PROPN')
(('Australia', 'GPE'), ('US', 'GPE'), 'pobj', 'PROPN', 'PROPN')
(('US', 'GPE'), ('Europe', 'LOC'), 'appos', 'PROPN', 'PROPN')
(('Europe', 'LOC'), ('Canada', 'GPE'), 'conj', 'PROPN', 'PROPN')
(('Canada', 'GPE'), ('Japan', 'GPE'), 'conj', 'PROPN', 'PROPN')
(('Boulder', 'GPE'), ('Colorado', 'GPE'), 'pobj', 'PROPN', 'PROPN')
(('Valley', 'LOC'), ('Egypt', 'GPE'), 'pobj', 'PROPN', 'PROPN')
(('Antartica', 'GPE'), ('Canada', 'GPE'), 'pobj', 'PROPN

In [None]:
# Display the extracted tuples
for tuple in tuples:
    print(tuple[:3])

(('Sydney', 'GPE'), ('Australia', 'GPE'), 'pobj')
(('Toronto', 'GPE'), ('Canada', 'GPE'), 'pobj')
(('Australian', 'NORP'), ('German', 'NORP'), 'amod')
(('Jews', 'NORP'), ('Ashkenazi', 'NORP'), 'pobj')
(('Europe', 'LOC'), ('US', 'GPE'), 'pobj')
(('Jews', 'NORP'), ('European', 'NORP'), 'pobj')
(('Jews', 'NORP'), ('Ashkenazi', 'NORP'), 'attr')
(('CSIRO', 'ORG'), ('NASA', 'ORG'), 'conj')
(('Australia', 'GPE'), ('US', 'GPE'), 'pobj')
(('US', 'GPE'), ('Europe', 'LOC'), 'appos')
(('Europe', 'LOC'), ('Canada', 'GPE'), 'conj')
(('Canada', 'GPE'), ('Japan', 'GPE'), 'conj')
(('Boulder', 'GPE'), ('Colorado', 'GPE'), 'pobj')
(('Valley', 'LOC'), ('Egypt', 'GPE'), 'pobj')
(('Antartica', 'GPE'), ('Canada', 'GPE'), 'pobj')
(('seven', 'CARDINAL'), ('eight', 'CARDINAL'), 'nummod')
(('Ketek', 'PERSON'), ('Aventis', 'GPE'), 'pobj')
(('US', 'GPE'), ('French', 'NORP'), 'nmod')
(('10,000', 'MONEY'), ('7500', 'MONEY'), 'pobj')
(('year', 'DATE'), ('two', 'CARDINAL'), 'npadvmod')
(('MatScape', 'ORG'), ('Joachim'