<a href="https://colab.research.google.com/github/sjut/DPO_Materials/blob/master/scacy_ner_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Извлечени информации
https://medium.com/analytics-vidhya/introduction-to-information-extraction-using-python-and-spacy-858f5d6416ca

In [None]:
import re
import string
import nltk
import spacy
import pandas as pd
import numpy as np
import math
from tqdm import tqdm

from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy

pd.set_option('display.max_colwidth', 200)

In [None]:
# load spaCy model
nlp = spacy.load("en_core_web_sm")

In [None]:
# sample text
text = "GDP in developing countries such as Vietnam will continue growing at a high rate."

# create a spaCy object
doc = nlp(text)

To be able to pull out the desired information from the above sentence, it is really important to understand its syntactic structure — things like the subject, object, modifiers, and parts-of-speech (POS) in the sentence.

In [None]:
# print token, dependency, POS tag
for tok in doc:
  print(tok.text, "-->",tok.dep_,"-->", tok.pos_)

GDP --> nsubj --> NOUN
in --> prep --> ADP
developing --> amod --> VERB
countries --> pobj --> NOUN
such --> amod --> ADJ
as --> prep --> ADP
Vietnam --> pobj --> PROPN
will --> aux --> AUX
continue --> ROOT --> VERB
growing --> xcomp --> VERB
at --> prep --> ADP
a --> det --> DET
high --> amod --> ADJ
rate --> pobj --> NOUN
. --> punct --> PUNCT


In [None]:
#define the pattern
pattern = [{'POS':'NOUN'},
           {'LOWER': 'such'},
           {'LOWER': 'as'},
           {'POS': 'PROPN'} #proper noun
           ]

In [None]:
# Matcher class object
matcher = Matcher(nlp.vocab)
matcher.add("matching_1", [pattern])

matches = matcher(doc)
span = doc[matches[0][1]:matches[0][2]]

print(span.text)

countries such as Vietnam


In [None]:
# Matcher class object
matcher = Matcher(nlp.vocab)

#define the pattern
pattern = [{'DEP':'amod', 'OP':"?"}, # adjectival modifier
           {'POS':'NOUN'},
           {'LOWER': 'such'},
           {'LOWER': 'as'},
           {'POS': 'PROPN'}]

matcher.add("matching_1", [pattern])
matches = matcher(doc)

span = doc[matches[0][1]:matches[0][2]]
print(span.text)

developing countries such as Vietnam


Note: The key ‘OP’: ‘?’ in the pattern above means that the modifier (‘amod’) can occur once or not at all.

In [None]:
doc = nlp("Here is how you can keep your car and other vehicles clean.")

# print dependency tags and POS tags
for tok in doc:
  print(tok.text, "-->",tok.dep_, "-->",tok.pos_)

Here --> advmod --> ADV
is --> ROOT --> AUX
how --> advmod --> SCONJ
you --> nsubj --> PRON
can --> aux --> AUX
keep --> ccomp --> VERB
your --> poss --> PRON
car --> dobj --> NOUN
and --> cc --> CCONJ
other --> amod --> ADJ
vehicles --> conj --> NOUN
clean --> oprd --> ADJ
. --> punct --> PUNCT


In [None]:
# Matcher class object
matcher = Matcher(nlp.vocab)

#define the pattern
pattern = [{'DEP':'amod', 'OP':"?"},
           {'POS':'NOUN'},
           {'LOWER': 'and', 'OP':"?"},
           {'LOWER': 'or', 'OP':"?"},
           {'LOWER': 'other'},
           {'POS': 'NOUN'}]

matcher.add("matching_1", [pattern])

matches = matcher(doc)
span = doc[matches[0][1]:matches[0][2]]
print(span.text)

car and other vehicles


In [None]:
# Matcher class object
matcher = Matcher(nlp.vocab)

#define the pattern
pattern = [{'DEP':'amod', 'OP':"?"},
           {'POS':'NOUN'},
           {'LOWER': 'and', 'OP':"?"},
           {'LOWER': 'or', 'OP':"?"},
           {'LOWER': 'other'},
           {'POS': 'NOUN'}]

matcher.add("matching_1", [pattern])

matches = matcher(doc)
span = doc[matches[0][1]:matches[0][2]]
print(span.text)

car and other vehicles


In [None]:
# Matcher class object
doc = nlp(" ‘Eight people, including two children")
matcher = Matcher(nlp.vocab)

#define the pattern
pattern = [{'DEP':'nummod','OP':"?"}, # numeric modifier
           {'DEP':'amod','OP':"?"}, # adjectival modifier
           {'POS':'NOUN'},
           {'IS_PUNCT': True},
           {'LOWER': 'including'},
           {'DEP':'nummod','OP':"?"},
           {'DEP':'amod','OP':"?"},
           {'POS':'NOUN'}]

matcher.add("matching_1", [pattern])

matches = matcher(doc)
span = doc[matches[0][1]:matches[0][2]]
print(span.text)


Eight people, including two children


In [None]:
doc = nlp("A healthy eating pattern includes fruits, especially whole fruits.")

for tok in doc:
  print(tok.text, tok.dep_, tok.pos_)

A det DET
healthy amod ADJ
eating compound NOUN
pattern nsubj NOUN
includes ROOT VERB
fruits dobj NOUN
, punct PUNCT
especially advmod ADV
whole amod ADJ
fruits appos NOUN
. punct PUNCT


In [None]:
# Matcher class object
matcher = Matcher(nlp.vocab)

#define the pattern
pattern = [{'DEP':'nummod','OP':"?"},
           {'DEP':'amod','OP':"?"},
           {'POS':'NOUN'},
           {'IS_PUNCT':True},
           {'LOWER': 'especially'},
           {'DEP':'nummod','OP':"?"},
           {'DEP':'amod','OP':"?"},
           {'POS':'NOUN'}]

matcher.add("matching_1", [pattern])

matches = matcher(doc)
span = doc[matches[0][1]:matches[0][2]]
print(span.text)

fruits, especially whole fruits


In [None]:
text = "Tableau was recently acquired by Salesforce."

# Plot the dependency graph
doc = nlp(text)
displacy.render(doc, style='dep',jupyter=True)

In [None]:
text = "Tableau was recently acquired by Salesforce."
doc = nlp(text)

for tok in doc:
  print(tok.text,"-->",tok.dep_,"-->",tok.pos_)

Tableau --> nsubjpass --> PROPN
was --> auxpass --> AUX
recently --> advmod --> ADV
acquired --> ROOT --> VERB
by --> agent --> ADP
Salesforce --> pobj --> PROPN
. --> punct --> PUNCT


In [None]:
def subtree_matcher(doc):
  x = ''
  y = ''

  # iterate through all the tokens in the input sentence
  for i,tok in enumerate(doc):
    # extract subject
    if tok.dep_.find("subjpass") == True:
      y = tok.text

    # extract object
    if tok.dep_.endswith("obj") == True:
      x = tok.text

  return x,y

In this case, we just have to find all those sentences that:
Have two entities, and
The term “acquired” as the only ROOT in the sentence
We can then capture the subject and the object from the sentences. Let’s call the above function:

In [None]:
subtree_matcher(doc)

('Salesforce', 'Tableau')

Here, the subject is the acquirer and the object is the entity that is getting acquired. Let’s use the same function, subtree_matcher( ), to extract entities related by the same relation (“acquired”):
But wait — what if I change the sentence from passive to active voice? Will our logic still work?

In [None]:
text_3 = "Salesforce recently acquired Tableau."
doc_3 = nlp(text_3)
subtree_matcher(doc_3)

('Tableau', '')

In [None]:
for tok in doc_3:
  print(tok.text, "-->",tok.dep_, "-->",tok.pos_)

Salesforce --> nsubj --> NOUN
recently --> advmod --> ADV
acquired --> ROOT --> VERB
Tableau --> dobj --> PROPN
. --> punct --> PUNCT


In [None]:
def subtree_matcher(doc):
  subjpass = 0

  for i,tok in enumerate(doc):
    # find dependency tag that contains the text "subjpass"
    if tok.dep_.find("subjpass") == True:
      subjpass = 1

  x = ''
  y = ''

  # if subjpass == 1 then sentence is passive
  if subjpass == 1:
    for i,tok in enumerate(doc):
      if tok.dep_.find("subjpass") == True:
        y = tok.text

      if tok.dep_.endswith("obj") == True:
        x = tok.text

  # if subjpass == 0 then sentence is not passive
  else:
    for i,tok in enumerate(doc):
      if tok.dep_.endswith("subj") == True:
        x = tok.text

      if tok.dep_.endswith("obj") == True:
        y = tok.text

  return x,y

In [None]:
subtree_matcher(doc_3)

('Salesforce', 'Tableau')