<a href="https://colab.research.google.com/github/sunc-dev/spaCY-ner-sustain/blob/main/ner-train-note.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font size="10">Custom entity recognition </font>
## Training data environment setup

## Introduction:

This notebook is test code to setup new entities and programmtically creating training data to train the NER model. The goal of the project is to tag environmental/sustainability technologies contained with in the systems data stored with in our capital planning systems.

To skip to creation of training data and go directly to [model training and model implementation](./ner-model-note.ipynb).

## SETUP
Module downloads (necessary for only first instance): 

In [52]:
!pip install spacy
!pip install pdfminer.six
!pip install nltk 
!python -m spacy download en_core_web_md

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


Import modules:

In [53]:
import os
import pandas as pd
from pathlib import Path
import glob
from io import StringIO

#pdf miner 
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from pdfminer.high_level import extract_text

#SpaCy 
import en_core_web_md
import spacy
from spacy import displacy
from spacy.lang.en import English
from spacy.matcher import Matcher
from spacy.matcher import PhraseMatcher
from spacy.pipeline import Sentencizer
from spacy.lemmatizer import Lemmatizer, ADJ, NOUN, VERB
from spacy.lookups import Lookups
#nltk
from nltk import tokenize


In [54]:
#mount drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Train data setup
The code section below setup the training data for the NER.
The is defined in the following steps:
* Convert documents into text
* Setting up the SpaCy nlp object
* Load, parsing and tokenization of text into a list of sentences
* Utility function order to parse the tokenized sentences in order to inherit the form:  [(Sentence, {entities: [(start, end, label)]}, ...].

SpaCy object delcaration and relative path setup:

SpaCy requires training data to be in the format: [(Sentence, {entities: [(start, end, label)]}, ...].
We first need to create a list of sentences throughout a doc object containing the new entity.

In [55]:
#Declare english vocab for our nlp object to define english language sentence boundaries
nlp = spacy.blank("en")
nlp = English()
nlp = en_core_web_md.load()
nlp.max_length = 6130000 # or higher

nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f304a8ada90>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f304770afa8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f3035f50528>)]

Declare rules for sentence boundary detection logic and add to the nlp pipe object.


In [56]:
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"])
nlp.add_pipe(sentencizer, before="tagger")
nlp.pipeline

[('sentencizer', <spacy.pipeline.pipes.Sentencizer at 0x7f3049cae048>),
 ('tagger', <spacy.pipeline.pipes.Tagger at 0x7f304a8ada90>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f304770afa8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f3035f50528>)]

Setup import doc import, write tests for single doc object for now:

## Current test case is for the word 'green roof'

Document path setup (can change to a higher level directory to capture more/different docs for other terms)

In [57]:
'''doc_path = Path(r'./doc/stormw') #change this for broader directory to capture
pdf_objects = list(doc_path.glob('stormw4.pdf'))  # convert result to a list
print(pdf_objects)'''

#Setup higher-level directory function to import more docs
#dir_path = r'./doc/' #change this for broader directory to capture

#For collab, change set for different training examples 
dir_path = r'/content/drive/My Drive/Colab Notebooks/nlp-ner-sustain-notebook/doc/set2'
pdf_objects = glob.glob(dir_path + '/**/*.pdf', recursive=True)
print(pdf_objects)
len(pdf_objects)
#test on single pdf object
#pdf_object =  os.path.join(doc_path, 'example2.pdf')'''

['/content/drive/My Drive/Colab Notebooks/nlp-ner-sustain-notebook/doc/set2/gutter/gutter1.pdf', '/content/drive/My Drive/Colab Notebooks/nlp-ner-sustain-notebook/doc/set2/gutter/gutter2.pdf', '/content/drive/My Drive/Colab Notebooks/nlp-ner-sustain-notebook/doc/set2/gutter/gutter3.pdf', '/content/drive/My Drive/Colab Notebooks/nlp-ner-sustain-notebook/doc/set2/rainw/rainw1.pdf', '/content/drive/My Drive/Colab Notebooks/nlp-ner-sustain-notebook/doc/set2/rainw/rainw2.pdf', '/content/drive/My Drive/Colab Notebooks/nlp-ner-sustain-notebook/doc/set2/rainw/rainw3.pdf', '/content/drive/My Drive/Colab Notebooks/nlp-ner-sustain-notebook/doc/set2/siteg/siteg2.pdf', '/content/drive/My Drive/Colab Notebooks/nlp-ner-sustain-notebook/doc/set2/sumpp/sumpp1.pdf', '/content/drive/My Drive/Colab Notebooks/nlp-ner-sustain-notebook/doc/set2/sumpp/sumpp2.pdf', '/content/drive/My Drive/Colab Notebooks/nlp-ner-sustain-notebook/doc/set2/sumpp/sumpp3.pdf', '/content/drive/My Drive/Colab Notebooks/nlp-ner-sust

21

Convert pdf object into text format to parse:

In [58]:
texts = []
for file in pdf_objects:
    with open(file,'rb') as pdf:
        print('loading pdf...')
        text = extract_text(pdf)
        texts.append(text)
        print('pdf appended..')

print('pdf read complete...')
#check object
#texts

loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
pdf read complete...


### Sentence tokenization using SpaCy Sentencizer

In [59]:
# alternative code for sentence tokenization using nltk
# tokens = tokenize.sent_tokenize(text)

#sentence tokenization using english vocab and SpaCy Sentencizer
#declare doc object to pass into our nlp pipeline
all_text = " ".join(str(x) for x in texts)    
doc = nlp(all_text)

#check number of sentences detected by Sentencizer
print("Number of lines :", (len(list(doc.sents))))

# create a list of tokenized sentences
sents = [sent.text.strip() for sent in doc.sents]

#lemmatization

#check sents list object
#sents

Number of lines : 11296


## DECLARE NEW CUSTOM ENTITY
Code below setups and defines new custom entity and prepares the training data for the NER

## New entity creation

### Ruled based matching using PhraseMatcher - edit to add additional rules to capture more strings

In [60]:
'''#Entity creation -- works okay...
#Label setup, we provide the label 'SUSTECH' for sustainability and resilience technologies.
LABEL = 'SUSTECH'

Define matcher object using english vocab we defined earlier. Test using PhraseMatcher
to define rules based on the exact patterns the strings will take the form of

#Character patterns to add into our matcher object
rule_matcher = PhraseMatcher(nlp.vocab)

# create rule patterns - ADD more rule patterns here!
rule_patterns = ['Green roof', 
            'green roof', 
            'green Roof',
            'Green Roof',
            'stormwater pond',
            'Stormwater Pond',
            'Stormwater pond',
            'Stormwater ponds',
            'stormwater ponds',
            'Stormwater Ponds',
            'stormwater Ponds',
            'Sump Pump',
            'sump Pump',
            'sump pump',
            'Sump Pumps',
            'sump Pumps',
            'sump pumps',
            'Backup Generator',
            'backup Generator',
            'Backup generator',
            'backup generator',
            'Backup Generators',
            'backup Generators',
            'Backup generators',
            'backup generators']

for i in rule_patterns: 
    rule_matcher.add(LABEL, None, nlp(i))'''

"#Entity creation -- works okay...\n#Label setup, we provide the label 'SUSTECH' for sustainability and resilience technologies.\nLABEL = 'SUSTECH'\n\nDefine matcher object using english vocab we defined earlier. Test using PhraseMatcher\nto define rules based on the exact patterns the strings will take the form of\n\n#Character patterns to add into our matcher object\nrule_matcher = PhraseMatcher(nlp.vocab)\n\n# create rule patterns - ADD more rule patterns here!\nrule_patterns = ['Green roof', \n            'green roof', \n            'green Roof',\n            'Green Roof',\n            'stormwater pond',\n            'Stormwater Pond',\n            'Stormwater pond',\n            'Stormwater ponds',\n            'stormwater ponds',\n            'Stormwater Ponds',\n            'stormwater Ponds',\n            'Sump Pump',\n            'sump Pump',\n            'sump pump',\n            'Sump Pumps',\n            'sump Pumps',\n            'sump pumps',\n            'Backup Generato

Test for pattern detection defined using phraseMatcher

In [61]:
'''# A simple test to check our matcher object
test_doc = nlp("I have a broken green roof in my Green Roof hood.")
print(test_doc)

for idx, start, end in rule_matcher(test_doc):
    print(test_doc[start:end],)'''

'# A simple test to check our matcher object\ntest_doc = nlp("I have a broken green roof in my Green Roof hood.")\nprint(test_doc)\n\nfor idx, start, end in rule_matcher(test_doc):\n    print(test_doc[start:end],)'

### Token based matching using Matcher - edit to add additional token patterns to capture more strings based on defined token attributes

In [62]:
### Using Matcher - Need to improve
'''
Define matcher object using english vocab we defined earlier. 
Test using Matcher to define rules based on the token attributes of the string.
'''

LABEL = 'SUSTECH'

token_matcher = Matcher(nlp.vocab, validate=True)

# create token patterns - ADD more token patterns here!
'''
token_patterns = [
    [{"LOWER": "green", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}}, {"LOWER": "roof", 'POS': {'IN' : ['NOUN','PROPN']}}],
    [{"LOWER": "stormwater", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}},{"LOWER": "pond", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}}],
    [{"LOWER": "stormwater", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}},{"LOWER": "ponds", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}}],
    [{"LOWER": "sump", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}},{"LOWER": "pump", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}}],
    [{"LOWER": "sump", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}},{"LOWER": "pumps", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}}],
    [{"LOWER": "backup", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}},{"LOWER": "generator", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}}],
]
'''


token_patterns = [
    [{"LOWER": "green"}, {"LOWER": "roof"}],
    [{"LOWER": "sump"}, {"LOWER": "pump"}],
    [{"LOWER": "backwater"},{"LOWER": "valve"}],
    [{"LOWER": "backup"},{"LOWER": "power"}],
    [{"LOWER": "standby"},{"LOWER": "generator"}],
    [{"LOWER": "backup"},{"LOWER": "generator"}],
    [{"LOWER": "emergency"},{"LOWER": "power"}],
    [{"LOWER": "roof"},{"LOWER": "drain"}, {"LOWER": "cover"}],
    [{"LOWER": "roof"},{"LOWER": "drain"}],
    [{"LOWER": "fuel"},{"LOWER": "storage"}, {"LOWER": "tank"}],
    [{"LOWER": "fuel"},{"LOWER": "storage"}],
    [{"LOWER": "downspout"}],
    [{"LOWER": "weeping"},{"LOWER": "tiles"}],
    [{"LOWER": "site"},{"LOWER": "grading"}],
    [{"LOWER": "flood"},{"LOWER": "barrier"}],
    [{"LOWER": "below"},{"LOWER": "ground"}, {"LOWER": "floor"}],
    [{"LOWER": "support"},{"LOWER": "ground"}, {"LOWER": "room"}],
    [{"LOWER": "rainwater"}, {"LOWER": "harvest", "LEMMA": "harvest", "POS": {"IN": ["NOUN","VERB"]}}],
    [{"LOWER": "rainwater"}, {"LOWER": "harvesting", "POS": {"IN": ["NOUN"]}}],
    [{"LOWER": "gutter"}],
    [{"LOWER": "rainwater"}, {"LOWER": "gutter"}],

    #plural patterns
    [{"LOWER": "green"}, {"LOWER": "roofs"}],
    [{"LOWER": "sump"}, {"LOWER": "pumps"}],
    [{"LOWER": "backwater"},{"LOWER": "valves"}],
    [{"LOWER": "backup"},{"LOWER": "powers"}],
    [{"LOWER": "backup"},{"LOWER": "generators"}],
    [{"LOWER": "emergency"},{"LOWER": "powers"}],
    [{"LOWER": "roof"},{"LOWER": "drain"}, {"LOWER": "covers"}],
    [{"LOWER": "roof"},{"LOWER": "drains"}],
    [{"LOWER": "fuel"},{"LOWER": "storage"}, {"LOWER": "tanks"}],
    [{"LOWER": "fuel"},{"LOWER": "storages"}],
    [{"LOWER": "downspouts"}],
    [{"LOWER": "weeping"},{"LOWER": "tiles"}],
    [{"LOWER": "site"},{"LOWER": "gradings"}],
    [{"LOWER": "flood"},{"LOWER": "barriers"}],
    [{"LOWER": "below"},{"LOWER": "ground"}, {"LOWER": "floors"}],
    [{"LOWER": "support"},{"LOWER": "ground"}, {"LOWER": "rooms"}],
    [{"LOWER": "rainwater"}, {"LOWER": "harvests", "LEMMA": "harvests", "POS": {"IN": ["NOUN", "VERB"]}}],
    [{"LOWER": "rainwater"}, {"LOWER": "harvestings", "LEMMA": "harvestings", "POS": {"IN": ["NOUN"]}}],
    [{"LOWER": "gutters"}],
    [{"LOWER": "rainwater"}, {"LOWER": "gutters"}],

]



token_matcher.add(LABEL, None, *token_patterns)


In [63]:
nlp.pipeline

[('sentencizer', <spacy.pipeline.pipes.Sentencizer at 0x7f3049cae048>),
 ('tagger', <spacy.pipeline.pipes.Tagger at 0x7f304a8ada90>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f304770afa8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f3035f50528>)]

In [64]:
text = (
    '''I have a broken green roof in my Green Roof hood. There is no Stormwater Pond on this roof. I like stormwater ponds. \n
    The backup generator located on the 8th floor is not working. The Sump Pump is so broken. \n
    The rainwater harvest is not working.
    ''')
test_doc = nlp(text)
words_lemmas_list = [token.lemma_ for token in test_doc]

print(test_doc)
print(words_lemmas_list)
#tokenizer = nlp.Defaults.create_tokenizer(nlp)

for idx, start, end in token_matcher(test_doc):
    print(test_doc[start:end],)
    
tokens = [token.text for token in test_doc]
tags = [token.tag for token in test_doc]
displacy.render(test_doc, style='dep', jupyter=True)

I have a broken green roof in my Green Roof hood. There is no Stormwater Pond on this roof. I like stormwater ponds. 

    The backup generator located on the 8th floor is not working. The Sump Pump is so broken. 

    The rainwater harvest is not working.
    
['-PRON-', 'have', 'a', 'break', 'green', 'roof', 'in', '-PRON-', 'Green', 'Roof', 'hood', '.', 'there', 'be', 'no', 'Stormwater', 'Pond', 'on', 'this', 'roof', '.', '-PRON-', 'like', 'stormwater', 'pond', '.', '\n\n    ', 'the', 'backup', 'generator', 'locate', 'on', 'the', '8th', 'floor', 'be', 'not', 'work', '.', 'the', 'Sump', 'Pump', 'be', 'so', 'broken', '.', '\n\n    ', 'the', 'rainwater', 'harvest', 'be', 'not', 'work', '.', '\n    ']
green roof
Green Roof
backup generator
Sump Pump
rainwater harvest


## Utility function
Recall in order to train the NER model, we require to annotate tokenized text that takes the form: [(Sentence, {entities: [(start, end, label)]}, ...].

In [65]:
#define our utility function
def train_parser(doc):
    position = [(doc[start:end].start_char, doc[start:end].end_char, LABEL) for 
                  idx, start, end in token_matcher(doc)]
    return (doc.text,  {'entities': position})

Tests for our utility function on our test doc object:

In [66]:
train_parser(test_doc)

('I have a broken green roof in my Green Roof hood. There is no Stormwater Pond on this roof. I like stormwater ponds. \n\n    The backup generator located on the 8th floor is not working. The Sump Pump is so broken. \n\n    The rainwater harvest is not working.\n    ',
 {'entities': [(16, 26, 'SUSTECH'),
   (33, 43, 'SUSTECH'),
   (127, 143, 'SUSTECH'),
   (189, 198, 'SUSTECH'),
   (223, 240, 'SUSTECH')]})

## Applying our utility function to our doc object

In [67]:
TRAIN_DATA = [train_parser(d) for d in nlp.pipe(sents) if len(token_matcher(d))==1]
TRAIN_DATA[0:10]
#len(TRAIN_DATA)

[('ANSI/SPRI GD-1  \nStructural Design Standard  \nfor Gutter Systems Used with  \nLow-Slope Roofs\n\nApproved October 7, 2010 \n\nTable of Contents\n\n1.',
  {'entities': [(50, 56, 'SUSTECH')]}),
 ('1.0 \n\nPurpose (See Commentary C1.0)\n\nThis standard provides designers, contractors, and building code  \nofficials information for proper structural design of Gutters used  \nwith low-slope roofing.',
  {'entities': [(158, 165, 'SUSTECH')]}),
 ('2.0 \n\nScope (See Commentary C2.0)\n\nThis standard specifies structural design for external Gutters used  \nwith low-slope, (2 in 12 or less) roofing on buildings less than or equal  \nto 60 ft (18 m) in height.',
  {'entities': [(90, 97, 'SUSTECH')]}),
 ('This standard does not address water removal or the water-carrying \ncapability of the Gutter as other building codes already address this issue .',
  {'entities': [(86, 92, 'SUSTECH')]}),
 ('This Standard does not consider downspouts or leaders.',
  {'entities': [(32, 42, 'SUSTECH')]}),
 (

## Storing annotations

In [68]:
train_path = r'/content/drive/My Drive/Colab Notebooks/nlp-ner-sustain-notebook/train'
TRAIN_DATA_OUTPUT = pd.DataFrame(TRAIN_DATA, columns=['text',
                                                      'position'])
TRAIN_DATA_OUTPUT.to_csv(os.path.join(train_path,'trainset2.csv'), index=False)