# <font size="10">Custom entity recognition </font>
## Training data environment setup

## Introduction:

This notebook is test code to setup new entities and programmtically creating training data to train the NER model. The goal of the project is to tag environmental/sustainability technologies contained with in the systems data stored with in our capital planning systems.

To skip to creation of training data and go directly to [model training and model implementation](./ner-model-note.ipynb).

## SETUP
Module downloads (necessary for only first instance): 

In [1]:
#!pip install spacy
#!pip install pdfminer.six
# !python -m spacy download en_core_web_md
#general imports 
#!pip install nltk 




You should consider upgrading via the 'C:\Users\csunj\AppData\Local\Programs\Python\Python38\python.exe -m pip install --upgrade pip' command.



[+] Download and installation successful
You can now load the model via spacy.load('en_core_web_md')


Import modules:

In [2]:
import os
import pandas as pd
from pathlib import Path
import glob
from io import StringIO

#pdf miner 
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from pdfminer.high_level import extract_text

#SpaCy 
import en_core_web_md
import spacy
from spacy import displacy
from spacy.lang.en import English
from spacy.matcher import Matcher
from spacy.matcher import PhraseMatcher
from spacy.pipeline import Sentencizer
from spacy.lemmatizer import Lemmatizer, ADJ, NOUN, VERB
from spacy.lookups import Lookups
#nltk
from nltk import tokenize


## Testing data setup
The code section below setup the testing for the NER.

Relative path setup

In [3]:
import os
root = r'./test/'
file = 'systems.csv'
print('Filepath is :',(os.path.join( root, file)))

Filepath is : ./test/systems.csv


Declaration and store testing data:

In [4]:
data = pd.read_csv(os.path.join(root, file))

Append comments together:

In [5]:
comments = [_ for _ in data['System - Comments']]
description = [_ for _ in data['System - Description']]

#combine list of text data
for i in description:
  comments.append(i)
  
#clean test
comments = [x for x in comments if str(x) != 'nan']

## Train data setup
The code section below setup the training data for the NER.
The is defined in the following steps:
* Convert documents into text
* Setting up the SpaCy nlp object
* Load, parsing and tokenization of text into a list of sentences
* Utility function order to parse the tokenized sentences in order to inherit the form:  [(Sentence, {entities: [(start, end, label)]}, ...].

SpaCy object delcaration and relative path setup:

SpaCy requires training data to be in the format: [(Sentence, {entities: [(start, end, label)]}, ...].
We first need to create a list of sentences throughout a doc object containing the new entity.

In [11]:
#Declare english vocab for our nlp object to define english language sentence boundaries
nlp = spacy.blank("en")
nlp = English()
nlp = en_core_web_md.load()
nlp.max_length = 6130000 # or higher

nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x263b7425460>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x263c93c5460>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x263c93c5ee0>)]

Declare rules for sentence boundary detection logic and add to the nlp pipe object.


In [12]:
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"])
nlp.add_pipe(sentencizer, before="tagger")
nlp.pipeline

[('sentencizer', <spacy.pipeline.pipes.Sentencizer at 0x263cc3a9cd0>),
 ('tagger', <spacy.pipeline.pipes.Tagger at 0x263b7425460>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x263c93c5460>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x263c93c5ee0>)]

Setup import doc import, write tests for single doc object for now:

## Current test case is for the word 'green roof'

Document path setup (can change to a higher level directory to capture more/different docs for other terms)

In [8]:
'''doc_path = Path(r'./doc/stormw') #change this for broader directory to capture
pdf_objects = list(doc_path.glob('stormw4.pdf'))  # convert result to a list
print(pdf_objects)'''

#Setup higher-level directory function to import more docs
dir_path = r'./doc/' #change this for broader directory to capture
pdf_objects = glob.glob(dir_path + '/**/*.pdf', recursive=True)
print(pdf_objects)
#test on single pdf object
#pdf_object =  os.path.join(doc_path, 'example2.pdf')'''

['./doc\\backg\\backg1.pdf', './doc\\backg\\backg2.pdf', './doc\\backp\\backp1.pdf', './doc\\backw\\backw1.pdf', './doc\\backw\\backw2.pdf', './doc\\backw\\backw3.pdf', './doc\\downs\\downs1.pdf', './doc\\downs\\downs2.pdf', './doc\\downs\\downs3.pdf', './doc\\downs\\downs4.pdf', './doc\\floodb\\floodb1.pdf', './doc\\floodb\\floodb2.pdf', './doc\\fuels\\fuels1.pdf', './doc\\fuels\\fuels2.pdf', './doc\\fuels\\fuels3.pdf', './doc\\greenr\\greenr1.pdf', './doc\\greenr\\greenr2.pdf', './doc\\greenr\\greenr3.PDF', './doc\\greenr\\greenr4.pdf', './doc\\gutter\\gutter1.pdf', './doc\\gutter\\gutter2.pdf', './doc\\gutter\\gutter3.pdf', './doc\\rainw\\rainw1.pdf', './doc\\rainw\\rainw2.pdf', './doc\\rainw\\rainw3.pdf', './doc\\roofd\\roofd1.pdf', './doc\\roofd\\roofd2.pdf', './doc\\siteg\\siteg2.pdf', './doc\\stormw\\stormw1.pdf', './doc\\stormw\\stormw2.pdf', './doc\\stormw\\stormw3.pdf', './doc\\stormw\\stormw4.pdf', './doc\\stormw\\stormw5.pdf', './doc\\stormw\\stormw6.pdf', './doc\\sumpp\\su

Convert pdf object into text format to parse:

In [9]:
texts = []
for file in pdf_objects:
    with open(file,'rb') as pdf:
        print('loading pdf...')
        text = extract_text(pdf)
        texts.append(text)
        print('pdf appended..')

print('pdf read complete...')
#check object
#texts

loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pdf...
pdf appended..
loading pd

### Sentence tokenization using SpaCy Sentencizer

In [13]:
# alternative code for sentence tokenization using nltk
# tokens = tokenize.sent_tokenize(text)

#sentence tokenization using english vocab and SpaCy Sentencizer
#declare doc object to pass into our nlp pipeline
all_text = " ".join(str(x) for x in texts)    
doc = nlp(all_text)

#check number of sentences detected by Sentencizer
print("Number of lines :", (len(list(doc.sents))))

# create a list of tokenized sentences
sents = [sent.text.strip() for sent in doc.sents]

#lemmatization

#check sents list object
#sents

MemoryError: Unable to allocate 333. MiB for an array with shape (908373, 96) and data type int32

## DECLARE NEW CUSTOM ENTITY
Code below setups and defines new custom entity and prepares the training data for the NER

## New entity creation

### Ruled based matching using PhraseMatcher - edit to add additional rules to capture more strings

In [None]:
'''#Entity creation -- works okay...
#Label setup, we provide the label 'SUSTECH' for sustainability and resilience technologies.
LABEL = 'SUSTECH'

Define matcher object using english vocab we defined earlier. Test using PhraseMatcher
to define rules based on the exact patterns the strings will take the form of

#Character patterns to add into our matcher object
rule_matcher = PhraseMatcher(nlp.vocab)

# create rule patterns - ADD more rule patterns here!
rule_patterns = ['Green roof', 
            'green roof', 
            'green Roof',
            'Green Roof',
            'stormwater pond',
            'Stormwater Pond',
            'Stormwater pond',
            'Stormwater ponds',
            'stormwater ponds',
            'Stormwater Ponds',
            'stormwater Ponds',
            'Sump Pump',
            'sump Pump',
            'sump pump',
            'Sump Pumps',
            'sump Pumps',
            'sump pumps',
            'Backup Generator',
            'backup Generator',
            'Backup generator',
            'backup generator',
            'Backup Generators',
            'backup Generators',
            'Backup generators',
            'backup generators']

for i in rule_patterns: 
    rule_matcher.add(LABEL, None, nlp(i))'''

Test for pattern detection defined using phraseMatcher

In [None]:
'''# A simple test to check our matcher object
test_doc = nlp("I have a broken green roof in my Green Roof hood.")
print(test_doc)

for idx, start, end in rule_matcher(test_doc):
    print(test_doc[start:end],)'''

### Token based matching using Matcher - edit to add additional token patterns to capture more strings based on defined token attributes

In [None]:
### Using Matcher - Need to improve
'''
Define matcher object using english vocab we defined earlier. 
Test using Matcher to define rules based on the token attributes of the string.
'''

LABEL = 'SUSTECH'

token_matcher = Matcher(nlp.vocab, validate=True)

# create token patterns - ADD more token patterns here!
'''
token_patterns = [
    [{"LOWER": "green", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}}, {"LOWER": "roof", 'POS': {'IN' : ['NOUN','PROPN']}}],
    [{"LOWER": "stormwater", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}},{"LOWER": "pond", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}}],
    [{"LOWER": "stormwater", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}},{"LOWER": "ponds", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}}],
    [{"LOWER": "sump", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}},{"LOWER": "pump", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}}],
    [{"LOWER": "sump", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}},{"LOWER": "pumps", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}}],
    [{"LOWER": "backup", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}},{"LOWER": "generator", 'POS': {'IN' : ['ADJ','NOUN', 'PROPN']}}],
]
'''


token_patterns = [
    [{"LOWER": "green"}, {"LOWER": "roof"}],
    [{"LOWER": "sump"}, {"LOWER": "pump"}],
    [{"LOWER": "backwater"},{"LOWER": "valve"}],
    [{"LOWER": "backup"},{"LOWER": "power"}],
    [{"LOWER": "roof"},{"LOWER": "drain"}, {"LOWER": "cover"}],
    [{"LOWER": "fuel"},{"LOWER": "storage"}, {"LOWER": "tank"}],
    [{"LOWER": "downspout"}],
    [{"LOWER": "weeping"},{"LOWER": "tiles"}],
    [{"LOWER": "site"},{"LOWER": "grading"}],
    [{"LOWER": "flood"},{"LOWER": "barrier"}],
    [{"LOWER": "below"},{"LOWER": "ground"}, {"LOWER": "floor"}],
    [{"LOWER": "support"},{"LOWER": "ground"}, {"LOWER": "room"}],
    [{"LOWER": "rainwater"}, {"LOWER": "harvest", "LEMMA": "harvest", "POS": {"IN": ["NOUN","VERB"]}}],
    [{"LOWER": "rainwater"}, {"LOWER": "harvesting", "POS": {"IN": ["NOUN"]}}],
    [{"LOWER": "gutter"}],
    #plural patterns
    [{"LOWER": "green"}, {"LOWER": "roofs"}],
    [{"LOWER": "sump"}, {"LOWER": "pumps"}],
    [{"LOWER": "backwater"},{"LOWER": "valves"}],
    [{"LOWER": "backup"},{"LOWER": "powers"}],
    [{"LOWER": "roof"},{"LOWER": "drain"}, {"LOWER": "covers"}],
    [{"LOWER": "fuel"},{"LOWER": "storage"}, {"LOWER": "tanks"}],
    [{"LOWER": "downspouts"}],
    [{"LOWER": "weeping"},{"LOWER": "tiles"}],
    [{"LOWER": "site"},{"LOWER": "gradings"}],
    [{"LOWER": "flood"},{"LOWER": "barriers"}],
    [{"LOWER": "below"},{"LOWER": "ground"}, {"LOWER": "floors"}],
    [{"LOWER": "support"},{"LOWER": "ground"}, {"LOWER": "rooms"}],
    [{"LOWER": "rainwater"}, {"LOWER": "harvests", "LEMMA": "harvests", "POS": {"IN": ["NOUN", "VERB"]}}],
    [{"LOWER": "rainwater"}, {"LOWER": "harvestings", "LEMMA": "harvestings", "POS": {"IN": ["NOUN"]}}],
    [{"LOWER": "gutters"}]
]



token_matcher.add(LABEL, None, *token_patterns)


In [None]:
nlp.pipeline

In [None]:
text = (
    '''I have a broken green roof in my Green Roof hood. There is no Stormwater Pond on this roof. I like stormwater ponds. \n
    The backup generator located on the 8th floor is not working. The Sump Pump is so broken. \n
    The rainwater harvest is not working.
    ''')
test_doc = nlp(text)
words_lemmas_list = [token.lemma_ for token in test_doc]

print(test_doc)
print(words_lemmas_list)
#tokenizer = nlp.Defaults.create_tokenizer(nlp)

for idx, start, end in token_matcher(test_doc):
    print(test_doc[start:end],)
    
tokens = [token.text for token in test_doc]
tags = [token.tag for token in test_doc]
displacy.render(test_doc, style='dep', jupyter=True)

## Utility function
Recall in order to train the NER model, we require to annotate tokenized text that takes the form: [(Sentence, {entities: [(start, end, label)]}, ...].

In [None]:
#define our utility function
def train_parser(doc):
    position = [(doc[start:end].start_char, doc[start:end].end_char, LABEL) for 
                  idx, start, end in token_matcher(doc)]
    return (doc.text,  {'entities': position})

Tests for our utility function on our test doc object:

In [None]:
train_parser(test_doc)

## Applying our utility function to our doc object

In [None]:
TRAIN_DATA = [train_parser(d) for d in nlp.pipe(sents) if len(token_matcher(d))==1]
TRAIN_DATA[0:10]
#len(TRAIN_DATA)

## Storing annotations

In [None]:
train_path = r'./train/'
TRAIN_DATA_OUTPUT = pd.DataFrame(TRAIN_DATA, columns=['text',
                                                      'position'])
TRAIN_DATA_OUTPUT.to_csv(os.path.join(train_path,'train.csv'), index=False)