# <font size="10">Custom entity recognition </font>
## Training data environment setup

## Introduction:

This notebook is test code to setup new entities and programmtically creating training data to train the NER model. The goal of the project is to tag environmental/sustainability technologies contained with in the systems data stored with in our capital planning systems.

To skip to creation of training data and go directly to [model training and model implementation](./ner-model-note.ipynb).

## SETUP
Module downloads (necessary for only first instance): 

In [1]:
#!pip install spacy
#!pip install pdfminer.six
#!python -m spacy download en_core_web_lg
#general imports 
#!pip install nltk 


Import modules:

In [2]:
import os
import pandas as pd
from pathlib import Path
from io import StringIO

#pdf miner 
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from pdfminer.high_level import extract_text

#SpaCy
import en_core_web_sm
import spacy
from spacy import displacy
from spacy.lang.en import English
from spacy.matcher import Matcher
from spacy.matcher import PhraseMatcher
from spacy.pipeline import Sentencizer
from spacy.lemmatizer import Lemmatizer, ADJ, NOUN, VERB

#nltk
from nltk import tokenize


## Testing data setup
The code section below setup the testing for the NER.

Relative path setup

In [3]:
import os
root = r'./test/'
file = 'systems.csv'
print('Filepath is :',(os.path.join( root, file)))

Filepath is : ./test/systems.csv


Declaration and store testing data:

In [4]:
data = pd.read_csv(os.path.join(root, file))

Append comments together:

In [5]:
comments = [_ for _ in data['System - Comments']]
description = [_ for _ in data['System - Description']]

#combine list of text data
for i in description:
  comments.append(i)
  
#clean test
comments = [x for x in comments if str(x) != 'nan']

## Train data setup
The code section below setup the training data for the NER.
The is defined in the following steps:
* Convert documents into text
* Setting up the SpaCy nlp object
* Load, parsing and tokenization of text into a list of sentences
* Utility function order to parse the tokenized sentences in order to inherit the form:  [(Sentence, {entities: [(start, end, label)]}, ...].

SpaCy object delcaration and relative path setup:

SpaCy requires training data to be in the format: [(Sentence, {entities: [(start, end, label)]}, ...].
We first need to create a list of sentences throughout a doc object containing the new entity.

In [6]:
#Declare english vocab for our nlp object to define english language sentence boundaries
nlp = spacy.blank("en")
nlp = English()
nlp.pipe

<bound method Language.pipe of <spacy.lang.en.English object at 0x000001DEA6F70880>>

Declare rules for sentence boundary detection logic and add to the nlp pipe object.


In [7]:
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"])
nlp.add_pipe(sentencizer)

Setup import doc import, write tests for single doc object for now:

## Current test case is for the word 'green roof'

Document path setup (can change to a higher level directory to capture more/different docs for other terms)

In [8]:
doc_path = Path(r'./doc/greenr/') #change this for broader directory to capture
pdf_files = list(doc_path.glob('*.pdf'))  # convert result to a list
print(pdf_files)

#test on single pdf object
#pdf_object =  os.path.join(doc_path, 'example2.pdf')

[WindowsPath('doc/greenr/example1.pdf'), WindowsPath('doc/greenr/example2.pdf'), WindowsPath('doc/greenr/example3.PDF'), WindowsPath('doc/greenr/example4.pdf')]


Convert pdf object into text format to parse:

In [9]:

texts = []
for file in pdf_files:
    with open(file,'rb') as pdf:
        text = extract_text(pdf)
        texts.append(text)
#check object
#texts

### Sentence tokenization using SpaCy Sentencizer

In [10]:
# alternative code for sentence tokenization using nltk
# tokens = tokenize.sent_tokenize(text)

#sentence tokenization using english vocab and SpaCy Sentencizer
#declare doc object to pass into our nlp pipeline
all_text = " ".join(str(x) for x in texts)    
doc = nlp(all_text)

#check number of sentences detected by Sentencizer
print("Number of lines :", (len(list(doc.sents))))

# create a list of tokenized sentences
sents = [sent.text.strip() for sent in doc.sents]
#check sents list object
#sents

Number of lines : 2830


## DECLARE NEW CUSTOM ENTITY
Code below setups and defines new custom entity and prepares the training data for the NER

## New entity creation

### Ruled based matching using PhraseMatcher - edit to add additional token patterns to capture more strings

In [11]:
#Entity creation
#Label setup, we provide the label 'SUSTECH' for sustainability and resilience technologies.
LABEL = 'SUSTECH'

'''Define matcher object using english vocab we defined earlier. Test using PhraseMatcher
to define rules based on the exact patterns the strings will take the form of'''

#Character patterns to add into our matcher object
rule_matcher = PhraseMatcher(nlp.vocab)

# create rule patterns - ADD more rule patterns here!
rule_patterns = ['Green roof', 
            'green roof', 
            'green Roof',
            'Green Roof']

for i in rule_patterns: 
    rule_matcher.add(LABEL, None, nlp(i))

Test for pattern detection defined using phraseMatcher

In [12]:
# A simple test to check our matcher object
test_doc = nlp("I have a broken green roof in my Green Roof hood.")
print(test_doc)

for idx, start, end in rule_matcher(test_doc):
    print(test_doc[start:end],)

I have a broken green roof in my Green Roof hood.
green roof
Green Roof


### Token based matching using Matcher - edit to add additional token patterns to capture more strings

In [13]:
### Using Matcher

'''Define matcher object using english vocab we defined earlier. 
Test using Matcher to define rules based on the token attributes of the string.'''
token_matcher = Matcher(nlp.vocab, validate=True)

# create token patterns - ADD more token patterns here!
token_patterns = [[{"LOWER": "green"}, 
                   {"LOWER": "roof"}, 
                  ],
                 ]


token_matcher.add(LABEL, None, *token_patterns)


In [14]:
test_doc = nlp("I have a broken green roof in my Green Roof hood.")
print(test_doc)

for idx, start, end in token_matcher(test_doc):
    print(test_doc[start:end],)

I have a broken green roof in my Green Roof hood.
green roof
Green Roof


## Utility function
Recall in order to train the NER model, we require to annotate tokenized text that takes the form: [(Sentence, {entities: [(start, end, label)]}, ...].

In [15]:
#define our utility function
def train_parser(doc):
    position = [(doc[start:end].start_char, doc[start:end].end_char, LABEL) for 
                  idx, start, end in token_matcher(doc)]
    return (doc.text, {'entities': position})


Tests for our utility function on our test doc object:

In [16]:
train_parser(test_doc)

('I have a broken green roof in my Green Roof hood.',
 {'entities': [(16, 26, 'SUSTECH'), (33, 43, 'SUSTECH')]})

## Applying our utility function to our doc object

In [21]:
TRAIN_DATA = [train_parser(d) for d in nlp.pipe(sents) if len(token_matcher(d))==1]
TRAIN_DATA[0:10]
#len(TRAIN_DATA)


[('From a hydrologic \nperspective, the green roof acts like a lawn or meadow by storing rainwater in the \ngrowing medium and ponding areas.',
  {'entities': [(36, 46, 'SUSTECH')]}),
 ('Guidance in this guide \nfocuses on extensive green roof design.',
  {'entities': [(45, 55, 'SUSTECH')]}),
 ('Some \nmunicipalities, such as the City of Toronto, offer green roof incentive programs \nthat should be considered in the cost assessment.',
  {'entities': [(57, 67, 'SUSTECH')]}),
 ('A study of the life cycle costs \nand savings of building and owning a green roof in the Greater Toronto Area was \nundertaken by TRCA (2007a).',
  {'entities': [(70, 80, 'SUSTECH')]}),
 ('4-24 \n\nVersion 1.0 \n\n\x0c \n\n \n\nLow Impact Development Stormwater Management Planning and Design Guide \n\nFigure 4.2.2  A green roof during winter \n\n \n\nSource: National Research Council Canada, 2006 \n\n \nPhysical Suitability and Constraints \nGreen roofs are physically feasible in most development situations, but 

## Storing annotations

In [24]:
train_path = r'./train/'
TRAIN_DATA_OUTPUT = pd.DataFrame(TRAIN_DATA, columns=['text',
                                                      'position'])
TRAIN_DATA_OUTPUT.to_csv(os.path.join(train_path,'train.csv'), index=False)