# <font size="10">Custom entity recognition </font>
## Training environment setup

This notebook is test code to setup new entities and training data for the NER model. The goal of the project is to tag environmental/sustainability technologies contained with in the systems data stored with in our capital planning systems.

Module downloads (necessary for only first instance): 

In [None]:
#!pip install spacy
#!pip install pdfminer.six
#!python -m spacy download en_core_web_lg
#general imports 
#!pip install nltk 


Import modules:

In [72]:
import os
import pandas as pd

#pdf miner 
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

#SpaCy
import en_core_web_sm
import spacy
from spacy import displacy
from spacy.lang.en import English
from spacy.matcher import Matcher
from spacy.matcher import PhraseMatcher
from spacy.pipeline import Sentencizer
from spacy.lemmatizer import Lemmatizer, ADJ, NOUN, VERB

#nltk
from nltk import tokenize


## Testing data setup
The code section below setup the testing for the NER.

Relative path setup

In [15]:
import os
root = r'./test/'
file = 'systems.csv'
print('Filepath is :',(os.path.join( root, file)))

Filepath is : ./data/systems.csv


Declaration and store testing data:

In [9]:
data = pd.read_csv(os.path.join(root, file))

Append comments together:

In [None]:
comments = [_ for _ in data['System - Comments']]
description = [_ for _ in data['System - Description']]

#combine list of text data
for i in description:
  comments.append(i)
  
#clean test
comments = [x for x in comments if str(x) != 'nan']

## Train data setup
The code section below setup the training data for the NER.
The is defined in the following steps:
* Convert documents into text
* Setting up the SpaCy nlp object
* Load, parsing and tokenization of text into a list of sentences
* Utility function order to parse the tokenized sentences in order to inherit the form:  [(Sentence, {entities: [(start, end, label)]}, ...].

SpaCy object delcaration and relative path setup:

SpaCy requires training data to be in the format: [(Sentence, {entities: [(start, end, label)]}, ...].
We first need to create a list of sentences throughout a doc object containing the new entity 

In [63]:
#Declare english vocab for our nlp object to define english language sentence boundaries
nlp = spacy.blank("en")
nlp = English()
nlp.pipe

<bound method Language.pipe of <spacy.lang.en.English object at 0x000002A76FE08580>>

Declare rules for sentence boundary detection logic and add to the nlp pipe object.


In [64]:
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"])
nlp.add_pipe(sentencizer)

Setup import doc import, write tests for single doc object for now:

In [53]:
doc_path = r'./doc/greenr/'
pdf_object =  os.path.join(doc_path, 'example2.pdf')

Convert pdf object into text format to parse:

In [54]:
from pdfminer.high_level import extract_text
with open(pdf_object,'rb') as pdf:
    text = extract_text(pdf)
    
#check object
#text

### Sentence tokenization using SpaCy Sentencizer

In [61]:
# alternative code for sentence tokenization using nltk
# tokens = tokenize.sent_tokenize(text)

#sentence tokenization using english vocab and SpaCy Sentencizer
#declare doc object to pass into our nlp pipeline
doc = nlp(text)

#check number of sentences detected by Sentencizer
len(list(doc.sents))

# create a list of tokenized sentences
sents = [sent.text for sent in doc.sents]
#check sents list object
# sents

## Declare new custom entity
Code below setups and defines new custom entity and prepares the training data for the NER

## New entity creation

### Pattern detection using PhraseMatcher

In [77]:
#Entity creation
#Label setup, we provide the label 'SUSTECH' for sustainability and resilience technologies.
label = 'SUSTECH'

'''Define matcher object using english vocab we defined earlier. Test using PhraseMatcher
to define rules based on the exact patterns the strings will take the form of'''

#Character patterns to add into our matcher object
matcher = PhraseMatcher(nlp.vocab)
patterns = ['Green roof', 
            'green roof', 
            'green Roof',
            'Green Roof']

for i in patterns: 
    matcher.add(label, None, nlp(i))

Test for pattern detection defined using phraseMatcher

In [78]:
# A simple test to check our matcher object
doc = nlp("I have a broken green roof in my Green Roof hood.")
print(doc)

for idx, start, end in matcher(doc):
    print(doc[start:end],)


I have a fucking green roof in my Green Roof hood.
green roof
Green Roof


### Pattern detection using Matcher - (to be defined)

In [76]:
### Using Matcher

'''Define matcher object using english vocab we defined earlier. 
Test using Matcher to define rules based on the token attributes of the string.'''
matcher = Matcher(nlp.vocab)


## Utility function
Recall in order to train the NER model, we require to annotate tokenized text that takes the form: [(Sentence, {entities: [(start, end, label)]}, ...].

In [79]:
#define our utility function
def train_parser(doc):
    detections = [(doc[start:end].start_char, doc[start:end].end_char, label) for 
                  idx, start, end in matcher(doc)]
    return (doc.text, {'entities': detections})


Tests for our utility function:

In [80]:
train_parser(doc)

('I have a fucking green roof in my Green Roof hood.',
 {'entities': [(17, 27, 'SUSTECH'), (34, 44, 'SUSTECH')]})

## Applying our utility function to our doc object

In [81]:
TRAIN_DATA = [train_parser(d) for d in nlp.pipe(sents) if len(matcher(d))==1]
TRAIN_DATA[5:8]

[('As green infrastructure, green roof benefits \nare  plenty.',
  {'entities': [(25, 35, 'SUSTECH')]}),
 (' \n \nSection 4 explores the need for a green roof policy for Malta.',
  {'entities': [(38, 48, 'SUSTECH')]}),
 ('  \n \nSection 9, looks at local planning and construction policies to identify whether they support \ngreen roof technology.',
  {'entities': [(100, 110, 'SUSTECH')]})]

## Storing results

In [86]:
train_path = r'./train/'
TRAIN_DATA_OUTPUT = pd.DataFrame(TRAIN_DATA)
TRAIN_DATA_OUTPUT.to_csv(os.path.join(train_path,'train.csv'))