# Identifying Actions 

This notebook explores approaches to identify sentences that are 'actions a citizen can do' from their parse trees.


##Setup

### Connect to Google Drive

In [0]:
#authorize Colab to access Drive
from google.colab import drive
drive.mount('/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive


###Imports

In [0]:
# all imports collected here
import gensim
import csv
import os
import pickle
import tabulate
import numpy as np
import re
import nltk
nltk.download('punkt')
from nltk import sent_tokenize, word_tokenize
from collections import defaultdict
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import TruncatedSVD
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.matutils import sparse2full


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


##Data Assembly and Exploration

###PDF to Text conversion

For these experiments we used the text files produced with pdfminer. 

This process is covered separately in the notebook PDF to Text pdfminer.ipynb



###Read actions from csv file

Column 10 of the csv file indicates who can do the action. Since we are looking for actions a citizen can do, we extract only actions that include 'citizen' in column 10.



In [0]:
#complete path of csv file
csv_path = '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/ClimateChangeDocs-master/Actions/Climate Change Docs - Actions Information Architecture.csv'

In [0]:
import csv

# define accumulator 
actions = []                                                                        

# open the csv file
with open(csv_path, 'r', encoding="utf8", errors='ignore') as f:          

#   define a reader (generator) for the file                                                                            
    reader = csv.reader(f, delimiter='|')

#   first row is titles
    titles=next(reader)

#   read all the actions  
    for row in reader: 
      #print(row[10])
      if 'citizen' in row[10]:
        actions.append(row[0]) 
                       

citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen
citizen


In [0]:
len(actions)

820

In [0]:
actions

['Have a drainage contractor visit your home to inspect your lateral with a Closed Circuit TV (CCTV).',
 'Never pour kitchen grease, fats or oils into your house drains because they may solidify in your plumbing system. Also, do not put objects down the toilet or drains that your plumbing system was never intended to handle.',
 'Consider a sanitary wastewater backflow preventer valve to reduce the risk of sewage backup into your basement.',
 'Disconnect roof downspouts, if connected to wastewater lateral, to reduce flows to the sanitary lateral and the wastewater sewer.',
 'Improve lot grading, making sure that the ground slopes away from your exterior walls.',
 'Maintain all original property swales to divert water away from your home.',
 'Where possible, disconnect your roof downspouts and divert the stormwater at least 2 meters (6ft) away from your home to a vegetated, safe discharge point away from adjacent property lines, sidewalks, or building foundations.',
 'Check for and reduc

###Discussion

There are 820 citizen actions listed.

Notice that many, but not all, of the actions are imperative sentences. Perhaps imperative sentences can be used to identify actions.

Experiments with various linguistic tools suggested that spaCy dependency parses might be able to help us.

###Obtaining Examples of Non-Actions

Although we have examples of the kinds of sentences we are looking for, we have nothing to represent what we are **not** looking for. In order to train a binary classifier we need examples of positives and negatives.  

We extracted 1500 random sentences from the corpus and tested that they were not in the Actions list.

We manually edited the extracted random sentences to weed out any that were not well-formed sentences or that could be considered Actions even though not labelled as such. 

This provided 388 samples of non-action sentences.

In [0]:
import os
import re
from random import random, seed

# manually construct a list of eligible files to look at
files = [
  'Guide-Building-Sustainable-and-Resilient-Communities-with-Asset-Management-EN.txt',
  'ccp_impactonpeople.txt',
  'health_facilit-instal_sante-eng.txt',
  'climate_data_discussion_primer.txt',
  'FloodRecovery-e.txt',
  'municipal-climate-change-action-plan-guidebook-en.txt',
  'public_guideline__principles_of_climate_adaptation_and_mitigation_for_engineers.txt',
  'Spring_Flood_Fact_Sheet.txt',
  'Ahead-of-the-Storm-1.txt',
  'builders_guide_2010_final.txt',
  'ClimatRisk-E-ACCESSIBLE.txt'
  ]

# define variables
non_actions = []    
counter = 0
seed(42)

in_path = '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/ClimateChangeDocs_pdfminer'
out_path = '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/'
outfile = open(out_path+"non-actions-raw.txt", 'w', encoding='utf8')

# read each file in the list
for file in files:
    path_to_txt = os.path.join(in_path, file)
    with open(path_to_txt, encoding='utf8') as f:
        raw_text=f.read()

    text=raw_text
 
# apply sentence tokenization    
    doc = nlp(text)

    for sent in doc.sents:
#capture random sentences and write them out
        if random() < 0.05:
            if not (sent.text in actions):
                outfile.write(sent.text+'\n')
                non_actions.append(sent.text)
                counter += 1

#stop when 1500 sentences written
    if counter > 1500:
        break

outfile.close()

In [0]:
non_actions

['An introduction for \n \n',
 '•  City of Vancou',
 'ON\n\n',
 'ON\n\n',
 '�',
 '�',
 '�',
 '�',
 '�',
 '�',
 'A sustainable community is one that meets the needs \n \nof the present without compromising the needs of \n \nfuture generations (source: Environment and Climate \n \nChange Canada).',
 'costs, risks and services \n\n',
 'Integrating natural and built environments \n\n \n\n',
 'Asset management (AM) is an integrated approach, \n \ninvolving all municipal departments, to choosing \n \nand managing existing and new assets.',
 'PROSPERITY \n\n',
 'ti es be built into asset \nmanagement planning? \n\n',
 'Compact, mixed-use de elopment is \ngenerally less costly to ser ice',
 'Re y ling or repurposing assets: As assets come to \n \n',
 'Of course, these considerations are not only  \nimportant at the maintenance stage; climate risks need to  \nbe considered at all stages of the life cycle of assets as well  \nas when decisions are made about building something new. \n\n',
 'CLIM

At this point the extracted sentences were copied into a text editor and manually edited.

In [0]:
# read the edited Non-Action file
import csv

# define variables 
non_actions = []   
in_path = '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/non-actions-edited.txt'
out_path = '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/non-actions-parse.pkl'
outfile = open(out_path, 'wb', encoding='utf8')
                                                                     
# open the file
with open(in_path, 'r', encoding="utf8", errors='ignore') as f:          

#   define a reader (generator) for the file                                                                            
    reader = csv.reader(f)

#   read all the non-actions  
    for row in reader: 
        out_file.write(process(row))
        non_actions.append(row) 

In [0]:
len(non_actions)

388

###Held out Test Data

In [0]:
# manually construct a list of eligible files that have not been examined yet, 
# and may contain both actions and non-actions that have never been seen before

test_files = [
  'coastal_flooded_land_guidelines.txt',
  'En56-226-2008-eng.txt',
  'env-yukon-state-play-analysis-climate-change-impacts-adaptation.txt',
  'FBC_WaterGuide_FINAL.txt',
  'final_climate_change_and_health_backgrounder_overview.txt',
  'Guidebook-2016.txt',
  'HP5-122-2017-eng.txt',
  'landuse-e.txt',
  'preparedbc_flood_information_for_homeowners_and_home_buyers_2018.txt',
  'protect-your-home-from-basement-flooding.txt',
  'Protect_Your_Home_From_Flooding_Brochure.txt',
  'sea_dike_guidelines.txt',
  'slr-primer.txt',
  'Synthesis_Eng.txt',
  'Urban_Forests_Guide.txt',
  'Vancouver-Climate-Change-Adaptation-Strategy-2012-11-07.txt',
  'WCEL_climate_change_FINAL.txt'
  ]

test_path = '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/ClimateChangeDocs_pdfminer'

##Data Wrangling

###Setting up spaCy

code adapted from https://heartbeat.fritz.ai/nlp-chronicles-intro-to-spacy-34949f1bc118#08c8

spaCy documentation  https://spacy.io/usage

spaCy POS tags https://spacy.io/api/annotation#pos-tagging

In [0]:
pip install -U spaCy

Collecting spaCy
[?25l  Downloading https://files.pythonhosted.org/packages/47/13/80ad28ef7a16e2a86d16d73e28588be5f1085afd3e85e4b9b912bd700e8a/spacy-2.2.3-cp36-cp36m-manylinux1_x86_64.whl (10.4MB)
[K     |████████████████████████████████| 10.4MB 7.4MB/s 
Collecting thinc<7.4.0,>=7.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/07/59/6bb553bc9a5f072d3cd479fc939fea0f6f682892f1f5cff98de5c9b615bb/thinc-7.3.1-cp36-cp36m-manylinux1_x86_64.whl (2.2MB)
[K     |████████████████████████████████| 2.2MB 43.6MB/s 
Collecting blis<0.5.0,>=0.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/41/19/f95c75562d18eb27219df3a3590b911e78d131b68466ad79fdf5847eaac4/blis-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (3.7MB)
[K     |████████████████████████████████| 3.7MB 43.2MB/s 
Collecting catalogue<1.1.0,>=0.0.7
  Downloading https://files.pythonhosted.org/packages/4f/d5/46ff975f0d7d055cf95557b944fd5d29d9dfb37a4341038e070f212b24fe/catalogue-0.0.8-py2.py3-none-any.whl
Collecting p

In [0]:
!python -m spacy download en

Collecting en_core_web_sm==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0MB)
[K     |████████████████████████████████| 12.0MB 611kB/s 
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.2.5-cp36-none-any.whl size=12011741 sha256=658ba3e7ccbb1d35120369fa403fd1fa4066a28fa005add628b30d6e155c22d7
  Stored in directory: /tmp/pip-ephem-wheel-cache-whowe7mg/wheels/6a/47/fb/6b5a0b8906d8e8779246c67d4658fd8a544d4a03a75520197a
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
  Found existing installation: en-core-web-sm 2.1.0
    Uninstalling en-core-web-sm-2.1.0:
      Successfully uninstalled en-core-web-sm-2.1.0
Successfully installed en-core-web-sm-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via s

In [0]:
import spacy
nlp=spacy.load('en')

###An example of a spaCy dependency parse

In [0]:
doc

Make sure everyone living in the home knows where to find the Go-Kit.

In [0]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})

### Define a function to represent the high-level syntax of a sentence

It should extract the top syntactic levels of the parse tree, excluding the leaves (tokens). 

The syntactic abstraction of the sentence is what will be used for classification.

In [0]:
def parse(text):
# create a spaCy document instance from the text 
    doc = nlp(text)
# find the root of the parse tree on which all other tokens depend
    root = [token for token in doc if token.head == token][0]

# starting from the root, find the syntactic tags of the first two levels of the parse tree.
# using tags instead of pos for greater differentiation of verb types.
    parsed = ['ROOT_self_' + root.tag_]
    for s in root.lefts:
        parsed.extend(['LEFT_' + s.dep_ +'_' + s.tag_])

    for s in root.rights:
        parsed.extend(['RIGHT_' + s.dep_ + '_' + s.tag_])
        
    return parsed

In [0]:
#demonstrate what the function does
parse('I want to be a clone.')

['ROOT_self_VBP', 'LEFT_nsubj_PRP', 'RIGHT_xcomp_VB', 'RIGHT_punct_.']

###Parse and pickle the Actions
 


In [0]:
import csv

#complete path of csv file
csv_path = '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/ClimateChangeDocs-master/Actions/Climate Change Docs - Actions Information Architecture.csv'
outpath = '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/Climate Change Docs - Actions.pkl'

# define accumulator 
actions = [] 
parsed_actions = []                                                                   

# open the csv file
with open(csv_path, 'r', encoding="utf8", errors='ignore') as f, open(outpath, 'w') as outfile:          

#   define a reader (generator) for the file                                                                            
    reader = csv.reader(f, delimiter='|')

#   first row is titles
    titles=next(reader)

#   read all the actions  
    for row in reader: 
        if 'citizen' in row[10]:
#   save the original sentences in the actions list         
            actions.append(row[0])
#   parse the syntactic abstraction and pickle it
            parsed = parse(row[0])
            outfile.writelines(' '.join(parsed))
            outfile.write('\n')
#   save the parse for further processing
            parsed_actions.append(parsed)

 

In [0]:
parsed_actions

[['ROOT_self_VB',
  'LEFT_aux_VB',
  'LEFT_nsubj_NN',
  'RIGHT_dobj_NN',
  'RIGHT_advcl_VB',
  'RIGHT_punct_.'],
 ['ROOT_self_VB',
  'LEFT_neg_RB',
  'RIGHT_dobj_NN',
  'RIGHT_prep_IN',
  'RIGHT_advcl_VB',
  'RIGHT_punct_.'],
 ['ROOT_self_VB', 'RIGHT_dobj_NN', 'RIGHT_advcl_VB', 'RIGHT_punct_.'],
 ['ROOT_self_VB',
  'LEFT_nsubj_NNS',
  'LEFT_punct_,',
  'LEFT_advcl_VBN',
  'LEFT_punct_,',
  'LEFT_aux_TO',
  'RIGHT_dobj_NNS',
  'RIGHT_prep_IN',
  'RIGHT_punct_.'],
 ['ROOT_self_VB',
  'RIGHT_dobj_NN',
  'RIGHT_punct_,',
  'RIGHT_advcl_VBG',
  'RIGHT_punct_.'],
 ['ROOT_self_VB', 'RIGHT_dobj_NNS', 'RIGHT_advcl_VB', 'RIGHT_punct_.'],
 ['ROOT_self_VB',
  'LEFT_advcl_JJ',
  'LEFT_punct_,',
  'RIGHT_dobj_NNS',
  'RIGHT_cc_CC',
  'RIGHT_conj_VB',
  'RIGHT_punct_.'],
 ['ROOT_self_VB',
  'RIGHT_prep_IN',
  'RIGHT_cc_CC',
  'RIGHT_conj_VB',
  'RIGHT_punct_.'],
 ['ROOT_self_VB',
  'LEFT_csubj_VB',
  'LEFT_aux_VBP',
  'LEFT_neg_RB',
  'RIGHT_dobj_NNS',
  'RIGHT_punct_.'],
 ['ROOT_self_NNS',
  'LEFT_a

### Parse and pickle the Non-Actions

In [0]:
import csv

#complete path of csv file
inpath = '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/non-actions-edited.txt'
outpath = '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/Climate Change Docs - Non-Actions.pkl'

# define accumulator 
non_actions = [] 
parsed_non_actions = []                                                                   

# open the text file
with open(inpath, 'r', encoding="utf8", errors='ignore') as f, open(outpath, 'w') as outfile:          
#   read the non-actions  
    for row in f.readlines(): 
#   save the original sentences in the actions list         
        non_actions.append(row)
#   parse the syntactic abstraction and pickle it
        parsed = parse(row)
        outfile.writelines(' '.join(parsed))
        outfile.write('\n')
#   save the parse for further processing
        parsed_non_actions.append(parsed)
 

In [0]:
parsed_non_actions

[['ROOT_self_VBZ', 'LEFT_nsubj_NN', 'RIGHT_attr_NN'],
 ['ROOT_self_VBZ',
  'LEFT_nsubj_NN',
  'RIGHT_attr_NN',
  'RIGHT_punct_,',
  'RIGHT_advcl_VBG',
  'RIGHT_punct_.'],
 ['ROOT_self_VBP',
  'LEFT_ccomp_VBP',
  'LEFT_punct_,',
  'LEFT_nsubj_NNS',
  'RIGHT_xcomp_VBN',
  'RIGHT_punct_.'],
 ['ROOT_self_VBZ', 'LEFT_nsubj_NN', 'RIGHT_prep_IN', 'RIGHT_punct_.'],
 ['ROOT_self_VBN',
  'LEFT_nsubjpass_PRP',
  'LEFT_auxpass_VBP',
  'RIGHT_xcomp_VB',
  'RIGHT_punct_.'],
 ['ROOT_self_VBZ',
  'LEFT_nsubj_PRP',
  'RIGHT_acomp_JJ',
  'RIGHT_xcomp_VB',
  'RIGHT_punct_.'],
 ['ROOT_self_VB',
  'LEFT_advmod_RB',
  'LEFT_punct_,',
  'LEFT_nsubj_NN',
  'LEFT_aux_MD',
  'RIGHT_prep_IN',
  'RIGHT_punct_,',
  'RIGHT_cc_CC',
  'RIGHT_conj_VB'],
 ['ROOT_self_VB',
  'LEFT_advmod_RB',
  'LEFT_nsubj_NNS',
  'LEFT_aux_MD',
  'RIGHT_dobj_NN',
  'RIGHT_advmod_RB',
  'RIGHT_prep_IN',
  'RIGHT_punct_.'],
 ['ROOT_self_VBZ',
  'LEFT_nsubj_PRP',
  'RIGHT_neg_RB',
  'RIGHT_acomp_JJ',
  'RIGHT_advcl_VB',
  'RIGHT_punct_.']

### Inspect the frequency counts

If there is an obvious rule for identifying actions, hard-code it as a rule.

Otherwise train a classification model on the pickled actions and non-actions.

In [0]:
from nltk import FreqDist

#  count the frequency of the syntactic patterns
action_freqs = FreqDist([' '.join(sent) for sent in parsed_actions])  
non_action_freqs = FreqDist([' '.join(sent) for sent in parsed_non_actions])                    

####Most Common Patterns for Actions

In [0]:
action_freqs.most_common(20)

[('ROOT_self_VB RIGHT_dobj_NN RIGHT_punct_.', 47),
 ('ROOT_self_VB RIGHT_dobj_NNS RIGHT_punct_.', 28),
 ('ROOT_self_VB RIGHT_dobj_NN RIGHT_prep_IN RIGHT_punct_.', 18),
 ('ROOT_self_VB RIGHT_acomp_JJ RIGHT_punct_.', 17),
 ('ROOT_self_VB RIGHT_prep_IN RIGHT_punct_.', 17),
 ('ROOT_self_VB RIGHT_dobj_NN RIGHT_cc_CC RIGHT_conj_VB RIGHT_punct_.', 11),
 ('ROOT_self_VB RIGHT_dobj_NN RIGHT_advcl_VB RIGHT_punct_.', 9),
 ('ROOT_self_VB RIGHT_cc_CC RIGHT_conj_VB RIGHT_punct_.', 9),
 ('ROOT_self_VB RIGHT_xcomp_VBG RIGHT_punct_.', 8),
 ('ROOT_self_VB RIGHT_dobj_NNS RIGHT_prep_IN RIGHT_punct_.', 7),
 ('ROOT_self_VB RIGHT_dobj_NN', 6),
 ('ROOT_self_VB RIGHT_ccomp_VBN RIGHT_punct_.', 6),
 ('ROOT_self_VB LEFT_nsubj_NN LEFT_aux_MD RIGHT_xcomp_VB RIGHT_punct_.', 6),
 ('ROOT_self_VB LEFT_nsubj_NNS LEFT_aux_MD RIGHT_dobj_NN RIGHT_punct_.', 6),
 ('ROOT_self_VBN LEFT_nsubjpass_NNS LEFT_aux_MD LEFT_auxpass_VB RIGHT_prep_IN RIGHT_punct_.',
  6),
 ('ROOT_self_VB RIGHT_dobj_NNS RIGHT_advcl_VB RIGHT_punct_.', 5),


####Most Common Patterns for Non-Actions

In [0]:
non_action_freqs.most_common(20)

[('ROOT_self_VBP LEFT_expl_EX RIGHT_advmod_RB RIGHT_attr_NNS RIGHT_punct_.',
  4),
 ('ROOT_self_VBN LEFT_nsubjpass_NN LEFT_auxpass_VBZ RIGHT_xcomp_VB RIGHT_punct_.',
  4),
 ('ROOT_self_VBP LEFT_nsubj_NNS RIGHT_dobj_NN RIGHT_punct_.', 3),
 ('ROOT_self_VBZ LEFT_nsubj_NN RIGHT_attr_NN RIGHT_punct_.', 3),
 ('ROOT_self_VBP LEFT_nsubj_NNS RIGHT_dobj_NNS RIGHT_punct_.', 3),
 ('ROOT_self_NNS RIGHT_prep_IN RIGHT_punct_.', 3),
 ('ROOT_self_VBP LEFT_nsubj_NNS LEFT_advmod_RB RIGHT_prep_IN RIGHT_punct_.',
  3),
 ('ROOT_self_VBZ LEFT_nsubj_NN RIGHT_prep_IN RIGHT_punct_.', 2),
 ('ROOT_self_VBZ LEFT_nsubj_PRP RIGHT_acomp_JJ RIGHT_xcomp_VB RIGHT_punct_.',
  2),
 ('ROOT_self_VB LEFT_nsubj_NN LEFT_aux_MD RIGHT_dobj_NN RIGHT_punct_.', 2),
 ('ROOT_self_VB LEFT_nsubj_NN LEFT_aux_MD RIGHT_dobj_NNS RIGHT_punct_.', 2),
 ('ROOT_self_VB LEFT_nsubj_NNS LEFT_aux_MD RIGHT_ccomp_VB RIGHT_punct_.', 2),
 ('ROOT_self_VBZ LEFT_nsubj_NNP RIGHT_dobj_NN RIGHT_punct_.', 2),
 ('ROOT_self_VBP LEFT_nsubj_NNS RIGHT_attr_NN RIGH

####Compare Distribution of Most Common Patterns between Actions and Non-Actions

In [0]:
action_freqs['ROOT_self_VB RIGHT_dobj_NN RIGHT_punct_.'] 

47

In [0]:
non_action_freqs['ROOT_self_VB RIGHT_dobj_NN RIGHT_punct_.'] 

0

In [0]:
action_freqs['ROOT_self_VB RIGHT_dobj_NNS RIGHT_punct_.'] 

28

In [0]:
non_action_freqs['ROOT_self_VB RIGHT_dobj_NNS RIGHT_punct_.'] 

0

In [0]:
non_action_freqs['ROOT_self_VBP LEFT_expl_EX RIGHT_advmod_RB RIGHT_attr_NNS RIGHT_punct_.']

4

In [0]:
action_freqs['ROOT_self_VBP LEFT_expl_EX RIGHT_advmod_RB RIGHT_attr_NNS RIGHT_punct_.']

0

In [0]:
non_action_freqs['ROOT_self_VBN LEFT_nsubjpass_NN LEFT_auxpass_VBZ RIGHT_xcomp_VB RIGHT_punct_.']

4

In [0]:
action_freqs['ROOT_self_VBN LEFT_nsubjpass_NN LEFT_auxpass_VBZ RIGHT_xcomp_VB RIGHT_punct_.']

0

### Discussion

There is definitely a difference between the grammatical structures that are common for actions and those that are common for non-actions.

Also, the most common patterns for actions are not found at all for non-actions, and vice versa.

The patterns themselves are too complex to be easily hard-coded by a human, but the fact that the patterns exist is obvious. 

Therefore is seems promising to apply a classification algorithm on the parses.

It may be necessary to balance the size of the datasets, as the actions are approximately twice as many as the non-actions.

###Parse Reserved Test Data 

The sentences of the reserved test data were also parsed in the same way as the training data.

In [0]:
# manually construct a list of eligible files that have not been examined yet, 
# and may contain both actions and non-actions that have never been seen before

test_files = [
  'coastal_flooded_land_guidelines.txt',
  'En56-226-2008-eng.txt',
  'env-yukon-state-play-analysis-climate-change-impacts-adaptation.txt',
  'FBC_WaterGuide_FINAL.txt',
  'final_climate_change_and_health_backgrounder_overview.txt',
  'Guidebook-2016.txt',
  'HP5-122-2017-eng.txt',
  'landuse-e.txt',
  'preparedbc_flood_information_for_homeowners_and_home_buyers_2018.txt',
  'protect-your-home-from-basement-flooding.txt',
  'Protect_Your_Home_From_Flooding_Brochure.txt',
  'sea_dike_guidelines.txt',
  'slr-primer.txt',
  'Synthesis_Eng.txt',
  'Urban_Forests_Guide.txt',
  'Vancouver-Climate-Change-Adaptation-Strategy-2012-11-07.txt',
  'WCEL_climate_change_FINAL.txt'
  ]

test_path = '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/ClimateChangeDocs_pdfminer'

# initialize variables
candidates = [] 
clean_sents = [] 
sent_files = []  
counter = 0

# read each file in the list and clean up the raw text
for file in test_files:
    path_to_txt = os.path.join(test_path, file)
    with open(path_to_txt, encoding='utf8') as f:
        raw_text=f.read()
        
        for sent in sent_tokenize(raw_text): # nltk.sent_tokenize is better at recognizing sentences across multiple lines than spaCy is
            #print("Raw: ", sent)
            sent = re.sub(r'\s+', ' ', sent) # replace multiple whitespace characters with a single space
            sent = re.sub(r'[^\x00-\x7F]', '_', sent) # replace non-ascii characters with underscore
            #print("Clean: ", sent)

#           apply spaCy nlp processing to cleaned-up sentence
            sent = nlp(sent)
            #print("spaCy: ", sent.text)

#           skip sentences with 3 or fewer words, as these are not likely to be 'actions a citizen can do'
            if len(word_tokenize(sent.text)) > 3: 
                clean_sents.append(sent.text)
                sent_files.append(file)
                candidates.append(parse(sent.text))
                counter += 1

#stop when 50,000 sentences found, because that is enough for a human to look at
    if counter > 50000:
        break

In [0]:
# show the parsed candidates
print("Found {} candidate sentences   \n".format(counter))
for i in range(0, 3):
    print("Clean: ", clean_sents[i])
    print("Parsed: ", candidates[i])
    print("File: ", sent_files[i], '\n')

print("...")    

for i in range(-3, 0):
    print("Clean: ", clean_sents[i])
    print("Parsed: ", candidates[i])
    print("File: ", sent_files[i], '\n')


Found 14278 candidate sentences   

Clean:   Process Infrastructure Ports, Marine & Offshore Project No.
Parsed:  ['ROOT_self_NNPS', 'LEFT_compound_NN', 'LEFT_compound_NNP', 'RIGHT_punct_,', 'RIGHT_conj_NNP']
File:  coastal_flooded_land_guidelines.txt 

Clean:  143111 Revision Number 0 BC Ministry of Environment Climate Change Adaption Guidelines for Sea Dikes and Coastal Flood Hazard Land Use Guidelines for Management of Coastal Flood Hazard Land Use 27 January 2011 DISCLAIMER: This document is for the private information and benefit only of the client for whom it was prepared and for the particular purpose previously advised to Ausenco Sandwell.
Parsed:  ['ROOT_self_NNP', 'LEFT_nsubj_NNPS', 'RIGHT_dobj_NNPS']
File:  coastal_flooded_land_guidelines.txt 

Clean:  The contents of this document are not to be relied upon or used, in whole or in part, by or for the benefit of others without prior adaptation and specific written verification by Ausenco Sandwell.
Parsed:  ['ROOT_self_VBP', '

##Logistic Regression Model

###Define Classes

#### Define Vectorizer Class

based on example distributed in course

In [0]:
class GensimTfidfVectorizer(BaseEstimator, TransformerMixin):

    def __init__(self, dirpath=".", tofull=False):
        """
        Pass in a directory that holds the lexicon in corpus.dict and the
        TFIDF model in tfidf.model (for now).

        Set tofull = True if the next thing is a Scikit-Learn estimator
        otherwise keep False if the next thing is a Gensim model.
        """
        self._lexicon_path = os.path.join(dirpath, "corpus.dict")
        self._tfidf_path = os.path.join(dirpath, "tfidf.model")

        self.lexicon = None
        self.tfidf = None
        self.tofull = tofull

        self.load()

    def load(self):

        if os.path.exists(self._lexicon_path):
            self.lexicon = Dictionary.load(self._lexicon_path)

        if os.path.exists(self._tfidf_path):
            self.tfidf = TfidfModel().load(self._tfidf_path)

    def save(self):
        self.lexicon.save(self._lexicon_path)
        self.tfidf.save(self._tfidf_path)

    def fit(self, documents, labels=None):
        self.lexicon = Dictionary(documents)
        self.tfidf = TfidfModel([self.lexicon.doc2bow(doc) for doc in documents], id2word=self.lexicon)
        self.save()
        return self

    def transform(self, documents):
        def generator():
            for document in documents:
                vec = self.tfidf[self.lexicon.doc2bow(document)]
                if self.tofull:
                    yield sparse2full(vec, len(self.lexicon))
                else:
                    yield vec
        return list(generator())


#### Define CorpusLoader Class to manage the folds for cross-validation

In [0]:
import numpy as np
from sklearn.model_selection import KFold

class CorpusLoader(object):
    """
    Splits a list of vectors and their labels
    """
    def __init__(self, vectors, labels, splits=12):
        self.folds = KFold(n_splits=splits, shuffle=True)
        self.X = np.asarray(vectors)
        self.y = np.asarray(labels)

    def documents(self, idx=None):
        #temp = [doc for doc in self.X[idx]]
        #print('docs: ', temp)
        #return(temp)
        return [doc for doc in self.X[idx]]

    def labels(self, idx):
        return self.y[idx]

    def __iter__(self):
        for train_index, test_index in self.folds.split(self.X):
            X_train = self.documents(train_index)
            y_train = self.labels(train_index)

            X_test = self.documents(test_index)
            y_test = self.labels(test_index)

            yield X_train, X_test, y_train, y_test


###Build the Model

#### Read in the Pickled Training Data

Non-actions are duplicated to approximate the same number as the Actions, so the training data is balanced.

In [0]:
# location of pickle files
actions_file = '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/Climate Change Docs - Actions.pkl'
non_actions_file = '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/Climate Change Docs - Non-Actions.pkl'

#initialize accumulators
corpus = []
labels = []

with open(actions_file, 'r', encoding="utf8", errors='ignore') as f:
    reader = csv.reader(f, delimiter=' ')
    for row in reader:
        corpus.append(row)
        labels.append('action')

with open(non_actions_file, 'r', encoding="utf8", errors='ignore') as f:
    reader = csv.reader(f, delimiter=' ')
    for row in reader:
        corpus.append(row)
        corpus.append(row)
        labels.append('non_action')
        labels.append('non_action')


In [0]:
# show the training data
for i in range(3):
    print(labels[i], corpus[i])

print("...")    

for i in range(-3, 0, 1): 
    print(labels[i], corpus[i])   
    

action ['ROOT_self_VB', 'LEFT_aux_VB', 'LEFT_nsubj_NN', 'RIGHT_dobj_NN', 'RIGHT_advcl_VB', 'RIGHT_punct_.']
action ['ROOT_self_VB', 'LEFT_neg_RB', 'RIGHT_dobj_NN', 'RIGHT_prep_IN', 'RIGHT_advcl_VB', 'RIGHT_punct_.']
action ['ROOT_self_VB', 'RIGHT_dobj_NN', 'RIGHT_advcl_VB', 'RIGHT_punct_.']
...
non_action ['ROOT_self_VB', 'RIGHT_ccomp_VBZ', 'RIGHT_punct_.']
non_action ['ROOT_self_VBN', 'LEFT_nsubjpass_NNS', 'LEFT_auxpass_VBD', 'LEFT_neg_RB', 'RIGHT_xcomp_VB', 'RIGHT_punct_,', 'RIGHT_cc_CC', 'RIGHT_conj_VBD']
non_action ['ROOT_self_VBN', 'LEFT_nsubjpass_NNS', 'LEFT_auxpass_VBD', 'LEFT_neg_RB', 'RIGHT_xcomp_VB', 'RIGHT_punct_,', 'RIGHT_cc_CC', 'RIGHT_conj_VBD']


####Try out Logistic Regression with Cross-Validation

Scores look consistent and reasonably good.


In [0]:
# where to save the models
%cd '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project'

# tiny corpus for testing
#docs=corpus[0:20]+corpus[-20:-1]
#labs=labels[0:20]+labels[-20:-1]

# whole corpus
docs=corpus
labs=labels

# Vectorizer
v=GensimTfidfVectorizer(".", True) 
vecs=v.fit_transform(docs)

# K-fold splitter for cross-validation
loader = CorpusLoader(vecs, labs, 10) 

# Storage for all our model metrics
#fields = ['precision', 'recall', 'accuracy', 'f1']
#scores = defaultdict(list)
#for f in fields:
#    scores[f]=[]

for X_train, X_test, y_train, y_test in loader:
    m=LogisticRegression()
    m.fit(X_train, y_train)
    y_pred=m.predict(X_test)
    #score = accuracy_score(y_test, y_pred)
    #scores.append(score)

    print(classification_report(y_test, y_pred))

    # Add scores to our scores
    #scores['precision'].append(precision_score(y_test, y_pred))
    #scores['recall'].append(recall_score(y_test, y_pred))
    #scores['accuracy'].append(accuracy_score(y_test, y_pred))
    #scores['f1'].append(f1_score(y_test, y_pred))

#print("Results for model {}".format(m))
#print("  Precision: {:0.3f}".format(np.mean(scores['precision'])))
#print("  Recall:    {:0.3f}".format(np.mean(scores['recall'])))
#print("  Accuracy:  {:0.3f}".format(np.mean(scores['accuracy'])))
#print("  F1:        {:0.3f}".format(np.mean(scores['f1'])))    

/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


              precision    recall  f1-score   support

      action       0.78      0.83      0.80        76
  non_action       0.84      0.79      0.81        84

    accuracy                           0.81       160
   macro avg       0.81      0.81      0.81       160
weighted avg       0.81      0.81      0.81       160

              precision    recall  f1-score   support

      action       0.81      0.83      0.82        83
  non_action       0.81      0.79      0.80        77

    accuracy                           0.81       160
   macro avg       0.81      0.81      0.81       160
weighted avg       0.81      0.81      0.81       160

              precision    recall  f1-score   support

      action       0.73      0.80      0.76        70
  non_action       0.83      0.77      0.80        90

    accuracy                           0.78       160
   macro avg       0.78      0.78      0.78       160
weighted avg       0.79      0.78      0.78       160

              preci



####Train the Logistic Regression model on the whole training corpus

Training data consists of about 1600 sentences that have been manually extracted from the pdf corpus and labelled as actions or non-actions. 

Each sentence is represented by its parse tokens, which are treated as words.

Parses are vectorized using TF-IDF, and then a Logistic Regression model is trained on them.


In [0]:
# where to save the models
%cd '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/FinalModel'

# use the whole training corpus
docs=corpus
labs=labels

# fit the Vectorizer to the training data and save it
v=GensimTfidfVectorizer('.', True) 
v.fit(docs)

# use the Vectorizer to transform the training data
vecs=v.transform(docs)

# fit the Classifier to the vectorized training data and save it
m=LogisticRegression()
m.fit(vecs, labels)
pickle.dump(m, open('LRClassifier.model', 'wb'))


/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/FinalModel


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
# show the vectors
print(vecs[0])
print("...")    
print(vecs[-1])

[0.7705148  0.2970811  0.4859179  0.25309247 0.02639841 0.13107584
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0. 

###Test the Model on Held-Out Test Data

####Predict whether the sentences are actions using the Logistic Regression Model



In [0]:
# where to get the models from if they were not already in memory
#%cd '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/FinalModel'
#v = GensimTfidfVectorizer('.', True)
#m = pickle.load(open('LRClassifier.model', 'rb'))

# test data has been previously read in and parsed
docs=candidates

# Vectorizer
v=GensimTfidfVectorizer('.', True) 
vecs=v.transform(docs)

# Predict
result = m.predict(vecs)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
# Print only the sentences classified as actions
if 'action' in result:
    print("Found actions ...")

    for i in range(len(result)):
        if result[i] == 'action':
            print("{}:  {}".format(i, clean_sents[i]))

else:            
    print("No actions found.")

Found actions ...
0:   Process Infrastructure Ports, Marine & Offshore Project No.
7:  Revision Status Revision Date Description Contributors Reviewer FirstName LastName Position Title FirstName LastName Position Title Approver FirstName LastName For Internal Information/Discussion For Internal Information/Discussion For Stakeholder Meeting HR/RA/JSR DR/HR/JSR DR/HR/JSR For Client Use DR/JSR Final Issue DR/JSR/Client Client internal Client JM JM A 31 March 2010 A2-A7 various 20 June 2020 31 October 2010 27 January 2011 A8 B 0 Signature Position Title Rev: 0 Date: 27 January 2011 Project No: 143111: BC Ministry of Environment/ Climate Change Adaption Guidelines for Sea Dikes and Coastal Flood Hazard Land Use 1 1 1 2 2 2 2 2 5 5 6 7 9 11 11 13 14 16 18 20 20 20 21 21 21 21 21 22 22 Contents Introduction and Application of This Document General Acknowledgment Background Scope Reference Documents Definitions Updated Definitions Climate Change Impacts on Coastal Land Use Management Incremen

In [0]:
# Print a sample of sentences with their labels
for i in range(0, 5000, 100):
    print("{}:  {}".format(result[i], clean_sents[i]))


action:   Process Infrastructure Ports, Marine & Offshore Project No.
action:  This approach will minimize the initial costs of considering SLR, and the future costs of adaptation.
action:  Such infrastructure should be designed and constructed to remain operational during floods.
non_action:  1.16 Sea Dike System A system of: dikes, dunes, berms or natural shorelines that provide a similar function; and associated engineering works (e.g., tidal gates, outfalls, outlet structures, seawalls, quay walls, ramps, adjacent building features, etc.)
non_action:  We are especially grateful to the Government of Canada_s Climate Change Impacts and Adaptation Program (CCIAP) (NRCan) for financial support for this Guidebook and the preparatory workshop for this publication (CCIAP project A-1439).
non_action:  Over the years since the United Nations Framework Convention on Climate Change was first signed in Rio de Janeiro in 1992, there have been a number of efforts led by national governments atte

###Discussion

The Logistic Regression model's assignment of action and non_action labels to unseen test data seems little better than random. 

This is disappointing because initial scores were in the 80% range when cross-validating the model on the training data.

It may be that the model is overfitted to the small training set, since there are 225 features and only 1208 unique training examples. 

Also, since the training data were hand-picked to represent examples of actions and non_actions, it may be that they represent unusually extreme examples of actions and non-actions. When trained on extreme ends of a spectrum, the classifier might have more difficulty distinguishing data that falls closer to the middle of the spectrum.

It might be worthwhile to continue experiments with an ML approach. 
In that case, it would be useful to
*   using more training data
*   extract the training data with a random picker, and have the human only label it
*   try different classifier models


## Hard-Coded Rule to identify actions from their parses

Based on our human ability to see a pattern in the syntax of actions, I wrote a simple hard-coded rule to identify action sentences.

This performed much better than the Logistic Regression model.


In [0]:
# Look for sentences that start with 'ROOT_self_VB RIGHT_dobj_NN*'
for i in range(len(candidates)):
  if candidates[i][0] == 'ROOT_self_VB' and candidates[i][1].startswith('RIGHT_dobj_NN'):
    print(i, clean_sents[i], candidates[i], sent_files[i])


197 Use water-resistant materials and construction as appropriate. ['ROOT_self_VB', 'RIGHT_dobj_NNS', 'RIGHT_prep_IN', 'RIGHT_punct_.'] coastal_flooded_land_guidelines.txt
407 Identify the focus and objectives of a SAM initiative Step 2. ['ROOT_self_VB', 'RIGHT_dobj_NN', 'RIGHT_punct_.'] En56-226-2008-eng.txt
408 Assess present status and trends. ['ROOT_self_VB', 'RIGHT_dobj_NN', 'RIGHT_punct_.'] En56-226-2008-eng.txt
410 Develop a vision of the future. ['ROOT_self_VB', 'RIGHT_dobj_NN', 'RIGHT_punct_.'] En56-226-2008-eng.txt
456 Examine current development challenges, planning principles and capacities b. ['ROOT_self_VB', 'RIGHT_dobj_NNS', 'RIGHT_punct_,', 'RIGHT_conj_VBG'] En56-226-2008-eng.txt
459 Identify future development priorities based on the principles of local sustainability and community planning b. Assess impacts of climate change and the potential for adaptation and mitigation within community goals STEP 4 Set trajectories to meet priorities. ['ROOT_self_VB', 'RIGHT_dobj_N

In [0]:
# Look at a sample of sentences that do NOT start with 'ROOT_self_VB RIGHT_dobj_NN*'
for i in range(0, len(candidates), 200):
  if candidates[i][0] == 'ROOT_self_VB' and candidates[i][1].startswith('RIGHT_dobj_NN'):
      continue

  else:    
    print(i, clean_sents[i], candidates[i], sent_files[i])


0  Process Infrastructure Ports, Marine & Offshore Project No. ['ROOT_self_NNPS', 'LEFT_compound_NN', 'LEFT_compound_NNP', 'RIGHT_punct_,', 'RIGHT_conj_NNP'] coastal_flooded_land_guidelines.txt
200 Such infrastructure should be designed and constructed to remain operational during floods. ['ROOT_self_VBN', 'LEFT_nsubjpass_NN', 'LEFT_aux_MD', 'LEFT_auxpass_VB', 'RIGHT_cc_CC', 'RIGHT_conj_VBN', 'RIGHT_punct_.'] coastal_flooded_land_guidelines.txt
400 We are especially grateful to the Government of Canada_s Climate Change Impacts and Adaptation Program (CCIAP) (NRCan) for financial support for this Guidebook and the preparatory workshop for this publication (CCIAP project A-1439). ['ROOT_self_VBP', 'LEFT_nsubj_PRP', 'RIGHT_acomp_JJ'] En56-226-2008-eng.txt
600 The voices in my head that don_t want to be seen to always be a bother, that want to be liked, that are also tired and panicked at the enormity of the global warming issue, convinced me, against my true better judgment, to be silent.

###Discussion:

Almost all the sentences found by the first cell can be considered actions of some kind, although some are actions that only trusted authorities could do.

The sentences found by the second cell do contain some actions (for example #1000), but they seem to be only a small fraction of the sample.

Therefore it seems like the hard-coded rule is a viable way of screening for actions in text documents.





##Logistic Regression with Less Features



I trained another Logistic Regression Model, this time applying Truncated SVD to reduce the number of features. From the original 225, I reduced to 100, 50, 20, and 5 features. I found that as the number of features decreased, the model classified more and more of the sentences as actions.

####Train the model

In [0]:
# where to save the models
%cd '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/FinalModel'

# use the whole training corpus
docs=corpus
labs=labels

# fit the Vectorizer to the training data and save it
v=GensimTfidfVectorizer('.', True) 
v.fit(docs)

# use the Vectorizer to transform the training data
vecs=v.transform(docs)

# Truncated SVD
t=TruncatedSVD(n_components=5)
vecs_reduced=t.fit_transform(vecs)

# fit the Classifier to the vectorized training data and save it
m=LogisticRegression()
m.fit(vecs_reduced, labels)
pickle.dump(m, open('LRClassifier.model', 'wb'))


/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/FinalModel


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
# show the vectors
print(vecs_reduced[0])
print("...")    
print(vecs_reduced[-1])

[ 0.3161857   0.13685243 -0.03990754  0.14029232 -0.115687  ]
...
[ 0.1383982  -0.18315855 -0.13795115 -0.10512668  0.02034208]


####Predict whether the sentences are actions using the model



In [0]:
# where to get the models from if they were not already in memory
#%cd '/gdrive/My Drive/Colab Notebooks/3666 ANLP/Project/FinalModel'
#v = GensimTfidfVectorizer('.', True)
#m = pickle.load(open('LRClassifier.model', 'rb'))

# test data has been previously read in and parsed
docs=candidates

# Vectorizer
v=GensimTfidfVectorizer('.', True) 
vecs=v.transform(docs)

# Truncated SVD
t=TruncatedSVD(n_components=5)
vecs_reduced=t.fit_transform(vecs)

# Predict
result = m.predict(vecs_reduced)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
# Print only the sentences classified as actions
if 'action' in result:
    print("Found actions ...")

    for i in range(len(result)):
        if result[i] == 'action':
            print("{}:  {}".format(i, clean_sents[i]))

else:            
    print("No actions found.")

Found actions ...
0:   Process Infrastructure Ports, Marine & Offshore Project No.
1:  143111 Revision Number 0 BC Ministry of Environment Climate Change Adaption Guidelines for Sea Dikes and Coastal Flood Hazard Land Use Guidelines for Management of Coastal Flood Hazard Land Use 27 January 2011 DISCLAIMER: This document is for the private information and benefit only of the client for whom it was prepared and for the particular purpose previously advised to Ausenco Sandwell.
3:  Particular financial and other projections and analysis contained herein, to the extent they are based upon assumptions concerning future events and circumstances over which Ausenco Sandwell has no control, are by their nature uncertain and are to be treated accordingly.
4:  Ausenco Sandwell makes no warranties regarding such projections and analysis.
6:  Copyright to this document is wholly reserved to Ausenco Sandwell.
7:  Revision Status Revision Date Description Contributors Reviewer FirstName LastName Pos

In [0]:
# Print a sample of sentences with their labels
for i in range(0, 5000, 100):
    print("{}:  {}".format(result[i], clean_sents[i]))


action:   Process Infrastructure Ports, Marine & Offshore Project No.
action:  This approach will minimize the initial costs of considering SLR, and the future costs of adaptation.
action:  Such infrastructure should be designed and constructed to remain operational during floods.
action:  1.16 Sea Dike System A system of: dikes, dunes, berms or natural shorelines that provide a similar function; and associated engineering works (e.g., tidal gates, outfalls, outlet structures, seawalls, quay walls, ramps, adjacent building features, etc.)
non_action:  We are especially grateful to the Government of Canada_s Climate Change Impacts and Adaptation Program (CCIAP) (NRCan) for financial support for this Guidebook and the preparatory workshop for this publication (CCIAP project A-1439).
action:  Over the years since the United Nations Framework Convention on Climate Change was first signed in Rio de Janeiro in 1992, there have been a number of efforts led by national governments attempting t

##Conclusion

The best approach found so far for identifying actions in PDF documents is:


1.   Convert the PDF to a text file using pdfminer or tika
2.   Tokenize into sentences using nltk sent_tokenize
3.   Parse each sentence using spaCy 
4.   Convert the spaCy parse to tokens using the parse function provided
5.   Look for sentences that start with 'ROOT_self_VB RIGHT_dobj_NN*' 

