/
README.txt
269 lines (211 loc) · 17.4 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
**********************************************
* *
* ------------------ *
* PLTAG v.2.5.0-rel *
* ------------------ *
* *
* Psycholinguistically Motivated *
* Tree-Adjoining Grammar Parser *
* *
* Authors *
* ------- *
* Vera Demberg, vera@coli.uni-saarland.de *
* Ioannis Konstas, ikonstas@inf.ed.ac.uk *
* Frank Keller, keller@inf.ed.ac.uk *
* *
**********************************************
Thanks for downloading the Psycholinguistically Motivated Tree-Adjoining Grammar (PLTAG) Parser.
This package contains a fully incremental PLTAG parser, with iSRL capabilities and discriminative reranking extensions
as implemented in the following publications. If you are going to use/extend this package in your work (and you are most
welcome to do so), please cite:
V. Demberg, F. Keller, A. Koller, 2013. Incremental, Predictive Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar, CL 2013.
I. Konstas, V. Demberg, F. Keller, and M. Lapata, 2014. Incremental Semantic Role Labeling with Tree Adjoining Grammar, EMNLP 2014.
I. Konstas. and F. Keller, 2015. Semantic Role Labeling Improves Incremental Parsing, ACL 2015.
We have included a list of scripts that perform all the implemented functionalities found in these papers. An extensive list of
features can be found in FEATURES.txt.
1. INSTALLATION
The PLTAG parser can be installed by simply extracting the contents of the zipped file into an empty folder.
You should now see the following structure:
bin/ contains the parser executable (PLTAG.jar),
data/ contains trained lexicon, and parameters on WSJ sections 02-21.,
input/ contains sample input data,
lib/ contains all the necessary libraries for PLTAG to run,
output/ empty folder with sample output,
resources/ contains scripts and text resources for lexicon extraction,
scripts/ contains a series of scripts that cover the core functionalities of the parser,
README.txt this file,
FEATURES.txt contains an extensive list of features implemented in the code.
LICENSE.txt GNU General Public License - TLDR; the license that applies for this package.
2. GETTING STARTED
Most likely, you will just need to run the trained parser (scripts/pltag_parser.sh) for PLTAG parsing and iSRL with the trained
argument identifier/labeller classifiers (scripts/pltag_iSRL.sh).
For the discriminative reranker model you can run (scripts/pltag_parser_discriminative.sh). NOTE: it's a memory hungry process,
so is more intended for experimental purposes/extensions!
For a full explanation of what each parameter in the script does, read below in the USAGE sections 3.2.2, 3.2.3 and 3.3.3.
3. USAGE
3.1 Extract PLTAG Lexicon and Gold standard PLTAG trees
The PLTAG parser is a lexicalised parser that uses two types of lexica, namely a standard TAG lexicon with elementary trees,
and a -unique to PLTAG- prediction lexicon that contains unlexicalised trees only. In order to extract both lexica from a standard corpus
such as the Penn Treebank Wall Street Journal (WSJ) corpus you need to run the script:
> scripts/create_pltag_corpus.sh
Note that you need to include several resources that are not included in this distribution; usually they are distributed by LDC.
More specifically, in order to extract lexica for WSJ you need the corpus itself (v.3) and the NP-bracketing patch by Vadas and Curran, 2007.
You need to have the original structure of the dataset that consists of .mrg files containing documents in common LISP format, split into 24 folders or tracks.
The application of the patch should leave the directory structure intact.
You also need the Proposition Bank annotation (propbank, Palmer 2005) and NomBank (Meyers et al., 2007).
The current release contains samples for each of these resources. Change the script to point to the correct paths.
More info on the parameters used are included in the script itself.
The script also creates gold standard PLTAG trees for each sentence found in the .mrg files. These can be used either for testing or training the parser.
We have extracted the PLTAG lexica from sections 02-21 of WSJ, with and without semantic (propbank) annotation. They can be found under data/lexicon/ .
3.2 Generative - PLTAG model
3.2.1 Training
In order to train the generative PLTAG model you need to either use the extracted lexicon found under data/lexicon or the one extracted using the process described above.
You also need to extract the gold standard trees and the corresponding lexicon entries for each example of the training set in a single file, using the script described above and setting the parameter:
examplesInSingleFile=true
To train the model, run the script:
> pltag_generative_train.sh
3.2.2 Parsing
In order to parse with the generative PLTAG model you need to either use the extracted lexicon found under data/lexicon or the one extracted using the process described above.
To parse you simply have to run the script:
> pltag_parser.sh
There are various options you need to change in the script to accommodate your needs, including the input, output directories, etc.
More specifically:
- memory, corresponds to the maximum heap size allocated to the JVM
- numThreads, is the number of threads to use. When set higher than 1, then multiple threads will be spawned to parse each sentence.
- execDir, refers to the output execution directory
- inputPath, is the path to the file containing sentences to be parsed
- inputType, can take one out of four choices: plain, posTagged, pltag and dundee.
* plain reads tokenised sentences in the input file
* posTagged reads tab-delimited tokenised sentences, that consist of POS-tag word pairs, e.g., DT The NN player ...
* pltag reads sentences in the following format:
Example_id
tokenised sentence
posTagged sentence
bracket Lisp-style gold standard parse
* dundee reads tokenised sentences, that consist of word-id pairs
See the accompanying examples in the input folder.
- goldPosTags, uses the provided POS-tags in the input (posTagged and pltag inputType) or uses the Stanford POS tagger on the input sentence (plain, and dundee inputType)
instead of predicting them
- beamSize, refers to the beam size of the parser. Higher values generally increase performance, but require more memory and slow down the parser. 400 is the default value.
- nBest is the number of n-best lists to store at each chart cell. 250 is the default value
- paramsPath, is the path to the probability model parameters, trained using the script 'pltag_generative_train.sh'.
- lexiconPath, is the path to the lexicalised and prediction trees, extracted as explained above in the 'Extract PLTAG Lexicon and Gold standard PLTAG trees' section.
3.2.2.1 Output and Evaluation
PLTAG parser can be set to generate output in three different formats:
- difficulty scores,
- prefix trees,
- full sentence parse tree.
The output format can be controlled by the two following parameters:
- estimateProcDifficulty, generates syntactic surprisal, verification and combined scores per word
- printIncrementalDeriv, generates prefix trees. NOTE you need to set estimateProcDifficulty=true as well.
NOTE that the full sentence parse trees is always printed at the end of each example sentence.
In terms of evaluation the parser can automatically compute model log score, sentence F1, evalb F1, and incremental evalb F1 scores.
NOTE that you need:
- an input file that contains gold standard input trees (inputPath parameter),
- and to set the inputType parameter to 'pltag'.
Incremental evalb F1 scores are computed when:
- evaluateIncrementalEvalb=true
There are also a few more options that can be set or unset (see at the bottom of the script) that enable a couple more output-related functionalities to the parser:
- interactiveMode, sets the parser in interactive mode and accepts input from the console. Simply, enter tokenised sentences. This is the DEFAULT mode of PLTAG parser.
- outputExampleFreq, sets the frequency of outputting progress information at the stdout
- outputFullPred, outputs difficulty scores (if estimateProceDifficulty=true), prefix trees, (if printIncrementalDeriv=true), and parse tree output. Normally, leave it on.
The parser creates a folder as set in the 'execDir' parameter, that contains at least the following files:
info.map contains generic host information, such as hostname, number of CPU cores used, memory, etc.
log outputs various logging information (mostly what is printed on the stdout) indicating the progress of the parser
options.map contains all the options used by the parser along with a short description
results.performance outputs log scores, accuracy, evalb, etc., depending on whether the user is evaluating against a gold standard file (pltag inputType only)
results.evalb contains the incremental evalb F1 scores for each word (if evaluateIncrementalEvalb is set to true)
test.full-pred-gen contains the output of the parser per example. If 'estimateProcDifficulty' is set to true then it contains syntactic surprisal scores for each
word and the syntactic tree of the full sentence in the end in Common LISP format.
time.map outputs logging info on the amount of time it took to parse the input dataset
3.2.3 Incremental Semantic Role Labeling: iSRL
In order to extract incremental Semantic Role Labels (iSRLs) you need to run the script:
> pltag_iSRL.sh
The script essentially replicates the standard PLTAG parser, by extending it with the additional functionality
of outputting SRL triples in tandem with the prefix trees/full constituency tree. Hence, most parameters remain the
same, apart from the trained argument/label classifier paths, and optional CoNLL input format, if the user requires
to evaluate the iSRL performance.
Here is a list of the additional parameters that apply to iSRL only:
- inputPathConll, is the path to the test set in CoNLL format. Used for evaluating the iSRL model only.
- argumentIdentifierModel, points to the trained argument identifier model,
- argumentLabellerModel, points to the trained argument labeller model,
- lexiconType, refers to the the degree to which the lexicon and the SRL annotation is going to be used by the parser in order to predict iSRLs.
There are three different options: oracle, oracleRoles, parsedAllRoles.
* oracle directs the parser to use gold standard lexicon entries for the input examples and gold standard oracle semantic role labels
(use ONLY in conjunction with a single test file that contains gold PLTAG trees and lexicon entries,
and a CoNLL input file that contains gold SRL annotation)
* oracleAllRoles lets the parser use gold standard lexicon entries only and all annotated SRL roles with ambiguity
* parsedAllRoles (DEFAULT) gives the parser complete freedom to use the full lexicon and full SRL annotations. This is the default mode for the parser,
as in the case of running the 'pltag_parser.sh' script.
- applyConllHeuristics (DEFAULT=true), applies heuristics adopted in the CoNLL 2008/9 annotation: preposition is head of a PP, infinitive marker
(IM) is head of VPs, subordinating conjunction (IN/DT) is the head in SBARs.
- useFullLexicon (DEFAULT=true), should be used in conjunction with the choice in the lexiconType option: Set to true if lexiconType=parsedAllRoles, i.e.,
in normal parser mode, or false otherwise.
- printIncrementalDependencies, output incremental (in)-complete semantic role triples (used in conjunction with gold standard/CoNLL input).
- printIncrementalCompleteDependencies, output incremental complete only semantic role triples (used in conjunction with gold standard/CoNLL input).
- evaluateIncrementalDependencies, outputs various metrics for evaluating incremental semantic role triples (used in conjunction with gold standard/CoNLL input).
The iSRL triples in the output file (test.full-pred-gen) or at the console (if running in interactive mode) should look like this:
<73:VP:t=19,[ARGM-TMP],79:Oct.:t=19,75:occurred:t=18>
<8:NP:t=5,[ARG1, ARG0, ARG2],8:decline:t=4,null>
Each tuple contains the following information:
<integration-point_node_id:integration-point_node:t=timestamp, [list of labels], arg_node_id:arg:t=timestamp,pred_node_id:pred:t=timestamp>
Note that incomplete tuples (second example), may contain more than one candidate label. This is normal, since an incomplete triple has not observed yet
both argument and predicate, hence there is increased ambiguity as to which role should be assigned to the dependency.
iSRL creates a few additional files in the folder as set in the 'execDir' parameter:
results.ups contains the Unlabeled Prediction Score per word. This is the unlabelled attachment score between arguments and their corresponding predicate
results.ciss contains the Combined Incremental SRL Score, i.e., measures the identification of complete semantic role triples (i.e., correct predi- cate, predicate
sense, argument, and role label) per word; by the end of the sentence, CISS coincides with standard combined SRL accuracy,
as reported in CoNLL 2009 SRL-only task.
test.full-pred-gen.conll contains the predicted iSRL output at the sentence level in CoNLL format.
gold.full-pred-gen.conll contains the gold standard ConLL output for the sentences that were parsed, for easier reference.
test.completed-dependencies contains complete-only incremental SRL triples. The format of each line in the file is as follows:
<pred,[role],arg> id:integration-point_timestamp word: integration-point_POS integration-point_word
Example: <occurred,[ARG1],crash> id: 18 word:VB occurred
3.3 Discriminative Reranking - PLTAG model
If you just want to use the already trained parser (recommended) just go to 3.3.3. For the bolder among you, then follow these steps:
3.3.1 Extracting Features for Incremental Analyses (Offline Process)
- We first need to extract predicted features from each incremental analysis in the chart for each example in the training set (such as the WSJ sections 02-21),
and save them to disk. This is a lengthy process as it involves parsing the whole training set using the generative PLTAG parser. On top of this,
some features (syntactic features) require traversing most entries in the chart (see Konstas and Keller, 2015 for exact details),
and constructing analyses for each word, which is an extra processing burden.
- Since we need to train a discriminative model (we have implemented the Average Structured Perceptron as described in Collins, 2002.
Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms.), we also need labels for each training instance.
Hence we need to also extract incremental analyses using an oracle model (again see Konstas and Keller, 2015 for the exact details), i.e., parse the training
set using gold standard lexicon entries (much faster process).
So to recap, remember that you have to extract two sets of training features from incremental analyses!
To help you out we have created two separate (and almost identical, less the parameters below).
Here is the list of common parameters you need to set only once in each script:
- inputPath the path to a single file containing the PLTAG training examples,
- inputPathConll the path to the CoNLL version of the tranining examples,
- incrAnalysesFeaturesPath the path where the incremental analyses features will get saved (automatically split to segments),
- discriminativeFeatureIndexers the path the dictionaries for the various features extracted.
3.3.1.1 Predicted Feature Extraction
Run the script:
> pltag_saveIncrAnalyses_predicted.sh
and make sure these parameters are set like this:
- lexiconType parsedAllRoles (see iSRL section 3.2.3 for a full description)
- parserType generative, i.e., use the baseline PLTAG parser with the full lexicon
- extendIndexers false. This directive essentially creates dictionaries for the various features extracted.
- useFullLexicon true. Use full lexica during parsing.
3.3.1.2 Oracle Feature Extraction
Run the script:
> pltag_saveIncrAnalyses_oracle.sh
and make sure these parameters are set like this:
- lexiconType oracle (see iSRL section 3.2.3 for a full description)
- parserType oracle, i.e., use the gold standard lexicon to extract near-perfect prefix trees,
- extendIndexers true. This directive loads the dictionaries stored in the process above and extends it with (potentially) new entries
coming from the oracle features. It will keep the old dictionaries and automatically create new ones.
- useFullLexicon false, i.e., use the gold standard lexicon to extract near-perfect prefix trees,
- discriminativeFeatureIndexers make sure it points to the correct path set in the previous step! Watch out for any path suffixes produced automatically by the parser.
3.3.2 Training the Perceptron Reranker (Offline Process)
3.3.3 Parsing
Copyright 2015, Ioannis Konstas, University of Edinburgh.
-------
HISTORY
-------
v.2.5.0: Added discriminative reranking support using PLTAG, Tree and SRL features.
v.2.2.0: Added web service output.
v.2.1.0: Added incremental Semantic Role Labelling support.
v.2.0.1: Added multi-thread support in conjunction with estimation of processing difficulty scores.
v.2.0.0: Major refactoring and speed-ups. Included incremental semantic role labeling, multiple input/output formats, interactive mode.
v.1.0.0: First release of code.