-
Notifications
You must be signed in to change notification settings - Fork 0
/
COPYRIGHT
137 lines (105 loc) · 5.11 KB
/
COPYRIGHT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
/*
* Copyright 1999, 2000, 2001, Brown University, Providence, RI.
*
* All Rights Reserved
*
* Permission to use, copy, modify, and distribute this software and its
* documentation for any purpose other than its incorporation into a
* commercial product is hereby granted without fee, provided that the
* above copyright notice appear in all copies and that both that
* copyright notice and this permission notice appear in supporting
* documentation, and that the name of Brown University not be used in
* advertising or publicity pertaining to distribution of the software
* without specific, written prior permission.
*
* BROWN UNIVERSITY DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE,
* INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY
* PARTICULAR PURPOSE. IN NO EVENT SHALL BROWN UNIVERSITY BE LIABLE FOR
* ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
* WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
* ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
* OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
*/
This parser takes ascii text with sentences delimited by <s> ...</s>,
and outputs the parsed versions in Penn tree-bank style. So if the input is
<s> (``He'll work at the factory.'') </s>
the output will be:
(S1 (PRN (-LRB- -LRB-)
(S (`` ``)
(NP (PRP He))
(VP (MD 'll)
(VP (VB work) (PP (IN at) (NP (DT the) (NN factory)))))
(. .)
('' ''))
(-RRB- -RRB-)))
The parser is that described in the paper "A maximum-entropy-inspired
parser" Eugene Charniak, Brown TR CS99-12. One parameter was adjusted
to produce parses at an average rate of about 2 seconds per sentence
on my 450 MH Sun (after about 45 seconds to load all of the data
files). This version is has a precision/recall of about 89.8% on the
stadard Penn treebank test set, about 0.3% lower than what can be
aclieved with significantly more search.
The program was created from this file by
make parseIt
The program is run from this directory by:
parseIt <path to data directory> <text file>
e.g.,
parseIt DATA/ DATA/test.raw
In this version it will ignore any sentence consisting of > 70
words+punctuation. To change this to, say 80 one would give
it the on-line argument -l80. Currently there are various array
sizes that make 99 the absolute maximum sentence length.
To see debugging information give it the on-line argument -d#
where # is a number > 5. As the numbers get larger, the verbosity of
the information increases.
*************************************************
Note for the 2000 Release:
This version differs in a few ways from the 1999 version.
1) Some bugs in the previous version have been fixed.
2) The statistics for guessing the preterminals of unknown words have
been slightly improved.
3) This version compiles under Sun's C++ compiler and Gnu's, althought
at the moment there are some rough spots. At the moment
it is set to compile under Solaris. To switch this to Gnu's
a) use makefile.gnu
b) in the file ECString.h change the line
#define ECS sun
to
#define ECS gnu
*************************************************
*************************************************
Note for the version nllparser (2001)
This version differs from the 2000 release in a few ways
1) only gnu (g++) is supported, so there is no makefile.gnu. Just
makefile.
2) I made a trivial improvement should make the parser about three
times faster than the 2000 version.
Other than the speed, I expect this version to be identical in
performance to the previous release. If I do not hear anything
to the contrary, the 2000 version (nlparser) will go away in
a few months and this version will replace it.
*************************************************
*************************************************
Note for the 2001 Release
I have now removed the old version of the parser and am only
distributing the version that runs under gnu g++. Too many
people were having too many problems with the Solaris version.
Thus this version is the descendant of the 2001 release of nllparser.
It differs from that primarily in its ability to handle deviant
sentences. In particular:
a) The maximum sentence length has been increased to 399 words and
punctuation. Without any command line argument the parser ignores
sentences of length > 80. To increase this to, say, 250, add "-l 250"
as a command line argument to parseIt.
b) The parser now goes into a "last ditch parsing" mode if it at first
fails to parse the sentence. This will attempt to find some parse
at the expense of accuracy.
c) If the parser still fails to find a parse it should (most of the
time, I hope), send a message to cerr saying "<Map>Parse failed on: ..."
and then just go on to the next sentence.
d) VERY IMPORTANT. MANY IMPLEMENTATIONS OF UNIX HAVE VERY LOW DEFAULT
STACK SIZE LIMITS. THIS CAN CAUSE MY PARSER TO ABORT. IT IS
GENERALLY A GOOD IDEA TO SET THE STACK SIZE TO "unlimited" BEFORE YOU
RUN THE PARSER.
e) Lastly, the memory leaks in the previous versions have been elimiated.
*************************************************