Disclaimer: while this glue code is provided under a BSD license, SENNA is not. Please refer to SENNA license.
This interface supports Part-of-speech tagging, Chunking, Name Entity Recognition and Semantic Role Labeling.
Because SENNA is shipped under a particular license, we do not include it into this repository. You thus need to follow these steps to install SENNA LuaJIT interface:
- Clone the SENNA LuaJIT interface:
git clone https://github.com/torch/senna.git
- Go into the created directory:
cd senna
-
Get SENNA. You must accept the license to proceed further.
-
Unpack SENNA archive into the git directory.
-
Run
luarocks
:
luarocks make rocks/senna-scm-1.rockspec
We provide an example usage called senna.run
. It outputs tags into stdout
for anything coming in stdin
.
Typical usage:
luajit -lsenna.run < file_to_tag.txt > tags.txt
Typical output:
echo "The Dow Jones industrials closed at 2569.26 ." | luajit -lsenna.run
The DT (NP* * - (A1*
Dow NNP * (MISC* - *
Jones NNP * *) - *
industrials NNS *) * - *)
closed VBD (VP*) * closed (V*)
at IN (PP*) * - (AM-EXT*
2569.26 CD (NP*) * - *)
. . * * - *
Please look into the example usage file (run.lua
) if you want to use the
interface on your own in LuaJIT. It provides a good overview on how things
work.
The LuaJIT interface provides several objects encapsulating SENNA's tools.
SENNA's Hash.
Load a hash stored at filename
, into the given path
. If the
admissible_keys_filename
is present, this will create a hash with
admissible keys (needed for NER).
Returns the index of the given string key
.
Returns the string at the given index idx
(a number).
Returns the number of pairs (key, value) stored into the hash.
Transform IOBES hash values (strings) into IOB format.
Transform IOBES hash values (strings) into bracket format.
Encapsulate tokens returned by the Tokenizer. Only created by the tokenizer.
Return a table containing tokenized word strings.
Encapsulate SENNA's tokenizer.
Create a new tokenizer. The tokenizer will be able to tokenize and create
any features required by SENNA subroutines. If is_tokenized
is at true,
then the tokenizer assumes words are already tokenized, separated with spaces.
Tokenize the given string. Returns Tokens
.
Important note
: because of internal states retained into the Tokenizer,
it is not possible to tokenize and process several sentences at the
time. Keep this in mind when calling the analyzing tools.
SENNA's Part-of-speech (POS) module.
Creates a POS analyzer.
Returns a table containing POS tags computed on the given tokens (which
must be from coming the Tokenizer
module).
SENNA's chunking (shallow parsing) module.
Creates a chunking analyzer. The optional hashtype
argument indicates the
format of the generated tags. By default it will be IOBES
. Other options
are IOB
or BRK
(for bracketing tags).
Returns a table containing chunking tags, computed on the given tokens
(which must be coming from the Tokenizer
module) and POS tags (which must be
coming from the POS
module).
SENNA's name entity recognition (NER) module.
Creates a NER analyzer. The optional hashtype
argument indicates the
format of the generated tags. By default it will be IOBES
. Other options
are IOB
or BRK
(for bracketing tags).
Returns a table containing NER tags, computed on the given tokens (which
must be coming from the Tokenizer
module).
SENNA's semantic role labeling (SRL) module.
Creates a SRL analyzer. The optional hashtype
argument indicates the
format of the generated tags. By default it will be IOBES
. Other options
are IOB
or BRK
(for bracketing tags).
The optional verbtype
indicates how verbs should be found. Default is
VBS
, SENNA's custom way of finding verbs. One can also use verbs from
POS with POS
or user provided verbs with USR
.
Returns a table containing a table of SRL tags, computed on the given
tokens (which must be coming from the Tokenizer
module) and POS tags
(which must be coming from the POS
module).
Each table in the table corresponds to a particular detected/provided verb and contains tags for each word in the sentence.
The returned table also contains a verb
field, which is a table of
booleans. A boolean at true means the word was considered as a verb.
If USR
was passed as verbtype
during creation of the module, the user
must also provide a list of words considered as verbs in
usr_verb_labels
. The list must be a list of booleans, of the size of the
number of tokens in the sentence. A boolean at true means the corresponding
word will be considered as a verb.
Set SENNA's verbose mode to flag
(true
or false
).