Skip to content
extract lexicalized tree adjoin grammar from treebank
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
input
notes
output
report
src
.gitignore
README.md

README.md

ltagextract

extract lexicalized tree adjoin grammar from treebank

Introduction

This project intends to extract Tree Adjoining Grammars with semantics aligned from KBGen corpus.

Software depends

Howto

To reproduce our current result, you can either simply run bin/run.sh or follow the pipeline described below:

  1. Deal with the conjunction occurred in the syntactic tree.
  2. Parse sentences using Stanford parser. We use the unlexicalized parser with head information output.
  3. Normalize the syntactic tree gotten from step 2.
  4. Extract TAG from the output of step 3
  5. Assign semantics to the output of step 4

Step 1

To do the coordination aggregation, run

java -jar bin/aggregation-0.1.1-SNAPSHOT-standalone.jar \
  input/triples/ output/aggregated/
`

Step 2

To parse the corpus using the Stanford parser, run

bin/parse.sh input/sentences/ output/parsed/

Step 3

To normalize the syntactic tree, run

java -jar bin/grook-0.1.0-SNAPSHOT-standalone.jar \
  output/parsed/ output/fixed/

Steps 4&5:

To extract the TAG with semantics aligned, run

PYTHONPATH="utilities/nltk-2.0.4/:$PYTHONPATH" python2 bin/extract/extractor.py \
  output/fixed/ input/alignments/ output/final.gram \
  --verbose output/grammar-verbose/

For more details, try running

python2 extractor.py -h
usage: extractor.py [-h] [--verbose VERBOSE] corpus alignment [outfile]

positional arguments:
  corpus             corpus path which should be a directroy
  alignment          alignment path which should be a directory
  outfile            outputfile for extracted grammar

optional arguments:
  -h, --help         show this help message and exit
  --verbose VERBOSE  output raw gammar extracted for each sentence. This
                     parameter should be a directory

to check the help.

Other

We also provide a small tool to help you visualize TAG extracted from step 4 or step 5, run

python2 grammarviewer.py -h
usage: grammarviewer.py [-h] [filename]

Draw the tree according to grammar file

positional arguments:
  filename    The name of grammar file, stdin will be used if left open

optional arguments:
  -h, --help  show this help message and exit

As a side product, our package provides a s-expression parser for python. You may want to use it to reconstruct ParentedTree(NLTK) from the plain text representation of TAG.

Description about the files

  • ./bin contains all runnable programs and scripts
  • ./src contains all the src code
  • ./output contains the intermediate results generated by the programs.
  • ./input contains the original corpus, annotated data
    • ./input/alignment contains our annotation result
    • ./input/heads-fixed
    • ./input/aggregation
  • ./report contains our report
You can’t perform that action at this time.