Attempt to clone SyntaxNet using only Python, with GPU support
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
corpus
other
.gitignore
README.md
arc_eager_transition_system.py
arc_standard_transition_system.py
conll_utils.py
decoded_parse_reader.py
dep_parser.py
feature_extractor.py
feature_map.py
gold_parse_reader.py
lexicon.py
model_parameters.py
parser-config.sample
parser_state.py
projectivize_filter.py
sentence_batch.py
training_test.sh
utils.py
well_formed_filter.py

README.md

ash-parser

This was originally for a class project.

Utilizes a Chen and Manning (2014) style neural network parser in Python and TensorFlow. Many elements mimic SyntaxNet.

I analyze SyntaxNet's Architecture here.

parsing-config file is required to be created in the model directory before execution.

Run training_test.sh for an example of how to train a model. Evaluation during training works as well, but there is no API for tagging new input yet or serving a model.

External dependencies

  • NumPy
  • TensorFlow 1.0

Similarities to SyntaxNet

  • Same embedding system (configurable per-feature group deep embedding)
  • Same optimizer (Momentum with exponential moving average)
  • Lexicon builder is identical for words, tags, and labels
  • Map files output by SyntaxNet and AshParser should be identical
  • Evaluation metric is identical (SyntaxNet's corresponds to AshParser's UAS)
  • Feature system is almost identical (except perhaps some very rare corner cases)
  • Due to same architecture, accuracy should be very close to Greedy SyntaxNet

Differences from SyntaxNet:

  • Arc-Eager transition system also supported
  • Context file with redundant or boilerplate information is unnecessary
  • Supports GPU: training phase can complete in minutes
  • Pure Python3 implementation. No need for bazel
  • LAS (Labeled Attachment Score) prints out during evaluation
  • Precalculation and caching of feature bags. This makes it easier to train multiple models with the same token features but different hyperparameters
  • No support for structured (beam) parsing. Considering LSTM or something simpler and faster instead for the future. Accuracy loss should be in the ballpark of 1-2% due to this.
  • Feature groups are automatically created by groups of tag, word, and label rather than by grouping together with semicolon in a context file
  • Only support for the transition parser, not the POS tagger, morphological analyzer, or tokenizer
  • ngrams, punctuation_amount, morph tags and other features not yet implemented