tensorflow · lukaszkaiser · Mar 21, 2018 · May 25, 2016 · Mar 19, 2018 · Mar 19, 2018
@@ -17,6 +17,7 @@
 /research/inception/ @shlens @vincentvanhoucke
 /research/learned_optimizer/ @olganw @nirum
 /research/learning_to_remember_rare_events/ @lukaszkaiser @ofirnachum
+/research/lexnet_nc/ @vered1986 @waterson
 /research/lfads/ @jazcollins @susillo
 /research/lm_1b/ @oriolvinyals @panyx0718
 /research/maskgan/ @a-dai

@@ -36,6 +36,8 @@ installation](https://www.tensorflow.org/install).
 -   [inception](inception): deep convolutional networks for computer vision.
 -   [learning_to_remember_rare_events](learning_to_remember_rare_events): a
     large-scale life-long memory module for use in deep learning.
+-   [lexnet_nc](lexnet_nc): a distributed model for noun compound relationship
+    classification.
 -   [lfads](lfads): sequential variational autoencoder for analyzing
     neuroscience data.
 -   [lm_1b](lm_1b): language modeling on the one billion word benchmark.

@@ -0,0 +1,132 @@
+# LexNET for Noun Compound Relation Classification
+
+This is a [Tensorflow](http://www.tensorflow.org/) implementation of the LexNET
+algorithm for classifying relationships, specifically applied to classifying the
+relationships that hold between noun compounds:
+
+* *olive oil* is oil that is *made from* olives
+* *cooking oil* which is oil that is *used for* cooking
+* *motor oil* is oil that is *contained in* a motor
+
+The model is a supervised classifier that predicts the relationship that holds
+between the constituents of a two-word noun compound using:
+
+1. A neural "paraphrase" of each syntactic dependency path that connects the
+   constituents in a large corpus. For example, given a sentence like *This fine
+   oil is made from first-press olives*, the dependency path is something like
+   `oil <NSUBJPASS made PREP> from POBJ> olive`.
+2. The distributional information provided by the individual words; i.e., the
+   word embeddings of the two consituents.
+3. The distributional signal provided by the compound itself; i.e., the
+   embedding of the noun compound in context.
+
+The model includes several variants: *path-based model* uses (1) alone, the
+*distributional model* uses (2) alone, and the *integrated model* uses (1) and
+(2).  The *distributional-nc model* and the *integrated-nc* model each add (3).
+
+Training a model requires the following:
+
+1. A collection of noun compounds that have been labeled using a *relation
+   inventory*.  The inventory describes the specific relationships that you'd
+   like the model to differentiate (e.g. *part of* versus *composed of* versus
+   *purpose*), and generally may consist of tens of classes.
+2. You'll need a collection of word embeddings: the path-based model uses the
+   word embeddings as part of the path representation, and the distributional
+   models use the word embeddings directly as prediction features.
+3. The path-based model requires a collection of syntactic dependency parses
+   that connect the constituents for each noun compound.
+
+At the moment, this repository does not contain the tools for generating this
+data, but we will provide references to existing datasets and plan to add tools
+to generate the data in the future.
+
+# Contents
+
+The following source code is included here:
+
+* `learn_path_embeddings.py` is a script that trains and evaluates a path-based
+  model to predict a noun-compound relationship given labeled noun-compounds and
+  dependency parse paths.
+* `learn_classifier.py` is a script that trains and evaluates a classifier based
+  on any combination of paths, word embeddings, and noun-compound embeddings.
+* `get_indicative_paths.py` is a script that generates the most indicative
+  syntactic dependency paths for a particular relationship.
+
+# Dependencies
+
+* [TensorFlow](http://www.tensorflow.org/): see detailed installation
+  instructions at that site.
+* [SciKit Learn](http://scikit-learn.org/): you can probably just install this
+  with `pip install sklearn`.
+
+# Creating the Model
+
+This section describes the necessary steps that you must follow to reproduce the
+results in the paper.
+
+## Generate/Download Path Data
+
+TBD! Our plan is to make the aggregate path data available that was used to
+train path embeddings and classifiers; however, this will be released
+separately.
+
+## Generate/Download Embedding Data
+
+TBD! While we used the standard Glove vectors for the relata embeddings, the NC
+embeddings were generated separately. Our plan is to make that data available,
+but it will be released separately.
+
+## Create Path Embeddings
+
+Create the path embeddings using `learn_path_embeddings.py`.  This shell script
+fragment will iterate through each dataset, split, and corpus to generate path
+embeddings for each.
+
+    for DATASET in tratz/fine_grained tratz/coarse_grained ; do
+      for SPLIT in random lexical_head lexical_mod lexical_full ; do
+        for CORPUS in wiki_gigiawords ; do
+          python learn_path_embeddings.py \
+            --dataset_dir ~/lexnet/datasets \
+            --dataset "${DATASET}" \
+            --corpus "${SPLIT}/${CORPUS}" \
+            --embeddings_base_path ~/lexnet/embeddings \
+            --logdir /tmp/learn_path_embeddings
+        done
+      done
+    done
+
+The path embeddings will be placed in the directory specified by
+`--embeddings_base_path`.
+
+## Train classifiers
+
+Train classifiers and evaluate on the validation and test data using
+`train_classifiers.py` script.  This shell script fragment will iterate through
+each dataset, split, corpus, and model type to train and evaluate classifiers.
+
+    LOGDIR=/tmp/learn_classifier
+    for DATASET in tratz/fine_grained tratz/coarse_grained ; do
+      for SPLIT in random lexical_head lexical_mod lexical_full ; do
+        for CORPUS in wiki_gigiawords ; do
+          for MODEL in dist dist-nc path integrated integrated-nc ; do
+            # Filename for the log that will contain the classifier results.
+            LOGFILE=$(echo "${DATASET}.${SPLIT}.${CORPUS}.${MODEL}.log" | sed -e "s,/,.,g")
+            python learn_classifier.py \
+              --dataset_dir ~/lexnet/datasets \
+              --dataset "${DATASET}" \
+              --corpus "${SPLIT}/${CORPUS}" \
+              --embeddings_base_path ~/lexnet/embeddings \
+              --logdir ${LOGDIR} \
+              --input "${MODEL}" > "${LOGDIR}/${LOGFILE}"
+          done
+        done
+      done
+    done
+
+The log file will contain the final performance (precision, recall, F1) on the
+train, dev, and test sets, and will include a confusion matrix for each.
+
+# Contact
+
+If you have any questions, issues, or suggestions, feel free to contact either
+@vered1986 or @waterson.
@@ -0,0 +1,111 @@
+#!/usr/bin/env python
+# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Extracts paths that are indicative of each relation."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+import tensorflow as tf
+
+from . import path_model
+from . import lexnet_common
+
+tf.flags.DEFINE_string(
+    'dataset_dir', 'datasets',
+    'Dataset base directory')
+
+tf.flags.DEFINE_string(
+    'dataset',
+    'tratz/fine_grained',
+    'Subdirectory containing the corpus directories: '
+    'subdirectory of dataset_dir')
+
+tf.flags.DEFINE_string(
+    'corpus', 'random/wiki',
+    'Subdirectory containing the corpus and split: '
+    'subdirectory of dataset_dir/dataset')
+
+tf.flags.DEFINE_string(
+    'embeddings_base_path', 'embeddings',
+    'Embeddings base directory')
+
+tf.flags.DEFINE_string(
+    'logdir', 'logdir',
+    'Directory of model output files')
+
+tf.flags.DEFINE_integer(
+    'top_k', 20, 'Number of top paths to extract')
+
+tf.flags.DEFINE_float(
+    'threshold', 0.8, 'Threshold above which to consider paths as indicative')
+
+FLAGS = tf.flags.FLAGS
+
+
+def main(_):
+  hparams = path_model.PathBasedModel.default_hparams()
+
+  # First things first. Load the path data.
+  path_embeddings_file = 'path_embeddings/{dataset}/{corpus}'.format(
+      dataset=FLAGS.dataset,
+      corpus=FLAGS.corpus)
+
+  path_dim = (hparams.lemma_dim + hparams.pos_dim +
+              hparams.dep_dim + hparams.dir_dim)
+
+  path_embeddings, path_to_index = path_model.load_path_embeddings(
+      os.path.join(FLAGS.embeddings_base_path, path_embeddings_file),
+      path_dim)
+
+  # Load and count the classes so we can correctly instantiate the model.
+  classes_filename = os.path.join(
+      FLAGS.dataset_dir, FLAGS.dataset, 'classes.txt')
+
+  with open(classes_filename) as f_in:
+    classes = f_in.read().splitlines()
+
+  hparams.num_classes = len(classes)
+
+  # We need the word embeddings to instantiate the model, too.
+  print('Loading word embeddings...')
+  lemma_embeddings = lexnet_common.load_word_embeddings(
+      FLAGS.embeddings_base_path, hparams.lemma_embeddings_file)
+
+  # Instantiate the model.
+  with tf.Graph().as_default():
+    with tf.variable_scope('lexnet'):
+      instance = tf.placeholder(dtype=tf.string)
+      model = path_model.PathBasedModel(
+          hparams, lemma_embeddings, instance)
+
+    with tf.Session() as session:
+      model_dir = '{logdir}/results/{dataset}/path/{corpus}'.format(
+          logdir=FLAGS.logdir,
+          dataset=FLAGS.dataset,
+          corpus=FLAGS.corpus)
+
+      saver = tf.train.Saver()
+      saver.restore(session, os.path.join(model_dir, 'best.ckpt'))
+
+      path_model.get_indicative_paths(
+          model, session, path_to_index, path_embeddings, classes,
+          model_dir, FLAGS.top_k, FLAGS.threshold)
+
+if __name__ == '__main__':
+  tf.app.run()