Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
/research/inception/ @shlens @vincentvanhoucke
/research/learned_optimizer/ @olganw @nirum
/research/learning_to_remember_rare_events/ @lukaszkaiser @ofirnachum
/research/lexnet_nc/ @vered1986 @waterson
/research/lfads/ @jazcollins @susillo
/research/lm_1b/ @oriolvinyals @panyx0718
/research/maskgan/ @a-dai
Expand Down
2 changes: 2 additions & 0 deletions research/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ installation](https://www.tensorflow.org/install).
- [inception](inception): deep convolutional networks for computer vision.
- [learning_to_remember_rare_events](learning_to_remember_rare_events): a
large-scale life-long memory module for use in deep learning.
- [lexnet_nc](lexnet_nc): a distributed model for noun compound relationship
classification.
- [lfads](lfads): sequential variational autoencoder for analyzing
neuroscience data.
- [lm_1b](lm_1b): language modeling on the one billion word benchmark.
Expand Down
132 changes: 132 additions & 0 deletions research/lexnet_nc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# LexNET for Noun Compound Relation Classification

This is a [Tensorflow](http://www.tensorflow.org/) implementation of the LexNET
algorithm for classifying relationships, specifically applied to classifying the
relationships that hold between noun compounds:

* *olive oil* is oil that is *made from* olives
* *cooking oil* which is oil that is *used for* cooking
* *motor oil* is oil that is *contained in* a motor

The model is a supervised classifier that predicts the relationship that holds
between the constituents of a two-word noun compound using:

1. A neural "paraphrase" of each syntactic dependency path that connects the
constituents in a large corpus. For example, given a sentence like *This fine
oil is made from first-press olives*, the dependency path is something like
`oil <NSUBJPASS made PREP> from POBJ> olive`.
2. The distributional information provided by the individual words; i.e., the
word embeddings of the two consituents.
3. The distributional signal provided by the compound itself; i.e., the
embedding of the noun compound in context.

The model includes several variants: *path-based model* uses (1) alone, the
*distributional model* uses (2) alone, and the *integrated model* uses (1) and
(2). The *distributional-nc model* and the *integrated-nc* model each add (3).

Training a model requires the following:

1. A collection of noun compounds that have been labeled using a *relation
inventory*. The inventory describes the specific relationships that you'd
like the model to differentiate (e.g. *part of* versus *composed of* versus
*purpose*), and generally may consist of tens of classes.
2. You'll need a collection of word embeddings: the path-based model uses the
word embeddings as part of the path representation, and the distributional
models use the word embeddings directly as prediction features.
3. The path-based model requires a collection of syntactic dependency parses
that connect the constituents for each noun compound.

At the moment, this repository does not contain the tools for generating this
data, but we will provide references to existing datasets and plan to add tools
to generate the data in the future.

# Contents

The following source code is included here:

* `learn_path_embeddings.py` is a script that trains and evaluates a path-based
model to predict a noun-compound relationship given labeled noun-compounds and
dependency parse paths.
* `learn_classifier.py` is a script that trains and evaluates a classifier based
on any combination of paths, word embeddings, and noun-compound embeddings.
* `get_indicative_paths.py` is a script that generates the most indicative
syntactic dependency paths for a particular relationship.

# Dependencies

* [TensorFlow](http://www.tensorflow.org/): see detailed installation
instructions at that site.
* [SciKit Learn](http://scikit-learn.org/): you can probably just install this
with `pip install sklearn`.

# Creating the Model

This section describes the necessary steps that you must follow to reproduce the
results in the paper.

## Generate/Download Path Data

TBD! Our plan is to make the aggregate path data available that was used to
train path embeddings and classifiers; however, this will be released
separately.

## Generate/Download Embedding Data

TBD! While we used the standard Glove vectors for the relata embeddings, the NC
embeddings were generated separately. Our plan is to make that data available,
but it will be released separately.

## Create Path Embeddings

Create the path embeddings using `learn_path_embeddings.py`. This shell script
fragment will iterate through each dataset, split, and corpus to generate path
embeddings for each.

for DATASET in tratz/fine_grained tratz/coarse_grained ; do
for SPLIT in random lexical_head lexical_mod lexical_full ; do
for CORPUS in wiki_gigiawords ; do
python learn_path_embeddings.py \
--dataset_dir ~/lexnet/datasets \
--dataset "${DATASET}" \
--corpus "${SPLIT}/${CORPUS}" \
--embeddings_base_path ~/lexnet/embeddings \
--logdir /tmp/learn_path_embeddings
done
done
done

The path embeddings will be placed in the directory specified by
`--embeddings_base_path`.

## Train classifiers

Train classifiers and evaluate on the validation and test data using
`train_classifiers.py` script. This shell script fragment will iterate through
each dataset, split, corpus, and model type to train and evaluate classifiers.

LOGDIR=/tmp/learn_classifier
for DATASET in tratz/fine_grained tratz/coarse_grained ; do
for SPLIT in random lexical_head lexical_mod lexical_full ; do
for CORPUS in wiki_gigiawords ; do
for MODEL in dist dist-nc path integrated integrated-nc ; do
# Filename for the log that will contain the classifier results.
LOGFILE=$(echo "${DATASET}.${SPLIT}.${CORPUS}.${MODEL}.log" | sed -e "s,/,.,g")
python learn_classifier.py \
--dataset_dir ~/lexnet/datasets \
--dataset "${DATASET}" \
--corpus "${SPLIT}/${CORPUS}" \
--embeddings_base_path ~/lexnet/embeddings \
--logdir ${LOGDIR} \
--input "${MODEL}" > "${LOGDIR}/${LOGFILE}"
done
done
done
done

The log file will contain the final performance (precision, recall, F1) on the
train, dev, and test sets, and will include a confusion matrix for each.

# Contact

If you have any questions, issues, or suggestions, feel free to contact either
@vered1986 or @waterson.
111 changes: 111 additions & 0 deletions research/lexnet_nc/get_indicative_paths.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
#!/usr/bin/env python
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

"""Extracts paths that are indicative of each relation."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os

import tensorflow as tf

from . import path_model
from . import lexnet_common

tf.flags.DEFINE_string(
'dataset_dir', 'datasets',
'Dataset base directory')

tf.flags.DEFINE_string(
'dataset',
'tratz/fine_grained',
'Subdirectory containing the corpus directories: '
'subdirectory of dataset_dir')

tf.flags.DEFINE_string(
'corpus', 'random/wiki',
'Subdirectory containing the corpus and split: '
'subdirectory of dataset_dir/dataset')

tf.flags.DEFINE_string(
'embeddings_base_path', 'embeddings',
'Embeddings base directory')

tf.flags.DEFINE_string(
'logdir', 'logdir',
'Directory of model output files')

tf.flags.DEFINE_integer(
'top_k', 20, 'Number of top paths to extract')

tf.flags.DEFINE_float(
'threshold', 0.8, 'Threshold above which to consider paths as indicative')

FLAGS = tf.flags.FLAGS


def main(_):
hparams = path_model.PathBasedModel.default_hparams()

# First things first. Load the path data.
path_embeddings_file = 'path_embeddings/{dataset}/{corpus}'.format(
dataset=FLAGS.dataset,
corpus=FLAGS.corpus)

path_dim = (hparams.lemma_dim + hparams.pos_dim +
hparams.dep_dim + hparams.dir_dim)

path_embeddings, path_to_index = path_model.load_path_embeddings(
os.path.join(FLAGS.embeddings_base_path, path_embeddings_file),
path_dim)

# Load and count the classes so we can correctly instantiate the model.
classes_filename = os.path.join(
FLAGS.dataset_dir, FLAGS.dataset, 'classes.txt')

with open(classes_filename) as f_in:
classes = f_in.read().splitlines()

hparams.num_classes = len(classes)

# We need the word embeddings to instantiate the model, too.
print('Loading word embeddings...')
lemma_embeddings = lexnet_common.load_word_embeddings(
FLAGS.embeddings_base_path, hparams.lemma_embeddings_file)

# Instantiate the model.
with tf.Graph().as_default():
with tf.variable_scope('lexnet'):
instance = tf.placeholder(dtype=tf.string)
model = path_model.PathBasedModel(
hparams, lemma_embeddings, instance)

with tf.Session() as session:
model_dir = '{logdir}/results/{dataset}/path/{corpus}'.format(
logdir=FLAGS.logdir,
dataset=FLAGS.dataset,
corpus=FLAGS.corpus)

saver = tf.train.Saver()
saver.restore(session, os.path.join(model_dir, 'best.ckpt'))

path_model.get_indicative_paths(
model, session, path_to_index, path_embeddings, classes,
model_dir, FLAGS.top_k, FLAGS.threshold)

if __name__ == '__main__':
tf.app.run()
Loading