Cleanup for workshop camera ready.

sleepinyourhat · Jun 18, 2015 · adabdda · adabdda
1 parent 1abcb7d
commit adabdda
Show file tree

Hide file tree

Showing 6 changed files with 57 additions and 54 deletions.
diff --git a/writing/F'14 paper/cameraready/intro.tex b/writing/F'14 paper/cameraready/intro.tex
@@ -1,6 +1,6 @@
 \section{Introduction}\label{sec:intro}
 
-Tree-structured recursive neural network models (TreeRNNs; \citealt{goller1996learning}) for sentence meaning
+Tree-structured recursive neural network models (TreeRNNs; \citealt{goller1996learning,socher2011semi}) for sentence meaning
 have been successful in an array of sophisticated language tasks,
 including sentiment analysis \cite{socher2011semi,irsoydeep},
 image description \cite{sochergrounded}, and paraphrase detection

diff --git a/writing/F'14 paper/cameraready/join.tex b/writing/F'14 paper/cameraready/join.tex
@@ -32,7 +32,7 @@ \section{Reasoning about semantic relations}\label{sec:join}
 full set of sound such inferences on pairs of premise relations is depicted in
 Table~\ref{tab:jointable}. Though these basic inferences do not involve compositional
 sentence representations, any successful reasoning using compositional representations
-will rely on the ability to perform sound inferences of this kind, so our first experiment studies how well each model can learn to perform them them in isolation.
+will rely on the ability to perform sound inferences of this kind in order to be able to use unseen relational facts within larger derivations. Our first experiment studies how well each model can learn to perform them them in isolation.
 
 % about the relations themselves that do not depend on the
 % internal structure of the things being compared. For example, given
@@ -46,29 +46,6 @@ \section{Reasoning about semantic relations}\label{sec:join}
 % $a \natneg b$ and $b~|~c$ then $a \sqsupset c$.
 
 
-\paragraph{Experiments}
-We begin by creating a world model
-on which we will base the statements in the train and test sets.
-This takes the form of a small Boolean structure in which terms denote
-sets of entities from a small domain.  Fig.~\ref{lattice-figure}a
-depicts a structure of this form with three entities ($a$, $b$, and $c$) and eight proposition terms ($p_1$--$p_8$). We then generate a 
-relational statement for each pair of terms in the model, as shown in Fig.~\ref{lattice-figure}b. 
-We divide these statements evenly into train and test sets, and delete the test set
- examples which cannot be proven from the train examples, for which there is not enough information for even an ideal system to choose a correct label.
-In each experimental run, we create a model with 80 terms over a domain of 7 elements, yielding a training set of 3200 examples and a test set of 
-2960 examples.
-
-We trained models with both the NN and NTN comparison functions on these
-data sets.\footnote{Since this task relies crucially on the learning of a pair of vectors, no simpler version of our model is a viable baseline.} %+%
-In both cases, the models are implemented as
-described in \S\ref{methods}, but since the items being compared
-are single terms rather than full tree structures, the composition
-layer is not used, and the two models are not recursive. We simply present
-the models with the (randomly initialized) embedding vectors for each
-of two terms, ensuring that the model has no information about the terms
-being compared except for the relations between them that appear in training.
-
-
 \begin{figure}[t]
   \centering
   \begin{subfigure}[t]{0.45\textwidth}
@@ -106,7 +83,7 @@ \section{Reasoning about semantic relations}\label{sec:join}
 
       \labelednode{2.5}{0.5}{}{}
     \end{picture}}
-    \caption{Example boolean structure. The terms $p_1$--$p_8$ name the sets. Not all sets have names, and  some sets have multiple names, so that learning $\nateq$ is non-trivial.}
+    \caption{Example boolean structure, shown with edges idicating inclusion. The terms $p_1$--$p_8$ name the sets. Not all sets have names, and  some sets have multiple names, so that learning $\nateq$ is non-trivial.}
   \end{subfigure}
   \qquad\small
     \begin{subfigure}[t]{0.43\textwidth}
@@ -126,7 +103,7 @@ \section{Reasoning about semantic relations}\label{sec:join}
     \end{tabular}
 
     \caption{A few examples of atomic statements about the
-      model.  Test statements that are not provable from the training data shown are
+      model depicted above.  Test statements that are not provable from the training data shown are
       crossed out.}
   \end{subfigure}  
   \caption{Small example structure and data for learning relation composition.}
@@ -150,14 +127,37 @@ \section{Reasoning about semantic relations}\label{sec:join}
   \label{joinresultstable}
 \end{table}
 
+\paragraph{Experiments}
+We begin by creating a world model
+on which we will base the statements in the train and test sets.
+This takes the form of a small Boolean structure in which terms denote
+sets of entities from a small domain.  Fig.~\ref{lattice-figure}a
+depicts a structure of this form with three entities ($a$, $b$, and $c$) and eight proposition terms ($p_1$--$p_8$). We then generate a 
+relational statement for each pair of terms in the model, as shown in Fig.~\ref{lattice-figure}b. 
+We divide these statements evenly into train and test sets, and delete the test set
+ examples which cannot be proven from the train examples, for which there is not enough information for even an ideal system to choose a correct label.
+In each experimental run, we create a model with 80 terms over a domain of 7 elements, yielding a training set of 3200 examples and a test set of 
+2960 examples.
+
+We trained models with both the NN and NTN comparison functions on these
+data sets.\footnote{Since this task relies crucially on the learning of a pair of vectors, no simpler version of our model is a viable baseline.} %+%
+In both cases, the models are implemented as
+described in \S\ref{methods}, but since the items being compared
+are single terms rather than full tree structures, the composition
+layer is not used, and the two models are not recursive. We simply present
+the models with the (randomly initialized) embedding vectors for each
+of two terms, ensuring that the model has no information about the terms
+being compared except for the relations between them that appear in training.
+
+
 \paragraph{Results} 
 The results (Table \ref{joinresultstable}) show that NTN is able to accurately encode the relations between the terms in the geometric relations between their vectors, 
 and is able to then use that information to recover relations that 
 are not overtly included in the training data. The NN also generalizes fairly well, 
 but makes enough errors that it remains an open question whether 
 it is capable of learning representations with these properties. 
 It is not possible for us to rule out the possibility that different optimization techniques or
-further hyperparameter tuning could lead an NN model to succeed here.
+finer-grained hyperparameter tuning could lead an NN model to succeed.
 
 As an example from our test data, both models correctly labeled $p_1 \natfor p_3$, potentially learning from the training examples $\{p_1 \natfor p_{51},~p_3 \natrev p_{51}\}$ or $\{p_1\natfor p_{65},~p_3 \natrev p_{65} \}$. On another example involving comparably frequent relations, the NTN correctly labeled $p_6 \natrev p_{24}$, likely on the basis of the training examples $\{p_6 \natcov p_{28},~p_{28} \natneg p_{24}\}$, while the NN incorrectly assigned it $\natind$.
 

diff --git a/writing/F'14 paper/cameraready/methods.tex b/writing/F'14 paper/cameraready/methods.tex
@@ -6,8 +6,8 @@ \section{Tree-structured neural networks} \label{methods}
  compositionality}, which says that the meanings for complex
 expressions are derived from the meanings of their parts
 via specific composition functions \cite{Partee84,Janssen97}. In our
-distributed setting, word meanings are embedding vectors of dimension $n$. A learned
-composition function maps pairs of them to single phrase vectors of dimension $n$, 
+distributed setting, word meanings are embedding vectors of dimension $N$. A learned
+composition function maps pairs of them to single phrase vectors of dimension $N$, 
 which can then be merged again to represent more complex
 phrases, forming a tree structure. Once the entire sentence-level representation has been
 derived at the top of the tree, it serves as a fixed-dimensional input for some subsequent layer function.
@@ -45,9 +45,9 @@ \section{Tree-structured neural networks} \label{methods}
 Here, $\vec{x}^{(l)}$ and $\vec{x}^{(r)}$ are the column vector
 representations for the left and right children of the node, and
 $\vec{y}$ is the node's output.  The TreeRNN concatenates them, multiplies
-them by an $n \times 2n$ matrix of learned weights, and adds a bias $\vec{b}$. 
+them by an $N \times 2N$ matrix of learned weights, and adds a bias $\vec{b}$. 
 The TreeRNTN adds a learned full rank third-order tensor 
-$\mathbf{T}$, of dimension $n \times n \times n$, modeling
+$\mathbf{T}$, of dimension $N \times N \times N$, modeling
 multiplicative interactions between the child vectors. 
 The comparison layer uses the same layer function as the
 composition layers (either an NN layer or an NTN layer) with
@@ -82,5 +82,5 @@ \section{Tree-structured neural networks} \label{methods}
 as the harmonic mean of average precision and average recall, both computed
 for all classes for which there is test data, setting precision to 0 
 where it is not defined.}
-Source code and generated data will be released after the review period.
+Source code and generated data can be downloaded from \url{http://stanford.edu/~sbowman/}.
 
diff --git a/writing/F'14 paper/cameraready/quantifiers.tex b/writing/F'14 paper/cameraready/quantifiers.tex
@@ -56,7 +56,7 @@ \section{Reasoning with quantifiers and negation}\label{sec:quantifiers}
 % yields 66k sentence pairs. Some examples of these data are provided
 % in Table~\ref{examplesofdata}.
 
-In each run, we randomly partition the set of valid \textit{single sentences} into train and test, and then label all of the pairs from within each set to generate a training set of 27k pairs and a test set of 7k pairs. Because the model doesn't see the test sentences at training time, it cannot directly use the kind of reasoning described in \S\ref{sec:join} (treating sentences as unanalyzed symbols), and must instead infer the word-level relations and learn a complete reasoning system over them for our logic. 
+In each run, we randomly partition the set of valid \textit{single sentences} into train and test, and then label all of the pairs from within each set to generate a training set of 27k pairs and a test set of 7k pairs. Because the model doesn't see the test sentences at training time, it cannot directly use the kind of reasoning described in \S\ref{sec:join} at the sentence level (by treating sentences as unanalyzed symbols), and must instead jointly learn the word-level relations and a complete reasoning system over them for our logic. 
 
 We use the same summing baseline as in \S\ref{sec:recursion}.
 The highly consistent  sentence structure in this experiment means that this model

diff --git a/writing/F'14 paper/cameraready/recursion.tex b/writing/F'14 paper/cameraready/recursion.tex
@@ -77,7 +77,7 @@ \section{Recursive structure}\label{sec:recursion}
       $\plneg\, (\plneg p_1 \pland \plneg p_2)$ & $\nateq$ & $(p_1 \plor p_2)$ \\ 
       \bottomrule
     \end{tabular}
-    \caption{Examples of the type of statements used for training and testing. These are relations between
+    \caption{Short examples of the type of statements used for training and testing. These are relations between
       well-formed formulae, computed in terms of sets of satisfying
       interpretation functions $\sem{\cdot}$.}\label{tab:plexs}
   \end{subtable}