<h1><a href="https://arxiv.org/abs/1709.09130">
A Simple Neural Network Module for Relational Reasoning</a></h1>
Adam Santoro et al.


<h2>Summary</h2>

* Relation Networks (RNs) can be used as a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning.

* A deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.

<h2>Introduction</h2>

* Reasoning about the relations is central to general intelligence.

* Symbolic approaches to artificial intelligence suffer from low robustness to small tasks and input variations.

* Deep learning, statistical learning, etc. struggle in data-poor problems where underlying structure is characterized by sparse but complex relations.

* Seemly simple relational inferences are remarkably difficult for powerful neural network architectures such as CNNs and MLPs.
    
* **Relation Network (RN)** in this paper is a general solution to relational reasoning in NN. 
    * Focus on flexible relational reasoning.
    * Influence upstream representations in CNNs and LSTMs to produce __(who???)__ implicit object-like representations
    * Outperforms general architectures in visual question answering (QA) dataset that demands rich relational reasoning.

<h2>Relation Networks</h2>

* RN is a NN module with a structure inferring relations.
* The capacity of capturing relations is the nature of RN.
>Just like the CNN can capture spatial, translation invariant properties, RNN can reason sequential dependencies.

* The simpliest form of an RN is a composite function $RN(O)=f_\phi(\sum_{i,j} g_\theta(o_i, o_j))$
>* Input is a set of "objects" $O=\{o_1, o_2,\ldots, o_n\}, o_i\in\mathbb{R}^m$ is the $i^{th}$ object<br>
 * $f_\phi$ can $g_\theta$ are MLPs with parameters $\phi$ and $\theta$.<br>
 * Input includes all $(o_i, o_j)$ pairs thus the RN considers the potential relations between all object pairs.<br>
 * The cost is determined by all object pairs of the entire object set instead of individuals.<br>

<h2>Tasks</h2>

RN-augmented network is applied to different domains

* **CLEVR**: visual QA dataset
    * Architectures must reason the relations over the feawtures in the visual inputs, language inputs, and their conjunction.
    * Contains images of 3D-rendered objects asscociated with questions such as querying the color and compare the material.
    * Powerful QA architectures in <a href="https://arxiv.org/abs/1511.02274">existing work</a> cannot solve certain tasks that require ability of handling relational aspects.
    * Two versions:
        * Pixel version: treat images as 2D pixel form.
        * State description version: treat images as 3D form and provide different aspects of information.
<img src="fig1.png"></img>

* **Sort-of-CLEVER**: separates relational and non-relational questions
    * Chose from 6 different shapes and 6 colors.
    * Questions are hard-coded as fixed-length binary string.
    * 10 relational questions and 10 non-relational questions are asked.
    
* **bAbI**: pure text-based QA dataset
    * 20 tasks each corresponding to a particular type of reasoning, e.g. deduction, induction, counting, ...
    * Each question is associated with a set of support set of facts as prior.

* **Dynamic physical systems**: simulated physical mass-spring systems using the MuJoCo engine
    * 20 balls on a table, where some move indepently while others are randomly connected in pairs by force
    * Create a randomly connected graph.
    * Inputs are the color of the balls and the positions across multiple sequential frames.
    * Two separated questions:
        * infer the absence of connections by observing the colors and coordinate positions across multiple sequential frames
        * count the number of graphs by observing the colors and coordinate positions across multiple sequential frames

<h2>Models</h2>

RNs operate on objects. The learning process induces upstream to produce useful "object" from distributed representations.
<img src="fig2.png"></img>
* **Dealing with pixels**: use CNN to process pixels and the final output of the convoluted layers is the object for RN.

* **Dealing with natural language**:
    * first identify up to 20 sentences in the support set that were immediately prior to the probe question
    * tag their relative position in the support set
    * use LSTM to process those sentences and the final state is the object for RN.

* **Dealing with questions**: use LSTM to process question words and output the final state to the RN as question embedding.

* **Model details**:
    * **CLEVER** task
        * 64 mini-batches and distributed traning with 10 workers synchronously updating a central parameter server.
        * Image processing: CNN with 4 conv layers each with 24 kernels, ReLU non-linearities and batch normalization
        * Questions processing: 128-unit LSTM; 32 unite word lookup embedding;
        * RN: 4 layer MLP with 256 units per layer and ReLU non-linearities for $g_\theta$; 3 layer MLP with 256,256($50%$ dropout), and 29 units with ReLU non-linearities for $f_\theta$; softmax output with cross-entropy loss function optimized by Adam optimizer with learning rate of $2.5e^{-4}$. 
        
    

<h2>Results</h2>

<h3>CLEVER pixels version</h3>

<img src="./fig3.png"></img>

RN architecture exceeds the best model mostly in the compare attrivute and count categories.

<img src="./fig4.png"></img>

<h3>CLEVER state description version</h3>

RN model achieved $96.4%$ accuracy which shows that the model can be generalised beyond visual problems. It can learn and reason about object relations while the true nature of the object is still agnostic.

<h3>Sort-of-CLEVER from pixels</h3>

In those tasks, the difficulty in parsing the question is much lighter. However, RN still achieved above $94%$ accuracy both in relational and non-relational questions. In comparison, CNN with MLP only achieved $63%$ in the relational questions, especially $52.3%$ in "closest-to" or "furthest-from" kind of heavily relational questions. 

<h3>bAbI</h3>

The RN model succeeded on $18/20$ tasks and has lower error rate ($2.1%$) on the basic induction task than Sparse DNC($54%$), DNC($55.1%$), and EntNet($52.1%$). And in tasks that it failed, it almost reached the success threshold.

<h3>Dynamic physical systems</h3>

The RN model correctly classified all the connections in $93%$ of the sample scenes in the connection inference task and $95%$ in counting task. In comparison, MLP with comparable number of parameters was unable to perform as well as RN in both tasks at the same time. 

<h2>Discussion</h2>

Experiments showed that RN module inclusion in simple CNN and LSTM based model can raise the performance. The author speculated that it was the RN module that freed the CNN to more focus on processing local spatial structure, thus processing and reasoning are distinct from each other.

Future work includes apply RNs to problems that can benefit from structure learning and exploitation such as rich scene understanding in RL agents and abstract problem solving.

RN modules can be further optimized, e.g. exploit prior knowledge, omit weak and non-existing relations, set up attention mechanism to allocate compuational power to important relations.