<h1><a href="https://arxiv.org/abs/1806.01261">
Relational inductive biases, deep learning, and graph networks</a></h1>
by Peter W. Battaglia et al.


<h2>Summary</h2>
* Graph network
    * Relational inductive biases within deep neural network
    * Support relational reasoning and combinatorial generalization

<h2>Motivation</h2>

* Deep learning faces key challenges in areas that **demand combinatorial generalization**.

* Deep learning eschews **compositionality and explicit structure struggles**.

* Recent examples of principled hybrids of structure-based methods and deep learning betoken great promise of combining **connectionist** and **symbolic** approaches.

* A new class of model has risen
    * reasoning about explicitly structured data in particular graphs
    * performing computation over discrete entities and the relations between them
    * carrying strong relational inductive biases in the form of specific architectural assumptions



<p align="left|right|center|justify">
<h2>Relational Inductive Biases</h2>

`Relational Inductive Biases` is used to achieve **`relational reasoning`**. 

* **`Relational reasoning`** manipulates `structured computation and representations` of `entities` and `relations`, using `rules` for how they can be composed.
    * `Structure` is the product of composing a set of known building blocks. 
    * `Structured representations` capture the composition.
    * `Structured computations` operate over the elements and their composition as a whole.
    * `entity` is an element with attributes, e.g. mass, size, ...
    * `relation` is a property between entities, e.g. same size, distance from, slower, ...
    * `rule` is a function maps entities and relations to other entities and relations, e.g. 'does entity X has property Y?', 'do entity X and Y have property Z?', ...
    
* **`Inductive biases`** allows a learning algorithm to impose constraints on relationships and interactions between entities and prioritize one solution (or interpretation) over another, independent of the observed data.
    * Examples include prior distribution in Bayesian model, regularization term penalizing overfitting, ...
    * The underlying assumtion is that searching for good solutions is easier when there is less ambiguity among solutions, e.g. the problem can be interpreted with a linear model; the model errors to be minizied can be approximated by a quadratic penalty; the natural and residual errors can be described as Gaussian noise, ...
     
> * The composition of layers in standard neural networks provides certain particular type of relational inductive biases.
    * For `MLP`, the dot product units are the `entities`; the fully connection between layers reflects `relations`; the `weights and biases` specify the `rules`.
    * For `CNN`, the individual pixels are `entities`; the local receptive fields of the input reflects `relations`; the reuse of the relations between the localities across the input reflects `translation invariance`.
    * For `RNN`, the hidden state of at each processing step is an `entity`; the Markov dependence of the hidden state on the hidden state of previous step and current input is the `relation`; the `rule` is that the hidden state is updated based on the current hidden state and current input; the reuse of the rule over each step reflects `temporal invariance`.
* But there is no deep learning model working on arbitrary relational structure. 

* What is needed is a model that have **explicit representations of entities and relations**, and **learning algorithms** that **find rules** so that the interaction of entities and relations can be realized. **Invariance** to `ordering` should be reflected. 
    * `ordering`: the relations between entities with certain same class, e.g. sizes, masses, ages, ...
    
> * How `ordering` affects the standard learning algorithms in solving problems.
     * Compute the center of mass for a solar system with $n$ planetes with MLP from the input vector $[x_1, x_2, ..., x_n]^T$ 
     * The learnt model cannot be transferred to fit different permutations of the elements of the vector such as $[x_n, x_1, ..., x_2]^T$ 
     * Considerring all the permuations requires exponential number of input/output training examples  



<h2>Graph Networks basics</h2>


Graph network (GN) framework defines a class of functions, including neural networks, for relational reasoning over graph-structured representations.

**`GN block`** is the main unit. It takes graph as input and outputs a graph. Block organization emphasizes customizability and synthesizing architectures which express desired relational inductive biases.
qwqewq

* A GN block contains a Graph is $G=(u, V, E)$. $V=\{v_1, v_2, \ldots, v_k\}$ is the set of verticies. $E=\{(e_1, r_1, s_1), (e_2, r_1, s_1), \ldots, (e_k, r_k, s_k)\}$ is the set of edges where for edge $e_i$, $s_i$ is the index of the 'sender' vertex  and $r_i$ is the 'receiver' vertex. $u$ is a global attribute for the graph. 
   
    * Attributes can be associated with individual edges and vertices and can be vector, set or graph
    * Global attribute is a graph-level attribute.
    * `Multi-graph`: there can be multiple edges between verticies, including self-edges.

* A GN block contains 3 update functions, $\phi$, and 3 aggregation functions, $\rho$. **Aggregation functions takes set as input, thus must be invariant to permutations of their inputs.**

\begin{align}
e'_k&=\phi^e(e_k, v_{r_k}, v_{s_k}, u)\\
v'_i&=\phi^v(\bar{\\e}'_i,v_i, u)\\
u'&=\phi^u(\bar{\\e}',\bar{\\v}',u)\\
\bar{\\e}'_i&=\rho^{e\rightarrow v}(E'_i)\\
\bar{\\e}'&=\rho^{e\rightarrow u}(E')\\
\bar{\\v}'&=\rho^{v\rightarrow u}(V')
\end{align}
<center>where $E'_i=\{(e'_k, r_k, s_k)|{r_k=i}\}, V'=\{v'_i|i=1,2,\ldots,|V|\}, E'=\cup_i E'_i$</center>

* The computational steps within a GN blocks is as follows
    <ol>
    <li>Apply edge update function to all edges, such as $e'_k =\phi^e(e_k, v_{r_k}, v_{s_k},u)$. The updated set of edges is $E'=\cup_i E'_i=\cup_i \{(e'_i, r_i, s_i)|i=1, 2, ..., |E|\}$</li>
    <li>Apply aggregation function to aggregate all edges that project to certain vertex, such as $\bar{\\e}'_i=\rho^{e\rightarrow v}(E'_i)$ aggregating $E'_i$ for vertex $v_i$ to $\bar{\\e}'_i$</li>
    <li>Apply vertex update function to all vertices, such as $v'_k =\phi^v(\bar{\\e'_k}, v_k, u)$</li>
    <li>Apply aggregation function to set of edges to aggregate all edge updates $\bar{\\e}'=\rho^{e\rightarrow u}(E')$</li>
    <li>Apply aggregateion funcetion to set of verticies to aggregate all vertex updates $\bar{\\v}'=\rho^{v\rightarrow u}(V')$</li>
    <li>Update attribute for entire graph $u'=\phi^u(\bar{\\e}',\bar{\\v}',u)$</li>
    </ol>
    
* The relational inductive biases in GN
    
    * Graphs express arbitrary relationships among entities determined by GN input.
    * The representations of entities and their relations in graph are sets, which are invarient to permutations
    * GN functions of edge and vertex are reused across entire GN architecture, thus supporting combinatorial generalization.
    

<h2>Design Principles for GN Architectures</h2>

<h3>High flexibility in graph representations</h3>

* **Attributes** representation

    * Attributes of a GN block can be arbitrary representational formats, e.g. sequences, sets, graphs, tensors. 
    * The content depends on the requirement of specific problem, e.g. image pixels.
    * The ouput of GN block can be tailored to the demands of the task
        * Use edge attributes as output to represent interactions among entities.
        * Use vertex attributes as output to reason about physical systems.
        * Use graph attributes as output to represent the global perperties.
        * Output can be a mixture of attributes (check <a href="https://arxiv.org/abs/1806.01203">Relational inductive bias for physical construction in humans and machines</a>)

* **Grpah** representation
    * Scenarios where relational structure is explicitly specified by the input, e.g. knowledge graphs, social networks, optimization problems, physical sysgtems with known interactions        
    * Scenarios Where relational structure is inferred or assumed, e.g. visual scenes, text fragmeents, programs, and **multi-agent systems**.
        * The data may be formatted as a set of entities without relations or even just a vector or tensor. 
        * Existence of the entities can be assumed or inferred such as the features of input text or image in CNN, output of auto-encoder.
        * Inference may be done by other mechanismm
        * Even when explicit relations between entities are not available, the relations can be assumed.
 
<h3>Configurability of the block inner structure</h3>

The structure and functions within a GN block can be configured, providing flexibility in the formats of inputs and outputs of the functions. <a href="https://arxiv.org/abs/1806.01203">Existing work</a> used neural networks as update functions and summations as aggregation functions.

<h4>Example: Message-passing neural network (MPNN)</a></h4>

<img src="./fig1.png"><center>A <a href="https://arxiv.org/abs/1704.01212">MPNN</a> can be translated natrually into the GN formalism.</center></img>
* The message function, $M_t$, in MPNN constitutes the edge update function $\phi^e$ without $u$ in the input.
* The elementwise summation plays GN's aggregation function $\rho^{e\rightarrow v}$
* The update function, $U_t$, in MPNN plays the vertex update function $\phi^v$.
* The readout function, R, in MPNN plays attribute update function $\phi^u$ without $u$ in the input, thus aggregation function $\rho^{e\rightarrow u}$ is not needed.
* $d_{master}$ in MPNN can be represented as the set of edges, $V$


<h3>Composable multi-block architectures</h3>

Graph network can be as complex as a composition of GN blocks, e.g. $G'=GN_2(GN_1(G))$.

* GNs can be connected serially, e.g. <a href="https://arxiv.org/abs/1806.01203">encode-process-decode design</a>, $G_0\overset{GN_1}{\rightarrow}G_1\overset{GN_2}{\rightarrow}G_2\overset{GN_3}{\rightarrow}\ldots\overset{GN_m}{\rightarrow}G_m$
* GNs can be structured as a <a href="https://arxiv.org/abs/1806.01242">recurrent architecture</a> by maintaining a hidden state.
* Operations such as concatenation can be done on graph, to be used in architectures such as LSTM.

<h2>Limitations and Open Questions</h2>

MPNNs' form of GNs structures have limitations of being unable to guarantee to solve some classes of problems, such as discriminating betweeen certain non-isomorphic graphs.

Notions like Recursion, control flow, and conditional iteration are not straightforward to represent with graphs.  **Programs and more 'computer-like' processing can o↵er greater representational and computational expressivity with respect to these notions, and some have argued they are an important component of human cognition.**

Q: where do the graphs come from that graph networks operate over?

Q: How to convert sensory data into representations more structured than neural network, such as graph?

Q: How to adaptively modify graph structures during the course of computation? E.g. when an object is broken into pieces, the entity in the graph should also be fractured in to a group of entities.