### Introduction

In [7]:
# We show how a sequential model can be written in functional form
# First the "classic" way
from keras.models import Sequential, Model
from keras import layers
from keras import Input

seq_model = Sequential()
seq_model.add(layers.Dense(32, activation='relu', input_shape=(64,)))
seq_model.add(layers.Dense(32, activation='relu'))
seq_model.add(layers.Dense(10, activation='softmax'))

# And now its functional equivalent
input_tensor = Input(shape=(64,))
x = layers.Dense(32, activation='relu')(input_tensor)
x = layers.Dense(32, activation='relu')(x)
output_tensor = layers.Dense(10, activation='softmax')(x)

# We now define the model, just by specifying the input and the output tensors
model = Model(input_tensor, output_tensor)

# And finally we can see its structure
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 64)                0         
_________________________________________________________________
dense_11 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_12 (Dense)             (None, 32)                1056      
_________________________________________________________________
dense_13 (Dense)             (None, 10)                330       
Total params: 3,466
Trainable params: 3,466
Non-trainable params: 0
_________________________________________________________________


We're now going to consider three ways in which the functionality API can be helpful:
* Multi-input models
* Multi-output models
* Graph-like models

### Multi-input models

In [8]:
# We consider a simple example, a question-answering model. This will take two inputs:
# - The question
# - Some reference text with information on how to answer

from keras.models import Model
from keras import layers
from keras import Input

text_vocabulary_size = 10000
question_vocabulary_size = 10000
answer_vocabulary_size = 500

# The input is a variable-length sequence of integers
# After definining it, we embed it into a sequence vector of size 64
# Then encode it in a single vector via a LSTM
text_input = Input(shape=(None,), dtype='int32', name='text')
embedded_text = layers.Embedding(64, text_vocabulary_size)(text_input)
encoded_text = layers.LSTM(32)(embedded_text)

# Now let's follow the same process for the question input
question_input = Input(shape=(None,), dtype='int32', name='question')
embedded_question = layers.Embedding(32, question_vocabulary_size)(question_input)
encoded_question = layers.LSTM(16)(embedded_question)

# We then concatenate the text and the question
concatenated = layers.concatenate([encoded_text, encoded_question], axis=-1)

# And finally we add a softmax classifier on top, as output layer
answer = layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated)

# Let's now instantiate the model, specifying the two inputs and the output
model = Model([text_input, question_input], answer)

# And finally we compile it
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

How do we train this model? There are two possible APIs:
1. Feed the model a list of Numpy arrays as inputs
2. Feed it a dictionary that maps input names to Numpy arrays (this option is only available if we give names ot the inputs) 

We show both below.

In [9]:
import numpy as np

num_samples = 1000
max_length = 100

# Defining some dummy input and output data
text = np.random.randint(1, text_vocabulary_size, size=(num_samples, max_length))
question = np.random.randint(1, question_vocabulary_size, size=(num_samples, max_length))
answers = np.random.randint(0, 1, size=(num_samples, answer_vocabulary_size))

# We can fit the model by feeding a list of Numpy arrays (not run)
# model.fit([text, question], answers, epochs=10, batch_size=128)

# Or we can fit it using a dictionary of inputs (only if inputs are named)
model.fit({'text': text, 'question': question}, answers, epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fac6c31bcc0>

### Multi-output models

A simple example is a network that attempts to simultaneously predict different properties of the data, such as a network that takes as input a series of social media posts from a single anonymous person and tries to predict attributes of that person, such as age, gender, and income level

In [11]:
from keras import layers
from keras import Input
from keras.models import Model

vocabulary_size = 50000
num_income_group = 10

# A single input type, which we first embed in a vector the size of 
# the vocabulary and then we process with Conv1D layers
posts_input = Input(shape=(None,), dtype='int32', name='posts')
embedded_posts = layers.Embedding(256, vocabulary_size)(posts_input)
x = layers.Conv1D(128, 5, activation='relu')(embedded_posts)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation='relu')(x)

# And now we can set up the different output layers, each with a given name
age_prediction = layers.Dense(1, name='age')(x)
income_prediction = layers.Dense(num_income_group, activation='softmax', name='income')(x)
gender_prediction = layers.Dense(1, activation='sigmoid', name='gender')(x)

# Finally we instantiate the model, specifying the single input and the various outputs
model = Model(posts_input, [age_prediction, income_prediction, gender_prediction])

Training such a model requires the ability to specify different loss functions for different heads of the network: for instance, age prediction is a scalar regression task, but gender prediction is a binary classification task, requiring a different training procedure.
But because gradient descent requires you to minimize a scalar, you must combine these losses into a single value in order to train the model. The simplest way to combine different losses is to sum them all.

In [15]:
# Let's compile the model to allow for multiple outputs, given different weights to the various losses
model.compile(optimizer='rmsprop',
              loss={'age': 'mse', 
                    'income': 'categorical_crossentropy',
                    'gender': 'binary_crossentropy'},
             loss_weights={'age': 0.25,
                           'income': 1.,
                           'gender': 10.})

In [None]:
# And finally we would fit the model by specifying a dictionary of arrays (but for the outputs this time)
model.fit(posts, {'age': age_targets,
                  'income': income_targets,
                  'gender': gender_targets},
                   epochs=10, batch_size=64)

### Directed acyclic graphs of layers

With the functional API, not only can you build models with multiple inputs and mul- tiple outputs, but you can also implement networks with a complex internal topology. Neural networks in Keras are allowed to be arbitrary directed acyclic graphs of layers.
Several common neural-network components are implemented as graphs. Two notable ones are Inception modules and residual connections. We consider them in turn

#### Inception modules

This is a popular type of network architecture for convolutional neural networks. It consists of a stack of modules that themselves look like small independent networks, split into several parallel branches. Assuming the existence of a 4D input tensor `x`, this is how the network on page 243 can be implemented.

In [None]:
from keras import layers

branch_a = layers.Conv2D(128, 1, activation='relu', strides=2)(x)

branch_b = layers.Conv2D(128, 1, activation='relu')(x)
branch_b = layers.Conv2D(128, 3, activation='relu', strides=2)(branch_b)

branch_c = layers.AveragePooling2D(3, strides=2)(x)
branch_c = layers.Conv2D(128, 3, activation='relu')(branch_c)

branch_d = layers.Conv2D(128, 1, activation='relu')(x)
branch_d = layers.Conv2D(128, 3, activation='relu')(branch_d)
branch_d = layers.Conv2D(128, 3, activation='relu', strides=2)(branch_d)

output = layers.concatenate([branch_a, branch_b, branch_c, branch_d], axis=-1)

Note that the full Inception V3 architecture is available in Keras as `keras.applications .inception_v3.InceptionV3`, including weights pretrained on the ImageNet dataset.

#### Residual connections

Residual connections are a common graph-like network component found in many post- 2015 network architectures, including Xception. They tackle two common problems that plague any large-scale deep-learning model: vanishing gradients and representational bottlenecks. In general, adding residual connections to any model that has more than 10 layers is likely to be beneficial.

A residual connection consists of making the output of an earlier layer available as input to a later layer, effectively creating a shortcut in a sequential network. Rather than being concatenated to the later activation, the earlier output is summed with the later activation, which assumes that both activations are the same size. 

In [None]:
# Implementation of residual connection in Keras when the feature-map sizes are the same
from keras import layers

x = ...
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
# Here we add the original x back to the output features
y = layers.add([y, x])

In [None]:
# Implementation of residual connection when the feature-map sizes are different
from keras import layers

x = ...
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.MaxPooling2D(2, strides=2)(y)

# We use a 1x1 convolution to linearly downsample the original x tensor to the same shape as y
residual = layers.Conv2D(128, 1, strides=2, padding='same')(x)

# Here we add the transformed x back to the output features
y = layers.add([y, residual])

#### Layer weight sharing

One more important feature of the functional API is the ability to reuse a layer instance several times. When you call a layer instance twice, instead of instantiating a new layer for each call, you reuse the same weights with every call. This allows you to build models that have shared branches—several branches that all share the same knowledge and perform the same operations. That is, they share the same representations and learn these representations simultaneously for different sets of inputs.

In [None]:
# This is an example of how to implement a model with two inputs that can be interchanged
from keras import layers
from keras import Input
from keras.models import Model

# We instantiate a LSTM layer only once
lstm = layers.LSTM(32)

# Now we build the left side of the network
left_input = Input(shape=(None, 128))
left_output = lstm(left_input)

# And then the right one
right_input = Input(shape=(None, 128))
right_output = lstm(right_input)

# Then we concatenate the layers
merged = layers.concatenate([left_output, right_output], axis=-1)

# And build the classifier on top
predictions = layers.Dense(1, activation='sigmoid')(merged)

# Finally we instantiate and train the model. When this model is trained
# the weights of the LSTM layer are updated based on both layers
model = Model([left_input, right_input], predictions)
model.fit([left_data, right_data], targets)

#### Models as layers

Importantly, in the functional API, models can be used as you’d use layers—effectively, you can think of a model as a “bigger layer.” This is true of both the Sequential and Model classes. This means you can call a model on an input tensor and retrieve an output tensor.
When you call a model instance, you’re reusing the weights of the model—exactly like what happens when you call a layer instance. Calling an instance, whether it’s a layer instance or a model instance, will always reuse the existing learned representations of the instance—which is intuitive.

One simple practical example of what you can build by reusing a model instance is a vision model that uses a dual camera as its input: two parallel cameras, a few centi- meters (one inch) apart. Such a model can perceive depth, which can be useful in many applications. You shouldn’t need two independent models to extract visual features from the left camera and the right camera before merging the two feeds. Here’s how you’d implement a Siamese vision model (shared convolutional base) in Keras:

In [5]:
from keras import layers
from keras import applications
from keras import Input

# The base image-processing model is the Xception network (convolutional base only)
xception_base = applications.Xception(weights=None, include_top=False)

# The inputs are 250 × 250 RGB images.
left_input = Input(shape=(250, 250, 3))
right_input = Input(shape=(250, 250, 3))

# We call the same vision model twice
left_features = xception_base(left_input)
right_input = xception_base(right_input)

# The merged features contain information from the right visual feed and the left visual feed.
merged_features = layers.concatenate([left_features, right_input], axis=-1)