__Automatic differentiation (AD)__ is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the [chain rule](https://en.wikipedia.org/wiki/Chain_rule) repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program. 

Automatic differentiation is __NOT__:
- Symbolic differentiation, nor
- Numerical differentiation (the method of finite differences).

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/AutomaticDifferentiationNutshell.png/1920px-AutomaticDifferentiationNutshell.png" width=600>

In [1]:
import numpy as np

In [2]:
import autograd.numpy as anp

In [3]:
from autograd.scipy.misc import logsumexp
from autograd.scipy.signal import convolve

def softmax(x):    return anp.exp(x - logsumexp(x))
def softplus(x):   return anp.logaddexp(0., x)
def sigmoid(x):    return anp.reciprocal(anp.exp(softplus(-x)))
def flip(x):       return x[::-1]

Элементарные функции

In [4]:
def affine_transform(x, mat, bias):
    return np.dot(mat,x) + bias

def subassign(x, y, idx_dst, idx_src):
    assert(len(idx_dst) == len(idx_src))
    z = x.copy()
    z[idx_dst] = y[idx_src]
    return z

primitive_functions_dict = dict()

primitive_functions_dict['negative'] = lambda x: -1.*x
primitive_functions_dict['exponential'] = lambda x,a: np.power(a,x)
primitive_functions_dict['powexp'] = lambda x,y: np.power(x,y)
primitive_functions_dict['exp'] = lambda x: np.exp(x)
primitive_functions_dict['log'] = lambda x: np.log(x)
primitive_functions_dict['sin'] = lambda x: np.sin(x)
primitive_functions_dict['cos'] = lambda x: np.cos(x)
primitive_functions_dict['power'] = lambda x,p: np.power(x,p)
primitive_functions_dict['sub'] = lambda x,idx: x[idx]
primitive_functions_dict['subassign'] = lambda x,y,idx_dst,idx_src: subassign(x, y, idx_dst, idx_src)
primitive_functions_dict['flip'] = lambda x: np.flip(x)
primitive_functions_dict['diff'] = lambda x: np.diff(x)
primitive_functions_dict['cumsum'] = lambda x: np.cumsum(x)
primitive_functions_dict['concat'] = lambda *vectors: np.concatenate(vectors)
primitive_functions_dict['conv_full'] = lambda x, kernel: convolve(x, kernel, mode='full')
primitive_functions_dict['conv_valid'] = lambda x, kernel: convolve(x, kernel, mode='valid')
primitive_functions_dict['reciprocal'] = lambda x: 1./x
primitive_functions_dict['logsumexp'] = lambda x: logsumexp(x)
primitive_functions_dict['softmax'] = lambda x: softmax(x)
primitive_functions_dict['softplus'] = lambda x: softplus(x)
primitive_functions_dict['sigmoid'] = lambda x: sigmoid(x)
primitive_functions_dict['inner'] = lambda x,y: np.inner(x,y)
primitive_functions_dict['add'] = lambda x,y: x+y
primitive_functions_dict['subtract'] = lambda x,y: x-y
primitive_functions_dict['multiply'] = lambda x,y: x*y
primitive_functions_dict['divide'] = lambda x,y: x/y
primitive_functions_dict['affine'] = lambda x, *params: affine_transform(x, *params)
primitive_functions_dict['scalar_mult_add'] = lambda x, *params: affine_transform(x, *params)


VJP для элементарных функций. 


Функция vjp(vec, ans, \*inputs, *params) принимает на вход
- vec - вектор, $v$
- ans - значение функции для заданных входных аргументов, $f(x,y,z,...)$
- *inputs - входные аргументы, $x,y,z,...$
- *params - параметры функции

и возвращает список

$$v^T \frac{\partial f}{\partial x}, v^T \frac{\partial f}{\partial y}, v^T \frac{\partial f}{\partial z}, ...$$

In [5]:
def vjp_affine_transform(vec, mat, bias):
    return np.dot(vec, mat)
 
vjps_dict = dict()

vjps_dict['negative'] = lambda vec, ans, x: [-1.*vec]
vjps_dict['exponential'] = lambda vec, ans, x, a: [vec*np.log(a)*ans]
vjps_dict['exp'] = lambda vec, ans, x: [vec*ans]
vjps_dict['log'] = lambda vec, ans, x: [vec/x]
vjps_dict['sin'] = lambda vec, ans, x: [vec*np.cos(x)]
vjps_dict['cos'] = lambda vec, ans, x: [-1.*vec*np.sin(x)]
vjps_dict['power'] = lambda vec, ans, x, p: [vec*p*np.power(x,p-1)]
vjps_dict['inner'] = lambda vec, ans, x,y: [vec*y, vec*x]
vjps_dict['add'] = lambda vec, ans, x,y: [vec, vec]
vjps_dict['subtract'] = lambda vec, ans, x,y: [vec, -1.*vec]
vjps_dict['affine'] = lambda vec, ans, x, *params: [vjp_affine_transform(vec, *params)]
vjps_dict['scalar_mult_add'] = lambda vec, ans, x, *params: [vjp_affine_transform(vec, *params)]


Прямой проход

In [6]:
def forward_pass(x, graph, nodes_sorted):
           
    for node in nodes_sorted:
        func = graph.node[node]['function']
        parents = graph.node[node]['parents']
        args = []
        for p in parents:
            args.append(graph.node[p]['value'])
        if 'params' in graph.node[node]:
            args = args + [*graph.node[node]['params']]

        if len(args) > 0:
            graph.node[node]['value'] = primitive_functions_dict[func](*args)
        else:
            graph.node[node]['value'] = x
            

Обратный проход

In [7]:
def add_jacs(prev_jac, jac):
    if prev_jac is None:
        return jac
    else:
        return prev_jac + jac

def backward_pass(graph, nodes_sorted):
    
    nodes_sorted_backward = nodes_sorted[::-1]
    end_node = nodes_sorted_backward[0]
    jacs = {end_node: 1.0}
    
    for node in nodes_sorted_backward:
        func = graph.node[node]['function']
        value = graph.node[node]['value']
        parents = graph.node[node]['parents']
        
        jac = jacs.pop(node)
        
        args = []
        for p in parents:
            args.append(graph.node[p]['value'])
        if 'params' in graph.node[node]:
            args = args + [*graph.node[node]['params']]
            
        if len(args) > 0:
            parent_jacs = vjps_dict[func](jac, value, *args) 
        
        for i in range(len(parents)): 
            p = parents[i]
            jacs.update({p: add_jacs(jacs.get(p), parent_jacs[i])})
            
    return jac


In [8]:
def make_function_and_gradient(computational_graph):
    
    nodes_sorted = list(nx.topological_sort(computational_graph))
    end_node = nodes_sorted[-1]
    
    def function(x):
        graph = computational_graph.copy()
        forward_pass(x, graph, nodes_sorted)
        f_x = graph.node[end_node]['value']
        return f_x
    
    def gradient(x):
        graph = computational_graph.copy()
        forward_pass(x, graph, nodes_sorted)
        g_x = backward_pass(graph, nodes_sorted)
        return g_x
    
    return function, gradient

Размерность входа

In [9]:
dim = 4

Определение функции, $f$

In [10]:
np.random.seed(123)

A = np.random.randn(dim,dim)
b = np.random.randn(dim,)

def function_0(x):
    z = anp.dot(A,x) + b
    u = anp.exp(z) + 5*anp.sin(x) + 3
    return anp.dot(u.T,u)**0.5

Создание графа вычислений для $f$. Каждая вершина графа имеет следующие атрибуты:
- 'function' - элементарная функция
- 'value' - результат вычисления значения элементарной функции
- 'parents' - список вершин-предков, длина которого равна количеству аргументов элементарной функции
- 'params' - параметры элементарной функции

Атрибут 'value' инициализируется как None; его значение вычисляется во время прямого прохода.

In [11]:
import networkx as nx

G = nx.DiGraph()

G.add_node(0, function=None, value=None, parents=[])
G.add_node(1, function='affine', value=None, parents=[0], params=[A,b])
G.add_node(2, function='exp', value=None, parents=[1])
G.add_node(3, function='sin', value=None, parents=[0])
G.add_node(4, function='scalar_mult_add', value=None, parents=[3], params=[5,3])
G.add_node(5, function='add', value=None, parents=[2,4])
G.add_node(6, function='inner', value=None, parents=[5,5])
G.add_node(7, function='power', value=None, parents=[6], params=[0.5])

G.add_edges_from([(0,1),(1,2),(0,3),(3,4),(2,5),(4,5),(5,6),(6,7)])


Получение функций, которые вычисляют $f(x)$ и $\nabla f(x)$ для любого $x$

In [12]:
function, gradient = make_function_and_gradient(G)

Получение функций, которые вычисляют $f(x)$ и $\nabla f(x)$ для любого $x$; с помощью библиотеки [autograd](https://github.com/HIPS/autograd)

In [13]:
from autograd import grad

gradient_0 = grad(function_0)

Входной вектор

In [14]:
x = np.random.randn(dim,)

Проверка корректности вычисления $f(x)$

In [15]:
f_true = function_0(x)
f = function(x)

print(f_true)
print(f)
print('Difference:', np.abs(f_true - f))

407.0560137190764
407.0560137190764
Difference: 0.0


Проверка корректности вычисления $\nabla f(x)$

In [16]:
from scipy.linalg import norm

g_true = gradient_0(x)
g = gradient(x)

print(g_true)
print(g)
print('Relative difference:', norm(g_true - g)/norm(g_true + g))

[-230.57270198  658.79279437 -967.44412286 -171.04452494]
[-230.57270198  658.79279437 -967.44412286 -171.04452494]
Relative difference: 0.0


Проверка корректности вычисления $\nabla f(x)$ на основе численного приближения

$$ \frac{\partial f}{\partial x_i} \approx \frac{f(x + e_i \epsilon) - f(x)}{\epsilon}$$

In [17]:
from scipy.optimize import check_grad

difference = check_grad(function, gradient, x)
print(difference)

5.719044814246033e-05
