-
Notifications
You must be signed in to change notification settings - Fork 130
/
native_ops.rst
102 lines (67 loc) · 2.59 KB
/
native_ops.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
.. _native_ops:
=================
Native operations
=================
Motivation:
* **Speed up** some important common calculations,
and potentially **reduce memory requirements**.
Examples:
* LSTM
* CTC loss
* Pure TensorFlow implementations can be suboptimal
* TF ops almost always create copies, even SplitOp etc
* Not a memory problem, as input tensor will get freed if not used further
* Performance problem
* Gradient might be suboptimal
* Require too much memory (see automatic gradient checkpointing for a solution)
* No automatic optimization
* (Could be solved by custom TF gradient)
* Memory can be too much distributed (``tf.TensorArray``, TF ``Stack``)
* Esp. problematic in loop: Separate tensor for every iteration
* Much better to allocate it as consecutive / contiguous block
* Overhead (calling individual TF ops, etc)
(minor compared to the other points)
(XLA can partially also solve this)
Solution: Write native (C++/CUDA) code
Why is native code faster?
* Operate inplace on tensors
* Solves all problems mentioned, no unnecessary copies
* Can use consecutive tensor / memory
* Enforces custom gradient implementation
Problems with native code:
* Can be difficult, memory unsafe, needs more debugging
* Need multiple implementations: CPU (C++), GPU (CUDA)
Our Approach in RETURNN:
The **NativeOp framework**.
See :mod:`returnn.native_op`, :mod:`returnn.tf.native_op`, :mod:`returnn.theano.native_op`.
* Some wrapper / helper code to simplify writing custom native op
* Abstractions to allow single code for CPU & GPU
* Write kernel CUDA style, using ``threadIdx``, ``blockIdx``, etc
* Kernel code must be flexible for amount of threads
* Example, LSTM kernel, loop over dimensions, executed per time-frame:
.. code-block:: c
int idx = threadIdx.x + blockDim.x ∗ blockIdx.x;
while (idx < n_cells ∗ n_batch) {
int batch_idx = idx / n_cells;
int cell_idx = idx % n_cells;
...
idx += gridDim.x ∗ blockDim.x;
}
* On CPU
* Custom ``gridDim``, ``blockDim``
* Other CUDA-like wrappers
History:
* Already available for the Theano backend
* Ported to TensorFlow
* Directly support for all already prev. implemented ops (LSTM, Baum Welch aligner, ...)
* Easy to port to other frameworks
Examples:
* ``NativeLstm`` (``LstmGenericBase``)
* ``NativeLstm2``
* ``TwoDLSTM``
* ``FastBaumWelch``
* ``FastViterbi``
* ``OptimalCompletionEditDistance``
* ``EditDistance``
* ``Chunking``, ``UnChunking``
See also :ref:`tf_lstm_benchmark`.