|
| 1 | +Maximising Python Speed |
| 2 | +======================= |
| 3 | + |
| 4 | +This tutorial describes ways of improving the performance of MicroPython code. |
| 5 | +Optimisations involving other languages are covered elsewhere, namely the use |
| 6 | +of modules written in C and the MicroPython inline ARM Thumb-2 assembler. |
| 7 | + |
| 8 | +The process of developing high performance code comprises the following stages |
| 9 | +which should be performed in the order listed. |
| 10 | + |
| 11 | +* Design for speed. |
| 12 | +* Code and debug. |
| 13 | + |
| 14 | +Optimisation steps: |
| 15 | + |
| 16 | +* Identify the slowest section of code. |
| 17 | +* Improve the efficiency of the Python code. |
| 18 | +* Use the native code emitter. |
| 19 | +* Use the viper code emitter. |
| 20 | + |
| 21 | +Designing for speed |
| 22 | +------------------- |
| 23 | + |
| 24 | +Performance issues should be considered at the outset. This involves taking a view |
| 25 | +on the sections of code which are most performance critical and devoting particular |
| 26 | +attention to their design. The process of optimisation begins when the code has |
| 27 | +been tested: if the design is correct at the outset optimisation will be |
| 28 | +straightforward and may actually be unnecessary. |
| 29 | + |
| 30 | +Algorithms |
| 31 | +~~~~~~~~~~ |
| 32 | + |
| 33 | +The most important aspect of designing any routine for performance is ensuring that |
| 34 | +the best algorithm is employed. This is a topic for textbooks rather than for a |
| 35 | +MicroPython guide but spectacular performance gains can sometimes be achieved |
| 36 | +by adopting algorithms known for their efficiency. |
| 37 | + |
| 38 | +RAM Allocation |
| 39 | +~~~~~~~~~~~~~~ |
| 40 | + |
| 41 | +To design efficient MicroPython code it is necessary to have an understanding of the |
| 42 | +way the interpreter allocates RAM. When an object is created or grows in size |
| 43 | +(for example where an item is appended to a list) the necessary RAM is allocated |
| 44 | +from a block known as the heap. This takes a significant amount of time; |
| 45 | +further it will on occasion trigger a process known as garbage collection which |
| 46 | +can take several milliseconds. |
| 47 | + |
| 48 | +Consequently the performance of a function or method can be improved if an object is created |
| 49 | +once only and not permitted to grow in size. This implies that the object persists |
| 50 | +for the duration of its use: typically it will be instantiated in a class constructor |
| 51 | +and used in various methods. |
| 52 | + |
| 53 | +This is covered in further detail :ref:`Controlling garbage collection <gc>` below. |
| 54 | + |
| 55 | +Buffers |
| 56 | +~~~~~~~ |
| 57 | + |
| 58 | +An example of the above is the common case where a buffer is required, such as one |
| 59 | +used for communication with a device. A typical driver will create the buffer in the |
| 60 | +constructor and use it in its I/O methods which will be called repeatedly. |
| 61 | + |
| 62 | +The MicroPython libraries typically provide optional support for pre-allocated buffers. |
| 63 | +For example the ``uart.readinto()`` method allows two options for its argument, an integer |
| 64 | +or a buffer. If an integer is supplied it will read up to that number of bytes and |
| 65 | +return the outcome: this implies that a buffer is created with a corresponding |
| 66 | +memory allocation. Providing a pre-allocated buffer as the argument avoids this. See |
| 67 | +the code fragment in :ref:`Caching object references <Caching>` below. |
| 68 | + |
| 69 | +Floating Point |
| 70 | +~~~~~~~~~~~~~~ |
| 71 | + |
| 72 | +For the most speed critical sections of code it is worth noting that performing |
| 73 | +any kind of floating point operation involves heap allocation. Where possible use |
| 74 | +integer operations and restrict the use of floating point to sections of the code |
| 75 | +where performance is not paramount. |
| 76 | + |
| 77 | +Arrays |
| 78 | +~~~~~~ |
| 79 | + |
| 80 | +Consider the use of the various types of array classes as an alternative to lists. |
| 81 | +The ``array`` module supports various element types with 8-bit elements supported |
| 82 | +by Python's built in ``bytes`` and ``bytearray`` classes. These data structures all store |
| 83 | +elements in contiguous memory locations. Once again to avoid memory allocation in critical |
| 84 | +code these should be pre-allocated and passed as arguments or as bound objects. |
| 85 | + |
| 86 | +When passing slices of objects such as ``bytearray`` instances, Python creates |
| 87 | +a copy which involves allocation. This can be avoided using a ``memoryview`` |
| 88 | +object: |
| 89 | + |
| 90 | +.. code:: python |
| 91 | +
|
| 92 | + ba = bytearray(100) |
| 93 | + func(ba[3:10]) # a copy is passed |
| 94 | + mv = memoryview(ba) |
| 95 | + func(mv[3:10]) # a pointer to memory is passed |
| 96 | +
|
| 97 | +A ``memoryview`` can only be applied to objects supporting the buffer protocol - this |
| 98 | +includes arrays but not lists. |
| 99 | + |
| 100 | +Identifying the slowest section of code |
| 101 | +--------------------------------------- |
| 102 | + |
| 103 | +This is a process known as profiling and is covered in textbooks and |
| 104 | +(for standard Python) supported by various software tools. For the type of |
| 105 | +smaller embedded application likely to be running on MicroPython platforms |
| 106 | +the slowest function or method can usually be established by judicious use |
| 107 | +of the timing ``ticks`` group of functions documented |
| 108 | +`here <http://docs.micropython.org/en/latest/pyboard/library/time.html>`_. |
| 109 | +Code execution time can be measured in ms, us, or CPU cycles. |
| 110 | + |
| 111 | +The following enables any function or method to be timed by adding an |
| 112 | +``@timed_function`` decorator: |
| 113 | + |
| 114 | +.. code:: python |
| 115 | +
|
| 116 | + def timed_function(f, *args, **kwargs): |
| 117 | + myname = str(f).split(' ')[1] |
| 118 | + def new_func(*args, **kwargs): |
| 119 | + t = time.ticks_us() |
| 120 | + result = f(*args, **kwargs) |
| 121 | + delta = time.ticks_diff(t, time.ticks_us()) |
| 122 | + print('Function {} Time = {:6.3f}ms'.format(myname, delta/1000)) |
| 123 | + return result |
| 124 | + return new_func |
| 125 | +
|
| 126 | +MicroPython code improvements |
| 127 | +----------------------------- |
| 128 | + |
| 129 | +The const() declaration |
| 130 | +~~~~~~~~~~~~~~~~~~~~~~~ |
| 131 | + |
| 132 | +MicroPython provides a ``const()`` declaration. This works in a similar way |
| 133 | +to ``#define`` in C in that when the code is compiled to bytecode the compiler |
| 134 | +substitutes the numeric value for the identifier. This avoids a dictionary |
| 135 | +lookup at runtime. The argument to ``const()`` may be anything which, at |
| 136 | +compile time, evaluates to an integer e.g. ``0x100`` or ``1 << 8``. |
| 137 | + |
| 138 | +.. _Caching: |
| 139 | + |
| 140 | +Caching object references |
| 141 | +~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 142 | + |
| 143 | +Where a function or method repeatedly accesses objects performance is improved |
| 144 | +by caching the object in a local variable: |
| 145 | + |
| 146 | +.. code:: python |
| 147 | +
|
| 148 | + class foo(object): |
| 149 | + def __init__(self): |
| 150 | + ba = bytearray(100) |
| 151 | + def bar(self, obj_display): |
| 152 | + ba_ref = self.ba |
| 153 | + fb = obj_display.framebuffer |
| 154 | + # iterative code using these two objects |
| 155 | +
|
| 156 | +This avoids the need repeatedly to look up ``self.ba`` and ``obj_display.framebuffer`` |
| 157 | +in the body of the method ``bar()``. |
| 158 | + |
| 159 | +.. _gc: |
| 160 | + |
| 161 | +Controlling garbage collection |
| 162 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 163 | + |
| 164 | +When memory allocation is required, MicroPython attempts to locate an adequately |
| 165 | +sized block on the heap. This may fail, usually because the heap is cluttered |
| 166 | +with objects which are no longer referenced by code. If a failure occurs, the |
| 167 | +process known as garbage collection reclaims the memory used by these redundant |
| 168 | +objects and the allocation is then tried again - a process which can take several |
| 169 | +milliseconds. |
| 170 | + |
| 171 | +There are benefits in pre-empting this by periodically issuing ``gc.collect()``. |
| 172 | +Firstly doing a collection before it is actually required is quicker - typically on the |
| 173 | +order of 1ms if done frequently. Secondly you can determine the point in code |
| 174 | +where this time is used rather than have a longer delay occur at random points, |
| 175 | +possibly in a speed critical section. Finally performing collections regularly |
| 176 | +can reduce fragmentation in the heap. Severe fragmentation can lead to |
| 177 | +non-recoverable allocation failures. |
| 178 | + |
| 179 | +Accessing hardware directly |
| 180 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 181 | + |
| 182 | +This comes into the category of more advanced programming and involves some knowledge |
| 183 | +of the target MCU. Consider the example of toggling an output pin on the Pyboard. The |
| 184 | +standard approach would be to write |
| 185 | + |
| 186 | +.. code:: python |
| 187 | +
|
| 188 | + mypin.value(mypin.value() ^ 1) # mypin was instantiated as an output pin |
| 189 | +
|
| 190 | +This involves the overhead of two calls to the ``Pin`` instance's ``value()`` |
| 191 | +method. This overhead can be eliminated by performing a read/write to the relevant bit |
| 192 | +of the chip's GPIO port output data register (odr). To facilitate this the ``stm`` |
| 193 | +module provides a set of constants providing the addresses of the relevant registers. |
| 194 | +A fast toggle of pin ``P4`` (CPU pin ``A14``) - corresponding to the green LED - |
| 195 | +can be performed as follows: |
| 196 | + |
| 197 | +.. code:: python |
| 198 | +
|
| 199 | + BIT14 = const(1 << 14) |
| 200 | + stm.mem16[stm.GPIOA + stm.GPIO_ODR] ^= BIT14 |
| 201 | +
|
| 202 | +The Native code emitter |
| 203 | +----------------------- |
| 204 | + |
| 205 | +This causes the MicroPython compiler to emit ARM native opcodes rather than |
| 206 | +bytecode. It covers the bulk of the Python language so most functions will require |
| 207 | +no adaptation (but see below). It is invoked by means of a function decorator: |
| 208 | + |
| 209 | +.. code:: python |
| 210 | +
|
| 211 | + @micropython.native |
| 212 | + def foo(self, arg): |
| 213 | + buf = self.linebuf # Cached object |
| 214 | + # code |
| 215 | +
|
| 216 | +There are certain limitations in the current implementation of the native code emitter. |
| 217 | + |
| 218 | +* Context managers are not supported (the ``with`` statement). |
| 219 | +* Generators are not supported. |
| 220 | +* If ``raise`` is used an argument must be supplied. |
| 221 | + |
| 222 | +The trade-off for the improved performance (roughly twices as fast as bytecode) is an |
| 223 | +increase in compiled code size. |
| 224 | + |
| 225 | +The Viper code emitter |
| 226 | +---------------------- |
| 227 | + |
| 228 | +The optimisations discussed above involve standards-compliant Python code. The |
| 229 | +Viper code emitter is not fully compliant. It supports special Viper native data types |
| 230 | +in pursuit of performance. Integer processing is non-compliant because it uses machine |
| 231 | +words: arithmetic on 32 bit hardware is performed modulo 2**32. |
| 232 | + |
| 233 | +Like the Native emitter Viper produces machine instructions but further optimisations |
| 234 | +are performed, substantially increasing performance especially for integer arithmetic and |
| 235 | +bit manipulations. It is invoked using a decorator: |
| 236 | + |
| 237 | +.. code:: python |
| 238 | +
|
| 239 | + @micropython.viper |
| 240 | + def foo(self, arg: int) -> int: |
| 241 | + # code |
| 242 | +
|
| 243 | +As the above fragment illustrates it is beneficial to use Python type hints to assist the Viper optimiser. |
| 244 | +Type hints provide information on the data types of arguments and of the return value; these |
| 245 | +are a standard Python language feature formally defined here `PEP0484 <https://www.python.org/dev/peps/pep-0484/>`_. |
| 246 | +Viper supports its own set of types namely ``int``, ``uint`` (unsigned integer), ``ptr``, ``ptr8``, |
| 247 | +``ptr16`` and ``ptr32``. The ``ptrX`` types are discussed below. Currently the ``uint`` type serves |
| 248 | +a single purpose: as a type hint for a function return value. If such a function returns ``0xffffffff`` |
| 249 | +Python will interpret the result as 2**32 -1 rather than as -1. |
| 250 | + |
| 251 | +In addition to the restrictions imposed by the native emitter the following constraints apply: |
| 252 | + |
| 253 | +* Functions may have up to four arguments. |
| 254 | +* Default argument values are not permitted. |
| 255 | +* Floating point may be used but is not optimised. |
| 256 | + |
| 257 | +Viper provides pointer types to assist the optimiser. These comprise |
| 258 | + |
| 259 | +* ``ptr`` Pointer to an object. |
| 260 | +* ``ptr8`` Points to a byte. |
| 261 | +* ``ptr16`` Points to a 16 bit half-word. |
| 262 | +* ``ptr32`` Points to a 32 bit machine word. |
| 263 | + |
| 264 | +The concept of a pointer may be unfamiliar to Python programmers. It has similarities |
| 265 | +to a Python ``memoryview`` object in that it provides direct access to data stored in memory. |
| 266 | +Items are accessed using subscript notation, but slices are not supported: a pointer can return |
| 267 | +a single item only. Its purpose is to provide fast random access to data stored in contiguous |
| 268 | +memory locations - such as data stored in objects which support the buffer protocol, and |
| 269 | +memory-mapped peripheral registers in a microcontroller. It should be noted that programming |
| 270 | +using pointers is hazardous: bounds checking is not performed and the compiler does nothing to |
| 271 | +prevent buffer overrun errors. |
| 272 | + |
| 273 | +Typical usage is to cache variables: |
| 274 | + |
| 275 | +.. code:: python |
| 276 | +
|
| 277 | + @micropython.viper |
| 278 | + def foo(self, arg: int) -> int: |
| 279 | + buf = ptr8(self.linebuf) # self.linebuf is a bytearray or bytes object |
| 280 | + for x in range(20, 30): |
| 281 | + bar = buf[x] # Access a data item through the pointer |
| 282 | + # code omitted |
| 283 | +
|
| 284 | +In this instance the compiler "knows" that ``buf`` is the address of an array of bytes; |
| 285 | +it can emit code to rapidly compute the address of ``buf[x]`` at runtime. Where casts are |
| 286 | +used to convert objects to Viper native types these should be performed at the start of |
| 287 | +the function rather than in critical timing loops as the cast operation can take several |
| 288 | +microseconds. The rules for casting are as follows: |
| 289 | + |
| 290 | +* Casting operators are currently: ``int``, ``bool``, ``uint``, ``ptr``, ``ptr8``, ``ptr16`` and ``ptr32``. |
| 291 | +* The result of a cast will be a native Viper variable. |
| 292 | +* Arguments to a cast can be a Python object or a native Viper variable. |
| 293 | +* If argument is a native Viper variable, then cast is a no-op (i.e. costs nothing at runtime) |
| 294 | + that just changes the type (e.g. from ``uint`` to ``ptr8``) so that you can then store/load |
| 295 | + using this pointer. |
| 296 | +* If the argument is a Python object and the cast is ``int`` or ``uint``, then the Python object |
| 297 | + must be of integral type and the value of that integral object is returned. |
| 298 | +* The argument to a bool cast must be integral type (boolean or integer); when used as a return |
| 299 | + type the viper function will return True or False objects. |
| 300 | +* If the argument is a Python object and the cast is ``ptr``, ``ptr``, ``ptr16`` or ``ptr32``, |
| 301 | + then the Python object must either have the buffer protocol with read-write capabilities |
| 302 | + (in which case a pointer to the start of the buffer is returned) or it must be of integral |
| 303 | + type (in which case the value of that integral object is returned). |
| 304 | + |
| 305 | +The following example illustrates the use of a ``ptr16`` cast to toggle pin X1 ``n`` times: |
| 306 | + |
| 307 | +.. code:: python |
| 308 | +
|
| 309 | + BIT0 = const(1) |
| 310 | + @micropython.viper |
| 311 | + def toggle_n(n: int): |
| 312 | + odr = ptr16(stm.GPIOA + stm.GPIO_ODR) |
| 313 | + for _ in range(n): |
| 314 | + odr[0] ^= BIT0 |
| 315 | +
|
| 316 | +A detailed technical description of the three code emitters may be found |
| 317 | +on Kickstarter here `Note 1 <https://www.kickstarter.com/projects/214379695/micro-python-python-for-microcontrollers/posts/664832>`_ |
| 318 | +and here `Note 2 <https://www.kickstarter.com/projects/214379695/micro-python-python-for-microcontrollers/posts/665145>`_ |
0 commit comments