Skip to content

Commit 21b7460

Browse files
peterhinchdpgeorge
authored andcommitted
docs: Add Python speed optimisation guide, including minimal viper ref.
1 parent 85d3b61 commit 21b7460

File tree

2 files changed

+319
-0
lines changed

2 files changed

+319
-0
lines changed

docs/reference/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ MicroPython are described in the sections here.
1414

1515
repl.rst
1616
isr_rules.rst
17+
speed_python.rst
1718

1819
.. only:: port_pyboard
1920

docs/reference/speed_python.rst

Lines changed: 318 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,318 @@
1+
Maximising Python Speed
2+
=======================
3+
4+
This tutorial describes ways of improving the performance of MicroPython code.
5+
Optimisations involving other languages are covered elsewhere, namely the use
6+
of modules written in C and the MicroPython inline ARM Thumb-2 assembler.
7+
8+
The process of developing high performance code comprises the following stages
9+
which should be performed in the order listed.
10+
11+
* Design for speed.
12+
* Code and debug.
13+
14+
Optimisation steps:
15+
16+
* Identify the slowest section of code.
17+
* Improve the efficiency of the Python code.
18+
* Use the native code emitter.
19+
* Use the viper code emitter.
20+
21+
Designing for speed
22+
-------------------
23+
24+
Performance issues should be considered at the outset. This involves taking a view
25+
on the sections of code which are most performance critical and devoting particular
26+
attention to their design. The process of optimisation begins when the code has
27+
been tested: if the design is correct at the outset optimisation will be
28+
straightforward and may actually be unnecessary.
29+
30+
Algorithms
31+
~~~~~~~~~~
32+
33+
The most important aspect of designing any routine for performance is ensuring that
34+
the best algorithm is employed. This is a topic for textbooks rather than for a
35+
MicroPython guide but spectacular performance gains can sometimes be achieved
36+
by adopting algorithms known for their efficiency.
37+
38+
RAM Allocation
39+
~~~~~~~~~~~~~~
40+
41+
To design efficient MicroPython code it is necessary to have an understanding of the
42+
way the interpreter allocates RAM. When an object is created or grows in size
43+
(for example where an item is appended to a list) the necessary RAM is allocated
44+
from a block known as the heap. This takes a significant amount of time;
45+
further it will on occasion trigger a process known as garbage collection which
46+
can take several milliseconds.
47+
48+
Consequently the performance of a function or method can be improved if an object is created
49+
once only and not permitted to grow in size. This implies that the object persists
50+
for the duration of its use: typically it will be instantiated in a class constructor
51+
and used in various methods.
52+
53+
This is covered in further detail :ref:`Controlling garbage collection <gc>` below.
54+
55+
Buffers
56+
~~~~~~~
57+
58+
An example of the above is the common case where a buffer is required, such as one
59+
used for communication with a device. A typical driver will create the buffer in the
60+
constructor and use it in its I/O methods which will be called repeatedly.
61+
62+
The MicroPython libraries typically provide optional support for pre-allocated buffers.
63+
For example the ``uart.readinto()`` method allows two options for its argument, an integer
64+
or a buffer. If an integer is supplied it will read up to that number of bytes and
65+
return the outcome: this implies that a buffer is created with a corresponding
66+
memory allocation. Providing a pre-allocated buffer as the argument avoids this. See
67+
the code fragment in :ref:`Caching object references <Caching>` below.
68+
69+
Floating Point
70+
~~~~~~~~~~~~~~
71+
72+
For the most speed critical sections of code it is worth noting that performing
73+
any kind of floating point operation involves heap allocation. Where possible use
74+
integer operations and restrict the use of floating point to sections of the code
75+
where performance is not paramount.
76+
77+
Arrays
78+
~~~~~~
79+
80+
Consider the use of the various types of array classes as an alternative to lists.
81+
The ``array`` module supports various element types with 8-bit elements supported
82+
by Python's built in ``bytes`` and ``bytearray`` classes. These data structures all store
83+
elements in contiguous memory locations. Once again to avoid memory allocation in critical
84+
code these should be pre-allocated and passed as arguments or as bound objects.
85+
86+
When passing slices of objects such as ``bytearray`` instances, Python creates
87+
a copy which involves allocation. This can be avoided using a ``memoryview``
88+
object:
89+
90+
.. code:: python
91+
92+
ba = bytearray(100)
93+
func(ba[3:10]) # a copy is passed
94+
mv = memoryview(ba)
95+
func(mv[3:10]) # a pointer to memory is passed
96+
97+
A ``memoryview`` can only be applied to objects supporting the buffer protocol - this
98+
includes arrays but not lists.
99+
100+
Identifying the slowest section of code
101+
---------------------------------------
102+
103+
This is a process known as profiling and is covered in textbooks and
104+
(for standard Python) supported by various software tools. For the type of
105+
smaller embedded application likely to be running on MicroPython platforms
106+
the slowest function or method can usually be established by judicious use
107+
of the timing ``ticks`` group of functions documented
108+
`here <http://docs.micropython.org/en/latest/pyboard/library/time.html>`_.
109+
Code execution time can be measured in ms, us, or CPU cycles.
110+
111+
The following enables any function or method to be timed by adding an
112+
``@timed_function`` decorator:
113+
114+
.. code:: python
115+
116+
def timed_function(f, *args, **kwargs):
117+
myname = str(f).split(' ')[1]
118+
def new_func(*args, **kwargs):
119+
t = time.ticks_us()
120+
result = f(*args, **kwargs)
121+
delta = time.ticks_diff(t, time.ticks_us())
122+
print('Function {} Time = {:6.3f}ms'.format(myname, delta/1000))
123+
return result
124+
return new_func
125+
126+
MicroPython code improvements
127+
-----------------------------
128+
129+
The const() declaration
130+
~~~~~~~~~~~~~~~~~~~~~~~
131+
132+
MicroPython provides a ``const()`` declaration. This works in a similar way
133+
to ``#define`` in C in that when the code is compiled to bytecode the compiler
134+
substitutes the numeric value for the identifier. This avoids a dictionary
135+
lookup at runtime. The argument to ``const()`` may be anything which, at
136+
compile time, evaluates to an integer e.g. ``0x100`` or ``1 << 8``.
137+
138+
.. _Caching:
139+
140+
Caching object references
141+
~~~~~~~~~~~~~~~~~~~~~~~~~~
142+
143+
Where a function or method repeatedly accesses objects performance is improved
144+
by caching the object in a local variable:
145+
146+
.. code:: python
147+
148+
class foo(object):
149+
def __init__(self):
150+
ba = bytearray(100)
151+
def bar(self, obj_display):
152+
ba_ref = self.ba
153+
fb = obj_display.framebuffer
154+
# iterative code using these two objects
155+
156+
This avoids the need repeatedly to look up ``self.ba`` and ``obj_display.framebuffer``
157+
in the body of the method ``bar()``.
158+
159+
.. _gc:
160+
161+
Controlling garbage collection
162+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
163+
164+
When memory allocation is required, MicroPython attempts to locate an adequately
165+
sized block on the heap. This may fail, usually because the heap is cluttered
166+
with objects which are no longer referenced by code. If a failure occurs, the
167+
process known as garbage collection reclaims the memory used by these redundant
168+
objects and the allocation is then tried again - a process which can take several
169+
milliseconds.
170+
171+
There are benefits in pre-empting this by periodically issuing ``gc.collect()``.
172+
Firstly doing a collection before it is actually required is quicker - typically on the
173+
order of 1ms if done frequently. Secondly you can determine the point in code
174+
where this time is used rather than have a longer delay occur at random points,
175+
possibly in a speed critical section. Finally performing collections regularly
176+
can reduce fragmentation in the heap. Severe fragmentation can lead to
177+
non-recoverable allocation failures.
178+
179+
Accessing hardware directly
180+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
181+
182+
This comes into the category of more advanced programming and involves some knowledge
183+
of the target MCU. Consider the example of toggling an output pin on the Pyboard. The
184+
standard approach would be to write
185+
186+
.. code:: python
187+
188+
mypin.value(mypin.value() ^ 1) # mypin was instantiated as an output pin
189+
190+
This involves the overhead of two calls to the ``Pin`` instance's ``value()``
191+
method. This overhead can be eliminated by performing a read/write to the relevant bit
192+
of the chip's GPIO port output data register (odr). To facilitate this the ``stm``
193+
module provides a set of constants providing the addresses of the relevant registers.
194+
A fast toggle of pin ``P4`` (CPU pin ``A14``) - corresponding to the green LED -
195+
can be performed as follows:
196+
197+
.. code:: python
198+
199+
BIT14 = const(1 << 14)
200+
stm.mem16[stm.GPIOA + stm.GPIO_ODR] ^= BIT14
201+
202+
The Native code emitter
203+
-----------------------
204+
205+
This causes the MicroPython compiler to emit ARM native opcodes rather than
206+
bytecode. It covers the bulk of the Python language so most functions will require
207+
no adaptation (but see below). It is invoked by means of a function decorator:
208+
209+
.. code:: python
210+
211+
@micropython.native
212+
def foo(self, arg):
213+
buf = self.linebuf # Cached object
214+
# code
215+
216+
There are certain limitations in the current implementation of the native code emitter.
217+
218+
* Context managers are not supported (the ``with`` statement).
219+
* Generators are not supported.
220+
* If ``raise`` is used an argument must be supplied.
221+
222+
The trade-off for the improved performance (roughly twices as fast as bytecode) is an
223+
increase in compiled code size.
224+
225+
The Viper code emitter
226+
----------------------
227+
228+
The optimisations discussed above involve standards-compliant Python code. The
229+
Viper code emitter is not fully compliant. It supports special Viper native data types
230+
in pursuit of performance. Integer processing is non-compliant because it uses machine
231+
words: arithmetic on 32 bit hardware is performed modulo 2**32.
232+
233+
Like the Native emitter Viper produces machine instructions but further optimisations
234+
are performed, substantially increasing performance especially for integer arithmetic and
235+
bit manipulations. It is invoked using a decorator:
236+
237+
.. code:: python
238+
239+
@micropython.viper
240+
def foo(self, arg: int) -> int:
241+
# code
242+
243+
As the above fragment illustrates it is beneficial to use Python type hints to assist the Viper optimiser.
244+
Type hints provide information on the data types of arguments and of the return value; these
245+
are a standard Python language feature formally defined here `PEP0484 <https://www.python.org/dev/peps/pep-0484/>`_.
246+
Viper supports its own set of types namely ``int``, ``uint`` (unsigned integer), ``ptr``, ``ptr8``,
247+
``ptr16`` and ``ptr32``. The ``ptrX`` types are discussed below. Currently the ``uint`` type serves
248+
a single purpose: as a type hint for a function return value. If such a function returns ``0xffffffff``
249+
Python will interpret the result as 2**32 -1 rather than as -1.
250+
251+
In addition to the restrictions imposed by the native emitter the following constraints apply:
252+
253+
* Functions may have up to four arguments.
254+
* Default argument values are not permitted.
255+
* Floating point may be used but is not optimised.
256+
257+
Viper provides pointer types to assist the optimiser. These comprise
258+
259+
* ``ptr`` Pointer to an object.
260+
* ``ptr8`` Points to a byte.
261+
* ``ptr16`` Points to a 16 bit half-word.
262+
* ``ptr32`` Points to a 32 bit machine word.
263+
264+
The concept of a pointer may be unfamiliar to Python programmers. It has similarities
265+
to a Python ``memoryview`` object in that it provides direct access to data stored in memory.
266+
Items are accessed using subscript notation, but slices are not supported: a pointer can return
267+
a single item only. Its purpose is to provide fast random access to data stored in contiguous
268+
memory locations - such as data stored in objects which support the buffer protocol, and
269+
memory-mapped peripheral registers in a microcontroller. It should be noted that programming
270+
using pointers is hazardous: bounds checking is not performed and the compiler does nothing to
271+
prevent buffer overrun errors.
272+
273+
Typical usage is to cache variables:
274+
275+
.. code:: python
276+
277+
@micropython.viper
278+
def foo(self, arg: int) -> int:
279+
buf = ptr8(self.linebuf) # self.linebuf is a bytearray or bytes object
280+
for x in range(20, 30):
281+
bar = buf[x] # Access a data item through the pointer
282+
# code omitted
283+
284+
In this instance the compiler "knows" that ``buf`` is the address of an array of bytes;
285+
it can emit code to rapidly compute the address of ``buf[x]`` at runtime. Where casts are
286+
used to convert objects to Viper native types these should be performed at the start of
287+
the function rather than in critical timing loops as the cast operation can take several
288+
microseconds. The rules for casting are as follows:
289+
290+
* Casting operators are currently: ``int``, ``bool``, ``uint``, ``ptr``, ``ptr8``, ``ptr16`` and ``ptr32``.
291+
* The result of a cast will be a native Viper variable.
292+
* Arguments to a cast can be a Python object or a native Viper variable.
293+
* If argument is a native Viper variable, then cast is a no-op (i.e. costs nothing at runtime)
294+
that just changes the type (e.g. from ``uint`` to ``ptr8``) so that you can then store/load
295+
using this pointer.
296+
* If the argument is a Python object and the cast is ``int`` or ``uint``, then the Python object
297+
must be of integral type and the value of that integral object is returned.
298+
* The argument to a bool cast must be integral type (boolean or integer); when used as a return
299+
type the viper function will return True or False objects.
300+
* If the argument is a Python object and the cast is ``ptr``, ``ptr``, ``ptr16`` or ``ptr32``,
301+
then the Python object must either have the buffer protocol with read-write capabilities
302+
(in which case a pointer to the start of the buffer is returned) or it must be of integral
303+
type (in which case the value of that integral object is returned).
304+
305+
The following example illustrates the use of a ``ptr16`` cast to toggle pin X1 ``n`` times:
306+
307+
.. code:: python
308+
309+
BIT0 = const(1)
310+
@micropython.viper
311+
def toggle_n(n: int):
312+
odr = ptr16(stm.GPIOA + stm.GPIO_ODR)
313+
for _ in range(n):
314+
odr[0] ^= BIT0
315+
316+
A detailed technical description of the three code emitters may be found
317+
on Kickstarter here `Note 1 <https://www.kickstarter.com/projects/214379695/micro-python-python-for-microcontrollers/posts/664832>`_
318+
and here `Note 2 <https://www.kickstarter.com/projects/214379695/micro-python-python-for-microcontrollers/posts/665145>`_

0 commit comments

Comments
 (0)