docs: Add Python speed optimisation guide, including minimal viper ref.

peterhinch · dpgeorge · commit 21b74604f936 · 2016-03-11T16:33:36.000Z
diff --git a/docs/reference/index.rst b/docs/reference/index.rst
@@ -14,6 +14,7 @@ MicroPython are described in the sections here.
 
    repl.rst
    isr_rules.rst
+   speed_python.rst
 
 .. only:: port_pyboard
 
diff --git a/docs/reference/speed_python.rst b/docs/reference/speed_python.rst
@@ -0,0 +1,318 @@
+Maximising Python Speed
+=======================
+
+This tutorial describes ways of improving the performance of MicroPython code.
+Optimisations involving other languages are covered elsewhere, namely the use
+of modules written in C and the MicroPython inline ARM Thumb-2 assembler.
+
+The process of developing high performance code comprises the following stages
+which should be performed in the order listed.
+
+* Design for speed.
+* Code and debug.
+
+Optimisation steps:
+
+* Identify the slowest section of code.
+* Improve the efficiency of the Python code.
+* Use the native code emitter.
+* Use the viper code emitter.
+
+Designing for speed
+-------------------
+
+Performance issues should be considered at the outset. This involves taking a view
+on the sections of code which are most performance critical and devoting particular
+attention to their design. The process of optimisation begins when the code has
+been tested: if the design is correct at the outset optimisation will be
+straightforward and may actually be unnecessary.
+
+Algorithms
+~~~~~~~~~~
+
+The most important aspect of designing any routine for performance is ensuring that
+the best algorithm is employed. This is a topic for textbooks rather than for a 
+MicroPython guide but spectacular performance gains can sometimes be achieved
+by adopting algorithms known for their efficiency.
+
+RAM Allocation
+~~~~~~~~~~~~~~
+
+To design efficient MicroPython code it is necessary to have an understanding of the
+way the interpreter allocates RAM. When an object is created or grows in size
+(for example where an item is appended to a list) the necessary RAM is allocated
+from a block known as the heap. This takes a significant amount of time;
+further it will on occasion trigger a process known as garbage collection which
+can take several milliseconds.
+
+Consequently the performance of a function or method can be improved if an object is created
+once only and not permitted to grow in size. This implies that the object persists
+for the duration of its use: typically it will be instantiated in a class constructor
+and used in various methods.
+
+This is covered in further detail :ref:`Controlling garbage collection <gc>` below.
+
+Buffers
+~~~~~~~
+
+An example of the above is the common case where a buffer is required, such as one
+used for communication with a device. A typical driver will create the buffer in the
+constructor and use it in its I/O methods which will be called repeatedly.
+
+The MicroPython libraries typically provide optional support for pre-allocated buffers.
+For example the ``uart.readinto()`` method allows two options for its argument, an integer
+or a buffer. If an integer is supplied it will read up to that number of bytes and
+return the outcome: this implies that a buffer is created with a corresponding
+memory allocation. Providing a pre-allocated buffer as the argument avoids this. See
+the code fragment in :ref:`Caching object references <Caching>` below.
+
+Floating Point
+~~~~~~~~~~~~~~
+
+For the most speed critical sections of code it is worth noting that performing
+any kind of floating point operation involves heap allocation. Where possible use
+integer operations and restrict the use of floating point to sections of the code
+where performance is not paramount.
+
+Arrays
+~~~~~~
+
+Consider the use of the various types of array classes as an alternative to lists.
+The ``array`` module supports various element types with 8-bit elements supported
+by Python's built in ``bytes`` and ``bytearray`` classes. These data structures all store
+elements in contiguous memory locations. Once again to avoid memory allocation in critical
+code these should be pre-allocated and passed as arguments or as bound objects.
+
+When passing slices of objects such as ``bytearray`` instances, Python creates
+a copy which involves allocation. This can be avoided using a ``memoryview``
+object:
+
+.. code:: python
+
+    ba = bytearray(100)
+    func(ba[3:10]) # a copy is passed
+    mv = memoryview(ba)
+    func(mv[3:10]) # a pointer to memory is passed
+
+A ``memoryview`` can only be applied to objects supporting the buffer protocol - this
+includes arrays but not lists.
+
+Identifying the slowest section of code
+---------------------------------------
+
+This is a process known as profiling and is covered in textbooks and
+(for standard Python) supported by various software tools. For the type of
+smaller embedded application likely to be running on MicroPython platforms
+the slowest function or method can usually be established by judicious use
+of the timing ``ticks`` group of functions documented
+`here <http://docs.micropython.org/en/latest/pyboard/library/time.html>`_.
+Code execution time can be measured in ms, us, or CPU cycles.
+
+The following enables any function or method to be timed by adding an
+``@timed_function`` decorator:
+
+.. code:: python
+
+    def timed_function(f, *args, **kwargs):
+        myname = str(f).split(' ')[1]
+        def new_func(*args, **kwargs):
+            t = time.ticks_us()
+            result = f(*args, **kwargs)
+            delta = time.ticks_diff(t, time.ticks_us())
+            print('Function {} Time = {:6.3f}ms'.format(myname, delta/1000))
+            return result
+        return new_func
+
+MicroPython code improvements
+-----------------------------
+
+The const() declaration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+MicroPython provides a ``const()`` declaration. This works in a similar way
+to ``#define`` in C in that when the code is compiled to bytecode the compiler
+substitutes the numeric value for the identifier. This avoids a dictionary
+lookup at runtime. The argument to ``const()`` may be anything which, at
+compile time, evaluates to an integer e.g. ``0x100`` or ``1 << 8``.
+
+.. _Caching:
+
+Caching object references
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Where a function or method repeatedly accesses objects performance is improved
+by caching the object in a local variable:
+
+.. code:: python
+
+    class foo(object):
+        def __init__(self):
+            ba = bytearray(100)
+        def bar(self, obj_display):
+            ba_ref = self.ba
+            fb = obj_display.framebuffer
+            # iterative code using these two objects
+
+This avoids the need repeatedly to look up ``self.ba`` and ``obj_display.framebuffer``
+in the body of the method ``bar()``.
+
+.. _gc:
+
+Controlling garbage collection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When memory allocation is required, MicroPython attempts to locate an adequately
+sized block on the heap. This may fail, usually because the heap is cluttered
+with objects which are no longer referenced by code. If a failure occurs, the
+process known as garbage collection reclaims the memory used by these redundant
+objects and the allocation is then tried again - a process which can take several
+milliseconds.
+
+There are benefits in pre-empting this by periodically issuing ``gc.collect()``.
+Firstly doing a collection before it is actually required is quicker - typically on the
+order of 1ms if done frequently. Secondly you can determine the point in code
+where this time is used rather than have a longer delay occur at random points,
+possibly in a speed critical section. Finally performing collections regularly
+can reduce fragmentation in the heap. Severe fragmentation can lead to
+non-recoverable allocation failures.
+
+Accessing hardware directly
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This comes into the category of more advanced programming and involves some knowledge
+of the target MCU. Consider the example of toggling an output pin on the Pyboard. The 
+standard approach would be to write
+
+.. code:: python
+
+    mypin.value(mypin.value() ^ 1) # mypin was instantiated as an output pin
+
+This involves the overhead of two calls to the ``Pin`` instance's ``value()``
+method. This overhead can be eliminated by performing a read/write to the relevant bit
+of the chip's GPIO port output data register (odr). To facilitate this the ``stm``
+module provides a set of constants providing the addresses of the relevant registers.
+A fast toggle of pin ``P4`` (CPU pin ``A14``) - corresponding to the green LED -
+can be performed as follows:
+
+.. code:: python
+
+    BIT14 = const(1 << 14)
+    stm.mem16[stm.GPIOA + stm.GPIO_ODR] ^= BIT14
+
+The Native code emitter
+-----------------------
+
+This causes the MicroPython compiler to emit ARM native opcodes rather than
+bytecode. It covers the bulk of the Python language so most functions will require
+no adaptation (but see below). It is invoked by means of a function decorator:
+
+.. code:: python
+
+    @micropython.native
+    def foo(self, arg):
+        buf = self.linebuf # Cached object
+        # code
+
+There are certain limitations in the current implementation of the native code emitter. 
+
+* Context managers are not supported (the ``with`` statement).
+* Generators are not supported.
+* If ``raise`` is used an argument must be supplied.
+
+The trade-off for the improved performance (roughly twices as fast as bytecode) is an
+increase in compiled code size.
+
+The Viper code emitter
+----------------------
+
+The optimisations discussed above involve standards-compliant Python code. The 
+Viper code emitter is not fully compliant. It supports special Viper native data types
+in pursuit of performance. Integer processing is non-compliant because it uses machine
+words: arithmetic on 32 bit hardware is performed modulo 2**32.
+
+Like the Native emitter Viper produces machine instructions but further optimisations
+are performed, substantially increasing performance especially for integer arithmetic and
+bit manipulations. It is invoked using a decorator:
+
+.. code:: python
+
+    @micropython.viper
+    def foo(self, arg: int) -> int:
+        # code
+
+As the above fragment illustrates it is beneficial to use Python type hints to assist the Viper optimiser. 
+Type hints provide information on the data types of arguments and of the return value; these
+are a standard Python language feature formally defined here `PEP0484 <https://www.python.org/dev/peps/pep-0484/>`_.
+Viper supports its own set of types namely ``int``, ``uint`` (unsigned integer), ``ptr``, ``ptr8``,
+``ptr16`` and ``ptr32``. The ``ptrX`` types are discussed below. Currently the ``uint`` type serves
+a single purpose: as a type hint for a function return value. If such a function returns ``0xffffffff``
+Python will interpret the result as 2**32 -1 rather than as -1.
+
+In addition to the restrictions imposed by the native emitter the following constraints apply:
+
+* Functions may have up to four arguments.
+* Default argument values are not permitted.
+* Floating point may be used but is not optimised.
+
+Viper provides pointer types to assist the optimiser. These comprise
+
+* ``ptr`` Pointer to an object.
+* ``ptr8`` Points to a byte.
+* ``ptr16`` Points to a 16 bit half-word.
+* ``ptr32`` Points to a 32 bit machine word.
+
+The concept of a pointer may be unfamiliar to Python programmers. It has similarities
+to a Python ``memoryview`` object in that it provides direct access to data stored in memory.
+Items are accessed using subscript notation, but slices are not supported: a pointer can return
+a single item only. Its purpose is to provide fast random access to data stored in contiguous
+memory locations - such as data stored in objects which support the buffer protocol, and
+memory-mapped peripheral registers in a microcontroller. It should be noted that programming
+using pointers is hazardous: bounds checking is not performed and the compiler does nothing to
+prevent buffer overrun errors.
+
+Typical usage is to cache variables:
+
+.. code:: python
+
+    @micropython.viper
+    def foo(self, arg: int) -> int:
+        buf = ptr8(self.linebuf) # self.linebuf is a bytearray or bytes object
+        for x in range(20, 30):
+            bar = buf[x] # Access a data item through the pointer
+            # code omitted
+
+In this instance the compiler "knows" that ``buf`` is the address of an array of bytes;
+it can emit code to rapidly compute the address of ``buf[x]`` at runtime. Where casts are
+used to convert objects to Viper native types these should be performed at the start of
+the function rather than in critical timing loops as the cast operation can take several
+microseconds. The rules for casting are as follows:
+
+* Casting operators are currently: ``int``, ``bool``, ``uint``, ``ptr``, ``ptr8``, ``ptr16`` and ``ptr32``.
+* The result of a cast will be a native Viper variable.
+* Arguments to a cast can be a Python object or a native Viper variable.
+* If argument is a native Viper variable, then cast is a no-op (i.e. costs nothing at runtime)
+  that just changes the type (e.g. from ``uint`` to ``ptr8``) so that you can then store/load
+  using this pointer.
+* If the argument is a Python object and the cast is ``int`` or ``uint``, then the Python object
+  must be of integral type and the value of that integral object is returned.
+* The argument to a bool cast must be integral type (boolean or integer); when used as a return
+  type the viper function will return True or False objects.
+* If the argument is a Python object and the cast is ``ptr``, ``ptr``, ``ptr16`` or ``ptr32``,
+  then the Python object must either have the buffer protocol with read-write capabilities
+  (in which case a pointer to the start of the buffer is returned) or it must be of integral
+  type (in which case the value of that integral object is returned).
+ 
+The following example illustrates the use of a ``ptr16`` cast to toggle pin X1 ``n`` times:
+
+.. code:: python
+
+    BIT0 = const(1)
+    @micropython.viper
+    def toggle_n(n: int):
+        odr = ptr16(stm.GPIOA + stm.GPIO_ODR)
+        for _ in range(n):
+            odr[0] ^= BIT0
+
+A detailed technical description of the three code emitters may be found
+on Kickstarter here `Note 1 <https://www.kickstarter.com/projects/214379695/micro-python-python-for-microcontrollers/posts/664832>`_
+and here `Note 2 <https://www.kickstarter.com/projects/214379695/micro-python-python-for-microcontrollers/posts/665145>`_