Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Communication: lwIP/UDP performance metrics and optimization (investigation) #47

Closed
rfairley opened this issue Aug 5, 2018 · 4 comments · Fixed by #102

Comments

@rfairley
Copy link
Member

@rfairley rfairley commented Aug 5, 2018

The next step in considering UDP for communications is to gather concrete values representing the resources used, and performance improvements provided, by communicating via UDP using the lwIP Raw API. For now, the performance optimizations can be applied and verified and in the standalone project Development/Ethernet/lwip-rtos-config.

Metrics

  • Memory usage
  • Code size
  • Latency (round-trip)
  • Predictability (of latency)

Optimizations (potential areas)

  • Disable unnecessary lwIP compiled modules from CubeMX (code size)
  • Reduce allocated memory for pcbs (Protocol Control Blocks) and pbufs (Packet Buffers) (memory usage)
  • Investigate scheduling and network configuration on the PC side for network/IP/UDP-related processes (latency, predictability)

Testing/Verification

  • Memory usage: way to report this at run time?
    • Or use CubeMX features to report stack/heap usage?
  • Code size: Check how many bytes the device programmer flashes to the MCU
    • right now in Development/Ethernet/lwip-rtos-config a message "Verified 63540B in ... seconds" is printed
  • Latency (round-trip): average of time measurements for multiple packets echoed to the MCU from the PC Testing/Ethernet/eth_echo_test.py
    • Right now, echo latencies are typically 0.75ms-0.9ms (testing on the F7 connected to a Windows 10 PC), but can vary 0.3ms to 1.1ms
  • Predictability (of latency): statistical analysis on echo time measurements: standard deviation, histogram, min/max. Max latency may be the most important statistic.

Latency Goal

The 10/100Mbps PHY on the F7 board should theoretically take 0.1/0.01 microseconds to transfer each bit across at the physical level. In testing, we are seeing values for a round-trip transmission of about 10-100x the theoretical physical speed, with an average of 0.8 milliseconds, when sending a UDP packet with data of 80 bytes (1.25 microseconds per bit). This is likely due to overhead in the network stack (IP, scheduling, context switching kernel to user mode).

It is difficult to find a theoretical latency for an application-application-level transmission (for PC<=>MCU), due to varying network stacks between different OSs, network interfaces, configurations where prior testing has been done.

For now, we can aim to reduce latency in the echo test as much as possible. The fact that the transmission can be as fast as 0.3ms (sometimes lower) makes me think it wouldn't be unreasonable to get a latency reliably centered around 0.5ms, given the right OS-level configurations (on the PC side). Then we could expect a round-trip latency of <1ms once implemented in the Robot program with the cache feature (#36).

Results

code size:

After disabling non-required modules such as TCP and ICMP (see the CubeMX config), the code size is 40480 bytes. This includes lwIP, FreeRTOS, and board drivers such as HAL. This is down from 63540 bytes first observed after adding lwIP to a blank project.

verified 40580 bytes in 0.071225s (556.390 KiB/s)
** Verified OK **

memory usage:

lwIP now uses the minimum pbuf pool size allowed by Cube (down to 11 pbufs from 16 default). Pbufs are of size 1524 - enough for 1 MTU + headers. Space for 1 UDP Process Control Block (pcb) is given as only one connection is needed (down from 4 default).

predictability:

Seen from the earlier results, using scheduling options on the PC-OS side improves predictability of the latency of a given transmission. The spreadsheet is uploaded. Results can be reproduced using the scripts in the test kit, with other scheduling settings FIFO and RR. We can control the priority of the communication task on the PC-OS side with respect to other running tasks by setting the priority value passed to the scheduling system call.

The results also show a relationship between the number of bytes and message transmission time - so we can accurately predict an average round trip time for the messages sizes we decide on for commands sent to the robot.

@rfairley

This comment has been minimized.

Copy link
Member Author

@rfairley rfairley commented Aug 5, 2018

Opened this to keep track of any ideas for optimization - please edit the issue description, or comment, to add any ideas. We can consider the issue resolved when all the check-boxes are complete, and the metrics are collected (for the standalone application).

@rfairley rfairley changed the title Communication: lwIP/UDP Performance Metrics and Optimization Communication: lwIP/UDP performance metrics and optimization Aug 5, 2018
@rfairley rfairley moved this from To do to In progress in PC Interface Improvements Aug 27, 2018
@rfairley

This comment has been minimized.

Copy link
Member Author

@rfairley rfairley commented Sep 3, 2018

Attaching some experimental data, testing with a Python module called scheddl (https://github.com/dahallgren/scheddl) which interfaces with the Linux kernel to schedule the current process under the SCHED_DEADLINE policy. The scheduler will use Global Earliest Deadline First when processes under this policy are present, giving the processes a higher priority than default (http://man7.org/linux/man-pages/man7/sched.7.html).

The following results show time taken to send a UDP packet of variable size from the PC (running Ubuntu 18.04 LTS) to the MCU (STM32F767ZI running FreeRTOS+lwIP), and have the MCU send the packet back (echo). The mean, standard deviations as error bars, and max times are shown. 100 trials of each message size were carried out. The message sizes do not include any headers or metadata appended by the network stack.

Times are measured under conditions of "not busy", "busy", "no scheddl", and "with scheddl".

  • "not busy" - means no other applications started by the user are open, only the terminal running the testing script
  • "busy" - means 30 (arbitrarily decided) background processes doing calculations, memory stores, console I/O, and file I/O are running in the background
  • "no scheddl" - means the testing script was configured with scheddl_setting = ""
  • "with scheddl" - means the testing script was configured with scheddl_setting = "deadline"

Testing script used: https://github.com/utra-robosoccer/soccer-embedded/blob/rfairley-lwip-rtos-config/Testing/Ethernet/eth_echo_test.py.
Background script to make PC busy: https://github.com/utra-robosoccer/soccer-embedded/blob/rfairley-lwip-rtos-config/Testing/Ethernet/fibo_finder.py

The generated .json files were parsed into .csv, and charts were made in Excel. Will try to get a more transparent and automatic data flow for analysis in future.

scheddl_test

Notes

  • From chart, can see when PC is not busy, there is little difference between the mean echo times with and without SCHED_DEADLINE scheduling. The mean echo times without scheddl are sometimes lower, surprisingly. However, there is more fluctuation without scheddl (seen by a high max of nearly 7ms).
  • When PC is busy, mean times without scheddl increase and are a lot more unpredictable, seen by higher standard deviations and higher mean values (note the difference the vertical axis between the two "Mean, busy, with/without scheddl" charts). With scheddl, there is little difference from when the PC is not busy. Another surprising observation with scheddl is when the PC is busy, mean echo times are slightly lower (compare the two "Mean... with scheddl" charts).

Overall, setting a higher scheduling priority for the PC-side communication process does not decrease UDP round-trip data transfer times by much, but improves resiliency of the transmission speed greatly when the PC is multitasking - and therefore is an improvement on the predictability of the latency of UDP communication.

@tygamvrelis

This comment has been minimized.

Copy link
Member

@tygamvrelis tygamvrelis commented Sep 3, 2018

@Vuwij Take a look at these Linux settings. Some stuff you might want to consider

@tygamvrelis tygamvrelis referenced this issue Sep 16, 2018
3 of 3 tasks complete
@rfairley

This comment has been minimized.

Copy link
Member Author

@rfairley rfairley commented Oct 7, 2018

Tooling for ethernet testing is better developed (https://github.com/utra-robosoccer/soccer-embedded/tree/f524d3b63fe2e03628893cad1c2c6c32ef49a570/Testing/Ethernet/eth_test_kit) and results are reproducible using the test script, with a few process scheduling options available (deadline, FIFO, and round-robin).

From the results we now have some values for the latency and variance of the latency that we can expect. Investigation can be considered complete once the tooling is merged.

@rfairley rfairley changed the title Communication: lwIP/UDP performance metrics and optimization Communication: lwIP/UDP performance benchmarking and optimization (investigation) Oct 7, 2018
@rfairley rfairley changed the title Communication: lwIP/UDP performance benchmarking and optimization (investigation) Communication: lwIP/UDP performance metrics and optimization (investigation) Oct 7, 2018
@tygamvrelis tygamvrelis moved this from In progress to Done in PC Interface Improvements Oct 7, 2018
rfairley added a commit that referenced this issue Jun 17, 2019
Also drop the python_schedll_module-results.xlsx file, as scripts
to generate a chart are included in the eth_test_kit directory.
Findings have been noted at #47 (comment).
rfairley added a commit that referenced this issue Jun 17, 2019
Also drop the python_schedll_module-results.xlsx file, as scripts
to generate a chart are included in the eth_test_kit directory.
Findings have been noted at #47 (comment).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.