Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The case for unrolling the main push/pull loops #1414

Open
alexandergall opened this issue Feb 1, 2019 · 1 comment
Open

The case for unrolling the main push/pull loops #1414

alexandergall opened this issue Feb 1, 2019 · 1 comment

Comments

@alexandergall
Copy link
Contributor

alexandergall commented Feb 1, 2019

The basic loop structure of Snabb is the following (simplified)

repeat
  for i = 1, #breathe_pull_order do
    breathe_pull_order[i]:pull()
  end
  for i = 1, #breathe_push_order do
    breathe_push_order[i]:push()
  end
until done and done()

The actual workload is processed in the pull() and push() methods, which, in general, contain loops of the kind

for _ = 1, link.nreadable(link) do
  local p = link.receive(link)
  -- Process packet
end

Those are the loops that we would like to be compiled most efficiently, i.e. containing an actual loop in the trace itself (a "looping" trace). Let's assume that the compiler actually does find those traces (but note that this is by no means guaranteed in a complex Snabb program). Due to the design of the tracing compiler, the outer loops cannot form looping traces themselves and end up being implemented by connecting distinct "straight" traces. Those traces are considerably less optimized and the number and inner structure of them can have a significant impact on the overall performance of the program.

I would like to argue that it is beneficial to replace the explicite second-level loops over the breathe_pull_order and breathe_push_order arrays by implicit loops using goto:

repeat
  local i = 1
  ::PULL_LOOP::
  do
    if i > #breathe_pull_order then goto PULL_EXIT
    breathe_pull_order[i]:pull()
    i = i + 1
    goto PULL_LOOP
  end
  ::PULL_EXIT::
  i = 1
  ::PUSH_LOOP::
  do
    if i > #breathe_push_order then goto PUSH_EXIT
    breathe_push_order[i]:push()
    i = i + 1
    goto PUSH_LOOP
  end
  ::PUSH_EXIT::
until done and done()

To study the impact of this, consider the following simplified model of the breathe() loop

local C = require("ffi").C
local napps = 5
local iterations = 1e8
local apps = {}

local mode = "std"
if #main.parameters >= 1 then
   mode = main.parameters[1]
end
count = 0
threshold = 1000
if #main.parameters >= 2 then
   threshold = tonumber(main.parameters[2])
end

local function new_App ()
   local App = {}
   App.pull = loadstring([[
      return function (self)
         if count > threshold then
            for _ = 1, 100 do
               self.i = self.i+1
            end
         end
      end]])()
   return App
end

for i = 1, napps do
   apps[i] = setmetatable({ i = 0 }, { __index = new_App() })
end

local start = C.get_time_ns()
if mode == "std" then
   repeat
      for i = 1, napps do
         apps[i]:pull()
      end
      count = count + 1
   until count > iterations
else
   repeat
      local i = 1
      ::LOOP::
      do
         if i > napps then goto DONE end
         apps[i]:pull()
         i = i + 1
         goto LOOP
      end
      ::DONE::
      count = count + 1
   until count > iterations
end
local now = C.get_time_ns()
print("elapsed "..(tonumber(now-start)/1e9).." seconds")

The program instantiates 5 pseudo-apps, each of which has a trivial pull method that simply increments a counter. Those methods are called from a nested loop in the style of a breathe loop. For simplicity, we omit the push loop, which will not change our findings in a fundamental way because the pull and push loops are sequential (i.e. the overall loop structure is the same in a "topological" sense).

The program takes two arguments. The first one is a string that selects which style of loop should be used. std will use the traditional nested loops while any other value will use the "flattened" loop using goto. The main repeat loop keeps track of its iterations in the global variable counter and the program terminates after 1e8 iterations, printing the time in seconds that has elapsed while the loop was being executed.The second parameter to the program is a number that determines after how many iterations of of the main loop the loops in the pull methods should be executed. This allows us to simulate a Snabb program which is idle for a while right after starting. I believe that this is actually the case in most real-world applications.

To simulate diversity in the apps being run, the actual functions that implement the pull methods are created via loadstring(). This makes sure that they will be implemented by different prototypes.

Let's start by studying a completely idle system with the standard loops

$ sudo ./snabb snsh -jdump=+rs,/tmp/sim-empty-std.txt breathsim.lua std 1e20
elapsed 7.638000843 seconds

Using perf top we see 7 active traces

  27.66%  perf-30093.map        [.] TRACE_2::breathsim.lua:36                                                                                                                                                                      
  21.58%  perf-30093.map        [.] TRACE_4::breathsim.lua:37                                                                                                                                                                      
  13.38%  perf-30093.map        [.] TRACE_7::breathsim.lua:37                                                                                                                                                                      
  11.16%  perf-30093.map        [.] TRACE_3::breathsim.lua:37                                                                                                                                                                      
  10.11%  perf-30093.map        [.] TRACE_10::breathsim.lua:37                                                                                                                                                                     
   9.02%  perf-30093.map        [.] TRACE_9::breathsim.lua:37                                                                                                                                                                      
   5.63%  perf-30093.map        [.] TRACE_5::breathsim.lua:37                                                                                                                                                                      

sim-empty-std.txt
The only loop that gets compiled is the for i = 1, napps do loop. Let's look at trace #2

---- TRACE 2 start breathsim.lua:36
0057  TGETV   11   3  10
0058  MOV     12  11
0059  TGETS   11  11  15  ; "pull"
0060  CALL    11   1   2
0000  . FUNCF    6          ; 0x40ce9090:1
0001  . GGET     1   0      ; "count"
0002  . GGET     2   1      ; "threshold"
0003  . ISGE     2   1
0004  . JMP      1 => 0013
0013  . RET0     0   1
0061  FORL     7 => 0057
---- TRACE 2 IR
....              SNAP   #0   [ ---- ]
0001 rax   >  int SLOAD  #9    CRI
0002       >  int LE     0001  +2147483646
0003 rbp      int SLOAD  #8    CI
0004 r13   >  tab SLOAD  #4    T
0005          int FLOAD  0004  tab.asize
0006       >  p32 ABC    0005  0001
0007 rcx      p32 FLOAD  0004  tab.array
0008          p32 AREF   0007  0003
0009 rdx   >  tab ALOAD  0008
0010 rdi      p32 HREF   0009  "pull"
0011       >  p32 EQ     0010  [0x40dbc458]
0012 r12      tab FLOAD  0009  tab.meta
0013       >  tab NE     0012  NULL
0014          int FLOAD  0012  tab.hmask
0015       >  int EQ     0014  +1  
0016 r11      p32 FLOAD  0012  tab.node
0017       >  p32 HREFK  0016  "__index" @1
0018 r10   >  tab HLOAD  0017
0019          int FLOAD  0018  tab.hmask
0020       >  int EQ     0019  +1  
0021 r9       p32 FLOAD  0018  tab.node
0022       >  p32 HREFK  0021  "pull" @1
0023       >  fun HLOAD  0022
0024       >  fun EQ     0023  0x40ce9090:1
0025 r8       tab FLOAD  0x40ce9090:1  func.env
0026          int FLOAD  0025  tab.hmask
0027       >  int EQ     0026  +63 
0028 rsi      p32 FLOAD  0025  tab.node
0029       >  p32 HREFK  0028  "count" @15
0030 xmm0  >  num HLOAD  0029
0031       >  p32 HREFK  0028  "threshold" @11
0032       >  num HLOAD  0031
....              SNAP   #1   [ ---- ---- ---- ---- ---- ---- ---- ---- 0003 0001 ---- 0003 0x40ce9090:1|0009 ]
0033       >  num UGE    0032  0030
0034 rbp    + int ADD    0003  +1  
....              SNAP   #2   [ ---- ---- ---- ---- ---- ---- ---- ---- ]
0035       >  int LE     0034  0001
....              SNAP   #3   [ ---- ---- ---- ---- ---- ---- ---- ---- 0034 0001 ---- 0034 ]
0036 ------------ LOOP ------------
0037          p32 AREF   0007  0034
0038 rbx   >  tab ALOAD  0037
0039 rdi      p32 HREF   0038  "pull"
0040       >  p32 EQ     0039  [0x40dbc458]
0041 rbx      tab FLOAD  0038  tab.meta
0042       >  tab NE     0041  NULL
0043          int FLOAD  0041  tab.hmask
0044       >  int EQ     0043  +1  
0045 rbx      p32 FLOAD  0041  tab.node
0046       >  p32 HREFK  0045  "__index" @1
0047 rbx   >  tab HLOAD  0046
0048          int FLOAD  0047  tab.hmask
0049       >  int EQ     0048  +1  
0050 rbx      p32 FLOAD  0047  tab.node
0051       >  p32 HREFK  0050  "pull" @1
0052       >  fun HLOAD  0051
0053       >  fun EQ     0052  0x40ce9090:1
0054 rbp    + int ADD    0034  +1  
....              SNAP   #4   [ ---- ---- ---- ---- ---- ---- ---- ---- ]
0055       >  int LE     0054  0001
0056 rbp      int PHI    0034  0054

This looks like a good trace, but note the guard for the address of the pull function

0019          int FLOAD  0018  tab.hmask
0020       >  int EQ     0019  +1  
0021 r9       p32 FLOAD  0018  tab.node
0022       >  p32 HREFK  0021  "pull" @1
0023       >  fun HLOAD  0022
0024       >  fun EQ     0023  0x40ce9090:1

This address is different for each pseudo-app. What happens right after the compiler has finished recording this loop iteration? Since this was a successful trace, the resulting code is optimized and compiled to a proper looping root trace and the system continues by executing the next iteration of the loop from that freshly compiled trace. However, in that very next iteration the mentioned guard fails, because the pull method of the next app is different from the previous one. As a consequence, the trace is left to the interpreter via exit #0 right away and before the actual loop has even be reached.

After this has happened 10 times (the default number of passes through a side exit after which it gets hot) the compiler starts recording a side trace

---- TRACE 3 start 2/0 breathsim.lua:37
0057  TGETV   11   3  10
0058  MOV     12  11
0059  TGETS   11  11  15  ; "pull"
0060  CALL    11   1   2
0000  . FUNCF    6          ; 0x40ce8c18:1
0001  . GGET     1   0      ; "count"
0002  . GGET     2   1      ; "threshold"
0003  . ISGE     2   1
0004  . JMP      1 => 0013
0013  . RET0     0   1
0061  JFORL    7   2
---- TRACE 3 IR
....              SNAP   #0   [ ---- ]
0001 rax   >  int SLOAD  #9    CRI
0002       >  int LE     0001  +2147483646
0003 rbp      int SLOAD  #8    CI
0004 r13   >  tab SLOAD  #4    T
0005          int FLOAD  0004  tab.asize
0006       >  p32 ABC    0005  0001
0007 rcx      p32 FLOAD  0004  tab.array
0008          p32 AREF   0007  0003
0009 rdx   >  tab ALOAD  0008
0010 rdi      p32 HREF   0009  "pull"
0011       >  p32 EQ     0010  [0x40dbc458]
0012 r12      tab FLOAD  0009  tab.meta
0013       >  tab NE     0012  NULL
0014          int FLOAD  0012  tab.hmask
0015       >  int EQ     0014  +1  
0016 r11      p32 FLOAD  0012  tab.node
0017       >  p32 HREFK  0016  "__index" @1
0018 r10   >  tab HLOAD  0017
0019          int FLOAD  0018  tab.hmask
0020       >  int EQ     0019  +1  
0021 r9       p32 FLOAD  0018  tab.node
0022       >  p32 HREFK  0021  "pull" @1
0023       >  fun HLOAD  0022
0024       >  fun EQ     0023  0x40ce8c18:1
0025 r8       tab FLOAD  0x40ce8c18:1  func.env
0026          int FLOAD  0025  tab.hmask
0027       >  int EQ     0026  +63 
0028 rsi      p32 FLOAD  0025  tab.node
0029       >  p32 HREFK  0028  "count" @15
0030 xmm0  >  num HLOAD  0029
0031       >  p32 HREFK  0028  "threshold" @11
0032       >  num HLOAD  0031
....              SNAP   #1   [ ---- ---- ---- ---- ---- ---- ---- ---- 0003 0001 ---- 0003 0x40ce8c18:1|0009 ]
0033       >  num UGE    0032  0030
0034 rbp    + int ADD    0003  +1  
....              SNAP   #2   [ ---- ---- ---- ---- ---- ---- ---- ---- ]
0035       >  int LE     0034  0001
....              SNAP   #3   [ ---- ---- ---- ---- ---- ---- ---- ---- 0034 0001 ---- 0034 ]
0036 ------------ LOOP ------------
0037          p32 AREF   0007  0034
0038 rbx   >  tab ALOAD  0037
0039 rdi      p32 HREF   0038  "pull"
0040       >  p32 EQ     0039  [0x40dbc458]
0041 rbx      tab FLOAD  0038  tab.meta
0042       >  tab NE     0041  NULL
0043          int FLOAD  0041  tab.hmask
0044       >  int EQ     0043  +1  
0045 rbx      p32 FLOAD  0041  tab.node
0046       >  p32 HREFK  0045  "__index" @1
0047 rbx   >  tab HLOAD  0046
0048          int FLOAD  0047  tab.hmask
0049       >  int EQ     0048  +1  
0050 rbx      p32 FLOAD  0047  tab.node
0051       >  p32 HREFK  0050  "pull" @1
0052       >  fun HLOAD  0051
0053       >  fun EQ     0052  0x40ce8c18:1
0054 rbp    + int ADD    0034  +1  
....              SNAP   #4   [ ---- ---- ---- ---- ---- ---- ---- ---- ]
0055       >  int LE     0054  0001
0056 rbp      int PHI    0034  0054

You can see that the address of the pull method is now 0x40ce8c18 when it was 0x40ce9090 in trace #2. Normally, a side trace cannot form a loop but this case is an exception, because the exit happens so early that the code has not yet been specialized enough to prohibit the formation of a new loop. The loop that is being recorded in trace #3 looks exactly like the previous trace and, in fact, the exact same thing is happening with it, i.e. the trace is left right in the next iteration.

The same game is repeated in traces #4 and #5. Sometimes it will happen that one of those traces is entered with the expected pull method and execution proceeds into the actual loop but since the next iteration will always cause the same failure of the guard, it will never reach the PHI at the end of the loop. Eventually, such an exit will become hot as well, which is the case for trace #6, which is started from exit #3 of trace #3. It executes the pull method and returns to the loop which has been compiled in trace #2 and therefore links directly to that trace.

Trace #7 is again like #2, #3, #4 and #5 and traces #8 #9 and #10 are of the same kind as #6.

The good news is that everything is running compiled. The bad thing is that none of the optimized loops are actually executed.

Let's compare this with the "flattened" loop

$ sudo ./snabb snsh -jdump=+rs,/tmp/sim-empty-flat.txt breathsim.lua flat 1e20
elapsed 0.281710828 seconds

We can already see that the elapsed time is smaller by a factor of almost 30. The program compiles to a single trace

---- TRACE 2 start breathsim.lua:42
0070  KSHORT   7   1
0071  ISGE     1   7
0072  JMP      8 => 0074
0074  TGETV    8   3   7
0075  MOV      9   8
0076  TGETS    8   8  15  ; "pull"
0077  CALL     8   1   2
0000  . FUNCF    6          ; 0x418d2eb0:1
0001  . GGET     1   0      ; "count"
0002  . GGET     2   1      ; "threshold"
0003  . ISGE     2   1
0004  . JMP      1 => 0013
0013  . RET0     0   1
0078  ADDVN    7   7   1  ; 1
0079  JMP      8 => 0071
0071  ISGE     1   7
0072  JMP      8 => 0074
0074  TGETV    8   3   7
0075  MOV      9   8
0076  TGETS    8   8  15  ; "pull"
0077  CALL     8   1   2
0000  . FUNCF    6          ; 0x418c68a0:1
0001  . GGET     1   0      ; "count"
0002  . GGET     2   1      ; "threshold"
0003  . ISGE     2   1
0004  . JMP      1 => 0013
0013  . RET0     0   1
0078  ADDVN    7   7   1  ; 1
0079  JMP      8 => 0071
0071  ISGE     1   7
0072  JMP      8 => 0074
0074  TGETV    8   3   7
0075  MOV      9   8
0076  TGETS    8   8  15  ; "pull"
0077  CALL     8   1   2
0000  . FUNCF    6          ; 0x412b9d78:1
0001  . GGET     1   0      ; "count"
0002  . GGET     2   1      ; "threshold"
0003  . ISGE     2   1
0004  . JMP      1 => 0013
0013  . RET0     0   1
0078  ADDVN    7   7   1  ; 1
0079  JMP      8 => 0071
0071  ISGE     1   7
0072  JMP      8 => 0074
0074  TGETV    8   3   7
0075  MOV      9   8
0076  TGETS    8   8  15  ; "pull"
0077  CALL     8   1   2
0000  . FUNCF    6          ; 0x418d7cb8:1
0001  . GGET     1   0      ; "count"
0002  . GGET     2   1      ; "threshold"
0003  . ISGE     2   1
0004  . JMP      1 => 0013
0013  . RET0     0   1
0078  ADDVN    7   7   1  ; 1
0079  JMP      8 => 0071
0071  ISGE     1   7
0072  JMP      8 => 0074
0074  TGETV    8   3   7
0075  MOV      9   8
0076  TGETS    8   8  15  ; "pull"
0077  CALL     8   1   2
0000  . FUNCF    6          ; 0x41fe4698:1
0001  . GGET     1   0      ; "count"
0002  . GGET     2   1      ; "threshold"
0003  . ISGE     2   1
0004  . JMP      1 => 0013
0013  . RET0     0   1
0078  ADDVN    7   7   1  ; 1
0079  JMP      8 => 0071
0071  ISGE     1   7
0072  JMP      8 => 0074
0073  JMP      8 => 0080
0080  GGET     8   6      ; "count"
0081  ADDVN    8   8   1  ; 1
0082  GSET     8   6      ; "count"
0083  GGET     8   6      ; "count"
0084  ISGE     2   8
0085  JMP      8 => 0069
0069  LOOP     7 => 0086
---- TRACE 2 IR
....              SNAP   #0   [ ---- ---- ---- ---- ---- ---- ---- ---- ]
0001 xmm10 >  num SLOAD  #2    T
....              SNAP   #1   [ ---- ---- ---- ---- ---- ---- ---- ---- +1   ]
0002       >  num UGE    0001  +1  
....              SNAP   #2   [ ---- ---- ---- ---- ---- ---- ---- ---- +1   ]
0003 rax   >  tab SLOAD  #4    T
0004          int FLOAD  0003  tab.asize
0005       >  int ABC    0004  +5  
0006 rax      p32 FLOAD  0003  tab.array
0007          p32 AREF   0006  +1  
0008 r10   >  tab ALOAD  0007
0009 rdi      p32 HREF   0008  "pull"
0010       >  p32 EQ     0009  [0x41bb5458]
0011 rcx      tab FLOAD  0008  tab.meta
0012       >  tab NE     0011  NULL
0013          int FLOAD  0011  tab.hmask
0014       >  int EQ     0013  +1  
0015 rcx      p32 FLOAD  0011  tab.node
0016       >  p32 HREFK  0015  "__index" @1
0017 rcx   >  tab HLOAD  0016
0018          int FLOAD  0017  tab.hmask
0019       >  int EQ     0018  +1  
0020 rcx      p32 FLOAD  0017  tab.node
0021       >  p32 HREFK  0020  "pull" @1
0022       >  fun HLOAD  0021
0023       >  fun EQ     0022  0x418d2eb0:1
0024 rcx      tab FLOAD  0x418d2eb0:1  func.env
0025          int FLOAD  0024  tab.hmask
0026       >  int EQ     0025  +63 
0027 rcx      p32 FLOAD  0024  tab.node
0028 r11   >  p32 HREFK  0027  "count" @15
0029 xmm3  >  num HLOAD  0028
0030       >  p32 HREFK  0027  "threshold" @11
0031 xmm8  >  num HLOAD  0030
....              SNAP   #3   [ ---- ---- ---- ---- ---- ---- ---- ---- +1   0x418d2eb0:1|0008 ]
0032       >  num UGE    0031  0029
....              SNAP   #4   [ ---- ---- ---- ---- ---- ---- ---- ---- +2   ]
0033       >  num UGE    0001  +2  
....              SNAP   #5   [ ---- ---- ---- ---- ---- ---- ---- ---- +2   ]
0034          p32 AREF   0006  +2  
0035 r8    >  tab ALOAD  0034
0036 rdi      p32 HREF   0035  "pull"
0037       >  p32 EQ     0036  [0x41bb5458]
0038 rcx      tab FLOAD  0035  tab.meta
0039       >  tab NE     0038  NULL
0040          int FLOAD  0038  tab.hmask
0041       >  int EQ     0040  +1  
0042 rcx      p32 FLOAD  0038  tab.node
0043       >  p32 HREFK  0042  "__index" @1
0044 rcx   >  tab HLOAD  0043
0045          int FLOAD  0044  tab.hmask
0046       >  int EQ     0045  +1  
0047 rcx      p32 FLOAD  0044  tab.node
0048       >  p32 HREFK  0047  "pull" @1
0049       >  fun HLOAD  0048
0050       >  fun EQ     0049  0x418c68a0:1
0051 rcx      tab FLOAD  0x418c68a0:1  func.env
0052          int FLOAD  0051  tab.hmask
0053       >  int EQ     0052  +63 
0054 rcx      p32 FLOAD  0051  tab.node
0055 r9    >  p32 HREFK  0054  "count" @15
0056 xmm2  >  num HLOAD  0055
0057       >  p32 HREFK  0054  "threshold" @11
0058 xmm5  >  num HLOAD  0057
....              SNAP   #6   [ ---- ---- ---- ---- ---- ---- ---- ---- +2   0x418c68a0:1|0035 ]
0059       >  num UGE    0058  0056
....              SNAP   #7   [ ---- ---- ---- ---- ---- ---- ---- ---- +3   ]
0060       >  num UGE    0001  +3  
....              SNAP   #8   [ ---- ---- ---- ---- ---- ---- ---- ---- +3   ]
0061          p32 AREF   0006  +3  
0062 rsi   >  tab ALOAD  0061
0063 rdi      p32 HREF   0062  "pull"
0064       >  p32 EQ     0063  [0x41bb5458]
0065 rcx      tab FLOAD  0062  tab.meta
0066       >  tab NE     0065  NULL
0067          int FLOAD  0065  tab.hmask
0068       >  int EQ     0067  +1  
0069 rcx      p32 FLOAD  0065  tab.node
0070       >  p32 HREFK  0069  "__index" @1
0071 rcx   >  tab HLOAD  0070
0072          int FLOAD  0071  tab.hmask
0073       >  int EQ     0072  +1  
0074 rcx      p32 FLOAD  0071  tab.node
0075       >  p32 HREFK  0074  "pull" @1
0076       >  fun HLOAD  0075
0077       >  fun EQ     0076  0x412b9d78:1
0078 rcx      tab FLOAD  0x412b9d78:1  func.env
0079          int FLOAD  0078  tab.hmask
0080       >  int EQ     0079  +63 
0081 rcx      p32 FLOAD  0078  tab.node
0082 rdi   >  p32 HREFK  0081  "count" @15
0083 xmm15 >  num HLOAD  0082
0084       >  p32 HREFK  0081  "threshold" @11
0085 xmm4  >  num HLOAD  0084
....              SNAP   #9   [ ---- ---- ---- ---- ---- ---- ---- ---- +3   0x412b9d78:1|0062 ]
0086       >  num UGE    0085  0083
....              SNAP   #10  [ ---- ---- ---- ---- ---- ---- ---- ---- +4   ]
0087       >  num UGE    0001  +4  
....              SNAP   #11  [ ---- ---- ---- ---- ---- ---- ---- ---- +4   ]
0088          p32 AREF   0006  +4  
0089 rbx   >  tab ALOAD  0088
0090 rbp      p32 HREF   0089  "pull"
0091       >  p32 EQ     0090  [0x41bb5458]
0092 rcx      tab FLOAD  0089  tab.meta
0093       >  tab NE     0092  NULL
0094          int FLOAD  0092  tab.hmask
0095       >  int EQ     0094  +1  
0096 rcx      p32 FLOAD  0092  tab.node
0097       >  p32 HREFK  0096  "__index" @1
0098 rcx   >  tab HLOAD  0097
0099          int FLOAD  0098  tab.hmask
0100       >  int EQ     0099  +1  
0101 rcx      p32 FLOAD  0098  tab.node
0102       >  p32 HREFK  0101  "pull" @1
0103       >  fun HLOAD  0102
0104       >  fun EQ     0103  0x418d7cb8:1
0105 rcx      tab FLOAD  0x418d7cb8:1  func.env
0106          int FLOAD  0105  tab.hmask
0107       >  int EQ     0106  +63 
0108 rcx      p32 FLOAD  0105  tab.node
0109 rbp   >  p32 HREFK  0108  "count" @15
0110 xmm13 >  num HLOAD  0109
0111       >  p32 HREFK  0108  "threshold" @11
0112 xmm3  >  num HLOAD  0111
....              SNAP   #12  [ ---- ---- ---- ---- ---- ---- ---- ---- +4   0x418d7cb8:1|0089 ]
0113       >  num UGE    0112  0110
....              SNAP   #13  [ ---- ---- ---- ---- ---- ---- ---- ---- +5   ]
0114       >  num UGE    0001  +5  
....              SNAP   #14  [ ---- ---- ---- ---- ---- ---- ---- ---- +5   ]
0115          p32 AREF   0006  +5  
0116 rcx   >  tab ALOAD  0115
0117 rdx      p32 HREF   0116  "pull"
0118       >  p32 EQ     0117  [0x41bb5458]
0119 rax      tab FLOAD  0116  tab.meta
0120       >  tab NE     0119  NULL
0121          int FLOAD  0119  tab.hmask
0122       >  int EQ     0121  +1  
0123 rax      p32 FLOAD  0119  tab.node
0124       >  p32 HREFK  0123  "__index" @1
0125 rax   >  tab HLOAD  0124
0126          int FLOAD  0125  tab.hmask
0127       >  int EQ     0126  +1  
0128 rax      p32 FLOAD  0125  tab.node
0129       >  p32 HREFK  0128  "pull" @1
0130       >  fun HLOAD  0129
0131       >  fun EQ     0130  0x41fe4698:1
0132 rax      tab FLOAD  0x41fe4698:1  func.env
0133          int FLOAD  0132  tab.hmask
0134       >  int EQ     0133  +63 
0135 rax      p32 FLOAD  0132  tab.node
0136 rdx   >  p32 HREFK  0135  "count" @15
0137 xmm11 >  num HLOAD  0136
0138       >  p32 HREFK  0135  "threshold" @11
0139 xmm2  >  num HLOAD  0138
....              SNAP   #15  [ ---- ---- ---- ---- ---- ---- ---- ---- +5   0x41fe4698:1|0116 ]
0140       >  num UGE    0139  0137
....              SNAP   #16  [ ---- ---- ---- ---- ---- ---- ---- ---- +6   ]
0141       >  num LT     0001  +6  
....              SNAP   #17  [ ---- ---- ---- ---- ---- ---- ---- ---- +6   ]
0142 r15      fun SLOAD  #0    R
0143 r14      tab FLOAD  0142  func.env
0144          int FLOAD  0143  tab.hmask
0145       >  int EQ     0144  +63 
0146 r13      p32 FLOAD  0143  tab.node
0147 rax   >  p32 HREFK  0146  "count" @15
0148 xmm7  >  num HLOAD  0147
0149 xmm7   + num ADD    0148  +1  
0150          num HSTORE 0147  0149
....              SNAP   #18  [ ---- ---- ---- ---- ---- ---- ---- ---- +6   ---- ]
0151 xmm0  >  num SLOAD  #3    T
....              SNAP   #19  [ ---- ---- ---- ---- ---- ---- ---- ---- ---- ]
0152       >  num UGE    0151  0149
....              SNAP   #20  [ ---- ---- ---- ---- ---- ---- ---- ---- ]
0153 ------------ LOOP ------------
....              SNAP   #21  [ ---- ---- ---- ---- ---- ---- ---- ---- +1   ]
0154 xmm6  >  num HLOAD  0028
....              SNAP   #22  [ ---- ---- ---- ---- ---- ---- ---- ---- +1   0x418d2eb0:1|0008 ]
0155       >  num ULE    0154  0031
....              SNAP   #23  [ ---- ---- ---- ---- ---- ---- ---- ---- +2   ]
0156 xmm6  >  num HLOAD  0055
....              SNAP   #24  [ ---- ---- ---- ---- ---- ---- ---- ---- +2   0x418c68a0:1|0035 ]
0157       >  num ULE    0156  0058
....              SNAP   #25  [ ---- ---- ---- ---- ---- ---- ---- ---- +3   ]
0158 xmm6  >  num HLOAD  0082
....              SNAP   #26  [ ---- ---- ---- ---- ---- ---- ---- ---- +3   0x412b9d78:1|0062 ]
0159       >  num ULE    0158  0085
....              SNAP   #27  [ ---- ---- ---- ---- ---- ---- ---- ---- +4   ]
0160 xmm6  >  num HLOAD  0109
....              SNAP   #28  [ ---- ---- ---- ---- ---- ---- ---- ---- +4   0x418d7cb8:1|0089 ]
0161       >  num ULE    0160  0112
....              SNAP   #29  [ ---- ---- ---- ---- ---- ---- ---- ---- +5   ]
0162 xmm6  >  num HLOAD  0136
....              SNAP   #30  [ ---- ---- ---- ---- ---- ---- ---- ---- +5   0x41fe4698:1|0116 ]
0163       >  num ULE    0162  0139
0164 xmm7   + num ADD    0149  +1  
0165          num HSTORE 0147  0164
....              SNAP   #31  [ ---- ---- ---- ---- ---- ---- ---- ---- ---- ]
0166       >  num ULE    0164  0151
0167 xmm7     num PHI    0149  0164

Essentially, the flattened loop is like an explicit unrolling which removes the guards for the individual function prototypes. The loop is super compact and contains only the actual function calls, one after the other.

In this extreme case, where non of the apps share the same pull functions, this is actually the optimum. But any real-world Snabb program will probably also have a significant number of distinct apps in their networks, which makes this example relevant.

Now let's look what happens if the apps get busy after a short while, e.g. 1000 iterations of the main loop. During the first 1000 iterations, the compiler will find the exact same traces as just discussed. But what happens once the pull methods start executing their own loops?

$ sudo ./snabb snsh -jdump=+rs,/tmp/sim-std.txt breathsim.lua std 1000
elapsed 62.653934395 seconds

sim-std.txt
There are 20 active traces

  18.37%  perf-30395.map       [.] TRACE_14::(string):3                                                                                                                                                                            
  17.51%  perf-30395.map       [.] TRACE_11::(string):3                                                                                                                                                                            
  17.40%  perf-30395.map       [.] TRACE_15::(string):3                                                                                                                                                                            
  16.81%  perf-30395.map       [.] TRACE_12::(string):3                                                                                                                                                                            
  15.77%  perf-30395.map       [.] TRACE_13::(string):3                                                                                                                                                                            
   3.16%  perf-30395.map       [.] TRACE_2::breathsim.lua:36                                                                                                                                                                       
   2.30%  perf-30395.map       [.] TRACE_3::breathsim.lua:37                                                                                                                                                                       
   2.03%  perf-30395.map       [.] TRACE_5::breathsim.lua:37                                                                                                                                                                       
   1.79%  perf-30395.map       [.] TRACE_4::breathsim.lua:37                                                                                                                                                                       
   1.04%  perf-30395.map       [.] TRACE_7::breathsim.lua:37                                                                                                                                                                       
   0.50%  perf-30395.map       [.] TRACE_25::(string):7                                                                                                                                                                            
   0.33%  perf-30395.map       [.] TRACE_22::(string):3                                                                                                                                                                            
   0.30%  perf-30395.map       [.] TRACE_18::(string):3                                                                                                                                                                            
   0.27%  perf-30395.map       [.] TRACE_16::(string):3                                                                                                                                                                            
   0.24%  perf-30395.map       [.] TRACE_20::(string):3                                                                                                                                                                            
   0.23%  perf-30395.map       [.] TRACE_24::(string):3                                                                                                                                                                            
   0.18%  perf-30395.map       [.] TRACE_19::(string):7                                                                                                                                                                            
   0.17%  perf-30395.map       [.] TRACE_23::(string):7                                                                                                                                                                            
   0.16%  perf-30395.map       [.] TRACE_17::(string):7                                                                                                                                                                            
   0.16%  perf-30395.map       [.] TRACE_21::(string):7                                                                                                                                                                            

As expected, we see the traces #2, #3, #4, #5 and #7 from the initial phase before the innermost loops become active. The five most active traces (#11-#15) all look like this

---- TRACE 11 start 0x41779eb0:3
0009  TGETS    5   0   2  ; "i"
0010  ADDVN    5   5   0  ; 1
0011  TSETS    5   0   2  ; "i"
0012  FORL     1 => 0009
---- TRACE 11 IR
....              SNAP   #0   [ ---- ]
0001 rbp      int SLOAD  #2    CI
0002 rdx   >  tab SLOAD  #1    T
0003          int FLOAD  0002  tab.hmask
0004       >  int EQ     0003  +1  
0005 rcx      p32 FLOAD  0002  tab.node
0006 rax   >  p32 HREFK  0005  "i"  @0
0007 xmm7  >  num HLOAD  0006
0008 xmm7   + num ADD    0007  +1  
0009          num HSTORE 0006  0008
0010 rbp    + int ADD    0001  +1  
....              SNAP   #1   [ ---- ---- ]
0011       >  int LE     0010  +100
....              SNAP   #2   [ ---- ---- 0010 ---- ---- 0010 ]
0012 ------------ LOOP ------------
0013 xmm7   + num ADD    0008  +1  
0014          num HSTORE 0006  0013
0015 rbp    + int ADD    0010  +1  
....              SNAP   #3   [ ---- ---- ]
0016       >  int LE     0015  +100
0017 rbp      int PHI    0010  0015
0018 xmm7     num PHI    0008  0013

Clearly, those are the pull loops for each of the 5 apps and they are all compiled nicely, which is good. All remaining traces are simple "straight" traces that connect these loops with each other and with the initial traces.

Compare this with the flattened loop

sudo ./snabb snsh -jdump=+rs,/tmp/sim-flat.txt breathsim.lua flat 1000
elapsed 52.961463354 seconds

sim-flat.txt

  19.42%  perf-30471.map       [.] TRACE_6::(string):3                                                                                                                                                                             
  18.73%  perf-30471.map       [.] TRACE_7::(string):3                                                                                                                                                                             
  18.47%  perf-30471.map       [.] TRACE_3::(string):3                                                                                                                                                                             
  18.20%  perf-30471.map       [.] TRACE_5::(string):3                                                                                                                                                                             
  18.02%  perf-30471.map       [.] TRACE_4::(string):3                                                                                                                                                                             
   1.27%  perf-30471.map       [.] TRACE_10::(string):7                                                                                                                                                                            
   1.15%  perf-30471.map       [.] TRACE_8::(string):7                                                                                                                                                                             
   1.15%  perf-30471.map       [.] TRACE_11::(string):7                                                                                                                                                                            
   1.12%  perf-30471.map       [.] TRACE_9::(string):7                                                                                                                                                                             
   0.79%  perf-30471.map       [.] TRACE_2::breathsim.lua:42                                                                                                                                                                       
   0.48%  perf-30471.map       [.] TRACE_12::(string):7                                                                                                                                                                            
   0.27%  perf-30471.map       [.] TRACE_13::(string):3                                                                                                                                                                            

We can see trace #2 from the "emtpy" phase and traces #3, #4, #5, #6 and #7 that contain the exact same looping traces for the pull methods. Also here, there are a handful of straight traces connecting all the loops.

The difference is that in the standard scenario, there is a significantly larger part of relatively inefficient code which translates to a running time which is 20% lower than in the standard case.

As a final test, consider the case when none of the apps is idle when the program starts.

$ sudo ./snabb snsh -jdump=+rs,/tmp/sim-full-std.txt breathsim.lua std 0
elapsed 52.692324165 seconds

sim-full-std.txt

  18.77%  perf-30544.map            [.] TRACE_2::(string):3                                                                                                                                                                        
  18.70%  perf-30544.map            [.] TRACE_4::(string):3                                                                                                                                                                        
  18.50%  perf-30544.map            [.] TRACE_3::(string):3                                                                                                                                                                        
  18.36%  perf-30544.map            [.] TRACE_5::(string):3                                                                                                                                                                        
  18.29%  perf-30544.map            [.] TRACE_6::(string):3                                                                                                                                                                        
   1.47%  perf-30544.map            [.] TRACE_11::(string):7                                                                                                                                                                       
   1.19%  perf-30544.map            [.] TRACE_7::(string):7                                                                                                                                                                        
   1.18%  perf-30544.map            [.] TRACE_8::(string):7                                                                                                                                                                        
   1.15%  perf-30544.map            [.] TRACE_10::(string):7                                                                                                                                                                       
   1.14%  perf-30544.map            [.] TRACE_9::(string):7                                                                                                                                                                        

Here, the innermost loops are naturally compiled first and none of the side-traces observed in the "empty" scenario occur. In fact, the trace structure is essentially the same as the one in the "flat + empty" case.

How does it compare to the modified loop case?

$ sudo ./snabb snsh -jdump=+rs,/tmp/sim-full-flat.txt breathsim.lua flat 0
elapsed 52.60125439 seconds

sim-full-flat.txt

  20.03%  perf-30579.map             [.] TRACE_2::(string):3                                                                                                                                                                       
  19.12%  perf-30579.map             [.] TRACE_5::(string):3                                                                                                                                                                       
  19.00%  perf-30579.map             [.] TRACE_3::(string):3                                                                                                                                                                       
  17.59%  perf-30579.map             [.] TRACE_4::(string):3                                                                                                                                                                       
  16.99%  perf-30579.map             [.] TRACE_6::(string):3                                                                                                                                                                       
   1.28%  perf-30579.map             [.] TRACE_7::(string):7                                                                                                                                                                       
   1.27%  perf-30579.map             [.] TRACE_10::(string):7                                                                                                                                                                      
   1.25%  perf-30579.map             [.] TRACE_11::(string):7                                                                                                                                                                      
   1.25%  perf-30579.map             [.] TRACE_8::(string):7                                                                                                                                                                       
   1.18%  perf-30579.map             [.] TRACE_9::(string):7                                                                                                                                                                       

In the "full" scenario, both looping mechanisms are equivalent.

The conclusion is that we can significantly improve the performance in case the system is idle for a while when it starts up without any negative effects for the case when the system is loaded immediately. I believe the former case is actually more common in realistic scenarios.

Based on these findings I propose to apply the "flattening" of the breathe_pull_order and breathe_push_order loops in core.app.

@alexandergall
Copy link
Contributor Author

I have tested this modification with the l2vpn program in a production-like environment. The engine contained about 30 apps (20 different app classes) and about 60 links. In any realistic scenario, changes are very high that the system is idle after a regular start. I confirmed that it exhibits exactly the behavior described in this issue.

The "flattening" of the loops has reduced the noise of the compiler significantly, in particular when confronted with changing workloads (e.g. direction of bulk traffic from encap-heavy to decap-heavy, from ipv4 only to ipv6 only, from unfragmented to fragmented traffic). The observed effect is that there are much fewer aborted traces and much fewer implicit blacklistings of side traces. Performance has become much more stable across all workloads and "catastrophic JIT failures" have practically vanished. Those failures would result in highly suboptimal traces or even substantial interpreter fallback (>50%) when the system was exposed to a wide range of different workloads.

I am confident enough to include this change in the next production release of l2vpn. I would be very much interested to see the effect this will have on other Snabb programs.

alexandergall added a commit to alexandergall/snabbswitch that referenced this issue Feb 15, 2019
The goal of this change is to simplify the loop structure of the Snabb
engine to make it more likely that the compiler can find and optimize
the packet-processing loops. In particular, it can help in the
situation when the program starts in an idle state with no packets to
process. In that case, we want to avoid that traces compiled for this
initial workload have a negative impact on the creation of traces for
the actual (non-idle) workload.

See snabbco#1414 for an analysis with
a simplified engine.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant