Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Fetching contributors…

Cannot retrieve contributors at this time

112 lines (91 sloc) 5.193 kb

nbody - More optimizations

In the first part I showed some problems and possibilities of the B::C compiler and B::CC optimizing compiler with an regexp example which was very bad to optimize.

In the second part I got 2 times faster run-times with the B::CC compiler with the nbody benchmark, which does a lot of arithmetic.

In the third part I got 4.5 times faster run-times with perl-level AELEMFAST optimizations, and discussed optimising array accesses via no autovifification or types.

Optimising array accesses showed the need for autovifification detection in B::CC and better stack handling for more ops and datatypes, esp. aelem and helem.

But first let's study more easier goals to accomplish. If we look at the generated C source for a simple arithmetic function, like pp_sub_offset_momentum we immediately detect more possibilities.

static
CCPP(pp_sub_offset_momentum)
{
    SV *sv, *src, *dst, *left, *right;
    NV rnv0, lnv0, d1_px, d2_py, d3_pz, d4_mass, d7_tmp, d10_tmp, d13_tmp, d15_tmp, d17_tmp, d19_tmp, d21_tmp, d23_tmp, d25_tmp, d27_tmp, d29_tmp, d31_tmp, d33_tmp, d35_tmp, d37_tmp, d40_tmp, d42_tmp, d44_tmp;
    PERL_CONTEXT *cx;
    MAGIC *mg;
    I32 oldsave, gimme;
    dSP;
  lab_2a41220:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* only needed once */
    FREETMPS;                  /* only needed once */
    SAVECLEARSV(PL_curpad[1]); /* not needed at all */
    d1_px = 0.00;
  lab_2a41370:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
    FREETMPS;                  /* only needed once */
    SAVECLEARSV(PL_curpad[2]); /* not needed at all */
    d2_py = 0.00;
  lab_2a50a00:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
    FREETMPS;                  /* only needed once */
    SAVECLEARSV(PL_curpad[3]); /* not needed at all */
    d3_pz = 0.00;
  lab_2a50b30:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
    FREETMPS;                  /* only needed once */
    SAVECLEARSV(PL_curpad[4]); /* not needed at all */
  lab_2a50cc0:
    TAINT_NOT;                 /* only needed once */
    sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; /* unneeded */
    FREETMPS;                  /* only needed once */
    PUSHs(AvARRAY(MUTABLE_AV(PL_curpad[5]))[0]);    /* no autovivification */
    sv = POPs;
    MAYBE_TAINT_SASSIGN_SRC(sv);    /* not needed */
    SvSetMagicSV(PL_curpad[4], sv); /* i.e. PL_curpad[4] = sv; */
    ...

We can study the expanded macros with:

cc_harness -DOPT -E -O2 -onbody.perl-2.perl-1.i nbody.perl-2.perl.c

TAINT_NOT does (PL_tainted = (0)). It is needed only once, because nobody changes PL_tainted. We can also ignore taint checks generally by setting -fomit_taint.

perl -MO=Concise,offset_momentum nbody.perl-2a.perl

main::offset_momentum:
42 <1> leavesub[1 ref] K/REFC,1 ->(end)
-     <@> lineseq KP ->42
1        <;> nextstate(main 141 (eval 5):4) v ->2
4        <2> sassign vKS/2 ->5
2           <$> const(NV 0) s ->3
3           <0> padsv[$px:141,145] sRM*/LVINTRO ->4
...

sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; is the 2nd part of the inlined code for nextstate and resets the stack pointer. As we keep track of the stack by ourselves we can omit most of these resets in nextstate.

FREETMPS is also part of nextstate, and calling it after each basic block is optimized by -O1, and -O2 would free the temps after each loop. If FREETMPS is needed at all, i.e. if locals are used in the function at all, is not checked yet.

SAVECLEARSV(PL_curpad[1-4]) is part of padsv /LVINTRO, but here unneeded, since it is in the context of sassign. So the value of the lexical does not need to be cleared before it is set. And btw. the setter of the lexical is already optimized to a temporary.

MAYBE_TAINT_SASSIGN_SRC(sv) is part of sassign and can be omitted with -fomit_taint, and since we are at TAINT_NOT we can leave it out.

SvSetMagicSV(PL_curpad[4], sv) is also part of the optimized sassign op, just not yet optimized enough, since sv cannot have any magic. A type declaration for the padsv would have used the faster equivalent SvNV(PL_curpad[4]) = SvNV(sv); put on the stack.

We can easily test this out by NOP'ing these code sections and see the costs.

With 4m53.073s, without 4m23.265s. 30 seconds or ~10% faster. This is now in the typical range of p5p micro-optimizations and not considered high-priority for now.

Let's rather check out more stack optimizations.

TBC...

Jump to Line
Something went wrong with that request. Please try again.