Skip to content

Commit

Permalink
[isolate-data] Move hot fields closer to isolate_root
Browse files Browse the repository at this point in the history
In generated code, we access fields inside IsolateData through the
root-register. On some platforms it is significantly cheaper to access
things that are close to the root-register value than things that are
located far away. The motivation for this CL was a 5% difference in
Octane/Mandreel scores between

// Part of the stack check.
cmpq rsp,[r13+0x9ea8]

and

cmpq rsp,[r13-0x30]  // Mandreel score improved by 5%.

This moves the StackGuard up to fix Mandreel. As a drive-by, also move
two more fields up that are accessed by each CallCFunction.

Tbr: yangguo@chromium.org
Bug: v8:9534,chromium:993264
Change-Id: I5418b63d40274a138e285fa3c99b96e33a814fb1
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1751345
Reviewed-by: Jakob Gruber <jgruber@chromium.org>
Reviewed-by: Yang Guo <yangguo@chromium.org>
Auto-Submit: Jakob Gruber <jgruber@chromium.org>
Commit-Queue: Yang Guo <yangguo@chromium.org>
Cr-Commit-Position: refs/heads/master@{#63187}
  • Loading branch information
schuay authored and Commit Bot committed Aug 13, 2019
1 parent a1982f0 commit fb698ce
Show file tree
Hide file tree
Showing 3 changed files with 37 additions and 16 deletions.
9 changes: 8 additions & 1 deletion include/v8-internal.h
Expand Up @@ -152,15 +152,22 @@ class Internals {

static const uint32_t kNumIsolateDataSlots = 4;

// IsolateData layout guarantees.
static const int kIsolateEmbedderDataOffset = 0;
static const int kExternalMemoryOffset =
kNumIsolateDataSlots * kApiSystemPointerSize;
static const int kExternalMemoryLimitOffset =
kExternalMemoryOffset + kApiInt64Size;
static const int kExternalMemoryAtLastMarkCompactOffset =
kExternalMemoryLimitOffset + kApiInt64Size;
static const int kIsolateRootsOffset =
static const int kIsolateFastCCallCallerFpOffset =
kExternalMemoryAtLastMarkCompactOffset + kApiInt64Size;
static const int kIsolateFastCCallCallerPcOffset =
kIsolateFastCCallCallerFpOffset + kApiSystemPointerSize;
static const int kIsolateStackGuardOffset =
kIsolateFastCCallCallerPcOffset + kApiSystemPointerSize;
static const int kIsolateRootsOffset =
kIsolateStackGuardOffset + 7 * kApiSystemPointerSize;

static const int kUndefinedValueRootIndex = 4;
static const int kTheHoleValueRootIndex = 5;
Expand Down
36 changes: 21 additions & 15 deletions src/execution/isolate-data.h
Expand Up @@ -111,21 +111,27 @@ class IsolateData final {
Address* builtins() { return builtins_; }

private:
// Static layout definition.
// Static layout definition.
//
// Note: The location of fields within IsolateData is significant. The
// closer they are to the value of kRootRegister (i.e.: isolate_root()), the
// cheaper it is to access them. See also: https://crbug.com/993264.
// The recommend guideline is to put frequently-accessed fields close to the
// beginning of IsolateData.
#define FIELDS(V) \
V(kEmbedderDataOffset, Internals::kNumIsolateDataSlots* kSystemPointerSize) \
V(kExternalMemoryOffset, kInt64Size) \
V(kExternalMemoryLlimitOffset, kInt64Size) \
V(kExternalMemoryAtLastMarkCompactOffset, kInt64Size) \
V(kFastCCallCallerFPOffset, kSystemPointerSize) \
V(kFastCCallCallerPCOffset, kSystemPointerSize) \
V(kStackGuardOffset, StackGuard::kSizeInBytes) \
V(kRootsTableOffset, RootsTable::kEntriesCount* kSystemPointerSize) \
V(kExternalReferenceTableOffset, ExternalReferenceTable::kSizeInBytes) \
V(kThreadLocalTopOffset, ThreadLocalTop::kSizeInBytes) \
V(kBuiltinEntryTableOffset, Builtins::builtin_count* kSystemPointerSize) \
V(kBuiltinsTableOffset, Builtins::builtin_count* kSystemPointerSize) \
V(kVirtualCallTargetRegisterOffset, kSystemPointerSize) \
V(kFastCCallCallerFPOffset, kSystemPointerSize) \
V(kFastCCallCallerPCOffset, kSystemPointerSize) \
V(kStackGuardOffset, StackGuard::kSizeInBytes) \
V(kStackIsIterableOffset, kUInt8Size) \
/* This padding aligns IsolateData size by 8 bytes. */ \
V(kPaddingOffset, \
Expand Down Expand Up @@ -153,6 +159,17 @@ class IsolateData final {
// Caches the amount of external memory registered at the last MC.
int64_t external_memory_at_last_mark_compact_ = 0;

// Stores the state of the caller for TurboAssembler::CallCFunction so that
// the sampling CPU profiler can iterate the stack during such calls. These
// are stored on IsolateData so that they can be stored to with only one move
// instruction in compiled code.
Address fast_c_call_caller_fp_ = kNullAddress;
Address fast_c_call_caller_pc_ = kNullAddress;

// Fields related to the system and JS stack. In particular, this contains the
// stack limit used by stack checks in generated code.
StackGuard stack_guard_;

RootsTable roots_;

ExternalReferenceTable external_reference_table_;
Expand All @@ -172,17 +189,6 @@ class IsolateData final {
// ia32 (otherwise the arguments adaptor call runs out of registers).
void* virtual_call_target_register_ = nullptr;

// Stores the state of the caller for TurboAssembler::CallCFunction so that
// the sampling CPU profiler can iterate the stack during such calls. These
// are stored on IsolateData so that they can be stored to with only one move
// instruction in compiled code.
Address fast_c_call_caller_fp_ = kNullAddress;
Address fast_c_call_caller_pc_ = kNullAddress;

// Fields related to the system and JS stack. In particular, this contains the
// stack limit used by stack checks in generated code.
StackGuard stack_guard_;

// Whether the SafeStackFrameIterator can successfully iterate the current
// stack. Only valid values are 0 or 1.
uint8_t stack_is_iterable_ = 1;
Expand Down
8 changes: 8 additions & 0 deletions src/execution/isolate.cc
Expand Up @@ -2926,6 +2926,14 @@ void Isolate::CheckIsolateLayout() {
CHECK_EQ(OFFSET_OF(Isolate, isolate_data_), 0);
CHECK_EQ(static_cast<int>(OFFSET_OF(Isolate, isolate_data_.embedder_data_)),
Internals::kIsolateEmbedderDataOffset);
CHECK_EQ(static_cast<int>(
OFFSET_OF(Isolate, isolate_data_.fast_c_call_caller_fp_)),
Internals::kIsolateFastCCallCallerFpOffset);
CHECK_EQ(static_cast<int>(
OFFSET_OF(Isolate, isolate_data_.fast_c_call_caller_pc_)),
Internals::kIsolateFastCCallCallerPcOffset);
CHECK_EQ(static_cast<int>(OFFSET_OF(Isolate, isolate_data_.stack_guard_)),
Internals::kIsolateStackGuardOffset);
CHECK_EQ(static_cast<int>(OFFSET_OF(Isolate, isolate_data_.roots_)),
Internals::kIsolateRootsOffset);
CHECK_EQ(Internals::kExternalMemoryOffset % 8, 0);
Expand Down

2 comments on commit fb698ce

@addaleax
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schuay We’re looking into a performance regression in Node.js v12.16.0 (which bumped V8 from 7.7 to 7.8) – for ABI compatibility, we had to revert this patch in Node 12, but it seems that without it there’s about a 20 % performance drop in some scenarios.

Here’s the sample reproduction:

const MAX = 100000000
function ceil () {
  console.time('Math.ceil')
  for (var i = 0; i < MAX; i++) {
    Math.ceil(0.5)
  }
  console.timeEnd('Math.ceil')
}
ceil()
ceil()
ceil()
ceil()
ceil()

with Node 12.15.0:

Math.ceil: 86.506ms
Math.ceil: 49.187ms
Math.ceil: 58.792ms
Math.ceil: 59.663ms
Math.ceil: 57.176ms

with Node 12.16.0:

Math.ceil: 85.020ms
Math.ceil: 82.864ms
Math.ceil: 54.058ms
Math.ceil: 52.339ms
Math.ceil: 59.471ms

Notice how in the first output, only the first call is slower, whereas in the second output, its the first two calls – This reproduces consistently.

Do you have any idea why this commit might have such an unexpectedly large effect (especially since neither Node.js v12.15.0 nor v12.16.0 contain it), and how to investigate this further?

@schuay
Copy link
Member Author

@schuay schuay commented on fb698ce Feb 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any idea why this commit might have such an unexpectedly large effect (especially since neither Node.js v12.15.0 nor v12.16.0 contain it), and how to investigate this further?

The numbers you posted for 12.15.0 and 12.16.0 are with or without the patch? Are a. the 20% perf drop without the patch and b. the perf drop after the first/second iteration related or two separate issues? Sorry, not sure I understand your post correctly :)

In any case, I don't have much more information than what is already in the commit message. One form of the stack check (with the small offset from kRootRegister) is significantly faster on some platforms. A stack check happens at function-entry, and in each loop iteration. So the effects will be particularly prominent if a function is dominated by a loop with a very short body. So I would not say the effect is "unexpectedly large".

Regarding the drop after 1/2 iterations, I can only guess it's due to a deopt or badly chosen optimization. --trace-deopt and --trace-opt will tell you more.

Please sign in to comment.