Skip to content

[RISCV] Add SiFive X390 scheduling model #143938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 23, 2025

Conversation

mshockwave
Copy link
Member

@mshockwave mshockwave commented Jun 12, 2025

This patch adds the scheduling model for sifive-x390. X390 is a dual issue in-order CPU. It has two scalar and two vector pipes, with VLEN=1024 and DLEN=512.

This patch stacks on top the new abstraction layer of #144442 .

I noticed that we probably need to update a few numbers in the model, but I prefer to do it incrementally in the future.

@llvmbot
Copy link
Member

llvmbot commented Jun 12, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Min-Yih Hsu (mshockwave)

Changes

This patch adds the scheduling model for sifive-x390. X390 is a dual issue in-order CPU. It has two scalar and two vector pipes, with VLEN=1024 and DLEN=512.

This is a relatively big patch because it tries to reuse the existing SiFive7 scheduling model. Let me breakdown some of the biggest changes:

  • Processor resource definitions (i.e. pipes) are factored out into a multiclass, SiFive7ProcResources. Similarly, WriteRes entries and bypass entries (i.e. ReadAdvance) are also factored out into their own multiclass: SiFive7WriteResBase and SiFive7ReadAdvance, respectively.
  • The aforementioned three components, SiFive7ProcResources, SiFive7WriteResBase, and SiFive7ReadAdvance are encapsulated into a bigger multiclass, SiFive7SchedResources, which configures these components with parameters passed from the template arguments. An example configure value would be whether there is an extra vector ALU or not (i.e. extraVALU)
  • SiFive7's SchedMachineModel carries not only standard fields like issue width, but also the concrete config values corresponding to the processor. For instance, X390 has extraVALU sets to true.
  • In the final phase, we "bind" SchedMachineModel from each processor to a SiFive7SchedResources that is instantiated from that SchedMachineModel's config values.

I noticed that we probably need to update a few numbers in the model, but I prefer to do it incrementally in the future.


Patch is 1.08 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/143938.diff

41 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVProcessors.td (+2-1)
  • (modified) llvm/lib/Target/RISCV/RISCVSchedSiFive7.td (+1140-1002)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFive7/div-fdiv.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFive7/gpr-bypass-c.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFive7/gpr-bypass.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFive7/instruction-tables-tests.s (+132-132)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFive7/jump.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/different-lmul-instruments.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/different-sew-instruments.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/disable-im.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/fractional-lmul-data.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/lmul-instrument-at-start.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/lmul-instrument-in-middle.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/lmul-instrument-in-region.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/lmul-instrument-straddles-region.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/multiple-same-lmul-instruments.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/multiple-same-sew-instruments.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/needs-sew-but-only-lmul.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/no-vsetvli-to-start.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/reductions.s (+9-9)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/sew-instrument-at-start.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/sew-instrument-in-middle.s (+9-9)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/sew-instrument-in-region.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/sew-instrument-straddles-region.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/strided-load-store.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/strided-load-x0.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/vector-integer-arithmetic.s (+9-9)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/vle-vse.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/vsetivli-lmul-instrument.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/vsetivli-lmul-sew-instrument.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/vsetvli-lmul-instrument.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFiveX280/vsetvli-lmul-sew-instrument.s (+8-8)
  • (added) llvm/test/tools/llvm-mca/RISCV/SiFiveX390/div-fdiv.s (+54)
  • (added) llvm/test/tools/llvm-mca/RISCV/SiFiveX390/fractional-lmul-data.s (+62)
  • (added) llvm/test/tools/llvm-mca/RISCV/SiFiveX390/reductions.s (+678)
  • (added) llvm/test/tools/llvm-mca/RISCV/SiFiveX390/strided-load-store.s (+368)
  • (added) llvm/test/tools/llvm-mca/RISCV/SiFiveX390/strided-load-x0.s (+132)
  • (added) llvm/test/tools/llvm-mca/RISCV/SiFiveX390/vector-fp.s (+4851)
  • (added) llvm/test/tools/llvm-mca/RISCV/SiFiveX390/vector-integer-arithmetic.s (+2272)
  • (added) llvm/test/tools/llvm-mca/RISCV/SiFiveX390/vgather-vcompress.s (+317)
  • (added) llvm/test/tools/llvm-mca/RISCV/SiFiveX390/vle-vse.s (+1256)
diff --git a/llvm/lib/Target/RISCV/RISCVProcessors.td b/llvm/lib/Target/RISCV/RISCVProcessors.td
index de6f0ecfce737..b2a74987de66f 100644
--- a/llvm/lib/Target/RISCV/RISCVProcessors.td
+++ b/llvm/lib/Target/RISCV/RISCVProcessors.td
@@ -292,7 +292,8 @@ def SIFIVE_X280 : RISCVProcessorModel<"sifive-x280", SiFive7Model,
                                        FeatureStdExtZbb],
                                       SiFiveIntelligenceTuneFeatures>;
 
-def SIFIVE_X390 : RISCVProcessorModel<"sifive-x390", NoSchedModel,
+def SIFIVE_X390 : RISCVProcessorModel<"sifive-x390",
+                                      SiFiveX390Model,
                                       [Feature64Bit,
                                        FeatureStdExtI,
                                        FeatureStdExtM,
diff --git a/llvm/lib/Target/RISCV/RISCVSchedSiFive7.td b/llvm/lib/Target/RISCV/RISCVSchedSiFive7.td
index c1d7cd4a716e7..78a176fcf18d9 100644
--- a/llvm/lib/Target/RISCV/RISCVSchedSiFive7.td
+++ b/llvm/lib/Target/RISCV/RISCVSchedSiFive7.td
@@ -169,6 +169,12 @@ class SiFive7GetOrderedReductionCycles<string mx, int sew, int VLEN> {
   int c = !mul(6, VLUpperBound);
 }
 
+class SiFive7FPLatencies {
+  int BasicFP16ALU;
+  int BasicFP32ALU;
+  int BasicFP64ALU;
+}
+
 class SiFive7AnyToGPRBypass<SchedRead read, int cycles = 2>
     : ReadAdvance<read, cycles, [WriteIALU, WriteIALU32,
                                  WriteShiftImm, WriteShiftImm32,
@@ -186,1121 +192,1253 @@ class SiFive7AnyToGPRBypass<SchedRead read, int cycles = 2>
                                  WriteIRem, WriteIRem32,
                                  WriteLDB, WriteLDH, WriteLDW, WriteLDD]>;
 
-// SiFive7 machine model for scheduling and other instruction cost heuristics.
-def SiFive7Model : SchedMachineModel {
-  let MicroOpBufferSize = 0; // Explicitly set to zero since SiFive7 is in-order.
-  let IssueWidth = 2;        // 2 micro-ops are dispatched per cycle.
-  let LoadLatency = 3;
-  let MispredictPenalty = 3;
-  let CompleteModel = 0;
-  let EnableIntervals = true;
-  let UnsupportedFeatures = [HasStdExtZbkb, HasStdExtZbkc, HasStdExtZbkx,
-                             HasStdExtZcmt, HasStdExtZknd, HasStdExtZkne,
-                             HasStdExtZknh, HasStdExtZksed, HasStdExtZksh,
-                             HasStdExtZkr];
-}
-
-// The SiFive7 microarchitecture has three pipelines: A, B, V.
+// The SiFive7 microarchitecture has three kinds of pipelines: A, B, V.
 // Pipe A can handle memory, integer alu and vector operations.
 // Pipe B can handle integer alu, control flow, integer multiply and divide,
 // and floating point computation.
-// The V pipeline is modeled by the VCQ, VA, VL, and VS resources.
-let SchedModel = SiFive7Model in {
-let BufferSize = 0 in {
-def SiFive7PipeA       : ProcResource<1>;
-def SiFive7PipeB       : ProcResource<1>;
-def SiFive7IDiv        : ProcResource<1>; // Int Division
-def SiFive7FDiv        : ProcResource<1>; // FP Division/Sqrt
-def SiFive7VA          : ProcResource<1>; // Arithmetic sequencer
-def SiFive7VL          : ProcResource<1>; // Load sequencer
-def SiFive7VS          : ProcResource<1>; // Store sequencer
-// The VCQ accepts instructions from the the A Pipe and holds them until the
-// vector unit is ready to dequeue them. The unit dequeues up to one instruction
-// per cycle, in order, as soon as the sequencer for that type of instruction is
-// available. This resource is meant to be used for 1 cycle by all vector
-// instructions, to model that only one vector instruction may be dequeued at a
-// time. The actual dequeueing into the sequencer is modeled by the VA, VL, and
-// VS sequencer resources below. Each of them will only accept a single
-// instruction at a time and remain busy for the number of cycles associated
-// with that instruction.
-def SiFive7VCQ         : ProcResource<1>; // Vector Command Queue
-}
-
-def SiFive7PipeAB : ProcResGroup<[SiFive7PipeA, SiFive7PipeB]>;
-
-defvar SiFive7VLEN = 512;
+// The V pipeline is modeled by the VCQ, VA, VL, and VS resources. There can
+// be one or two VA (Vector Arithmetic).
+multiclass SiFive7ProcResources<bit extraVALU = false> {
+  let BufferSize = 0 in {
+    def PipeA     : ProcResource<1>;
+    def PipeB     : ProcResource<1>;
+
+    def IDiv      : ProcResource<1>; // Int Division
+    def FDiv      : ProcResource<1>; // FP Division/Sqrt
+
+    // Arithmetic sequencer(s)
+    if extraVALU then {
+      // VA1 can handle any vector airthmetic instruction.
+      def VA1     : ProcResource<1>;
+      // VA2 generally can only handle simple vector arithmetic.
+      def VA2     : ProcResource<1>;
+    } else {
+      def VA      : ProcResource<1>;
+    }
 
-// Branching
-let Latency = 3 in {
-def : WriteRes<WriteJmp, [SiFive7PipeB]>;
-def : WriteRes<WriteJal, [SiFive7PipeB]>;
-def : WriteRes<WriteJalr, [SiFive7PipeB]>;
-}
+    def VL        : ProcResource<1>; // Load sequencer
+    def VS        : ProcResource<1>; // Store sequencer
+    // The VCQ accepts instructions from the the A Pipe and holds them until the
+    // vector unit is ready to dequeue them. The unit dequeues up to one instruction
+    // per cycle, in order, as soon as the sequencer for that type of instruction is
+    // available. This resource is meant to be used for 1 cycle by all vector
+    // instructions, to model that only one vector instruction may be dequeued at a
+    // time. The actual dequeueing into the sequencer is modeled by the VA, VL, and
+    // VS sequencer resources below. Each of them will only accept a single
+    // instruction at a time and remain busy for the number of cycles associated
+    // with that instruction.
+    def VCQ       : ProcResource<1>; // Vector Command Queue
+  }
 
-//Short forward branch
-def : WriteRes<WriteSFB, [SiFive7PipeA, SiFive7PipeB]> {
-  let Latency = 3;
-  let NumMicroOps = 2;
-}
+  def PipeAB : ProcResGroup<[!cast<ProcResource>(NAME#"PipeA"),
+                             !cast<ProcResource>(NAME#"PipeB")]>;
 
-// Integer arithmetic and logic
-let Latency = 3 in {
-def : WriteRes<WriteIALU, [SiFive7PipeAB]>;
-def : WriteRes<WriteIALU32, [SiFive7PipeAB]>;
-def : WriteRes<WriteShiftImm, [SiFive7PipeAB]>;
-def : WriteRes<WriteShiftImm32, [SiFive7PipeAB]>;
-def : WriteRes<WriteShiftReg, [SiFive7PipeAB]>;
-def : WriteRes<WriteShiftReg32, [SiFive7PipeAB]>;
+  if extraVALU then
+  def VA1OrVA2 : ProcResGroup<[!cast<ProcResource>(NAME#"VA1"),
+                               !cast<ProcResource>(NAME#"VA2")]>;
 }
 
-// Integer multiplication
-let Latency = 3 in {
-def : WriteRes<WriteIMul, [SiFive7PipeB]>;
-def : WriteRes<WriteIMul32, [SiFive7PipeB]>;
-}
+multiclass SiFive7WriteResBase<int VLEN,
+    ProcResourceKind PipeA, ProcResourceKind PipeB, ProcResourceKind PipeAB,
+    ProcResourceKind IDiv, ProcResourceKind FDiv,
+    ProcResourceKind VA1, ProcResourceKind VA1OrVA2,
+    ProcResourceKind VL, ProcResourceKind VS,
+    ProcResourceKind VCQ,
+    SiFive7FPLatencies fpLatencies,
+    bit isFP64Throttled = false> {
 
-// Integer division
-def : WriteRes<WriteIDiv, [SiFive7PipeB, SiFive7IDiv]> {
-  let Latency = 66;
-  let ReleaseAtCycles = [1, 65];
-}
-def : WriteRes<WriteIDiv32,  [SiFive7PipeB, SiFive7IDiv]> {
-  let Latency = 34;
-  let ReleaseAtCycles = [1, 33];
-}
-
-// Integer remainder
-def : WriteRes<WriteIRem, [SiFive7PipeB, SiFive7IDiv]> {
-  let Latency = 66;
-  let ReleaseAtCycles = [1, 65];
-}
-def : WriteRes<WriteIRem32,  [SiFive7PipeB, SiFive7IDiv]> {
-  let Latency = 34;
-  let ReleaseAtCycles = [1, 33];
-}
+  // Branching
+  let Latency = 3 in {
+    def : WriteRes<WriteJmp, [PipeB]>;
+    def : WriteRes<WriteJal, [PipeB]>;
+    def : WriteRes<WriteJalr, [PipeB]>;
+  }
 
-// Bitmanip
-let Latency = 3 in {
-// Rotates are in the late-B ALU.
-def : WriteRes<WriteRotateImm, [SiFive7PipeB]>;
-def : WriteRes<WriteRotateImm32, [SiFive7PipeB]>;
-def : WriteRes<WriteRotateReg, [SiFive7PipeB]>;
-def : WriteRes<WriteRotateReg32, [SiFive7PipeB]>;
+  //Short forward branch
+  def : WriteRes<WriteSFB, [PipeA, PipeB]> {
+    let Latency = 3;
+    let NumMicroOps = 2;
+  }
 
-// clz[w]/ctz[w] are in the late-B ALU.
-def : WriteRes<WriteCLZ, [SiFive7PipeB]>;
-def : WriteRes<WriteCLZ32, [SiFive7PipeB]>;
-def : WriteRes<WriteCTZ, [SiFive7PipeB]>;
-def : WriteRes<WriteCTZ32, [SiFive7PipeB]>;
+  // Integer arithmetic and logic
+  let Latency = 3 in {
+    def : WriteRes<WriteIALU, [PipeAB]>;
+    def : WriteRes<WriteIALU32, [PipeAB]>;
+    def : WriteRes<WriteShiftImm, [PipeAB]>;
+    def : WriteRes<WriteShiftImm32, [PipeAB]>;
+    def : WriteRes<WriteShiftReg, [PipeAB]>;
+    def : WriteRes<WriteShiftReg32, [PipeAB]>;
+  }
 
-// cpop[w] look exactly like multiply.
-def : WriteRes<WriteCPOP, [SiFive7PipeB]>;
-def : WriteRes<WriteCPOP32, [SiFive7PipeB]>;
+  // Integer multiplication
+  let Latency = 3 in {
+    def : WriteRes<WriteIMul, [PipeB]>;
+    def : WriteRes<WriteIMul32, [PipeB]>;
+  }
 
-// orc.b is in the late-B ALU.
-def : WriteRes<WriteORCB, [SiFive7PipeB]>;
+  // Integer division
+  def : WriteRes<WriteIDiv, [PipeB, IDiv]> {
+    let Latency = 66;
+    let ReleaseAtCycles = [1, 65];
+  }
+  def : WriteRes<WriteIDiv32,  [PipeB, IDiv]> {
+    let Latency = 34;
+    let ReleaseAtCycles = [1, 33];
+  }
 
-// min/max are in the late-B ALU
-def : WriteRes<WriteIMinMax, [SiFive7PipeB]>;
+  // Integer remainder
+  def : WriteRes<WriteIRem, [PipeB, IDiv]> {
+    let Latency = 66;
+    let ReleaseAtCycles = [1, 65];
+  }
+  def : WriteRes<WriteIRem32,  [PipeB, IDiv]> {
+    let Latency = 34;
+    let ReleaseAtCycles = [1, 33];
+  }
 
-// rev8 is in the late-A and late-B ALUs.
-def : WriteRes<WriteREV8, [SiFive7PipeAB]>;
+  // Bitmanip
+  let Latency = 3 in {
+    // Rotates are in the late-B ALU.
+    def : WriteRes<WriteRotateImm, [PipeB]>;
+    def : WriteRes<WriteRotateImm32, [PipeB]>;
+    def : WriteRes<WriteRotateReg, [PipeB]>;
+    def : WriteRes<WriteRotateReg32, [PipeB]>;
 
-// shNadd[.uw] is on the early-B and late-B ALUs.
-def : WriteRes<WriteSHXADD, [SiFive7PipeB]>;
-def : WriteRes<WriteSHXADD32, [SiFive7PipeB]>;
-}
+    // clz[w]/ctz[w] are in the late-B ALU.
+    def : WriteRes<WriteCLZ, [PipeB]>;
+    def : WriteRes<WriteCLZ32, [PipeB]>;
+    def : WriteRes<WriteCTZ, [PipeB]>;
+    def : WriteRes<WriteCTZ32, [PipeB]>;
 
-// Single-bit instructions
-// BEXT[I] instruction is available on all ALUs and the other instructions
-// are only available on the SiFive7B pipe.
-let Latency = 3 in {
-def : WriteRes<WriteSingleBit, [SiFive7PipeB]>;
-def : WriteRes<WriteSingleBitImm, [SiFive7PipeB]>;
-def : WriteRes<WriteBEXT, [SiFive7PipeAB]>;
-def : WriteRes<WriteBEXTI, [SiFive7PipeAB]>;
-}
+    // cpop[w] look exactly like multiply.
+    def : WriteRes<WriteCPOP, [PipeB]>;
+    def : WriteRes<WriteCPOP32, [PipeB]>;
 
-// Memory
-def : WriteRes<WriteSTB, [SiFive7PipeA]>;
-def : WriteRes<WriteSTH, [SiFive7PipeA]>;
-def : WriteRes<WriteSTW, [SiFive7PipeA]>;
-def : WriteRes<WriteSTD, [SiFive7PipeA]>;
-def : WriteRes<WriteFST16, [SiFive7PipeA]>;
-def : WriteRes<WriteFST32, [SiFive7PipeA]>;
-def : WriteRes<WriteFST64, [SiFive7PipeA]>;
-
-let Latency = 3 in {
-def : WriteRes<WriteLDB, [SiFive7PipeA]>;
-def : WriteRes<WriteLDH, [SiFive7PipeA]>;
-def : WriteRes<WriteLDW, [SiFive7PipeA]>;
-def : WriteRes<WriteLDD, [SiFive7PipeA]>;
-}
+    // orc.b is in the late-B ALU.
+    def : WriteRes<WriteORCB, [PipeB]>;
 
-let Latency = 2 in {
-def : WriteRes<WriteFLD16, [SiFive7PipeA]>;
-def : WriteRes<WriteFLD32, [SiFive7PipeA]>;
-def : WriteRes<WriteFLD64, [SiFive7PipeA]>;
-}
+    // min/max are in the late-B ALU
+    def : WriteRes<WriteIMinMax, [PipeB]>;
 
-// Atomic memory
-def : WriteRes<WriteAtomicSTW, [SiFive7PipeA]>;
-def : WriteRes<WriteAtomicSTD, [SiFive7PipeA]>;
+    // rev8 is in the late-A and late-B ALUs.
+    def : WriteRes<WriteREV8, [PipeAB]>;
 
-let Latency = 3 in {
-def : WriteRes<WriteAtomicW, [SiFive7PipeA]>;
-def : WriteRes<WriteAtomicD, [SiFive7PipeA]>;
-def : WriteRes<WriteAtomicLDW, [SiFive7PipeA]>;
-def : WriteRes<WriteAtomicLDD, [SiFive7PipeA]>;
-}
+    // shNadd[.uw] is on the early-B and late-B ALUs.
+    def : WriteRes<WriteSHXADD, [PipeB]>;
+    def : WriteRes<WriteSHXADD32, [PipeB]>;
+  }
 
-// Half precision.
-let Latency = 5 in {
-def : WriteRes<WriteFAdd16, [SiFive7PipeB]>;
-def : WriteRes<WriteFMul16, [SiFive7PipeB]>;
-def : WriteRes<WriteFMA16, [SiFive7PipeB]>;
-}
-let Latency = 3 in {
-def : WriteRes<WriteFSGNJ16, [SiFive7PipeB]>;
-def : WriteRes<WriteFMinMax16, [SiFive7PipeB]>;
-}
+  // Single-bit instructions
+  // BEXT[I] instruction is available on all ALUs and the other instructions
+  // are only available on the B pipe.
+  let Latency = 3 in {
+    def : WriteRes<WriteSingleBit, [PipeB]>;
+    def : WriteRes<WriteSingleBitImm, [PipeB]>;
+    def : WriteRes<WriteBEXT, [PipeAB]>;
+    def : WriteRes<WriteBEXTI, [PipeAB]>;
+  }
 
-let Latency = 14, ReleaseAtCycles = [1, 13] in {
-def :  WriteRes<WriteFDiv16, [SiFive7PipeB, SiFive7FDiv]>;
-def :  WriteRes<WriteFSqrt16, [SiFive7PipeB, SiFive7FDiv]>;
-}
+  // Memory
+  def : WriteRes<WriteSTB, [PipeA]>;
+  def : WriteRes<WriteSTH, [PipeA]>;
+  def : WriteRes<WriteSTW, [PipeA]>;
+  def : WriteRes<WriteSTD, [PipeA]>;
+  def : WriteRes<WriteFST16, [PipeA]>;
+  def : WriteRes<WriteFST32, [PipeA]>;
+  def : WriteRes<WriteFST64, [PipeA]>;
+
+  let Latency = 3 in {
+  def : WriteRes<WriteLDB, [PipeA]>;
+  def : WriteRes<WriteLDH, [PipeA]>;
+  def : WriteRes<WriteLDW, [PipeA]>;
+  def : WriteRes<WriteLDD, [PipeA]>;
+  }
 
-// Single precision.
-let Latency = 5 in {
-def : WriteRes<WriteFAdd32, [SiFive7PipeB]>;
-def : WriteRes<WriteFMul32, [SiFive7PipeB]>;
-def : WriteRes<WriteFMA32, [SiFive7PipeB]>;
-}
-let Latency = 3 in {
-def : WriteRes<WriteFSGNJ32, [SiFive7PipeB]>;
-def : WriteRes<WriteFMinMax32, [SiFive7PipeB]>;
-}
+  let Latency = 2 in {
+  def : WriteRes<WriteFLD16, [PipeA]>;
+  def : WriteRes<WriteFLD32, [PipeA]>;
+  def : WriteRes<WriteFLD64, [PipeA]>;
+  }
 
-def : WriteRes<WriteFDiv32, [SiFive7PipeB, SiFive7FDiv]> { let Latency = 27;
-                                                         let ReleaseAtCycles = [1, 26]; }
-def : WriteRes<WriteFSqrt32, [SiFive7PipeB, SiFive7FDiv]> { let Latency = 27;
-                                                          let ReleaseAtCycles = [1, 26]; }
+  // Atomic memory
+  def : WriteRes<WriteAtomicSTW, [PipeA]>;
+  def : WriteRes<WriteAtomicSTD, [PipeA]>;
 
-// Double precision
-let Latency = 7 in {
-def : WriteRes<WriteFAdd64, [SiFive7PipeB]>;
-def : WriteRes<WriteFMul64, [SiFive7PipeB]>;
-def : WriteRes<WriteFMA64, [SiFive7PipeB]>;
-}
-let Latency = 3 in {
-def : WriteRes<WriteFSGNJ64, [SiFive7PipeB]>;
-def : WriteRes<WriteFMinMax64, [SiFive7PipeB]>;
-}
+  let Latency = 3 in {
+  def : WriteRes<WriteAtomicW, [PipeA]>;
+  def : WriteRes<WriteAtomicD, [PipeA]>;
+  def : WriteRes<WriteAtomicLDW, [PipeA]>;
+  def : WriteRes<WriteAtomicLDD, [PipeA]>;
+  }
 
-def : WriteRes<WriteFDiv64, [SiFive7PipeB, SiFive7FDiv]> { let Latency = 56;
-                                                         let ReleaseAtCycles = [1, 55]; }
-def : WriteRes<WriteFSqrt64, [SiFive7PipeB, SiFive7FDiv]> { let Latency = 56;
-                                                          let ReleaseAtCycles = [1, 55]; }
-
-// Conversions
-let Latency = 3 in {
-def : WriteRes<WriteFCvtI32ToF16, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtI32ToF32, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtI32ToF64, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtI64ToF16, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtI64ToF32, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtI64ToF64, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtF16ToI32, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtF16ToI64, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtF16ToF32, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtF16ToF64, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtF32ToI32, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtF32ToI64, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtF32ToF16, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtF32ToF64, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtF64ToI32, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtF64ToI64, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtF64ToF16, [SiFive7PipeB]>;
-def : WriteRes<WriteFCvtF64ToF32, [SiFive7PipeB]>;
-
-def : WriteRes<WriteFClass16, [SiFive7PipeB]>;
-def : WriteRes<WriteFClass32, [SiFive7PipeB]>;
-def : WriteRes<WriteFClass64, [SiFive7PipeB]>;
-def : WriteRes<WriteFCmp16, [SiFive7PipeB]>;
-def : WriteRes<WriteFCmp32, [SiFive7PipeB]>;
-def : WriteRes<WriteFCmp64, [SiFive7PipeB]>;
-def : WriteRes<WriteFMovI16ToF16, [SiFive7PipeB]>;
-def : WriteRes<WriteFMovF16ToI16, [SiFive7PipeB]>;
-def : WriteRes<WriteFMovI32ToF32, [SiFive7PipeB]>;
-def : WriteRes<WriteFMovF32ToI32, [SiFive7PipeB]>;
-def : WriteRes<WriteFMovI64ToF64, [SiFive7PipeB]>;
-def : WriteRes<WriteFMovF64ToI64, [SiFive7PipeB]>;
-}
+  // Half precision.
+  let Latency = fpLatencies.BasicFP16ALU in {
+  def : WriteRes<WriteFAdd16, [PipeB]>;
+  def : WriteRes<WriteFMul16, [PipeB]>;
+  def : WriteRes<WriteFMA16, [PipeB]>;
+  }
+  let Latency = 3 in {
+  def : WriteRes<WriteFSGNJ16, [PipeB]>;
+  def : WriteRes<WriteFMinMax16, [PipeB]>;
+  }
 
-// 6. Configuration-Setting Instructions
-let Latency = 3 in {
-def : WriteRes<WriteVSETVLI, [SiFive7PipeA]>;
-def : WriteRes<WriteVSETIVLI, [SiFive7PipeA]>;
-def : WriteRes<WriteVSETVL, [SiFive7PipeA]>;
-}
+  let Latency = 14, ReleaseAtCycles = [1, 13] in {
+  def :  WriteRes<WriteFDiv16, [PipeB, FDiv]>;
+  def :  WriteRes<WriteFSqrt16, [PipeB, FDiv]>;
+  }
 
-// 7. Vector Loads and Stores
-// Unit-stride loads and stores can operate at the full bandwidth of the memory
-// pipe. The memory pipe is DLEN bits wide on x280.
-foreach mx = SchedMxList in {
-  defvar Cycles = SiFive7GetCyclesDefault<mx>.c;
-  defvar IsWorstCase = SiFive7IsWorstCaseMX<mx, SchedMxList>.c;
-  let Latency = 4, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
-    defm "" : LMULWriteResMX<"WriteVLDE",    [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVLDFF",   [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
+  // Single precision.
+  let Latency = fpLatencies.BasicFP32ALU in {
+    def : WriteRes<WriteFAdd32, [PipeB]>;
+    def : WriteRes<WriteFMul32, [PipeB]>;
+    def : WriteRes<WriteFMA32, [PipeB]>;
+  }
+  let Latency = 3 in {
+    def : WriteRes<WriteFSGNJ32, [PipeB]>;
+    def : WriteRes<WriteFMinMax32, [PipeB]>;
   }
-  let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in
-  defm "" : LMULWriteResMX<"WriteVSTE",    [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
-}
 
-foreach mx = SchedMxList in {
-  defvar Cycles = SiFive7GetMaskLoadStoreCycles<mx>.c;
-  defvar IsWorstCase = SiFive7IsWorstCaseMX<mx, SchedMxList>.c;
-  let Latency = 4, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in
-  defm "" : LMULWriteResMX<"WriteVLDM",    [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
-  let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in
-  defm "" : LMULWriteResMX<"WriteVSTM",    [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
-}
+  def : WriteRes<WriteFDiv32, [PipeB, FDiv]> {
+    let Latency = 27;
+    let ReleaseAtCycles = [1, 26];
+  }
+  def : WriteRes<WriteFSqrt32, [PipeB, FDiv]> {
+    let Latency = 27;
+    let ReleaseAtCycles = [1, 26];
+  }
 
-// Strided loads and stores operate at one element per cycle and should be
-// scheduled accordingly. Indexed loads and stores operate at one element per
-// cycle, and they stall the machine until all addresses have been generated,
-// so they cannot be scheduled. Indexed and strided loads and stores have LMUL
-// specific suffixes, but since SEW is already encoded in the name of the
-// resource, we do not need to use LMULSEWXXX constructors. However, we do
-// use the SEW from the name to determine the number of Cycles.
-
-foreach mx = SchedMxList in {
-  defvar VLDSX0Cycles = SiFive7GetCyclesDefault<mx>.c;
-  defvar Cycles = SiFive7GetCyclesOnePerElement<mx, 8, SiFive7VLEN>.c;
-  defvar IsWorstCase = SiFive7IsWorstCaseMX<mx, SchedMxList>.c;
-  defm SiFive7 : LMULWriteResMXVariant<"WriteVLDS8",  VLDSX0Pred, [SiFive7VCQ, SiFive7VL],
-                                       4, [0, 1], [1, !add(1, VLDSX0Cycles)], !add(3, Cycles),
-                                       [0, 1], [1, !add(1, Cycles)], mx, IsWorstCase>;
-  let Latency = !add(3, Cycles), AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1,...
[truncated]

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general comment: would the abstraction make the code hard to maintain compared to putting X380 model into a standalone file?

@mshockwave
Copy link
Member Author

Just a general comment: would the abstraction make the code hard to maintain compared to putting X380 model into a standalone file?

The idea behind a common abstraction is that any update on the scheduling models of X390 and SiFive7 cores will only need to be done once, as they share lots common performance characteristics. So it actually lowers the maintenance cost

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you separate this PR into 2 PRs at least, it is massive and hard to review:

  1. Adding abstraction.
  2. Adding X380.

@mshockwave
Copy link
Member Author

Can you separate this PR into 2 PRs at least, it is massive and hard to review:

  1. Adding abstraction.
  2. Adding X380.

Good idea, the abstraction layer is now in #144442

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I still feel uncomfortable about the complexity (too many conditions, and I personally prefer straightforward definitions) but I think we can improve it progressively.

@mshockwave
Copy link
Member Author

mshockwave commented Jun 17, 2025

LGTM. I still feel uncomfortable about the complexity (too many conditions, and I personally prefer straightforward definitions) but I think we can improve it progressively.

Yeah I agree the complexity is a bit daunting. I feel like fundamentally the scheduling model infrastructure should be more ergonomic on creating hierarchical scheduling models. In fact, many many moons ago on the inception of the current instr scheduling model framework, I think people were thinking about making a way to have a "base" / parent scheduling model (there is a comment left somewhere in the TableGen code that shows this) but obviously it went nowhere.

@wangpc-pp
Copy link
Contributor

LGTM. I still feel uncomfortable about the complexity (too many conditions, and I personally prefer straightforward definitions) but I think we can improve it progressively.

Yeah I agree the complexity is a bit daunting. I feel like fundamentally the scheduling model infrastructure should be more ergonomic on creating hierarchical scheduling models. In fact, many many moons ago on the inception of the current instr scheduling model framework, I think people were thinking about making a way to have a "base" / parent scheduling model (there is a comment left somewhere in the TableGen code that shows this) but obviously it went nowhere.

I had the same plan but I didn't have much time to give it a try. Please feel free to tag me when you are trying to fix these scheduling model issues. :-)

@mshockwave
Copy link
Member Author

Any comments from other reviewers? If there isn't I intend to merge this series of patches early next week.

Co-authored-by: Michael Maitland <michaeltmaitland@gmail.com>
@mshockwave mshockwave force-pushed the patch/rvv/x390-sched-model branch from d123b88 to dc41b77 Compare June 23, 2025 16:50
@mshockwave mshockwave merged commit f40909f into llvm:main Jun 23, 2025
5 of 7 checks passed
@mshockwave mshockwave deleted the patch/rvv/x390-sched-model branch June 23, 2025 17:06
Jaddyen pushed a commit to Jaddyen/llvm-project that referenced this pull request Jun 23, 2025
This patch adds the scheduling model for sifive-x390. X390 is a dual
issue in-order CPU. It has two scalar and two vector pipes, with
VLEN=1024 and DLEN=512.

Co-authored-by: Michael Maitland <michaeltmaitland@gmail.com>
anthonyhatran pushed a commit to anthonyhatran/llvm-project that referenced this pull request Jun 26, 2025
This patch adds the scheduling model for sifive-x390. X390 is a dual
issue in-order CPU. It has two scalar and two vector pipes, with
VLEN=1024 and DLEN=512.

Co-authored-by: Michael Maitland <michaeltmaitland@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants