Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fine grained module level build definitions #1552

Open
aherrmann opened this issue Jun 3, 2021 · 37 comments · Fixed by #1623
Open

Support fine grained module level build definitions #1552

aherrmann opened this issue Jun 3, 2021 · 37 comments · Fixed by #1623
Assignees

Comments

@aherrmann
Copy link
Member

Goal

Add a rule haskell_module to support fine grained build definitions at the level of individual Haskell modules.

Motivation

The existing haskell_library|binary|test rules operate on the level of Haskell packages. In these rules building consist of two main steps with corresponding Bazel build actions: Compile all modules of a package in one build action and link all modules together into a library in a separate action (one for static and one for dynamic linking). Additionally, there is an action to construct the package's package database and several supporting actions.

This means that all modules of such a package need to be recompiled even if only a single module in a package changed compared to a previous cached build.

With a haskell_module rule we could instead issue one compile action per Haskell module. This would allow us to benefit from Bazel's caching and parallel builds on the level of individual modules.

Outline

Not all the details are worked out at this stage, however, the idea is that the haskell_module rule may be used like this:

haskell_module(
    name = "ModuleA",
    src = "ModuleA.hs",
    deps = ["@stackage//:base"],
)
haskell_module(
    name = "ModuleB",
    src = "ModuleB",
    deps = [
        "@stackage//:base",
        ":ModuleA",
    ],
)
haskell_library(
    name = "my-library",
    modules = [
        ":ModuleA",
        ":ModuleB",
    ],
    deps = ["@stackage//:base"],
)

In this example the haskell_module rules perform the compilation of modules to object and interface files, and the haskell_library rule performs the linking of these compiled modules into a static archive or shared object and the generation of a package database for the package.

Alternatives considered

The goal is to provide what we called "sub-target incremental builds" in the context of the current haskell_library|binary|test rules. I.e. avoid recompiling modules that have not changed between builds even if another module that is part of the same library or executabe target did change. An alternative approach that we considered is to use Bazel's persistent worker mechanism to provide a persistent GHC session that caches previous compilation outputs (object and interface files). Downsides of this approach are that:

  • It is a more complex solution, since it requires maintaining our own cache and maintaining state across worker invocations.
  • It introduces potential inhermeticity, since the worker would persist state across build actions outside of Bazel's control.
  • The benefits are limited to a worker instance, i.e. modules cannot be shared across build nodes through a remote cache.

Anticipated problems

Slower uncached builds

Starting one GHC session per module may well be slower than starting a single GHC session that compiles multiple modules at once. Meaning there is a tradeoff between uncached, builds from scratch and incremental, cached builds. For everyday development the incremental, cached build is likely to be the more common case than uncached, builds from scratch. Meaning, the benefit may outweigh in the common case. Furthermore, we could still use a persistent worker to save on the repeated start-up of GHC sessions similar to how it is already done for Java or Scala in Bazel. In this case we would not need to maintain a cache of compilation outputs in the persistent worker, reducing the downside of this approach.

rules_haskell already contains an implementation of a persistent worker that could be adapted to this use-case.

More verbose build definitions

Haskell compilation has to occur in dependency order: The Haskell compiler requires interface files of modules that the current module depends on as an input in order to compile the current module. This means that haskell_module rules need to spell out the Haskell module dependency graph in the BUILD files. Manually maintaining this dependency graph in both the Haskell sources and the BUILD files is tedious. BUILD file generation, e.g. a Gazelle extension for Haskell, can solve this issue.

@aherrmann
Copy link
Member Author

The implementation of this feature has progressed (#1553, #1564, #1565, #1568, #1579, #1590) and we have discovered some design questions with the API that need to be resolved.

Haskell module compilation is trickier than, say, C module compilation, because we need metadata about the final library to compile the individual module, most often -this-unit-id which needs to know the name of the package that the module is part of. So, there is a reverse dependency going from the haskell_library back to the haskell_module.

Extremes

It's a large design space, but, I think it can help to consider two extremes in-between which haskell_module could lie:

  • Either haskell_module is very explicit and acts like a rule in a Makefile. I.e. it is fully explicit about all compilation flags and produces a .o and .hi file itself.
  • Or it is more abstract and only defines the inter-module dependency graph and the actual compilation is deferred until the module is used in a haskell_library|binary|test target.

Explicit

The explicit extreme could look something like this:

haskell_module(
    name = "Module.A",
    src = "Module/A.hs",
    deps = [
        "@stackage//:base",
    ],
    # Package name and version - duplicated across modules and library target
    # Relevant for flags such as -this-unit-id, -optP-DCURRENT_PACKAGE_KEY
    package_name = "mypackage",
    package_version = "0.1.0.0",
    # common flags - duplicated across modules
    # Relevant for compilation
    ghcopts = ["-XOverloadedStrings"],
)
haskell_module(
    name = "Module.B",
    src = "Module/B.hs",
    deps = [
        "@stackage//:base",
        ":anotherpackage",
        ":Module.A",
    ],
    package_name = "mypackage",
    package_version = "0.1.0.0",
    ghcopts = ["-XOverloadedStrings"],
)
haskell_library(
    name = "mypackage",
    package_name = "mypackage",  # optional: defaults to name
    version = "0.1.0.0",
    modules = [":Module.A", ":Module.B"],
    # We need to repeat package deps for the package-db entry.
    deps = [
        "@stackage//:base",
        ":anotherpackage",
    ],
    ghcopts = [...],  # only linking options go here
)

Where bazel build //:Module.A produces bazel-bin/Module/A.o (or similar).

Note, many attributes are repeated across modules and library, e.g. package name or ghcopts.

Abstract

The abstract extreme could look something like this instead:

haskell_module(
    name = "Module.A",
    src = "src/Module/A.hs",
)
haskell_module(
    name = "Module.B",
    src = "src/Module/B.hs",
    deps = [
        ":Module.A",
    ],
)
haskell_library(
    name = "mypackage",
    package_name = "mypackage",  # optional: defaults to name
    version = "0.1.0.0",
    modules = [":Module.A", ":Module.B"],
    deps = [
        "@stackage//:base",
        ":anotherpackage",
    ],
    ghcopts = ["-XOverloadedStrings"],
)

Where bazel build //:Module.A produces no output. Instead the module is only built when //:mypackage is built.
(There are several ways to achieve this: transitions, aspects, compile inside haskell_library. But, let's focus on the API side first.)

In this case haskell_library collects attributes like package name or ghcopts for all modules.
(This approach could still allow for per-module ghcopts as well, if deemed necessary.)

Pros & Cons

Explicit

A clear advantage of the explicit approach is that it is very simple to understand.

A clear downside is that it requires a lot of very mechanical duplication. This is perhaps less of a concern when using a Gazelle extension to generate the haskell_module targets. But, it's still very verbose.

Another downside is that the package_name|version attributes, used to generate -this-unit-id, mean that this target can only be used at one haskell_library. If, for whatever reason, one wants to include the module in two separate haskell_library targets, then one needs two separate haskell_module targets as well.

It's also easy to make mistakes, such as using a haskell_module in a haskell_library with a mismatching package-name (ideally haskell_library should raise an error in this case).

Abstract

The abstract approach on the other hand is much less verbose and the haskell_module targets are more high-level: They declare the module dependencies but hide details like -this-unit-id.

The downside is of course that it is less explicit: It's not the haskell_module alone that produces a .o|.hi, but only the combination of haskell_module and haskell_library.

Targets that don't themselves produce anything seem a bit strange. This is not without precedent though, proto_library targets are similar. They themselves only declare the interdependency between proto files, but don't generate any bindings. Only when paired with one of the <lang>_proto_library rules will Bazel generate any output. In that case typically through aspects.

Current Status

The current implementation lies a little in between both approaches. It largely follows the explicit approach, but, the package name is implicitly forwarded from the haskell_library to the haskell_module via a transition. In it's current form this has the issue that bazel build //... will build the haskell_module twice: Once without -this-unit-id when building the haskell_module target, and once more with -this-unit-id when building the haskell_library.

Going Forward

I'd argue that we should commit more closely to one of those approaches. Either, take the explicit approach and incur the need for duplication of the package name. But, have haskell_module produce the .o and .hi file by itself.

Or make haskell_module more abstract, but then also avoid compiling the code twice. I.e. haskell_module alone does not produce a .o or .hi file. If we go the abstract road then we may as well make full use of the benefits and reduce duplication by letting users declare shared attributes on the haskell_library|binary|test target.

@Radvendii @facundominguez thoughts?

@facundominguez
Copy link
Member

Or make haskell_module more abstract, but then also avoid compiling the code twice.

How do you avoid compiling twice if the module needs to be built with different -this-unit-id flags?

Also, I have some recollections of cc_library not committing to build a library until seeing how it would be linked with cc_binary.

@aherrmann
Copy link
Member Author

Or make haskell_module more abstract, but then also avoid compiling the code twice.

How do you avoid compiling twice if the module needs to be built with different -this-unit-id flags?

Well, if it needs to be built with different unit-ids then you can't avoid building it multiple times. But, the point is that, right now, it is being built twice even if it is only needed with one unit-id.

Also, I have some recollections of cc_library not committing to build a library until seeing how it would be linked with cc_binary.

If you build a cc_library target, either directly or bazel build //..., then it will build both the .a and .so (unless you set linkstatic=True). But, the actions that generate the .a and .so are separate actions. So, if you bazel build //:some-target that links the cc_library statically, then only the .a will be requested and built.

@Radvendii
Copy link
Member

Radvendii commented Oct 12, 2021

Another option is to modify what we currently have slightly by having the haskell_module() rule detect when it is being built without a package name supplied (typically from being used by a haskell_library()) and not build anything in that case (since it can't really be used by anything.

Pros:

  • bazel build //... doesn't build twice
  • the haskell_module() rule is the one doing the building

Cons:

  • The rule inconsistently builds or doesn't build depending on how it's used (this could maybe be mitigated by emiting a message / warning when it's built on it's own)

There is another option which combines all three approaches which has the advantage of enabling both approaches depending on need and the disadvantage of being more complex. This would be that haskell_module() would build if supplied a package_name attribute, and only supply the information necessary to build if not. haskell_library() would know how to deal with both types of outputs.

Now that I write it out that seems not worth the complexity

@facundominguez
Copy link
Member

Another option is to modify what we currently have slightly by having the haskell_module() rule detect when it is being built without a package name supplied (typically from being used by a haskell_library()) and not build anything in that case

Do haskell_binary rules supply a package name to haskell_modules?

@Radvendii
Copy link
Member

That's a good point. We could make them do that, but that might be even more confusing.

I'm not sure I quite understand the relationship between binaries and modules. Why not bundle modules up in a library and have the binary depend on that?

@facundominguez
Copy link
Member

I'm not sure I quite understand the relationship between binaries and modules. Why not bundle modules up in a library and have the binary depend on that?

It looks feasible, but more bureaucratic than defining a binary with a single rule.

@aherrmann
Copy link
Member Author

Another option is to modify what we currently have slightly by having the haskell_module() rule detect when it is being built without a package name supplied (typically from being used by a haskell_library()) and not build anything in that case (since it can't really be used by anything.

Pros:

  • bazel build //... doesn't build twice
  • the haskell_module() rule is the one doing the building

As I understand it, this falls into the abstract side. Building a haskell_module alone won't do anything, because no package_name is supplied. Only building a haskell_library|binary|test that depends on it will cause it to be built, because it supplies a package name. Note, the pro "the haskell_module() rule is the one doing the building" can also be achieved without this kind of detection mechanism. If haskell_module doesn't return the object in DefaultInfo, but only in HaskellModuleInfo, then building the target directly won't do anything, only building a haskell_library|binary|test that depends on it will request the object and therefore trigger its build. The build action can still be emitted by haskell_module.

Or phrased as a question: Is there an observable difference between this approach and just never returning an object in DefaultInfo so that it is only ever requested through a haskell_library|binary|test dependency via HaskellModuleInfo?

Another option is to modify what we currently have slightly by having the haskell_module() rule detect when it is being built without a package name supplied (typically from being used by a haskell_library()) and not build anything in that case

Do haskell_binary rules supply a package name to haskell_modules?

haskell_binary does not supply a package name. In that case GHC defaults to main (Side note: The language around the name main is surprisingly strong given that it can be changed with -main-is).

I'm not sure I quite understand the relationship between binaries and modules. Why not bundle modules up in a library and have the binary depend on that?

It looks feasible, but more bureaucratic than defining a binary with a single rule.

Agreed, it seems unnecessary to enforce an intermediate haskell_library.

There is another option which combines all three approaches which has the advantage of enabling both approaches depending on need and the disadvantage of being more complex. This would be that haskell_module() would build if supplied a package_name attribute, and only supply the information necessary to build if not. haskell_library() would know how to deal with both types of outputs.

Now that I write it out that seems not worth the complexity

Yeah, that sounds like adding a lot of complexity without clear benefit.

@Radvendii
Copy link
Member

If haskell_module doesn't return the object in DefaultInfo, but only in HaskellModuleInfo, then building the target directly won't do anything

Ohh I didn't realize it was this simple. This would be my vote, then, because it's the way I think I'd prefer to use rules_haskell. I tend to get very annoyed about needless duplication.

On the other hand, if the way people are expected to use haskell_module is almost always via gazelle, then I think a simpler underlying mechanism is better.

Do we know how it's going to be used in practice? I don't have a sense for how ubiquetous gazelle is.

@facundominguez
Copy link
Member

facundominguez commented Oct 13, 2021

At first, I expected the abstract approach to help avoid setting ghcopts for every module of a library. But now I realize that perhaps this can be done with the explicit approach via transitions. If that is the case, then this becomes pretty much an implementation choice. I don't think it would be obvious for the user to observe the difference, unless she goes into peering at the actions that implement the rules.

I would expect the abstract approach to be simpler, because all the information is gathered in haskell_module to later build in the library or binary rules without using transitions. Transitions, in my opinion, spread the task among more concepts than we would otherwise need.

@aherrmann
Copy link
Member Author

At first, I expected the abstract approach to help avoid setting ghcopts for every module of a library. But now I realize that perhaps this can be done with the explicit approach via transitions. If that is the case, then this becomes pretty much an implementation choice. I don't think it would be obvious for the user to observe the difference, unless she goes into peering at the actions that implement the rules.

The way I intended that distinction this would fall closer to the "abstract" side. With the "explicit" vs. "abstract" distinction I was referring to the exposed API, not to the implementation. Whether we use transitions to achieve this or not is an implementation choice. But, as far as the user is concerned, if ghcopts (and others) are not fully declared on the haskell_module, then the module rule is not fully explicit, but is implicitly requiring parameters from the haskell_library|binary|test that depends on it in some way. So, haskell_module cannot be built in isolation.

That's what I meant above by

Where bazel build //:Module.A produces no output. Instead the module is only built when //:mypackage is built.
(There are several ways to achieve this: transitions, aspects, compile inside haskell_library. But, let's focus on the API side first.)


I would expect the abstract approach to be simpler, because all the information is gathered in haskell_module to later build in the library or binary rules without using transitions. Transitions, in my opinion, spread the task among more concepts than we would otherwise need.

As stated above, I think this now discusses the implementation more than the API design. I assume that you mean an implementation where the build actions are not emitted by haskell_module but only later by haskell_library. Yes, I agree that this is simpler. Transitions introduce a fair bit of complexity. I'm also concerned that bazelbuild/bazel#13281 might come bite us in this case.

@facundominguez
Copy link
Member

I don't see much benefit in the explicit API. If a module needs to be compiled with specific ghcopts, why not use the OPTIONS_GHC pragma instead? Are there any other attributes that would be desirable to specify per module?

@aherrmann
Copy link
Member Author

Are there any other attributes that would be desirable to specify per module?

Plugins come to mind, as discussed in #1566 (comment) in the past. That said, the "abstract" approach does not preclude that option.

@Radvendii
Copy link
Member

IIUC, the advantage of the explicit approach is that it is simpler to understand and implement, meaning (potentially) fewer bugs. Is that right? Is there anything else to it?

@facundominguez
Copy link
Member

IIUC, the advantage of the explicit approach is that it is simpler to understand and implement, meaning (potentially) fewer bugs. Is that right? Is there anything else to it?

At this point I'm totally confused about whether we are discussing the implementation or the interface. :)

If we are discussing the interface, then the "abstract" approach is to allow haskell_library and haskell_binary rules to influence what compilation options are used to build modules. The "explicit" approach does not allow this (.i.e. all the compilation options need to be included in the haskell_module rule).

Both interfaces can be implemented by deferring or not deferring compilation actions to the haskell_library and haskell_binary rules. Of the mix of interfaces and implementation:

  1. explicit interface - deferred actions: simple implementation
  2. explicit interface - non-deferred actions: simplest implementation
  3. abstract interface - deferred actions: simple implementation
  4. abstract interface - non-deferred actions: requires transitions

I think the easiest to use is probably (3), and the simplest to implement is (2).

@aherrmann
Copy link
Member Author

@Radvendii

IIUC, the advantage of the explicit approach is that it is simpler to understand and implement, meaning (potentially) fewer bugs. Is that right? Is there anything else to it?

Yes, I agree. It is the only option that allows the haskell_module to be built in isolation in a meaningful way, i.e. bazel build //:My.Module produces a .o and .hi file, such that those are the same files that a haskell_library or similar depending on the module will use.


@facundominguez

At this point I'm totally confused about whether we are discussing the implementation or the interface. :)
If we are discussing the interface [...]

Yes, the interface.

Thanks for good summary, I agree.
I'd say option 1 is not something we'd want since it incurs the verbosity of the explicit interface, but doesn't provide the benefit of a target that can be built in isolation as described above.

@Radvendii
Copy link
Member

Yeah, good summary @facundominguez that helped clear things up in my head.

For reference, (4) is what we currently have implemented. I agree that (2) and (3) have significant advantages though.

I don't really have a good grasp on using rules_haskell, so I don't have strong opinions on which is better.

@aherrmann
Copy link
Member Author

aherrmann commented Oct 20, 2021

I think the abstract interface is preferable. From a user's perspective it provides a higher level and more declarative interface: The user defines the module dependency graph, not the mechanics behind building individual modules. The explicit approach not just introduces more tedious duplication, it also increases the potential for user errors, e.g. by trying to bundle modules with different package names into the same library target - something that rules_haskell would then have to detect to produce a meaningful error message.

Regarding implementation, i.e. the choice between (3) or (4), I think (3) is preferable. I did a test to see if we are affected by bazelbuild/bazel#13281, and we are indeed. I've pushed an example that triggers the issue in acb030b. This constructs a diamond shaped dependency graph like this

         --> Module.A --> lib-a -->
lib-root                            test
         --> Module.B --> lib-b -->

Meaning test directly depends on lib-a, and lib-b, and transitively on the modules and lib-root.

In this case we'd want lib-root to only be built once, and the same objects and interfaces to be used for all dependents. But, that is not what we observe:

$ bazel clean; bazel test //tests/haskell_module/diamond:test --execution_log_json_file=execlog.json
$ jq 'select(.mnemonic == "HaskellBuildLibrary")|{progressMessage, outputPaths: [.actualOutputs[].path]}' <execlog.json
{
  "progressMessage": "HaskellBuildLibrary //tests/haskell_module/diamond:lib-root",
  "outputPaths": [
    "bazel-out/k8-fastbuild-ST-2e826c114ca7/bin/tests/haskell_module/diamond/_obj/lib-root/Module/Root.o",
    ...
  ]
}
{
  "progressMessage": "HaskellBuildLibrary //tests/haskell_module/diamond:lib-root",
  "outputPaths": [
    "bazel-out/k8-fastbuild-ST-3ce0d30d28dc/bin/tests/haskell_module/diamond/_obj/lib-root/Module/Root.o",
    ...
  ]
}
{
  "progressMessage": "HaskellBuildLibrary //tests/haskell_module/diamond:lib-root",
  "outputPaths": [
    "bazel-out/k8-fastbuild/bin/tests/haskell_module/diamond/_obj/lib-root/Module/Root.o",
    ...
  ]
}

Meaning the library is actually built three times. That's precisely the issue described in bazelbuild/bazel#13281. So, transitions are not a good approach in this case.

EDIT
I noticed that I linked to the wrong commit. The fixed example is available at cb567e4. The corresponding commands are

$ bazel clean; bazel test //tests/haskell_module/generated:test --execution_log_json_file=execlog.json
$ jq 'select(.mnemonic == "HaskellBuildLibrary")|{progressMessage, outputPaths: [.actualOutputs[].path]}' <execlog.json

(Tested with Nix and direnv installed and direnv enabled in the repository.)

@facundominguez
Copy link
Member

facundominguez commented Nov 2, 2021

I'm currently trying to implement (3), however, it feels like going against the current so far.

The task is creating the compile actions for the modules, given the descriptions/targets of the haskell_module rules that have been deferred.

My first try was making a recursive function, for every haskell_module target, the function would create first the actions to compile its dependencies, and then it would create the action for the haskell_module target.

The above is very simple, but it doesn't work because bazel starlark doesn't allow recursive functions!

My next attempt was to encode the recursion with an explicit stack. This seems to work, but since starlark doesn't offer while loops, I resorted to a for loop over a very long range, that is terminated early when the stack-encoded recursion finishes.

Is this the right way to go?

Here's the function:

def _build_haskell_modules(ctx, hs, cc, posix, package_name, hidir, odir, module_outputs, module_dep):
# This is an encoding of a recursive function, since bazel doesn't
# allow recursion.
# The stack holds the module target being processed and the list of
# results collected from its dependencies so far.
module_stack = [(module_dep, [])]
# The loop breaks when the stack is empty
for i in range(0, 0x7fffffff):
dep, children_results = module_stack[len(module_stack) - 1]
if len(children_results) == 0 and (dep.label in module_outputs or not HaskellModuleInfo in dep):
# Don't recurse if the result has been already computed
# or dependency is not from a haskell_module rule
if HaskellModuleInfo in dep:
outputs = module_outputs[dep.label]
else:
outputs = (depset(), depset())
module_stack.pop()
if len(module_stack) == 0:
return outputs
# Add the result to the result list of the parent
_, children_results = module_stack[len(module_stack) - 1]
children_results.append(outputs)
elif len(children_results) < len(dep[HaskellModuleInfo].attr.deps):
# Create the action for the next child dependency
module_stack.append((dep[HaskellModuleInfo].attr.deps[len(children_results)], []))
else:
# All actions for all the children have been created
# Now create the action for the current target.
transitive_interfaces = [interface_set for (interface_set, _) in children_results]
transitive_objs = [obj_set for (_, obj_set) in children_results]
interfaces, objs = _build_haskell_module(
ctx,
dep[HaskellModuleInfo].attr,
hs,
cc,
posix,
package_name,
hidir,
odir,
depset(transitive = transitive_interfaces),
depset(transitive = transitive_objs),
)
outputs = (
depset(direct = interfaces, transitive = transitive_interfaces),
depset(direct = objs, transitive = transitive_objs),
)
module_outputs[dep.label] = outputs
module_stack.pop()
if len(module_stack) == 0:
return outputs
# Add the result to the result list of the parent
_, children_results = module_stack[len(module_stack) - 1]
children_results.append(outputs)

@facundominguez
Copy link
Member

And here's another question: if a module is used in two haskell_binary rules, and each binary rule creates its own action to compile the module, we should expect the module to be compiled twice and there is little we can do to avoid this, right?

This is something that affects erasing library boundaries when defining haskell_module rules. Before, if modules coming originally from libraries ended up used in multiple binaries, they wouldn't be rebuilt. But this doesn't appear to be the case anymore with deferred actions.

@aherrmann
Copy link
Member Author

My first try was making a recursive function, for every haskell_module target, the function would create first the actions to compile its dependencies, and then it would create the action for the haskell_module target.
The above is very simple, but it doesn't work because bazel starlark doesn't allow recursive functions!

Correct, Starlark does not allow recursion, or unbounded loops in general. That said, I don't think this is required in this case.

One difficulty seems to be to calculate the set of transitive module dependencies in haskell_library|... rule implementation. Fortunately, you don't have to do this in that place. Instead you can let Bazel do this for you on the level of the haskell_module rule by returning a field for transitive dependencies in the HaskellModuleInfo provider.

Another important feature of Bazel is that the build actions don't need to be emitted in any particular order. So, you could do something along the lines of this (just a sketch):

# Declare module outputs
module_outputs = {
    mod.label: struct(
        interface = ctx.actions.declare_file(...),
        object = ctx.actions.declare_file(...),
    )
    for mod in modules
}

# Declare build actions
for mod in modules:
    outputs = module_outputs[mod.label]
    inputs = []
    for dep in mod.transitive_module_dependencies:
        inputs.append(module_outputs[dep.label].interface)
        inputs.append(module_outputs[dep.label].object)
    _run_ghc(
        outputs = outputs,
        inputs = inputs,
        ...
    )

And here's another question: if a module is used in two haskell_binary rules, and each binary rule creates its own action to compile the module, we should expect the module to be compiled twice and there is little we can do to avoid this, right?

Yes, that is true. Given the odir, hidir constraint, i.e. that all modules of a library or binary need to have their interfaces and objects in the same directory, I don't see much that we can do about this.

@facundominguez
Copy link
Member

When trying to decide what data structure to use to collect transitive dependencies of haskell_module, I'm not finding an obvious choice.

This is mod.transitive_module_dependencies in Andreas's sketch. Semantically it would be a set of Targets. Bazel recommends using depset for collecting values transitively, but I imagine that checking for Target equality is overkill, since comparing only the labels would be necessary.

But the other alternatives also seem to impose some merging overhead as explained in the documentation. https://docs.bazel.build/versions/main/skylark/depsets.html

Looks like the most efficient data structure would be depmap, something like a depset but dedups by looking at a key instead of looking at the values.

Any thought's on how to proceed here?

@aherrmann
Copy link
Member Author

A depset seems like the best data structure and I'm not aware of a depset of Targets being a particular performance concern. As far as I know depset doesn't use value equality, but compares hashes instead. I'm not sure how exactly this works in the case of depsets of Targets off the top of my head, but, there are cases in the Bazel code base where the hash code of a target is calculated based only on its label and configuration key and also avoids frequent re-computation of the hash by caching it. In other words, I'd expect it to behave similarly to the depmap you describe.

The more concerning part, performance-wise, is that we need to flatten the depset inside haskell_library|... to generate all the build actions. However, I don't really see a way around this, and so long as there is no overlap of modules between haskell_library|... targets it also wouldn't be quadratic.

@facundominguez
Copy link
Member

facundominguez commented Nov 3, 2021

The more concerning part, performance-wise, is that we need to flatten the depset inside haskell_library|... to generate all the build actions. However, I don't really see a way around this, and so long as there is no overlap of modules between haskell_library|... targets it also wouldn't be quadratic.

We have to flatten the transitive dependencies once per compile action for a module in order to produce the list of input interface files. This does look quadratic on the amount of modules in a library. Aren't we better in this respect with the stack-encoded recursion?

@aherrmann
Copy link
Member Author

We have to flatten the transitive dependencies once per compile action for a module in order to produce the list of input interface files. This does look quadratic on the amount of modules in a library.

Ah, yes, you're right. I was looking at it the wrong way.

This reminds me that we have a somewhat similar situation with the Haskell toolchain libraries, implemented here. In that case we exploit the fact that a depset can give us the dependencies in a well defined order.

I think something like this could be used here as well. If we can iterate over the modules in the right order, such that A appears before B if B depends on A, then we can avoid iterating over each module's transitive dependencies. Another sketch illustrates what I have in mind:

# Declare module outputs
module_outputs = {
    mod: [
        ctx.actions.declare_file(...),  # interface
        ctx.actions.declare_file(...),  # object
    ]
    for mod in modules
}

# List of modules in post-order
# Meaning, if module B depends on module A, then A will appear before B.
ordered_modules = depset(transitive = [
    mod[HaskellModuleInfo].transitive_module_deps  # Created with order = "postorder"
    for mod in modules
])

# Collect module build inputs and declare build actions
# This is the only place where we flatten the depset
module_inputs = {}
for mod in ordered_modules.to_list():
    outputs = module_outputs[mod]
    inputs = depset(
        direct = [
            out
            for dep in mod[HaskellModuleInfo].direct_module_deps
            for out in module_outputs[dep]
        ],
        transitive = [
            module_inputs[dep]  # Will be set by a previous iteration, since all deps were visited before.
            for dep in mod[HaskellModuleInfo].direct_module_deps
        ],
    )
    module_inputs[mod] = inputs
    _run_ghc(
        outputs = outputs,
        inputs = inputs,
        ...
    )

This flattens the depset of all modules only once, and for each module only iterates over its direct dependencies.

@facundominguez
Copy link
Member

I'm impressed. That's a clever way to copy the structure of a depset.

However, I'm failing to see the essential difference with the previous sketch. Your last sketch still flattens the transitive inputs once per action, only that it is deferred until the very moment the action is executed. Is not it still quadratic?

Which also brings to the front the fact that once we require transitive inputs for all actions, there is no approach to implement it that isn't potentially quadratic, not even running the actions in the haskell_module rules as before.

@aherrmann
Copy link
Member Author

However, I'm failing to see the essential difference with the previous sketch. Your last sketch still flattens the transitive inputs once per action, only that it is deferred until the very moment the action is executed. Is not it still quadratic?

The difference is that it benefits from the sharing of depset nodes and avoids unnecessary flattening in the analysis phase. The performance documentation raises the concern about calling .to_list() in rule implementations, not about using depsets in general. See also here and here.

Which also brings to the front the fact that once we require transitive inputs for all actions, there is no approach to implement it that isn't potentially quadratic, not even running the actions in the haskell_module rules as before.

Yes, if we need those inputs, then we can't avoid that. But, by using depset efficiently in the rule implementation we can make best use of Bazel's mechanisms to optimize these kinds of situations.

@facundominguez
Copy link
Member

facundominguez commented Nov 4, 2021

Should dependencies be inherited from libraries and binaries?
gazelle_haskell_modules copies those into the dependencies of haskell_module, but maybe we should leave room to the user to tweak that manually (?).

@aherrmann
Copy link
Member Author

Should dependencies be inherited from libraries and binaries? gazelle_haskell_modules copies those into the dependencies of haskell_module, but maybe we should leave room to the user to tweak that manually (?).

Good question. I can see two ideals there:

  1. A haskell_module has no library deps but only module deps. At compilation we just forward all library deps that are defined at haskell_library|binary|test to the module compilation action.
  2. Each haskell_module has minimal library deps, i.e. only those haskell(_cabal)_library that that module actually depends on. At compilation we forward only the module's library deps (and I assume we also need transitive library deps from libraries and modules for template Haskell).

Option 1. is not optimal for incremental builds and parallelism, because a module may be rebuilt when an unused library dependency changes or its build may need to wait for an unused library to finish building. The unused_inputs_file feature could potentially help with this, but it's additional complexity to get this to work.

Option 2. is better for incremental builds and parallelism. But, it's harder to get right with gazelle_haskell_modules, because we would need a module index for all library targets, including external ones (@stackage) to get it right, which Gazelle doesn't have right now.

I don't see an obvious winner between the two. I would err on the side of option 2 and consider option 1 if we know that we can make unused_inputs_file work reasonably in this case.

@facundominguez
Copy link
Member

facundominguez commented Nov 4, 2021

How about:
3. merging deps from libraries and the module, and offering a boolean attribute to disable the merging?

I think option (1) could be preferable if we can make unused_inputs_file work. So maybe we can just wait until we earn clarity on that before committing to a choice.

@aherrmann
Copy link
Member Author

I think option (1) could be preferable if we can make unused_inputs_file work. So maybe we can just wait until we earn clarity on that before committing to a choice.

Agreed, let's do that.

How about:
3. merging deps from libraries and the module, and offering a boolean attribute to disable the merging?

It's essentially letting the user switch between option 1 or 2. Instead of solving the problems of 1 with unused_inputs_file it instead defers to the user to fix such issues on demand. I fear that, without a good way for users to generate minimal module deps, it would boil down to option 1 but without unused_inputs_file in most use-cases.

@aherrmann
Copy link
Member Author

I had an idea regarding cross package module dependencies that came to my mind while looking at #1623. I'm not sure if this is a good idea, but I thought I'll put it out here for consideration.

In this approach haskell_module deps always express dependencies on modules, if a haskell_module depends on a haskell_library then that means that it depends on all modules of that library. For dependencies on modules of a different package the haskell_library|... needs to depend on the corresponding haskell_library target, this is how we can provide the required package-db. Each module dependency of the current module must either be part of the same haskell_library|...'s modules attribute, or must be provided by a haskell_library that is listed in deps of the haskell_library|... containing the current module. The following snippet illustrates this:

haskell_module(
    name = "Foo.A",
    deps = [
        ":Foo.B",  # depends on module Foo.B in the same package
        ":Bar.C",  # depends on module Bar.C in package bar
        ":baz",  # depends on all modules (Baz.D, Baz.E) in package baz
        "@stackage//:lens",  # depends on all modules in package lens
    ],
    ...
)
haskell_module(name = "Foo.B", ...)
haskell_module(name = "Bar.B", ...)

haskell_library(
    name = "foo",
    modules = [":Foo.A", "Foo.B"],
    deps = [":bar", ":baz", "@stackage//:lens"],  # direct deps of package foo - this is how we can know that Bar.C is provided by package bar
    ...
)
haskell_library(
    name = "bar",
    modules = [":Bar.C"],
    deps = ...,
    ...
)
haskell_library(
    name = "baz",
    modules = [":Baz.D", "Baz.E"],
    deps = ...,
    ...
)

Let me know what you think.

@mergify mergify bot closed this as completed in #1623 Nov 10, 2021
@facundominguez
Copy link
Member

Could work. One aspect that I'd like us to figure out is what is the story when using template haskell. Right now a module which uses TH can't be built until the libraries that need to be loaded are built. This means that compiling the module must wait on entire libraries to be built and linked. Could we only load the object files of the imported modules (and their transitive module dependencies) instead of libraries in these cases?

@facundominguez
Copy link
Member

We solved the discussion of the abstract vs explicit interface, but it looks like this issue is to stay open until we have implemented all aspects of haskell_module.

@facundominguez
Copy link
Member

A recap of what else needs to be done to complete this issue would be helpful.

@aherrmann
Copy link
Member Author

aherrmann commented Dec 23, 2021

A recap of what else needs to be done to complete this issue would be helpful.

Agreed, I think in terms of defining and building modules we've covered all that comes to mind:

  • define module level dependency graph
  • module level incremental builds
  • cross package module dependencies
  • dependencies on toolchain libraries, Cabal libraries, and regular Haskell libraries
  • TH support
  • profiling support

There are some things that may work already, but we should have tests for to be sure:

  • GHC plugin support
  • .hsc support
  • .hs-boot support
  • c2hs support

And there are integrations with other rules_haskell features:

@facundominguez Let me know if something else comes to mind.

@facundominguez
Copy link
Member

tests/haskell_module/plugin and tests/haskell_module/attr_merging both test plugins, so perhaps that can be ticked.

Additionally, we should cast tests/java_classpath with haskell_module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants