Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perform AstGen on whole files at once (AST->ZIR) #8516

Closed
andrewrk opened this issue Apr 13, 2021 · 4 comments
Closed

perform AstGen on whole files at once (AST->ZIR) #8516

andrewrk opened this issue Apr 13, 2021 · 4 comments
Labels
accepted This proposal is planned. breaking Implementing this issue could cause existing code to no longer compile or have different behavior. enhancement Solving this issue will likely involve adding new logic or components to the codebase. frontend Tokenization, parsing, AstGen, Sema, and Liveness. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@andrewrk
Copy link
Member

This is a language proposal as well as a concrete plan for how to implement it. It solves #335 and goes a long way towards making the problematic issue #3028 unneeded. The implementation plan simplifies the compiler and yet opens up straightforward opportunities for parallelism and caching.

In stage2 we have a concept of "AstGen" which stands for Abstract Syntax Tree Generation. This is the part where we input an AST and output Zig Intermediate Representation code.

Currently, this is done lazily as-needed per Decl (top level declaration). This requires code to orchestrate per-Decl ZIR code and independently manage memory lifetimes. It also means each Decl uses independent arrays of ZIR tags, instruction lists, string tables, and auxiliary lists. When a file is modified, the compiler checks which Decl source bytes differ, and repeats AstGen for the changed Decls to generate updated ZIR code.

One key design strategy is to make ZIR code immutable, typeless, and depend only on AST. This ensures that it can be re-used for multiple generic instantiations, comptime function calls, and inlined function calls.

This proposal takes that design strategy, and observes that it is possible to generate ZIR for an entire file indiscriminately, for all Decls, depending on AST alone and not introducing any type checking. Furthermore, it observes that this allows implementing the following compile errors:

  • Unused private function
  • Unused local variable
  • Unused private global variable
  • Unreachable code
  • Local variable not mutated

All of these compile errors are possible with AstGen alone, and do not require types. In fact, trying to implement these compile errors with types is problematic because of conditional compilation. But there is no conditional compilation with AstGen. Doing entire files at once would make it possible to have compile errors for unused private functions and globals.

With the way that ZIR is encoded, doing all of a file into one piece of ZIR code is less overhead than splitting it by Decl. Less overhead of list capacity is wasted, and more strings in the string table will be shared.

This works great for caching. All source files independently need to be converted to ZIR, and once converted to ZIR, the original source, token list, and AST node list are all no longer needed. The relevant bytes will be stored directly in ZIR. So each .zig source file will have exactly one corresponding ZIR bytecode. It's easy to imagine a caching strategy for this. Consider also that the transformation from .zig to ZIR does not depend on the target options, or anything, other than the AST. So cached ZIR for std lib files and common used packages can be re-used between unrelated projects.

Furthermore, thanks to #2206, the compiler can optimistically look for all .zig source files in a project, and parallelize each tokenize->parse->ZIR transformation. The caching system can notice when .zig source files are unchanged, and load the .ZIR code directly instead of the source, skipping tokenization, parsing, and AstGen entirely, on a per-file basis. The AST tree would only need to be loaded in order to report compile errors.

Serialization of ZIR in binary form is straightforward. It consists only of:

  • List of u8 tags for each instruction
  • List of u32, u32 data for each instruction
  • List of u8 string table
  • List of u32 auxiliary data
    Writing/reading this to/from a file is trivial.
@andrewrk andrewrk added enhancement Solving this issue will likely involve adding new logic or components to the codebase. breaking Implementing this issue could cause existing code to no longer compile or have different behavior. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. frontend Tokenization, parsing, AstGen, Sema, and Liveness. labels Apr 13, 2021
@andrewrk andrewrk added this to the 0.8.0 milestone Apr 13, 2021
andrewrk added a commit that referenced this issue Apr 14, 2021
See #8516.

 * AstGen is now done on whole files at once rather than per Decl.

 * Introduce a new wait group for AstGen tasks. `performAllTheWork`
   waits for all AstGen tasks to be complete before doing Sema,
   single-threaded.
   - The C object compilation tasks are moved to be spawned after
     AstGen, since they only need to complete by the end of
     the function.

With this commit, the codebase compiles, but much more reworking is
needed to get things back into a useful state.
andrewrk added a commit that referenced this issue Apr 16, 2021
See #8516.

 * AstGen is now done on whole files at once rather than per Decl.

 * Introduce a new wait group for AstGen tasks. `performAllTheWork`
   waits for all AstGen tasks to be complete before doing Sema,
   single-threaded.
   - The C object compilation tasks are moved to be spawned after
     AstGen, since they only need to complete by the end of
     the function.

With this commit, the codebase compiles, but much more reworking is
needed to get things back into a useful state.
@andrewrk andrewrk added the accepted This proposal is planned. label Apr 21, 2021
@zigazeljko
Copy link
Contributor

All source files independently need to be converted to ZIR, and once converted to ZIR, the original source, token list, and AST node list are all no longer needed.

How is debug info handled in this case? ZIR describes locations in terms of node/token indices, so AST is still needed to obtain line and column numbers for DWARF info.

@andrewrk
Copy link
Member Author

andrewrk commented Apr 30, 2021

dbg_stmt ZIR instructions are emitted which indicate the beginning of statements, and contain line/column information. This was already true before whole-file-astgen. Only difference is that before dbg_stmt used node indexes, and got resolved into line/column later, and now they are resolved to line/column in AstGen. This is more efficient because AstGen is where the source bytes are loaded in memory for other reasons such as looking at identifiers and string literals.

I have not finished implementing all of the above in this branch yet.

@andrewrk
Copy link
Member Author

This is nearly completed by #8554. All that is remaining is to implement the new compile errors.

@andrewrk
Copy link
Member Author

Actually this is done now, since the compile errors are covered by (accepted) proposals #224 and #335.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted This proposal is planned. breaking Implementing this issue could cause existing code to no longer compile or have different behavior. enhancement Solving this issue will likely involve adding new logic or components to the codebase. frontend Tokenization, parsing, AstGen, Sema, and Liveness. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests

2 participants