chore: documentation

psteinroe · psteinroe · commit 815498416f32 · 2023-07-15T16:00:09.000+02:00
diff --git a/README.md b/README.md
@@ -1,2 +1,43 @@
+> 🚧 This is in active development and not ready for use yet.
+
 # postgres_lsp
-A Language Server for Postgres
+
+A Language Server for Postgres. Not SQL with flavours, just Postgres.
+
+## Rationale
+
+Despite an ever rising popularity of Postgres, support for the PostgreSQL language in the IDE or editor of your choice is still very sparse. There are a few proprietary ones (e.g. [DataGrip](https://www.jetbrains.com/datagrip/)) that work well, but are only available within the respective IDE. Open Source attempts (e.g. [sql-language-server](https://github.com/joe-re/sql-language-server), [pgFormatter](https://github.com/darold/pgFormatter/tree/master), [sql-parser-cst](https://github.com/nene/sql-parser-cst)) mostly attempt to provide a generic SQL language server, and implement the Postgres syntax only as a flavor of their parser. This always falls short due to the ever evolving and complex syntax of PostgreSQL. This project only ever wants to support PostgreSQL, and leverages parts of the PostgreSQL server source (see [libg_query](https://github.com/pganalyze/libpg_query)) to parse the source code reliabaly. This is slightly crazy, but is the only reliable way of parsing all valid PostgreSQL queries. You can find a longer rationale on why This is the way™ [here](https://pganalyze.com/blog/parse-postgresql-queries-in-ruby). Of course, libg_query was built to execute SQL, and not to build a language server, but all of the resulting shortcomings were successfully mitigated in the [`parser`](./crates/parser/src/lib.rs) crate.
+
+Once the parser is stable, and a robust and scalable data model is implemented, the language server will not only provide basic features such as semantic highlighting, code completion and syntax error diagnostics, but also serve as the user interface for all the great tooling of the Postgres ecosystem.
+
+## Roadmap
+
+At this point however, this is merely a proof of concept for building both a concrete syntax tree and an abstract syntax tree from a potentially malformed PostgreSQL source code. The `postgres_lsp` crate was created only to proof that it works e2e, and is just a very basic language server with semantic highlighting and error diagnostics. Before actual feature development can start, we have to do a bit of groundwork.
+
+1. _Finish the parser_
+   - The parser works, but the enum values for all the different syntax elements and internal conversations are manually written or copied, and, in some places, only cover a few elements required for a simple select statement. To have full coverage without possibilities for a copy and past error, they should be generated from pg_query.rs source code.
+   - There are a few cases such as nested and named dollar quoted strings that cause the parser to fail due to limitations of the regex-based lexer. Nothing that is impossible to fix, or requires any change in the approach though.
+2. _Implement a robust and scalable data model_
+   - TODO
+3. _Setup the language server properly_
+   - TODO
+4. _Implement basic language server features_
+   - Semantic Highlighting
+   - Syntax Error Diagnostics
+   - Show SQL comments on hover
+   - Auto-Completion
+   - Code Actions, such as `Execute the statement under the cursor`, or `Execute the current file`
+   - ... anything you can think of really
+5. _Integrate all the existing open source tooling_
+   - Show migration file lint errors from [squawk](https://github.com/sbdchd/squawk)
+   - Show plpsql lint errors from [plpsql_check](https://github.com/okbob/plpgsql_check)
+6. _Build missing pieces_
+   - An optionated code formatter (think prettier for PostgreSQL)
+7. _(Maybe) Support advanced features with declarative schema management_
+   - Jump to definition
+   - ... anything you can think of really
+
+## Acknowledgments
+
+- [rust-analyzer](https://github.com/rust-lang/rust-analyzer) for implementing such a robust, well documented, and feature-rich language server. Great place to learn from.
+- [squawk](https://github.com/sbdchd/squawk) and [pganalyze](https://pganalyze.com) for inspiring the use of libg_query.
diff --git a/crates/parser/src/lib.rs b/crates/parser/src/lib.rs
@@ -1,59 +1,19 @@
-//! The SQL parser.
+//! The Postgres parser.
 //!
+//! This crate provides a parser for the Postgres SQL dialect.
+//! It is based in the pg_query.rs crate, which is a wrapper around the PostgreSQL query parser.
+//! The main `Parser` struct parses a source file and individual statements.
+//! The `Parse` struct contains the resulting concrete syntax tree, syntax errors, and the abtract syntax tree, which is a list of pg_query statements and their positions.
 //!
-//
-// TODO: implement parser similarly to rust_analyzer
-// result is a stream of events (including errors) and a list of errors
-//
-//
-//
-// we can use Vec::new() in constructor and then set nodes in parse() if parsing was successful
-//
-//
-//
-//
-// differences to rust_analyzer
-// 1.
-// since we always have to parse just text, there is no need to have lexer and parser separated
-// input of the parser is a string and we always parse the full string
-// syntax crate does not know about lexers and their tokens
-// --> input is just a string
-// 2.
-// in rust_analyzer, the output is just a stream of 32-bit encoded events WITHOUT the text
-// again, this extra layer of abstraction is not necessary for us, since we always parse text
-// the output of the parser is pretty much the same as the input but with nodes
-// --> the parser takes fn that is called for every node and token to build the tree
-// so we skip the intermediate list of events and just build the tree directly
-// we can define a trait that is implemented by the GreenNodeBuilder
-//
-//
-// SyntaxNode in the syntax create is just the SyntaxKind from the parser
-// cst is build with the SyntaxKind type
-// in the syntax crate, the SyntaxTreeBuilder is created and the events are fed into it to build
-// the three
-//
-//
-// how does rust_analyzer know what parts of text is an error?
-// errors are not added to the tree in SyntaxTreeBuilder, which means the tokens must include the
-// erronous parts of the text
-// but the parser output does not include text, so how does the cst can have correct text?
-// easy: the tokenizer is running beforehand, so we always have the tokens, and the errors are just
-// added afterwards when parsing the tokens using the grammar.
-// so there is a never-failing tokenizer step which is followed by the parser that knows the
-// grammar and emits errors
-// --> we will do the same, but with a multi-step tokenizer and parser that fallbacks to simpler
-// and simpler tokens
-//
-//
-// api has to cover parse source file and parse statement
-//
-//
-// we will also have to add a cache for pg_query parsing results using fingerprinting
-//
-// all parsers can be just a function that iterates the base lexer
-// so we will have a `parse_statement` and a `parse_source_file` function
-// the tree always covers all text since we use the scantokens and, if failing, the StatementTokens
-// errors are added to a list, and are not part of the tree
+//! The idea is to offload the heavy lifting to the same parser that the PostgreSQL server uses,
+//! and just fill in the gaps to be able to build both cst and ast from a a source file that
+//! potentially contains erroneous statements.
+//!
+//! The main drawbacks of the PostgreSQL query parser mitigated by this parser are:
+//! - it only parsed a full source text, and if there is any syntax error in a file, it will not parse anything and return an error.
+//! - it does not parse whitespaces and newlines, so it is not possible to build a concrete syntax tree build a concrete syntax tree.
+//!
+//! To see how these drawbacks are mitigated, see the `statement.rs` and the `source_file.rs` module.
 
 mod ast_node;
 mod parser;
diff --git a/crates/parser/src/parser.rs b/crates/parser/src/parser.rs
@@ -7,25 +7,36 @@ use crate::syntax_error::SyntaxError;
 use crate::syntax_kind::{SyntaxKind, SyntaxKindType};
 use crate::syntax_node::SyntaxNode;
 
+/// Main parser that controls the cst building process, and collects errors and statements
 #[derive(Debug)]
 pub struct Parser {
+    /// The cst builder
     inner: GreenNodeBuilder<'static, 'static, SyntaxKind>,
+    /// A buffer for tokens that are not yet applied to the cst
     token_buffer: Vec<(SyntaxKind, String)>,
+    /// The current depth of the cst
     curr_depth: i32,
+    /// The syntax errors accumulated during parsing
     errors: Vec<SyntaxError>,
+    /// The pg_query statements representing the abtract syntax tree
     stmts: Vec<RawStmt>,
+    /// The current checkpoint depth, if any
     checkpoint: Option<i32>,
+    /// Whether the parser is currently parsing a flat node
     is_parsing_flat_node: bool,
 }
 
+/// Result of parsing
 #[derive(Debug)]
 pub struct Parse {
+    /// The concrete syntax tree
     pub cst: ResolvedNode<SyntaxKind>,
+    /// The syntax errors accumulated during parsing
     pub errors: Vec<SyntaxError>,
+    /// The pg_query statements representing the abtract syntax tree
     pub stmts: Vec<RawStmt>,
 }
 
-/// Main parser that controls the cst building process, and collects errors and statements
 impl Parser {
     pub fn new() -> Self {
         Self {
@@ -39,21 +50,31 @@ impl Parser {
         }
     }
 
+    /// close all nodes until the specified depth is reached
     pub fn close_until_depth(&mut self, depth: i32) {
         while self.curr_depth >= depth {
             self.finish_node();
             self.curr_depth -= 1;
         }
     }
 
+    /// set a checkpoint at current depth
+    ///
+    /// if `is_parsing_flat_node` is true, all tokens will be applied immediately
     pub fn set_checkpoint(&mut self, is_parsing_flat_node: bool) {
-        assert!(self.checkpoint.is_none());
-        assert!(self.token_buffer.is_empty());
-        println!("set_checkpoint at {}", self.curr_depth);
+        assert!(
+            self.checkpoint.is_none(),
+            "Must close previouos checkpoint before setting new one"
+        );
+        assert!(
+            self.token_buffer.is_empty(),
+            "Token buffer must be empty before setting a checkpoint"
+        );
         self.checkpoint = Some(self.curr_depth);
         self.is_parsing_flat_node = is_parsing_flat_node;
     }
 
+    /// close all nodes until checkpoint depth is reached
     pub fn close_checkpoint(&mut self) {
         self.consume_token_buffer();
         if self.checkpoint.is_some() {
@@ -63,6 +84,7 @@ impl Parser {
         self.is_parsing_flat_node = false;
     }
 
+    /// start a new node of `SyntaxKind`
     pub fn start_node(&mut self, kind: SyntaxKind) {
         self.inner.start_node(kind);
     }
@@ -83,6 +105,7 @@ impl Parser {
         self.start_node(kind);
     }
 
+    /// finish current node
     pub fn finish_node(&mut self) {
         self.inner.finish_node();
     }
@@ -123,14 +146,17 @@ impl Parser {
         }
     }
 
+    /// collects an SyntaxError with an `error` message at `range`
     pub fn error(&mut self, error: String, range: TextRange) {
         self.errors.push(SyntaxError::new(error, range));
     }
 
+    /// collects a pg_query `stmt` at `range`
     pub fn stmt(&mut self, stmt: NodeEnum, range: TextRange) {
         self.stmts.push(RawStmt { stmt, range });
     }
 
+    /// finish cstree and return `Parse`
     pub fn finish(self) -> Parse {
         let (tree, cache) = self.inner.finish();
         Parse {
diff --git a/crates/parser/src/pg_query_utils.rs b/crates/parser/src/pg_query_utils.rs
@@ -1,8 +1,9 @@
 use pg_query::NodeRef;
 
 /// Gets the position value for a pg_query node
-/// can mostly be generated.
-/// there are some exceptions where the location on the node itself is not leftmost position, e.g. for AExpr.
+///
+/// This can mostly be generated by just returning `node.location` if the type has the property,
+/// but there are some exceptions where the location on the node itself is not leftmost position, e.g. for `AExpr`.
 pub fn get_position_for_pg_query_node(node: &NodeRef) -> i32 {
     match node {
         NodeRef::ResTarget(n) => n.location,
diff --git a/crates/parser/src/source_file.rs b/crates/parser/src/source_file.rs
@@ -1,14 +1,14 @@
-/// A super simple lexer for sql files to work around the main weakness of pg_query.rs.
-///
-/// pg_query.rs only parses valid statements, and also fail to parse all statements if any contain
-/// syntax errors. To circumvent this, we use a lexer to split the input into statements, and then
-/// parse each statement individually.
-///
-/// This lexer does the split.
 use logos::Logos;
 
 use crate::{parser::Parser, syntax_kind::SyntaxKind};
 
+/// A super simple lexer for sql files that splits the input into indivudual statements and
+/// comments.
+///
+/// pg_query.rs only parses valid statements, and also fail to parse all statements if any contain syntax errors.
+/// To circumvent this, we use a lexer to split the input into statements, and then parse each statement individually.
+///
+/// This regex-based lexer does the split.
 #[derive(Logos, Debug, PartialEq)]
 #[logos(skip r"[ \t\f]+")] // Ignore this regex pattern between tokens
 pub enum SourceFileToken {
@@ -21,6 +21,10 @@ pub enum SourceFileToken {
 }
 
 impl Parser {
+    /// Parse a source file
+    ///
+    /// TODO: rename to `parse_source_at(text: &str, at: Option<u32>)`, and allow parsing substatements, e.g. bodies of create
+    /// function statements.
     pub fn parse_source_file(&mut self, text: &str) {
         let mut lexer = SourceFileToken::lexer(text);
 
diff --git a/crates/parser/src/statement.rs b/crates/parser/src/statement.rs
@@ -1,8 +1,3 @@
-/// A super simple lexer for sql statements to work around a weakness of pg_query.rs.
-///
-/// pg_query.rs only parses valid statements, and no whitespaces or newlines.
-/// To circumvent this, we use a very simple lexer that just knows what kind of characters are
-/// being used. all words are put into the "Word" type and will be defined in more detail by the results of pg_query.rs
 use cstree::text::{TextRange, TextSize};
 use logos::Logos;
 use regex::Regex;
@@ -11,12 +6,11 @@ use crate::{
     parser::Parser, pg_query_utils::get_position_for_pg_query_node, syntax_kind::SyntaxKind,
 };
 
-#[derive(Logos, Debug, PartialEq)]
-pub enum Test {
-    #[regex("'([^']+)'|\\$(\\w)?\\$.*\\$(\\w)?\\$")]
-    Sconst,
-}
-
+/// A super simple lexer for sql statements.
+///
+/// One weakness of pg_query.rs is that it does not parse whitespace or newlines. To circumvent
+/// this, we use a very simple lexer that just knows what kind of characters are being used. It
+/// does not know anything about postgres syntax or keywords. For example, all words such as `select` and `from` are put into the `Word` type.
 #[derive(Logos, Debug, PartialEq)]
 pub enum StatementToken {
     // copied from protobuf::Token. can be generated later
@@ -110,6 +104,22 @@ impl StatementToken {
 }
 
 impl Parser {
+    /// The main entry point for parsing a statement `text`. `at_offset` is the offset of the statement in the source file.
+    ///
+    /// On a high level, the algorithm works as follows:
+    /// 1. Parse the statement with pg_query.rs and order nodes by their position. If the
+    ///   statement contains syntax errors, the parser will report the error and continue to work without information
+    ///   about the nodes. The result will be a flat list of tokens under the generic `Stmt` node.
+    ///   If successful, the first node in the ordered list will be the main node of the statement,
+    ///   and serves as a root node.
+    /// 2. Scan the statements for tokens with pg_query.rs. This will never fail, even if the statement contains syntax errors.
+    /// 3. Parse the statement with the `StatementToken` lexer. The lexer will be the main vehicle
+    ///    while walking the statement.
+    /// 4. Walk the statement with the `StatementToken` lexer.
+    ///    - at every token, consume all nodes that are within the token's range.
+    ///    - if there is a pg_query token within the token's range, consume it. if not, fallback to
+    ///    the StatementToken. This is the case for e.g. whitespace.
+    /// 5. Close all open nodes for that statement.
     pub fn parse_statement(&mut self, text: &str, at_offset: Option<u32>) {
         let offset = at_offset.unwrap_or(0);
         let range = TextRange::new(
diff --git a/crates/parser/src/syntax_kind.rs b/crates/parser/src/syntax_kind.rs