Skip to content

Commit 8154984

Browse files
committed
chore: documentation
1 parent c4e9edb commit 8154984

File tree

7 files changed

+159
-88
lines changed

7 files changed

+159
-88
lines changed

README.md

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,43 @@
1+
> 🚧 This is in active development and not ready for use yet.
2+
13
# postgres_lsp
2-
A Language Server for Postgres
4+
5+
A Language Server for Postgres. Not SQL with flavours, just Postgres.
6+
7+
## Rationale
8+
9+
Despite an ever rising popularity of Postgres, support for the PostgreSQL language in the IDE or editor of your choice is still very sparse. There are a few proprietary ones (e.g. [DataGrip](https://www.jetbrains.com/datagrip/)) that work well, but are only available within the respective IDE. Open Source attempts (e.g. [sql-language-server](https://github.com/joe-re/sql-language-server), [pgFormatter](https://github.com/darold/pgFormatter/tree/master), [sql-parser-cst](https://github.com/nene/sql-parser-cst)) mostly attempt to provide a generic SQL language server, and implement the Postgres syntax only as a flavor of their parser. This always falls short due to the ever evolving and complex syntax of PostgreSQL. This project only ever wants to support PostgreSQL, and leverages parts of the PostgreSQL server source (see [libg_query](https://github.com/pganalyze/libpg_query)) to parse the source code reliabaly. This is slightly crazy, but is the only reliable way of parsing all valid PostgreSQL queries. You can find a longer rationale on why This is the way™ [here](https://pganalyze.com/blog/parse-postgresql-queries-in-ruby). Of course, libg_query was built to execute SQL, and not to build a language server, but all of the resulting shortcomings were successfully mitigated in the [`parser`](./crates/parser/src/lib.rs) crate.
10+
11+
Once the parser is stable, and a robust and scalable data model is implemented, the language server will not only provide basic features such as semantic highlighting, code completion and syntax error diagnostics, but also serve as the user interface for all the great tooling of the Postgres ecosystem.
12+
13+
## Roadmap
14+
15+
At this point however, this is merely a proof of concept for building both a concrete syntax tree and an abstract syntax tree from a potentially malformed PostgreSQL source code. The `postgres_lsp` crate was created only to proof that it works e2e, and is just a very basic language server with semantic highlighting and error diagnostics. Before actual feature development can start, we have to do a bit of groundwork.
16+
17+
1. _Finish the parser_
18+
- The parser works, but the enum values for all the different syntax elements and internal conversations are manually written or copied, and, in some places, only cover a few elements required for a simple select statement. To have full coverage without possibilities for a copy and past error, they should be generated from pg_query.rs source code.
19+
- There are a few cases such as nested and named dollar quoted strings that cause the parser to fail due to limitations of the regex-based lexer. Nothing that is impossible to fix, or requires any change in the approach though.
20+
2. _Implement a robust and scalable data model_
21+
- TODO
22+
3. _Setup the language server properly_
23+
- TODO
24+
4. _Implement basic language server features_
25+
- Semantic Highlighting
26+
- Syntax Error Diagnostics
27+
- Show SQL comments on hover
28+
- Auto-Completion
29+
- Code Actions, such as `Execute the statement under the cursor`, or `Execute the current file`
30+
- ... anything you can think of really
31+
5. _Integrate all the existing open source tooling_
32+
- Show migration file lint errors from [squawk](https://github.com/sbdchd/squawk)
33+
- Show plpsql lint errors from [plpsql_check](https://github.com/okbob/plpgsql_check)
34+
6. _Build missing pieces_
35+
- An optionated code formatter (think prettier for PostgreSQL)
36+
7. _(Maybe) Support advanced features with declarative schema management_
37+
- Jump to definition
38+
- ... anything you can think of really
39+
40+
## Acknowledgments
41+
42+
- [rust-analyzer](https://github.com/rust-lang/rust-analyzer) for implementing such a robust, well documented, and feature-rich language server. Great place to learn from.
43+
- [squawk](https://github.com/sbdchd/squawk) and [pganalyze](https://pganalyze.com) for inspiring the use of libg_query.

crates/parser/src/lib.rs

Lines changed: 14 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,19 @@
1-
//! The SQL parser.
1+
//! The Postgres parser.
22
//!
3+
//! This crate provides a parser for the Postgres SQL dialect.
4+
//! It is based in the pg_query.rs crate, which is a wrapper around the PostgreSQL query parser.
5+
//! The main `Parser` struct parses a source file and individual statements.
6+
//! The `Parse` struct contains the resulting concrete syntax tree, syntax errors, and the abtract syntax tree, which is a list of pg_query statements and their positions.
37
//!
4-
//
5-
// TODO: implement parser similarly to rust_analyzer
6-
// result is a stream of events (including errors) and a list of errors
7-
//
8-
//
9-
//
10-
// we can use Vec::new() in constructor and then set nodes in parse() if parsing was successful
11-
//
12-
//
13-
//
14-
//
15-
// differences to rust_analyzer
16-
// 1.
17-
// since we always have to parse just text, there is no need to have lexer and parser separated
18-
// input of the parser is a string and we always parse the full string
19-
// syntax crate does not know about lexers and their tokens
20-
// --> input is just a string
21-
// 2.
22-
// in rust_analyzer, the output is just a stream of 32-bit encoded events WITHOUT the text
23-
// again, this extra layer of abstraction is not necessary for us, since we always parse text
24-
// the output of the parser is pretty much the same as the input but with nodes
25-
// --> the parser takes fn that is called for every node and token to build the tree
26-
// so we skip the intermediate list of events and just build the tree directly
27-
// we can define a trait that is implemented by the GreenNodeBuilder
28-
//
29-
//
30-
// SyntaxNode in the syntax create is just the SyntaxKind from the parser
31-
// cst is build with the SyntaxKind type
32-
// in the syntax crate, the SyntaxTreeBuilder is created and the events are fed into it to build
33-
// the three
34-
//
35-
//
36-
// how does rust_analyzer know what parts of text is an error?
37-
// errors are not added to the tree in SyntaxTreeBuilder, which means the tokens must include the
38-
// erronous parts of the text
39-
// but the parser output does not include text, so how does the cst can have correct text?
40-
// easy: the tokenizer is running beforehand, so we always have the tokens, and the errors are just
41-
// added afterwards when parsing the tokens using the grammar.
42-
// so there is a never-failing tokenizer step which is followed by the parser that knows the
43-
// grammar and emits errors
44-
// --> we will do the same, but with a multi-step tokenizer and parser that fallbacks to simpler
45-
// and simpler tokens
46-
//
47-
//
48-
// api has to cover parse source file and parse statement
49-
//
50-
//
51-
// we will also have to add a cache for pg_query parsing results using fingerprinting
52-
//
53-
// all parsers can be just a function that iterates the base lexer
54-
// so we will have a `parse_statement` and a `parse_source_file` function
55-
// the tree always covers all text since we use the scantokens and, if failing, the StatementTokens
56-
// errors are added to a list, and are not part of the tree
8+
//! The idea is to offload the heavy lifting to the same parser that the PostgreSQL server uses,
9+
//! and just fill in the gaps to be able to build both cst and ast from a a source file that
10+
//! potentially contains erroneous statements.
11+
//!
12+
//! The main drawbacks of the PostgreSQL query parser mitigated by this parser are:
13+
//! - it only parsed a full source text, and if there is any syntax error in a file, it will not parse anything and return an error.
14+
//! - it does not parse whitespaces and newlines, so it is not possible to build a concrete syntax tree build a concrete syntax tree.
15+
//!
16+
//! To see how these drawbacks are mitigated, see the `statement.rs` and the `source_file.rs` module.
5717
5818
mod ast_node;
5919
mod parser;

crates/parser/src/parser.rs

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,25 +7,36 @@ use crate::syntax_error::SyntaxError;
77
use crate::syntax_kind::{SyntaxKind, SyntaxKindType};
88
use crate::syntax_node::SyntaxNode;
99

10+
/// Main parser that controls the cst building process, and collects errors and statements
1011
#[derive(Debug)]
1112
pub struct Parser {
13+
/// The cst builder
1214
inner: GreenNodeBuilder<'static, 'static, SyntaxKind>,
15+
/// A buffer for tokens that are not yet applied to the cst
1316
token_buffer: Vec<(SyntaxKind, String)>,
17+
/// The current depth of the cst
1418
curr_depth: i32,
19+
/// The syntax errors accumulated during parsing
1520
errors: Vec<SyntaxError>,
21+
/// The pg_query statements representing the abtract syntax tree
1622
stmts: Vec<RawStmt>,
23+
/// The current checkpoint depth, if any
1724
checkpoint: Option<i32>,
25+
/// Whether the parser is currently parsing a flat node
1826
is_parsing_flat_node: bool,
1927
}
2028

29+
/// Result of parsing
2130
#[derive(Debug)]
2231
pub struct Parse {
32+
/// The concrete syntax tree
2333
pub cst: ResolvedNode<SyntaxKind>,
34+
/// The syntax errors accumulated during parsing
2435
pub errors: Vec<SyntaxError>,
36+
/// The pg_query statements representing the abtract syntax tree
2537
pub stmts: Vec<RawStmt>,
2638
}
2739

28-
/// Main parser that controls the cst building process, and collects errors and statements
2940
impl Parser {
3041
pub fn new() -> Self {
3142
Self {
@@ -39,21 +50,31 @@ impl Parser {
3950
}
4051
}
4152

53+
/// close all nodes until the specified depth is reached
4254
pub fn close_until_depth(&mut self, depth: i32) {
4355
while self.curr_depth >= depth {
4456
self.finish_node();
4557
self.curr_depth -= 1;
4658
}
4759
}
4860

61+
/// set a checkpoint at current depth
62+
///
63+
/// if `is_parsing_flat_node` is true, all tokens will be applied immediately
4964
pub fn set_checkpoint(&mut self, is_parsing_flat_node: bool) {
50-
assert!(self.checkpoint.is_none());
51-
assert!(self.token_buffer.is_empty());
52-
println!("set_checkpoint at {}", self.curr_depth);
65+
assert!(
66+
self.checkpoint.is_none(),
67+
"Must close previouos checkpoint before setting new one"
68+
);
69+
assert!(
70+
self.token_buffer.is_empty(),
71+
"Token buffer must be empty before setting a checkpoint"
72+
);
5373
self.checkpoint = Some(self.curr_depth);
5474
self.is_parsing_flat_node = is_parsing_flat_node;
5575
}
5676

77+
/// close all nodes until checkpoint depth is reached
5778
pub fn close_checkpoint(&mut self) {
5879
self.consume_token_buffer();
5980
if self.checkpoint.is_some() {
@@ -63,6 +84,7 @@ impl Parser {
6384
self.is_parsing_flat_node = false;
6485
}
6586

87+
/// start a new node of `SyntaxKind`
6688
pub fn start_node(&mut self, kind: SyntaxKind) {
6789
self.inner.start_node(kind);
6890
}
@@ -83,6 +105,7 @@ impl Parser {
83105
self.start_node(kind);
84106
}
85107

108+
/// finish current node
86109
pub fn finish_node(&mut self) {
87110
self.inner.finish_node();
88111
}
@@ -123,14 +146,17 @@ impl Parser {
123146
}
124147
}
125148

149+
/// collects an SyntaxError with an `error` message at `range`
126150
pub fn error(&mut self, error: String, range: TextRange) {
127151
self.errors.push(SyntaxError::new(error, range));
128152
}
129153

154+
/// collects a pg_query `stmt` at `range`
130155
pub fn stmt(&mut self, stmt: NodeEnum, range: TextRange) {
131156
self.stmts.push(RawStmt { stmt, range });
132157
}
133158

159+
/// finish cstree and return `Parse`
134160
pub fn finish(self) -> Parse {
135161
let (tree, cache) = self.inner.finish();
136162
Parse {

crates/parser/src/pg_query_utils.rs

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
use pg_query::NodeRef;
22

33
/// Gets the position value for a pg_query node
4-
/// can mostly be generated.
5-
/// there are some exceptions where the location on the node itself is not leftmost position, e.g. for AExpr.
4+
///
5+
/// This can mostly be generated by just returning `node.location` if the type has the property,
6+
/// but there are some exceptions where the location on the node itself is not leftmost position, e.g. for `AExpr`.
67
pub fn get_position_for_pg_query_node(node: &NodeRef) -> i32 {
78
match node {
89
NodeRef::ResTarget(n) => n.location,

crates/parser/src/source_file.rs

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
/// A super simple lexer for sql files to work around the main weakness of pg_query.rs.
2-
///
3-
/// pg_query.rs only parses valid statements, and also fail to parse all statements if any contain
4-
/// syntax errors. To circumvent this, we use a lexer to split the input into statements, and then
5-
/// parse each statement individually.
6-
///
7-
/// This lexer does the split.
81
use logos::Logos;
92

103
use crate::{parser::Parser, syntax_kind::SyntaxKind};
114

5+
/// A super simple lexer for sql files that splits the input into indivudual statements and
6+
/// comments.
7+
///
8+
/// pg_query.rs only parses valid statements, and also fail to parse all statements if any contain syntax errors.
9+
/// To circumvent this, we use a lexer to split the input into statements, and then parse each statement individually.
10+
///
11+
/// This regex-based lexer does the split.
1212
#[derive(Logos, Debug, PartialEq)]
1313
#[logos(skip r"[ \t\f]+")] // Ignore this regex pattern between tokens
1414
pub enum SourceFileToken {
@@ -21,6 +21,10 @@ pub enum SourceFileToken {
2121
}
2222

2323
impl Parser {
24+
/// Parse a source file
25+
///
26+
/// TODO: rename to `parse_source_at(text: &str, at: Option<u32>)`, and allow parsing substatements, e.g. bodies of create
27+
/// function statements.
2428
pub fn parse_source_file(&mut self, text: &str) {
2529
let mut lexer = SourceFileToken::lexer(text);
2630

crates/parser/src/statement.rs

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,3 @@
1-
/// A super simple lexer for sql statements to work around a weakness of pg_query.rs.
2-
///
3-
/// pg_query.rs only parses valid statements, and no whitespaces or newlines.
4-
/// To circumvent this, we use a very simple lexer that just knows what kind of characters are
5-
/// being used. all words are put into the "Word" type and will be defined in more detail by the results of pg_query.rs
61
use cstree::text::{TextRange, TextSize};
72
use logos::Logos;
83
use regex::Regex;
@@ -11,12 +6,11 @@ use crate::{
116
parser::Parser, pg_query_utils::get_position_for_pg_query_node, syntax_kind::SyntaxKind,
127
};
138

14-
#[derive(Logos, Debug, PartialEq)]
15-
pub enum Test {
16-
#[regex("'([^']+)'|\\$(\\w)?\\$.*\\$(\\w)?\\$")]
17-
Sconst,
18-
}
19-
9+
/// A super simple lexer for sql statements.
10+
///
11+
/// One weakness of pg_query.rs is that it does not parse whitespace or newlines. To circumvent
12+
/// this, we use a very simple lexer that just knows what kind of characters are being used. It
13+
/// does not know anything about postgres syntax or keywords. For example, all words such as `select` and `from` are put into the `Word` type.
2014
#[derive(Logos, Debug, PartialEq)]
2115
pub enum StatementToken {
2216
// copied from protobuf::Token. can be generated later
@@ -110,6 +104,22 @@ impl StatementToken {
110104
}
111105

112106
impl Parser {
107+
/// The main entry point for parsing a statement `text`. `at_offset` is the offset of the statement in the source file.
108+
///
109+
/// On a high level, the algorithm works as follows:
110+
/// 1. Parse the statement with pg_query.rs and order nodes by their position. If the
111+
/// statement contains syntax errors, the parser will report the error and continue to work without information
112+
/// about the nodes. The result will be a flat list of tokens under the generic `Stmt` node.
113+
/// If successful, the first node in the ordered list will be the main node of the statement,
114+
/// and serves as a root node.
115+
/// 2. Scan the statements for tokens with pg_query.rs. This will never fail, even if the statement contains syntax errors.
116+
/// 3. Parse the statement with the `StatementToken` lexer. The lexer will be the main vehicle
117+
/// while walking the statement.
118+
/// 4. Walk the statement with the `StatementToken` lexer.
119+
/// - at every token, consume all nodes that are within the token's range.
120+
/// - if there is a pg_query token within the token's range, consume it. if not, fallback to
121+
/// the StatementToken. This is the case for e.g. whitespace.
122+
/// 5. Close all open nodes for that statement.
113123
pub fn parse_statement(&mut self, text: &str, at_offset: Option<u32>) {
114124
let offset = at_offset.unwrap_or(0);
115125
let range = TextRange::new(

0 commit comments

Comments
 (0)