Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient parsing of UTF-8 #41

Open
Plecra opened this issue Apr 15, 2021 · 3 comments
Open

Efficient parsing of UTF-8 #41

Plecra opened this issue Apr 15, 2021 · 3 comments
Labels
bug Incorrect behavior in the compiler feature Features that add something new to the language
Milestone

Comments

@Plecra
Copy link
Member

Plecra commented Apr 15, 2021

for char in self.remaining().chars() {
// \n indicates a token, so it isn't 'whitespace'
if !char.is_whitespace() || char == '\n' {
break;
}
len += char.len_utf8();
}

UTF-8's ability to represent ASCII in-place allows lexers for ASCII characters to be implemented extra-efficiently by skipping decoding of utf-8 bytes. It's fine to simply scan directly through the bytes (and a crate like https://docs.rs/memchr can vectorize this) for characters like ' ' and '\t'

@slightknack
Copy link
Member

slightknack commented Apr 15, 2021

Right now, the lexer is very inefficient, as it doesn't use a state machine. I think we should refactor it; this should be a viable optimization.

As for pulling a crate like https://docs.rs/memchr, Passerine's core depends only on Rust's standard library and has no external dependencies. For this reason, if we were to do vectorization, I suggest we extract the core concept of the library to vectorize as needed.

Thanks for pointing this out! much appreciated :D

@slightknack slightknack added bug Incorrect behavior in the compiler feature Features that add something new to the language labels Apr 15, 2021
@slightknack slightknack added this to the 0.10.0 milestone Jul 17, 2021
@slightknack
Copy link
Member

@Plecra I recently completely redid the lexer in big-refactor. Would you mind taking a look at it and letting me know what style/performance improvements can be made? I suspect there's something with the way we're using peekable iterators that could be improved:

/// Parses the next token.
/// Expects all whitespace and comments to be stripped.
fn next_token(&mut self) -> Result<Spanned<Token>, Syntax> {
let mut remaining = self.remaining().peekable();
let (token, len) = match remaining.next().unwrap() {
// separator
c @ ('\n' | ';') => self.take_while(
&mut once(c).chain(remaining).peekable(),
|_| Token::Sep,
|n| n.is_whitespace() || n == ';'
),
// the unit type, `()`
'(' if Some(')') == remaining.next() => {
(Token::Lit(Lit::Unit), 2)
},
// Grouping
'(' => (Token::Open(Delim::Paren), 1),
'{' => (Token::Open(Delim::Curly), 1),
'[' => (Token::Open(Delim::Square), 1),
')' => (Token::Close(Delim::Paren), 1),
'}' => (Token::Close(Delim::Curly), 1),
']' => (Token::Close(Delim::Square), 1),
// Label
c if c.is_alphabetic() && c.is_uppercase() => {
self.take_while(
&mut once(c).chain(remaining).peekable(),
|s| match s {
// TODO: In the future, booleans in prelude as ADTs
"True" => Token::Lit(Lit::Boolean(true)),
"False" => Token::Lit(Lit::Boolean(false)),
_ => Token::Label(s.to_string()),
},
|n| n.is_alphanumeric() || n == '_'
)
},
// Iden
c if c.is_alphabetic() || c == '_' => {
self.take_while(
&mut once(c).chain(remaining).peekable(),
|s| Token::Iden(s.to_string()),
|n| n.is_alphanumeric() || n == '_'
)
},
// Number literal:
// Integer: 28173908, etc.
// Radix: 0b1011001011, 0xFF, etc.
// Float: 420.69, 0.0, etc.
c @ '0'..='9' => {
if c == '0' {
if let Some(n) = remaining.next() {
// Potentially integers in other radixes
self.radix_literal(n, remaining)?
} else {
// End of source, must be just `0`
(Token::Lit(Lit::Integer(0)), 1)
}
} else {
// parse decimal literal
// this could be an integer
// but also a floating point number
self.decimal_literal(once(c).chain(remaining).peekable())?
}
}
// String
'"' => self.string(remaining)?,
// TODO: choose characters for operator set
// don't have both a list and `is_ascii_punctuation`
// Op
c if OP_CHARS.contains(c) => {
self.take_while(
&mut once(c).chain(remaining).peekable(),
|s| Token::Op(s.to_string()),
|n| OP_CHARS.contains(n),
)
},
// Unrecognized char
unknown => return Err(Syntax::error(
&format!(
"Hmm... The character `{}` is not recognized in this context - check for encoding issues or typos",
unknown,
),
&Span::point(&self.source, self.index),
)),
};
let spanned =
Spanned::new(token, Span::new(&self.source, self.index, len));
self.index += len;
Ok(spanned)
}

@Plecra
Copy link
Member Author

Plecra commented Nov 14, 2022

It's awfully late, but I just saw this in my email, and I like it 😆. The token representation and parsing logic all flows very nicely. If parsing ever becomes a bottleneck, passerine could return to the once/chain/peekable pattern, since rustc will chug on it a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior in the compiler feature Features that add something new to the language
Projects
None yet
Development

No branches or pull requests

2 participants