Efficient parsing of UTF-8 #41

Plecra · 2021-04-15T18:31:12Z

Lines 152 to 158 in 3796465

    
           for char in self.remaining().chars() { 
        
               // \n indicates a token, so it isn't 'whitespace' 
        
               if !char.is_whitespace() || char == '\n' { 
        
                   break; 
        
               } 
        
               len += char.len_utf8(); 
        
           }

UTF-8's ability to represent ASCII in-place allows lexers for ASCII characters to be implemented extra-efficiently by skipping decoding of utf-8 bytes. It's fine to simply scan directly through the bytes (and a crate like https://docs.rs/memchr can vectorize this) for characters like ' ' and '\t'

The text was updated successfully, but these errors were encountered:

slightknack · 2021-04-15T18:53:13Z

Right now, the lexer is very inefficient, as it doesn't use a state machine. I think we should refactor it; this should be a viable optimization.

As for pulling a crate like https://docs.rs/memchr, Passerine's core depends only on Rust's standard library and has no external dependencies. For this reason, if we were to do vectorization, I suggest we extract the core concept of the library to vectorize as needed.

Thanks for pointing this out! much appreciated :D

slightknack · 2022-04-20T16:56:20Z

@Plecra I recently completely redid the lexer in big-refactor. Would you mind taking a look at it and letting me know what style/performance improvements can be made? I suspect there's something with the way we're using peekable iterators that could be improved:

passerine/src/compiler/lex.rs

Lines 288 to 387 in f8fabd0

    
           /// Parses the next token. 
        
           /// Expects all whitespace and comments to be stripped. 
        
           fn next_token(&mut self) -> Result<Spanned<Token>, Syntax> { 
        
               let mut remaining = self.remaining().peekable(); 
        
               let (token, len) = match remaining.next().unwrap() { 
        
                   // separator 
        
                   c @ ('\n' | ';') => self.take_while( 
        
                       &mut once(c).chain(remaining).peekable(), 
        
                       |_| Token::Sep, 
        
                       |n| n.is_whitespace() || n == ';' 
        
                   ), 
        
                   // the unit type, `()` 
        
                   '(' if Some(')') == remaining.next() => { 
        
                       (Token::Lit(Lit::Unit), 2) 
        
                   }, 
        
                   // Grouping 
        
                   '(' => (Token::Open(Delim::Paren), 1), 
        
                   '{' => (Token::Open(Delim::Curly), 1), 
        
                   '[' => (Token::Open(Delim::Square), 1), 
        
                   ')' => (Token::Close(Delim::Paren), 1), 
        
                   '}' => (Token::Close(Delim::Curly), 1), 
        
                   ']' => (Token::Close(Delim::Square), 1), 
        
                   // Label 
        
                   c if c.is_alphabetic() && c.is_uppercase() => { 
        
                       self.take_while( 
        
                           &mut once(c).chain(remaining).peekable(), 
        
                           |s| match s { 
        
                               // TODO: In the future, booleans in prelude as ADTs 
        
                               "True" => Token::Lit(Lit::Boolean(true)), 
        
                               "False" => Token::Lit(Lit::Boolean(false)), 
        
                               _ => Token::Label(s.to_string()), 
        
                           }, 
        
                           |n| n.is_alphanumeric() || n == '_' 
        
                       ) 
        
                   }, 
        
                   // Iden 
        
                   c if c.is_alphabetic() || c == '_' => { 
        
                       self.take_while( 
        
                           &mut once(c).chain(remaining).peekable(), 
        
                           |s| Token::Iden(s.to_string()), 
        
                           |n| n.is_alphanumeric() || n == '_' 
        
                       ) 
        
                   }, 
        
                   // Number literal: 
        
                   // Integer: 28173908, etc. 
        
                   // Radix:   0b1011001011, 0xFF, etc. 
        
                   // Float:   420.69, 0.0, etc. 
        
                   c @ '0'..='9' => { 
        
                       if c == '0' { 
        
                           if let Some(n) = remaining.next() { 
        
                               // Potentially integers in other radixes 
        
                               self.radix_literal(n, remaining)? 
        
                           } else { 
        
                               // End of source, must be just `0` 
        
                               (Token::Lit(Lit::Integer(0)), 1) 
        
                           } 
        
                       } else { 
        
                           // parse decimal literal 
        
                           // this could be an integer 
        
                           // but also a floating point number 
        
                           self.decimal_literal(once(c).chain(remaining).peekable())? 
        
                       } 
        
                   } 
        
                   // String 
        
                   '"' => self.string(remaining)?, 
        
                   // TODO: choose characters for operator set 
        
                   // don't have both a list and `is_ascii_punctuation` 
        
                   // Op 
        
                   c if OP_CHARS.contains(c) => { 
        
                       self.take_while( 
        
                           &mut once(c).chain(remaining).peekable(), 
        
                           |s| Token::Op(s.to_string()), 
        
                           |n| OP_CHARS.contains(n), 
        
                       ) 
        
                   }, 
        
                   // Unrecognized char 
        
                   unknown => return Err(Syntax::error( 
        
                       &format!( 
        
                           "Hmm... The character `{}` is not recognized in this context - check for encoding issues or typos", 
        
                           unknown, 
        
                       ), 
        
                       &Span::point(&self.source, self.index), 
        
                   )), 
        
               }; 
        
               let spanned = 
        
                   Spanned::new(token, Span::new(&self.source, self.index, len)); 
        
               self.index += len; 
        
               Ok(spanned) 
        
           }

Plecra · 2022-11-14T19:36:24Z

It's awfully late, but I just saw this in my email, and I like it 😆. The token representation and parsing logic all flows very nicely. If parsing ever becomes a bottleneck, passerine could return to the once/chain/peekable pattern, since rustc will chug on it a bit.

slightknack added bug Incorrect behavior in the compiler feature Features that add something new to the language labels Apr 15, 2021

slightknack added this to the 0.10.0 milestone Jul 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient parsing of UTF-8 #41

Efficient parsing of UTF-8 #41

Plecra commented Apr 15, 2021 •

edited

slightknack commented Apr 15, 2021 •

edited

slightknack commented Apr 20, 2022

Plecra commented Nov 14, 2022 •

edited

Efficient parsing of UTF-8 #41

Efficient parsing of UTF-8 #41

Comments

Plecra commented Apr 15, 2021 • edited

slightknack commented Apr 15, 2021 • edited

slightknack commented Apr 20, 2022

Plecra commented Nov 14, 2022 • edited

Plecra commented Apr 15, 2021 •

edited

slightknack commented Apr 15, 2021 •

edited

Plecra commented Nov 14, 2022 •

edited