Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tight binding of SwiLex to SwiParse makes some jobs difficult. #1

Open
dabrahams opened this issue Mar 23, 2021 · 1 comment
Open
Assignees

Comments

@dabrahams
Copy link

For example, I want to keep track of source location (line, column, etc.) and bundle that in with my tokens. One way I can do that is to not have any separators, recognize whitespace explicitly,

  case horizontal_whitespace = #"[ \t\r]+"#
  case newlines =              #"\n+"#
  case illegal =               #"."#

and then filter the tokens myself:

/// Returns a `Token` for each lexically significant unit in `sourceText`.
func tokenize(sourceText: String) -> [MyToken] {
  var r: [MyToken] = []
  var tokenStart = PositionInSourceFile(line: 1, column: 1)
  var scanner = SwiLex<Terminal>()
  
  for t in try! scanner.lex(input: sourceText) {
    switch t.type {
    case .one_line_comment, .eof, .none:
      () // ignored
    case .horizontal_whitespace:
      tokenStart.column += t.value.count
    case .newlines:
      tokenStart.column = 1
      tokenStart.line += t.value.utf8.count
    default:
      var tokenEnd = tokenStart
      tokenEnd.column += t.value.count
      r.append(
        MyToken(kind: t.type, text: t.value, location: tokenStart..<tokenEnd))
    }
  }
  return r 
}

But there's really no way to interpose any of this into the process. Ideally the parser would be able to operate on a generic token type and could take a token stream rather than a string as input.

I can probably get out of this particular jam by mapping substring indices back into source locations, but it's quite awkward, and I'm pretty sure people will want to make their own lexers.

@yassram
Copy link
Owner

yassram commented Mar 26, 2021

This kind of information (line number, column number) will be added in the next versions to the Token structure. It is important for error localisation in the parser.

The SwiLex's default Token type uses Substrings as value (and not Strings) the goal was to keep a relation (a view of the base string with different indices) between the given input and the tokens (in both the Lexer and the Parser).

As you mentioned, this solution can not be used with SwiParse since the lexing is done manually.. and using a Lexer (without separators, ...) adds some unnecessary constraints..

I suggest the following solution:

We need the indices of all the \n in the input string

// the indices of all the newline characters in the base string
let newlineIndices = input.indices.filter{ input[$0] == "\n" }

And with position(of: token) we can have the line and column numbers (starting from (0,0)) of a given token:

/// - Returns: A tuple of the line number and the column number of 
/// a token in its base string.
func position(of token: Substring) -> (Int, Int) {
    // the index of the previous newline character or -1 if no
    // newline is found before the token
    let previousNewlineIndex = previousIndex(of: token.startIndex,
                                             in: newlineIndices)
    let lineStartIndex: String.Index

    if previousNewlineIndex >= 0 {
        // start counting the column number from the character after
        // the previous newline character in the base string
        lineStartIndex = token.base.index(after: newlineIndices[previousNewlineIndex])
    } else {
        // If no previous newline is found then the column number
        // counting starts from the beginning of the base string
        lineStartIndex = token.base.startIndex
    }

    let line = previousNewlineIndex + 1
    let column = token.base.distance(from: lineStartIndex,
                                     to: token.startIndex)
    return (line, column)
}

/// Binary search.
/// - Parameters:
///   - element: a `Comparable` item.
///   - array: array of `Comparable` items.
/// - Returns: the index of the element of `array` preceding `element`
/// or -1 if no element in `array` precedes it.
func previousIndex<T:Comparable>(of element: T, in array: [T]) -> Int {
    var beg = 0
    var end = array.count-1
    while end >= beg {
        let mid = beg + (end-beg) / 2
        if array[mid] == element { return mid }
        else if array[mid] < element { beg = mid + 1 }
        else { end = mid - 1 }
    }
    return beg-1
}

This solution can be used directly in the reduction actions of the SwiParse parser. In fact each token is by default terminal (since it was generated with a Lexer) so it will be reduced into a non terminal using a grammar rule (and a reduction action). This reduction action will take the token's substring as parameter. The position (line column) can be known at this point (during the parsing).

Here is a simple example using the solution. The parser here concatenates the positions (line number, column number) of each token in a String:

enum Terminal: String, SwiLexable {
    static var separators: Set<Character> = [" ", "\n"]

    case text = "[A-Za-z]*"
    case number = "[0-9]*"

    case eof
    case none
}

enum NonTerminal: SwiParsable {
    case start  // starting token

    case concat
}

let input = """
Hello
this is
1 test
"""

let newlineIndices = input.indices.filter{ input[$0] == "\n" }

func singlePosition(s: Substring) -> String { "\(position(of: s))" }
func concatPosition(s1: String, s2: Substring) -> String { s1 + "\(position(of: s2))" }

let parser = try SwiParse<Terminal, NonTerminal>(rules: [
    .start => [Word(.concat)], // accepting rule

    .concat => [Word(.text)] >>>> singlePosition,
    .concat => [Word(.number)] >>>> singlePosition,
    .concat => [Word(.concat), Word(.text)] >>>> concatPosition,
    .concat => [Word(.concat), Word(.number)] >>>> concatPosition,
])

let result = try parser.parse(input: input)

print(result)
// Optional("(0, 0)(1, 0)(1, 5)(2, 0)(2, 2)")

As mentioned before this information (line number, column number, size, ...) will be in the Token struct in the next versions (especially for localised parsing errors).. I just need to figure out how the user could interact with it (since the user only provides reduction actions on the substring (and not the Token itself)). The goal is also to keep the API as simple as possible.

An abstraction of the Lexer (with a generic Token), supporting different parsers, adding some debugging tools (with graphviz), visualisation of the AST, supporting files, ... are planned. I just need some free time (and maybe contributors 😉)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants