Skip to content
Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
209 lines (170 sloc) 9.25 KB
---
title: "Data Structures"
author: "Lorenz Walthert"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
> This vignette is partly outdated since nested structure was implemented
> completely.
This vignette illustrates how the core of `styler` currently^[at commit `e6ddee0f510d3c9e3e22ef68586068fa5c6bc140`] works, i.e. how
rules are applied to a parse table and how limitations of this approach can be
overcome with a refined approach.
Status quo - the flat approach
------------------------------
Roughly speaking, a string containing code to be formatted is parsed
with `parse` and the output is passed to `getParseData` in order to
obtain a parse table with detailed information about every token. For a
simple example string
"`a <- function(x) { if(x > 1) { 1+1 } else {x} }`" to be formatted, the
parse table on which `styler` performs the manipulations looks similar
to the one presented below.
library("styler")
library("dplyr")
code <- "a <- function(x) { if(x > 1) { 1+1 } else {x} }"
(parse_table <- styler:::compute_parse_data_flat_enhanced(code))
## # A tibble: 24 x 14
## line1 col1 line2 col2 token text terminal short newlines
## <int> <int> <int> <int> <chr> <chr> <lgl> <chr> <int>
## 1 1 0 1 0 START NA <NA> 0
## 2 1 1 1 1 SYMBOL a TRUE a 0
## 3 1 3 1 4 LEFT_ASSIGN <- TRUE <- 0
## 4 1 6 1 13 FUNCTION function TRUE funct 0
## 5 1 14 1 14 '(' ( TRUE ( 0
## 6 1 15 1 15 SYMBOL_FORMALS x TRUE x 0
## 7 1 16 1 16 ')' ) TRUE ) 0
## 8 1 18 1 18 '{' { TRUE { 0
## 9 1 20 1 21 IF if TRUE if 0
## 10 1 22 1 22 '(' ( TRUE ( 0
## # ... with 14 more rows, and 5 more variables: lag_newlines <int>,
## # spaces <int>, multi_line <lgl>, indention_ref_id <lgl>, indent <dbl>
The column `spaces` was computed from the columns `col1` and `col2`,
`newlines` was computed from `line1` and `line2` respectively.
So far, styler can set the spaces around the operators correctly. In our
example, that involves adding spaces around `+`, so in the `spaces`
column, element nine and ten must be set to one. This means that a space
is added after `1` and after `+`. To get the spacing right and cover the
various cases, a set of functions has to be applied to the parse table
subsequently (and in the right order), which is essentially done via
`Reduce()`. After all modifications on the table are completed,
`serialize_parse_data()` collapses the `text` column and adds the number
of spaces and line breaks specified in `spaces` and `newlines` in
between the elements of `text`. If we serialize our table and don't
perform any modification, we obviously just get back what we started
with.
styler:::serialize_parse_data_flat(parse_table)
## [1] "a <- function(x) { if(x > 1) { 1+1 } else {x} }"
Refining the flat approach - nesting the parse table
----------------------------------------------------
Although the flat approach is good place to start, e.g. for fixing
spaces between operators, it has its limitations. In particular, it
treats each token the same way in the sense that it does not account for
the context of the token, i.e. in which sub-expression it appears. To
set the indention correctly, we need a hierarchical view on the parse
data, since all tokens in a sub-expression have the same indention
level. Hence, a natural approach would be to create a nested parse table
instead of a flat parse table and then take a recursion over all
elements in the table, so for each sub(-sub etc.)-expression, a separate
parse table would be created and the modifications would be applied to
this table before putting everything back together. A function to create
a nested parse table already exists in `styler`. Let's have a look at
the top level:
(l1 <- styler:::compute_parse_data_nested(code)[-1])
## # A tibble: 1 x 13
## col1 line2 col2 id parent token terminal text short token_before
## <int> <int> <int> <int> <int> <chr> <lgl> <chr> <chr> <chr>
## 1 1 1 47 49 0 expr FALSE <NA>
## # ... with 3 more variables: token_after <chr>, internal <lgl>,
## # child <list>
The tibble contains the column `child`, which itself contains a tibble.
If we "enter" the first child, we can see that the expression was split
up further.
l1$child[[1]] %>%
select(text, terminal, child, token)
## # A tibble: 3 x 4
## text terminal child token
## <chr> <lgl> <list> <chr>
## 1 FALSE <tibble [1 x 14]> expr
## 2 <- TRUE <NULL> LEFT_ASSIGN
## 3 FALSE <tibble [5 x 14]> expr
And further...
l1$child[[1]]$child[[3]]$child[[5]]
## # A tibble: 3 x 14
## line1 col1 line2 col2 id parent token terminal text short
## <int> <int> <int> <int> <int> <int> <chr> <lgl> <chr> <chr>
## 1 1 18 1 18 9 45 '{' TRUE { {
## 2 1 20 1 45 42 45 expr FALSE
## 3 1 47 1 47 40 45 '}' TRUE } }
## # ... with 4 more variables: token_before <chr>, token_after <chr>,
## # internal <lgl>, child <list>
... and so on. Every child that is not a terminal contains another
tibble where the sub-expression is split up further - until we are left
with tibbles that only contain terminals.
Recall the above example.
`a <- function(x) { if(x > 1) { 1+1 } else {x} }`. In the last printed
parse table, we can see that see that the whole if condition is a
sub-expression of `code`, surrounded by two curly brackets. Hence, one
would like to set the indention level for this sub-expression before
doing anything with it in more detail. Later, when we progressed deeper
into the nested table, we hit a similar pattern:
l1$child[[1]]$child[[3]]$child[[5]]$child[[2]]$child[[5]]
## # A tibble: 3 x 14
## line1 col1 line2 col2 id parent token terminal text short
## <int> <int> <int> <int> <int> <int> <chr> <lgl> <chr> <chr>
## 1 1 30 1 30 20 30 '{' TRUE { {
## 2 1 32 1 34 27 30 expr FALSE
## 3 1 36 1 36 26 30 '}' TRUE } }
## # ... with 4 more variables: token_before <chr>, token_after <chr>,
## # internal <lgl>, child <list>
Again, we have two curly brackets and an expression inside. We would
like to set the indention level for the expression `1+1` in the same way
as for the whole if condition.
The simple example above makes it evident that a recursive approach to
this problem would be the most natural.
The code for a function that kind of sketches the idea and illustrates
such a recursion is given below.
It takes a nested parse table as input and then does the recursion over
all children. If the child is a terminal, it returns the text,
otherwise, it "enters" the child to find the terminals inside of the
child and returns them.
serialize <- function(x) {
out <- Map(
function(terminal, text, child) {
if (terminal)
text
else
serialize(child)
},
x$terminal, x$text, x$child
)
out
}
x <- styler:::compute_parse_data_nested(code)
serialize(x) %>% unlist
## [1] "a" "<-" "function" "(" "x" ")"
## [7] "{" "if" "(" "x" ">" "1"
## [13] ")" "{" "1" "+" "1" "}"
## [19] "else" "{" "x" "}" "}"
How to exactly implement a similar recursion to not just return each
text token separately, but the styled text as one string (or one string
per line) is subject to future work, so would be the functions to be
applied to a sub-expression parse table that create correct indention.
Similar to `compute_parse_data_flat_enhanced`, the column `spaces` and
`newlines` would be required to be computed by
`compute_parse_data_nested` as well as a new column `indention`.
Final Remarks
-------------
Although a flat structure would possibly also allow us to solve the
problem of indention, it is a less elegant and flexible solution to the
problem. It would involve looking for an opening curly bracket in the
parse table, set the indention level for all subsequent rows in the
parse table until the next opening or closing curly bracket is hit and
then intending one level further or setting indention back to where it
was at the beginning of the table.
Note that the vignette just addressed the question of indention caused
by curly brackets and has not dealt with other operators that would
trigger indention, such as `(` or `+`.
[1] at commit `e6ddee0f510d3c9e3e22ef68586068fa5c6bc140`
You can’t perform that action at this time.