Why do you need another lexical analyzer for Java? The different compilers and parsers already do lexical parsing along with syntax analysis. So what is the advantage of doing it again?
The answer is that JavaLex is not simply a lexical analyzer. It also includes tools to support the analysis of the lexical elements. At the bottom, it does the unavoidable lexical analysis, but this is a thing that is already implemented in many other tools. However, the list of the tokens, which is usually the result of the lexical analysis, does not get fed into a syntax analyzer but can be
-
queried,
-
modified, and
-
compared.
You can create patterns similar to regular expressions to match or find certain parts of the code. You can replace specified tokens or regions. Last but not least, you can compare two different source code files deciding whether they are the same. The comparison can be made
-
ignoring the spaces and
-
comments,
-
treating multi-line and single-line strings as the same if their content matches, and
-
ignoring differences sourcing from
\u
escape sequences.
which is a feature that is not available in the standard Java tools. This feature is heavily used in code generation tools, like Java::Geci, that need to know if the code generated was already there or has been changed.
The library in this project can
-
perform lexical analysis on Java source code
-
change the tokenized code
-
find code segments using pattern matching
-
convert the tokenized code back to Java source code.
The code was initially written for the project Java::Geci as an experiment to support Java code generation depending on the lexical elements instead of the plain text of the source. Later, the code was moved to a separate library for use in other projects.
In this section, we will look at a simple list matching. Through this example, we will learn
-
how to create a
JavaLexed
object, which is the tokenized Java code, -
how to match a list of lexical elements, and
-
how to get the result of the match.
The example is as follows.
final var source = "private final int z = 13;\npublic var h = \"kkk\"";
try (final var javaLexed = new JavaLexed(source)) {
final var result = javaLexed.match(match("private final int")).fromIndex(0).result();
Assertions.assertTrue(result.matches);
Assertions.assertEquals(0, result.start);
Assertions.assertEquals(5, result.end);
}
Note
|
You can find the examples in the TestMatching.java unit test file.
|
The Java source file has to be given in a string.
When the object javaLexed
is created, the source is tokenized, and the tokens are stored in the javaLexed
object.
The javaLexed
object is used to find code segments using pattern matching and to replace parts of the code if needed.
If there is any change in the structure, then the javaLexed
object can be used to convert the tokenized code back to Java source code.
To do that, the javaLexed
object must first be closed, as we will see in later examples.
In this example, closing is implicitly done with the try-with-resource statement.
In this example, we check that the source code contains a list of lexical elements.
It is done by calling the match
method on the javaLexed
object.
The match
method argument is a match expression.
When the method fromIndex()
is invoked, a match operation will start from the given index.
It will result in a matching success if the lexical elements in that index match the given match expression.
There is a pair of the method match()
, find()
.
It does the same; however, it succeeds if there is any match in the source code following the given index.
The index is the serial number of the different lexical elements we will see later. Now the index is zero, which intuitively means the beginning of the source code.
The method result()
returns the result of the match operation.
This MatchResult
object contains three fields:
-
matches
is a boolean value indicating whether the match was successful. -
start
is the index of the first lexical element of the match. -
end
is the index of the first lexical element after the match.
In this example, the match was successful, so the matches
field is true
.
The match started at index 0 and ended at index 4.
It may seem unintuitive because private final int
is only three lexical elements; how can it end at index four and not 2 (starting from zero)?
The reason is that the lexical analysis keeps the spaces in the list of lexical elements.
One or more space is treated as a single lexical element.
That way, the match is as follows:
-
0 private
-
1 SPACE
-
2 final
-
3 SPACE
-
4 int
Finally, the field end
is 5, the index of the first lexical element after the match.
There may not be a lexical element with that index if the match ends at the end of the source.
The last thing we have not discussed is the match expression.
The match expression in this example is straightforward.
It is created by calling a static method match
from the class LexpressionBuilder
.
It is a utility class that contains a lot of static methods to create match expressions.
The recommended way is to import these static methods so that the code is more readable using the
import static javax0.javalex.LexpressionBuilder.*;
import line in your class.
Though the method match()
has the same name as the one in the class JavaLexed
, it is a different method.
It takes a string that it converts to a list of lexical elements.
The match will be successful if the elements, one by one, match the lexical elements in the source code one after the other.
This example taught us the basics of matching lexical patterns on a Java source code. In the next section, we will see how we can consider spaces.
In this section, we will get accustomed to the method find()
, the pair of match()
in the class JavaLexed
, and tell the matcher to consider spaces.
The sample unit test code doing just that is the following:
final var source = "private final int z = 13;\npublic var /*comment*/h = \"kkk\"";
try (final var javaLexed = new JavaLexed(source)) {
final var result = javaLexed.sensitivity(Lexpression.SPACE_SENSITIVE).find(match("public var h")).fromIndex(0).result();
Assertions.assertTrue(result.matches);
Assertions.assertEquals(13, result.start);
Assertions.assertEquals(19, result.end);
}
The structure of the test is the same as the previous one.
We have a Java source in a string, create a JavaLexed
object, and call the find()
method on it.
The difference is that we use the method find()
instead of match()
and also that we use the method sensitivity()
to set the space sensitivity of the match.
The argument to this method is an integer value, and it can be one of the following:
-
NO_SENSITIVITY
(0) -
SPACE_SENSITIVE
(1), and -
COMMENT_SENSITIVE
(2).
These constants are defined in the class Lexpression
and can be used alone or combined with the bitwise or operator.
In the example above, we do the matching space sensitive but not comment sensitive.
The Matching is simple again; we use three tokens converted from a string.
The Matching does not care about the comment between the var
keyword and the variable name.
However, the match does care about the spaces between the tokens.
Because of that, if there were a space following the comment as var /*comment*/ h
, then the match would fail.
That is because the matching list of tokens are
-
public
-
SPACE
-
var
-
SPACE
-
h
and the source is
-
public
-
SPACE
-
var
-
SPACE
(before the comment) -
SPACE
(after the comment -
h
and the Matching is space sensitive.
If you count the tokens in the example, you can also see that the comment is a single token, and it counts in the indexing, even though the Matching does not care about it.
In this section, we have learned how to do a simple match taking spaces into account and how to ignore the comments, which is the default, so it is relatively simple. In the next section, we will see what to do when we care about the comments.
In this section, we will see how to match a list of tokens taking the comments into account. In addition, we will expand the toolset we use to build up a match expression.
The unit test code is similar to the previous one, but this time we call the method sensitivity()
with the argument COMMENT_SENSITIVE
.
final var source = "private final int z = 13;\npublic var //comment\nh = \"kkk\"";
try (final var javaLexed = new JavaLexed(source)) {
final var result = javaLexed.sensitivity(Lexpression.COMMENT_SENSITIVE).find(list(match("public var "), comment(), match("h"))).fromIndex(0).result();
Assertions.assertTrue(result.matches);
Assertions.assertEquals(13, result.start);
Assertions.assertEquals(20, result.end);
}
The match expression this time is a list of tokens, just like before, but it is not created implicitly by the match expression builder.
Instead, we create the list explicitly using the method list()
from the class LexpressionBuilder
.
The arguments for this method are matchers, and it returns a matcher that matches a list of the underlying matchers.
It is composing the list matcher intelligently recognizing that the match("public var ")
is already a list flattening the final list.
The method comment()
returns a matcher that matches a comment.
In this section, we learned how to match a list of tokens created explicitly and match comments.
In the next one, we will use a parameterized version of the comment()
method to match only specific comments.
This section will see how to match lexical elements using regular expressions. The regular expressions do not replace them but extend the matching process. When we want to match a comment, a string, or a symbol, it still has to be that: a comment a string, or a symbol. However, in addition to that, we can specify a standard Java regular expression that the lexical element has to match.
Mixing match expressions and regular expressions may be confusing because the match expressions are technically also regular expressions. The difference is that they work on lexical elements and tokens, while the standard regular expressions work on characters. If you understand that, you will see that the two are different and how they can be mixed.
The sample unit test, in this case, again, differs only a little bit from the previous one.
We provide an additional argument to the comment()
method, a regular expression.
final var source = "private final int z = 13;\npublic var //no no no\n" +
"h\npublic var //comment\nh = \"kkk\"";
try (final var javaLexed = new JavaLexed(source)) {
final var result =
javaLexed.sensitivity(Lexpression.COMMENT_SENSITIVE)
.find(list(match("public var "), comment(Pattern.compile("//c.*t")), match("h")))
.fromStart().result();
Assertions.assertTrue(result.matches);
Assertions.assertEquals(21, result.start);
Assertions.assertEquals(28, result.end);
}
The regular expression is a standard Java regular expression compiled using the Pattern
class.
If you count the tokens, you can see that the Matching does not find the
public var //no no no h
part because the comment does not match the regular expression.
//no no no
does not match ‘//c.*t’.
On the other hand, //comment
does, and thus the
public var //comment h
part is matched.
In this example, we added a regular expression to the comment matcher. If you study the API of the expression builder, you can see that most of the matchers have parameterized versions that take a regular expression wherever it makes sense.
In this section, we specified a regular expression to restrict the Matching further. The following section will see how we can retrieve the matched lexical elements.
In this section, we will see how we can retrieve the matched lexical elements when we provide a regular expression pattern to the match expression.
final var source = "private final int z = 13;\npublic var //cummant\nh = \"kkk\"";
try (final var javaLexed = new JavaLexed(source)) {
final var result =
javaLexed.sensitivity(Lexpression.COMMENT_SENSITIVE)
.find(list(match("public var "), comment("what?", Pattern.compile("//(c.*t)")), match("h")))
.fromStart().result();
Assertions.assertTrue(result.matches);
Assertions.assertEquals(13, result.start);
Assertions.assertEquals(20, result.end);
Assertions.assertTrue(javaLexed.regexGroups("what?").isPresent());
Assertions.assertEquals("cummant", javaLexed.regexGroups("what?").get().group(1));
}
In this example, we add a string parameter to the comment()
method.
It is the name of the group that will be used to retrieve the matched lexical element.
Later we will see that we can identify lexical elements with names not only when we provide regular expression patterns.
The name of the regular expression results in this example is what?
.
It is an arbitrary string that can identify the different results.
In this example, we have only one, but in more complex cases, we may have several.
The matching regular expression result contains the regular expression match groups.
These strings are matched in the regular expression between the (
and )
characters.
It is standard Java regular expression pattern matching as documented in the Java documentation.
The method to get the result for the name is regexGroups()
.
It is plural because there can be more than one character group when the regular expression has more than one (
and )
pair.
In this example, we have only one.
The returned value is an Optional
, hence the call to get()
, and then we fetch the first group.
These groups are indexed in the Java regular expression library from 1.
In the previous section, we matched a comment and restricted the match using a regular expression. Regular expressions are powerful tools but are not always the best choice. This section will see how to match a lexical element specifying a string value.
final var source = "private final int z = 13;\npublic var //comment\nh = \"kkk\"";
try (final var javaLexed = new JavaLexed(source)) {
final var result =
javaLexed.sensitivity(Lexpression.COMMENT_SENSITIVE)
.find(list(match("public var "), comment("//comment"), match("h")))
.fromIndex(0).result();
Assertions.assertTrue(result.matches);
Assertions.assertEquals(13, result.start);
Assertions.assertEquals(20, result.end);
}
This time the call to the comment()
method has a string argument.
As in the example, it will match when the comment is the same as the string argument.
In the next section, we will see how to get the lexical elements that match a more complex match expression.
In this section, we will see how to fetch the lexical elements that match a more complex match expression.
final var source = "private final int z = 13;\npublic var h = \"kkk\"";
try (final var javaLexed = new JavaLexed(source)) {
final var result = javaLexed.find(list(oneOf(group("protection"), "public", "private"), match("var h"))).fromStart().result();
Assertions.assertTrue(result.matches);
Assertions.assertEquals(13, result.start);
Assertions.assertEquals(18, result.end);
Assertions.assertEquals(1, javaLexed.group("protection").size());
Assertions.assertEquals("public", javaLexed.group("protection").get(0).getLexeme());
}
In this example, the matching expression is a list.
The first element of the list is created by calling the method oneOf()
on the expression builder.
The method returns a matcher that will match any lexical element list that matches one of the matchers listed as an argument.
However, the first argument to the method is not a lexical matcher but a group name.
It is created through the call to the method group()
.
The version of oneOf()
that accepts strings as arguments also converts these strings to match expressions to avoid excessive typing to calls to the method match()
.
The call to this method in this example will match the keyword public
or the keyword private
.
The one matched will be stored in a group named protection
.
Note that this group contains lexical elements, not strings, as in the case of regular expression matching. In this case, a single one, hence the size of the list, is 1.
The group’s name is arbitrary, but it has to be unique in the match expression.
It is encapsulated into a GroupNameWrapper
object to help method overloading and aid readability.
It is simply a class containing a final string to distinguish the name from the other arguments.
Up to now, we have seen how to match lexical elements and how to retrieve them. In the next section, we will see how to handle the case when some part of the match expression is not matched.
Until now, all the groups were matched, and we could retrieve the lexical elements that matched them. In this section, we will see how to handle when some part of the expression is not matched.
final var source = "private final int z = 13;\npublic var h = \"kkk\"";
try (final var javaLexed = new JavaLexed(source)) {
javaLexed.find(list(group("protection", oneOf(match("public"), group("private", match("private")))), match("var h")));
// skip 4 lines
final var result = javaLexed.fromStart().result();
Assertions.assertTrue(result.matches);
Assertions.assertEquals(13, result.start);
Assertions.assertEquals(18, result.end);
Assertions.assertEquals(0, javaLexed.group("private").size());
Assertions.assertEquals(1, javaLexed.group("protection").size());
}
In this example, we match the variable declaration of h
if it is public
or private
.
In the example, it is public
, the match is successful, and the group protection
contains the lexical element public
.
On the other hand, the group named private
is not matched because the variable declaration is not private
.
It can be seen through the fact that the size of the group is zero; it contains no lexical elements.
In this chapter, we list all the methods of the expression builder. To use these methods in your code, we recommend importing these static methods:
import static javax0.javalex.LexpressionBuilder.*;
To ease readability, we deleted most type declarations from the method signatures. What remained are
-
int
as they are rare and special -
Predicate
because they deliver the generic type of them.
Other parameter types can be inferred from the name:
-
text
or string` is a string. -
name
is also a string and denotes the name of a regular expression group list. -
pattern
is a regular expression pattern compiled. -
groupName
is aGroupNameWrapper
object easily created form a string calling the methodgroup()
.
Methods are overloaded and usually have the following forms:
-
method()
with no arguments will match whatever they match, and the matched lexers are not stored. Also, the Matching is not further constrained. For example, anidentifier()
will match any identifier. -
method(text)
will match whatever they match if the lexical element is exactly as the text is. For example, anidentifier("apple")
will match the identifierapple
but not other identifiers. -
method(pattern)
will match whatever they match if the lexical element matches the regular expression pattern. For example, anidentifier("[a-z]+")
will match any identifier that contains only lowercase letters. -
method(groupName)
will match whatever they match, and the matched lexers are stored in the group namedgroupName
. -
method(groupName, text)
will match whatever they match if the lexical element is exactly as the text is and the matched lexers are stored in the group namedgroupName
. -
method(name, pattern)
will match whatever they match if the lexical element matches the regular expression pattern and the regular expression matching groups are stored under thename
.
The methods are in the source file LexpressionBuilder.java
.
The file also contains the short documentation of the methods in comments.
The method signatures and the documentation are generated from the source file using Jamal ensuring that there is no typo in the method signatures and no method is forgotten.
-
modifier(int mask)
one modifier.mask
is an integer as defined in the JDK classModifier
. -
keyword(id)
match a keyword. -
oneOf(. . . matchers)
match one of the matchers. -
zeroOrMore(matcher)
match zero or more times the matcher. -
zeroOrMore(string)
match zero or more times the matcher created from the string. -
optional(matcher)
match zero or one times the matcher. -
optional(string)
match zero or one times the matcher created from the string. -
oneOrMore(matcher)
match one or more times the matcher. -
oneOrMore(string)
match one or more times the matcher created from the string. -
repeat(matcher, int times)
match the matcher exactlytimes
times. -
repeat(matcher, int min, int max)
match the matcher at leastmin
times and at mostmax
times. -
integerNumber()
match an integer number. -
integerNumber(Predicate<Long> predicate)
match an integer number that satisfies the predicate. -
number()
match a number, either integer or float. -
number(Predicate<Number> predicate)
match a number, either integer or float that satisfies the predicate. -
floatNumber()
match a float number -
floatNumber(Predicate<Double> predicate)
match a float number that satisfies the predicate. -
list(String. . . strings)
match a list of matchers created from the strings. The matchers are matched in the order they are listed. -
list(. . . matchers)
match a list of matchers. -
match(string)
create a list of lexical elements from the string and match the list. -
unordered(. . . matchers)
match a list of matchers. The actual lexical elements can be in any order. -
unordered(LexicalElement. . . elements)
match a list of lexical elements. The actual lexical elements can be in any order. -
unordered(string)
create a list of lexical elements from the string and match the list. The actual lexical elements can be in any order. -
group(name, matcher)
match the matcher and group the matched lexical elements under the name. -
oneOf(String. . . strings)
create matchers from the strings and match lexical elements that match one of them. -
not(. . . matchers)
match lexical elements that do not match any of the matchers. -
not(LexicalElement. . . elements)
match lexical elements that do not match any of the lexical elements. -
not(string)
create a matcher from the string and match lexical elements that do not match it. -
anyTill(. . . matchers)
match lexical elements until one of the matchers matches. -
anyTill(LexicalElement. . . elements)
match lexical elements until one of the lexical elements matches. -
anyTill(string)
create a matcher from the string and match lexical elements until it matches. -
modifier(groupName, int mask)
match a modifier. The lexical element matched will be stored under the group name. -
keyword(groupName, id)
match a keyword. -
oneOf(groupName, . . . matchers)
match one of the matchers. The lexical element matched will be stored under the group name. -
zeroOrMore(groupName, matcher)
match zero or more of the matchers. The lexical elements matched will be stored under the name. -
zeroOrMore(groupName, string)
match zero or more of the lexical elements. The lexical elements matched will be stored under the name. -
optional(groupName, matcher)
match zero or one of the matcher. The lexical element matched will be stored under the group name. -
optional(groupName, string)
match zero or one of the lexical elements. The lexical element matched will be stored under the group name. -
oneOrMore(groupName, matcher)
match one or more of the matcher. The lexical element matched will be stored under the group name. -
oneOrMore(groupName, string)
match one or more of the lexical elements converted from the string. The lexical element matched will be stored under the group name. -
repeat(groupName, matcher, int times)
match the matcher the specified number of times. The lexical elements matched will be stored under the name. -
repeat(groupName, matcher, int min, int max)
match the matcher the specified number of times, minimum and maximum. The lexical elements matched will be stored under the name. -
integerNumber(groupName)
match an integer number. The lexical element matched will be stored under the group name. -
integerNumber(groupName, Predicate<Long> predicate)
match an integer number and check it against the predicate. The lexical element matched will be stored under the group name. -
number(groupName)
match a number either integer or float. -
number(groupName, Predicate<Number> predicate)
match a number either integer or float and check it against the predicate. The lexical element matched will be stored under the group name. -
floatNumber(groupName)
match a float number. The lexical element matched will be stored under the group name. -
floatNumber(groupName, Predicate<Double> predicate)
match a float number and check it against the predicate. The lexical element matched will be stored under the group name. -
list(groupName, String. . . strings)
match a list of lexical elements converted from the strings. The lexical elements matched will be stored under the name. -
list(groupName, . . . matchers)
match a list of matchers. The lexical elements matched will be stored under the name. -
match(groupName, string)
match lexical elements converted from the string. The lexical elements matched will be stored under the name. -
unordered(groupName, . . . matchers)
match the matchers in any order. The lexical elements matched will be stored under the name. -
unordered(groupName, LexicalElement. . . elements)
match the lexical elements in any order. The lexical elements matched will be stored under the name. -
unordered(groupName, string)
match the lexical elements converted from the string in any order. The lexical elements matched will be stored under the name. -
oneOf(groupName, String. . . strings)
match one of the lexical elements converted from the strings. The lexical elements matched will be stored under the name. -
not(groupName, . . . matchers)
create a matcher that does not match the list of matchers. The process goes through the matchers taking the lexical elements one by one as consumed by the matchers, and if any of them matches, then this matcher does not. The lexical elements "matched" will be stored under the name. -
not(groupName, LexicalElement. . . elements)
create a matcher that does not match the list of lexical elements. The lexical elements "matched" will be stored under the name. -
not(groupName, string)
create a matcher that does not match the lexical elements converted from the string. The lexical elements "matched" will be stored under the name. -
anyTill(groupName, . . . matchers)
match any lexical elements until the list of matchers matches. The lexical elements matched will be stored under the name. -
anyTill(groupName, LexicalElement. . . elements)
match any lexical elements until the list of lexical elements matches. The lexical elements matched will be stored under the name. -
anyTill(groupName, string)
match any lexical elements until the lexical elements converted from the string matches. The lexical elements matched will be stored under the name. -
identifier(groupName)
match an identifier. The lexical element matched will be stored under the group name. -
identifier(groupName, text)
match an identifier with the given text. The lexical element matched will be stored under the group name. This is a bit superficial, since the name was already given in the parameter, but it is here for consistency. -
identifier(groupName, pattern)
match an identifier with the given pattern. The lexical element matched will be stored under the group name. Note that this is not the regular expression groups. -
identifier(groupName, name, pattern)
match an identifier with the given name and pattern. The lexical element matched will be stored under the group name. -
character(groupName)
match a character literal. This is a single character enclosed in single quotes. The lexical element matched will be stored under the group name. -
character(groupName, text)
match a character literal with the given text. The lexical element matched will be stored under the group name. -
character(groupName, pattern)
match a character literal with the given pattern. The lexical element matched will be stored under the group name. -
character(groupName, name, pattern)
match a character literal with the given name and pattern. The lexical element matched will be stored under the group name. The regular expression groups will also be stored under the name. -
string(groupName)
match a string literal. This is a sequence of characters enclosed in double quotes. The mather also matches multi-line strings. The lexical element matched will be stored under the group name. -
string(groupName, text)
match a string literal with the given text. The matcher also matches multi-line strings. The lexical element matched will be stored under the group name. -
string(groupName, pattern)
match a string literal with the given pattern. The matcher also matches multi-line strings. The lexical element matched will be stored under the group name. -
string(groupName, name, pattern)
match a string literal with the given name and pattern. The matcher also matches multi-line strings. The lexical element matched will be stored under the group name. The regular expression groups will also be stored under the name. -
type(groupName)
match a Java type declaration. This can be a single name, or generics' enhanced type declaration. The lexical element matched will be stored under the group name. -
type(groupName, text)
match a Java type declaration with the given text. This can be a single name, or generics' enhanced type declaration. The lexical element matched will be stored under the group name. -
type(groupName, pattern)
match a Java type declaration with the given pattern. This can be a single name, or generics' enhanced type declaration. The lexical element matched will be stored under the group name. -
type(groupName, name, pattern)
match a Java type declaration with the given name and pattern. This can be a single name, or generics' enhanced type declaration. The lexical element matched will be stored under the group name. The regular expression groups will also be stored under the name. -
comment(groupName)
match a Java comment. The lexical element matched will be stored under the group name. -
comment(groupName, text)
match a Java comment with the given text. The lexical element matched will be stored under the group name. -
comment(groupName, pattern)
match a Java comment with the given pattern. The lexical element matched will be stored under the group name. -
comment(groupName, name, pattern)
match a Java comment with the given name and pattern. The lexical element matched will be stored under the group name. The regular expression groups will also be stored under the name. -
identifier()
match a Java identifier. -
identifier(text)
match a Java identifier with the given text. -
identifier(pattern)
match a Java identifier with the given pattern. -
identifier(name, pattern)
match a Java identifier with the given name and pattern. The regular expression groups will also be stored under the name. -
character()
match a Java character literal. -
character(text)
match a Java character literal with the given text. -
character(pattern)
match a Java character literal with the given pattern. -
character(name, pattern)
match a Java character literal with the given pattern. The regular expression groups will also be stored under the name. -
string()
match a Java string literal. The matcher also matches multi-line strings. -
string(text)
match a Java string literal with the given text. The matcher also matches multi-line strings. -
string(pattern)
match a Java string literal with the given pattern. The matcher also matches multi-line strings. -
string(name, pattern)
match a Java string literal with the given name and pattern. The matcher also matches multi-line strings. The regular expression groups will also be stored under the name. -
type()
match a Java type declaration. This can be a single name, or generics' enhanced type declaration. -
type(text)
match a Java type declaration with the given text. This can be a single name, or generics' enhanced type declaration. -
type(pattern)
match a Java type declaration with the given pattern. This can be a single name, or generics' enhanced type declaration. -
type(name, pattern)
match a Java type declaration with the given pattern. This can be a single name, or generics' enhanced type declaration. The regular expression groups will also be stored under the name. -
comment()
match a Java comment. -
comment(text)
match a Java comment with the given text. -
comment(pattern)
match a Java comment with the given pattern. -
comment(name, pattern)
match a Java comment with the given name and pattern. The regular expression groups will also be stored under the name.
In the previous chapters, we have looked at how to match certain parts of the token list. In this chapter, we will see how to change the token list.
The token list after the lexical analysis is in the object JavaLexed
.
This token list can be changed
-
deleting tokens,
-
replacing ranges,
-
inserting new tokens, and
-
converting the token list to a string.
To do that the class JavaLexed
provides methods.
We will look at these methods through examples.
In the first example, we look at how to convert the token list back to a string. The example does not modify the token list, which would not make sense in a practical case. However, this is the first step to learn. If we cannot get the source code after the modification, the changes would be lost.
void testNullTransformation(String lines) {
final var expected = lines;
final String result;
try (final var sut = new JavaLexed(lines)) {
// do nothing, no transformation on lexical level
result = sut.toString();
}
Assertions.assertEquals(expected, result);
}
To get the string version of the token list, we call the method toString()
on the object JavaLexed
.
This conversion is the concatenation of the tokens in the list.
Remember that the token list includes all the spaces, comments and new lines as well.
The conversion back and forth will result exactly in the same string as the original source code unless the token list was modified.
The example method is called with several different Java sources in the unit test source code TestJavaLexed.java
.
You can use the JavaLexed
object to loop through the elements.
This is demonstrated by a method from the unit test file, which converts the tokens to a debug string.
private String toLexicalString(JavaLexed javaLexed) {
StringBuilder sb = new StringBuilder();
for (final var le : javaLexed.lexicalElements()) {
sb.append(le.getType().name()).append("[")
.append(le.getFullLexeme()).append("]").append("\n");
}
return sb.toString();
}
The method lexicalElements()
return an Iterable
object of the lexical elements.
This utility method, which is only a test method and not part of the API iterates through the elements and concatenates them into a string with some decoration.
The following code shows the usage and the result of this utility method:
final var source =
"private final var apple= \"appleee\"; \n" +
" private final final 13 'aaaaa' ";
final String lexed;
try (final var sut = new JavaLexed(source)) {
lexed = toLexicalString(sut);
}
Assertions.assertEquals("IDENTIFIER[private]\n" +
"SPACING[ ]\n" +
"IDENTIFIER[final]\n" +
"SPACING[ ]\n" +
"IDENTIFIER[var]\n" +
"SPACING[ ]\n" +
"IDENTIFIER[apple]\n" +
"SYMBOL[=]\n" +
"SPACING[ ]\n" +
"STRING[\"appleee\"]\n" +
"SYMBOL[;]\n" +
"SPACING[ \n" +
" ]\n" +
"IDENTIFIER[private]\n" +
"SPACING[ ]\n" +
"IDENTIFIER[final]\n" +
"SPACING[ ]\n" +
"IDENTIFIER[final]\n" +
"SPACING[ ]\n" +
"INTEGER[13]\n" +
"SPACING[ ]\n" +
"CHARACTER['aaaaa']\n" +
"SPACING[ ]\n", lexed);
The following examples heavily use this method to visualize the result of the different transformations.
We can remove a single token from the token list.
final var source = "public static final";
final String lexed;
try (final var sut = new JavaLexed(source)) {
sut.remove(2);
lexed = toLexicalString(sut);
}
Assertions.assertEquals("IDENTIFIER[public]\n" +
"SPACING[ ]\n" +
"SPACING[ ]\n" +
"IDENTIFIER[final]\n", lexed);
The example above calls the method remove()
on the JavaLexed
object.
The method removes the token at the index given as the argument.
The indexing, as usual in Java, starts with zero.
The third token in this case is the keyword static
.
The first one, indexed with zero is the keyword public.
The second one, indexed with one is a space
.
Removing the third token will result a token list as demonstrated in the assertion. Note that this token list contains two adjacent space tokens, which may not be the result of the lexical analysis. One or more neighboring spaces and new lines are always collapsed into a single space token.
In addition to removing a single token, you can remove a range of tokens.
final var source = "final static private";
final String lexed;
try (final var sut = new JavaLexed(source)) {
sut.removeRange(1, 3);
lexed = toLexicalString(sut);
}
Assertions.assertEquals("IDENTIFIER[final]\n" +
"SPACING[ ]\n" +
"IDENTIFIER[private]\n", lexed);
The method removeRange(int, int)
removes the tokens from the index given as the first argument to the index given as the second argument.
We can replace tokens, not only remove them.
final var source = "final private";
final String lexed;
try (final var sut = new JavaLexed(source)) {
sut.replace(0, 2, Arrays.asList(new Identifier("static"), new Spacing(" ")));
lexed = toLexicalString(sut);
}
Assertions.assertEquals("IDENTIFIER[static]\n" +
"SPACING[ ]\n" +
"IDENTIFIER[private]\n", lexed);
The method replace()
has three arguments.
The first two are the start and the end of the range to be replaced.
The start is inclusive, the end index is exclusive.
That way replace replaces the tokens from the start till the end-1.
The third argument is the replacement. This is a list that can be created as shown in the example, but there is a convenient method for that.
final var source = "final private";
final String lexed;
try (final var sut = new JavaLexed(source)) {
sut.replace(0, 2, Lex.of("static "));
lexed = toLexicalString(sut);
}
Assertions.assertEquals("IDENTIFIER[static]\n" +
"SPACING[ ]\n" +
"IDENTIFIER[private]\n", lexed);
The utility class Lex
has a method of()
that creates a list of lexical elements from a string.
Note that in both examples the number of spaces following the static
keyword is reserved.
If you know the position of a token, you can fetch it.
final var source = "public static final";
try (final var sut = new JavaLexed(source)) {
final var lexicalElement = sut.get(2);
Assertions.assertEquals(LexicalElement.Type.IDENTIFIER, lexicalElement.getType());
Assertions.assertEquals("static", lexicalElement.getFullLexeme());
}
The method get(int)
will return the element at the given index.
When you have the lexical element, you can get its type and the lexeme as a string.
The library contains a simple comparison tool to compare two sources.
The comparison is done on the lexical level.
The class JavaSourceDiff
implements the BiPredicate<String, String>
interface.
It returns true
if the two sources differ.
By default, the comparison ignores
-
the comments,
-
the difference in spaces and new lines, and
-
the different representations of identical numbers, characters, and strings.
The comment sensitivity can be changed by calling the method commentSensitive()
.
The other insensitivities are hard-wired.
For example, if sources differ only in that one contains the number 32
while the other 0x20
, they are identical.
The following example shows a comparison where the two sources differ only in formatting.
final var s1 = " void test(){\n" +
" final var s1 = \"\";\n" +
" final var s2 = \"\";\n" +
" Assertions.assertTrue( new Comparator().test(Arrays.asList(s1.split(\"\\n\",-1)),Arrays.asList(s2.split(\"\\n\",-1))));\n" +
" }";
final var s2 = " void test(){\n" +
" final var s1 = \"\";\n" +
" final var \n" +
" s2 = \"\";\n" +
" Assertions.assertTrue( new Comparator().test(Arrays.asList(s1.split(\"\\n\",-1)),Arrays.asList(s2.split(\"\\n\",-1))));\n" +
" }";
Assertions.assertFalse( new JavaSourceDiff().test(s1, s2));