New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xml parser #1659
xml parser #1659
Conversation
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
modules/xml/xml.c
Outdated
g_list_free_full(self->exclude_tags, g_free); | ||
self->exclude_tags = g_list_copy(exclude_tags); | ||
|
||
/* g_list_copy_deep would be better, but not supported on all glib versions :( */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to implement g_list_copy_deep under a compiler switch (do we have g_list_copy_deep or not) ?
In this case the code would document this comment :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment! I backported a g_list_copy_deep implementation to use on machines where it is not available. It is less efficient than the one provided by glib master, because this version iterates twice through the list instead of one. But it does not use any internal api from the glib list implementation (like memory allocation), so I felt safer to go with this version.
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
modules/xml/xml.c
Outdated
while (pattern != NULL) | ||
{ | ||
gchar *str = ((gchar *)pattern->data); | ||
if (strchr(str, '*') || strchr(str, '?')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could use strpbrk
instead of the two strchr
.
${CMAKE_CURRENT_BINARY_DIR}/xml-grammar.c | ||
COMPILE_FLAGS ${BISON_FLAGS}) | ||
|
||
option(ENABLE_xml "Enable xml ON") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use for the name of the option variable upper case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided as this is non-critical issue and general in other modules, this will not be scope of this patch.
${CMAKE_CURRENT_BINARY_DIR}/xml-grammar.c | ||
COMPILE_FLAGS ${BISON_FLAGS}) | ||
|
||
option(ENABLE_xml "Enable xml ON") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description is confusing, because it says "Enable xml ON", but the default value for option
is OFF. It is not stated here otherwise. I would follow the pattern:
option(ENABLE_XML, "Enable xml module", ON)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided as this is non-critical issue and general in other modules, this will not be scope of this patch.
|
||
option(ENABLE_xml "Enable xml ON") | ||
|
||
if (ENABLE_xml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would exclude the whole file with the condition, not just the compiling and linking part.
option(ENABLE_BLA)
if (ENABLE_BLA)
everything here
endif()
The advantage of this that people are used to search include guards there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided as this is non-critical issue and general in other modules, this will not be scope of this patch.
modules/xml/CMakeLists.txt
Outdated
"${CMAKE_CURRENT_BINARY_DIR}/xml-grammar.h" | ||
) | ||
|
||
set(xml_SOURCES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are missing the xml.c here.
{"<tag1>part1<tag2>value2</tag2>part2</tag1>", ".xml.tag1", "part1part2"}, | ||
{"<tag1><tag11></tag11><tag12><tag121>value</tag121></tag12></tag1>", ".xml.tag1.tag12.tag121", "value"}, | ||
{"<tag1><tag11></tag11><tag12><tag121 attr1='1' attr2='2'>value</tag121></tag12></tag1>", ".xml.tag1.tag12.tag121._attr1", "1"}, | ||
{"<tag1><tag11></tag11><tag12><tag121 attr1='1' attr2='2'>value</tag121></tag12></tag1>", ".xml.tag1.tag12.tag121._attr2", "2"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the following case:
<tag>
<tag1>value11</tag1>
<tag2>value22</tag1>
</tag>
What should be the result of tag.tag1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kokan: this is an invalid XML as tag2
is terminated by a non-matching tag.
But the question is still valid: what if an error occurs here? Do we recognize this as an error and drop the whole xml or we keep the result parsed from the valid part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a drop_invalid() with default value: no. User can control this behaviour.
For the first part: if we write tag1 instead of tag2 as well:
<tag>
<tag1>value11</tag1>
<tag1>value22</tag1>
</tag>
the result will be concatenated. The same behaviour is visible with xmllint:
$ xmllint --xpath "/tag/tag1/text()" text.xml
value11value22%
$
It is currently a limitation of the parser users cannot access the individual tags in such vector-like structures. This limitation will be documented in the administrator's guide.
modules/xml/xml-grammar.ym
Outdated
|
||
xml_opt | ||
: KW_PREFIX '(' string ')' { xml_parser_set_prefix(last_parser, $3); free($3); } | ||
| KW_DROP_INVALID '(' yesno ')' { xml_parser_set_forward_invalid(((XMLParser *) last_parser), !$3); }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signature must use LogParser instead of XMLParser to reduce dependency from grammar towards module.
@@ -17,6 +17,7 @@ compat_sources = \ | |||
lib/compat/inet_aton.c \ | |||
lib/compat/memrchr.c \ | |||
lib/compat/pio.c \ | |||
lib/compat/glib.c \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add glib.c
also to the lib/compat/CMakeLists.txt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
modules/xml/xml.c
Outdated
InserterState *state = (InserterState *)user_data; | ||
const gchar *current_value = log_msg_get_value_by_name(state->msg, state->key->str, NULL); | ||
GString *value = scratch_buffers_alloc(); | ||
g_string_printf(value, "%s%s", current_value, text); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Text is not nullterminated here.
Added a strip_whitespaces() option. As xmllint also does not strip whitespaces, the default value is false in syslog-ng as well. |
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
modules/xml/xml.h
Outdated
LogParser super; | ||
gchar *prefix; | ||
gboolean forward_invalid; | ||
GList *exclude_tags; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both exclude_tags and exclude patterns are not needed. The exclude_tags set function can compile right away, there is no need to store exclude_tags.
Please try not to use sprintf() and friends (g_string_sprintf qualifies) as that ruins performance a lot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Address the questions regarding to the tests.
- What about the error handling (GMarkupParser)?
- Mention the limitations of this xml parser (useful information to the doc team : https://developer.gnome.org/glib/stable/glib-Simple-XML-Subset-Parser.html )
should_reverse
request is optional- as well the
strrchr
lib/compat/glib.c
Outdated
{ | ||
GList *new_list = NULL; | ||
|
||
if (list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd return immediately instead of the if statement.
This (input validations & return/throwing exception) is a common technique for reducing the McCabe index.
{"<tag1>value1</tag1>", ".xml.tag1", "value1"}, | ||
{"<tag1 attr='attr_value'>value1</tag1>", ".xml.tag1._attr", "attr_value"}, | ||
{"<tag1><tag2>value2</tag2></tag1>", ".xml.tag1.tag2", "value2"}, | ||
{"<tag1>part1<tag2>value2</tag2>part2</tag1>", ".xml.tag1", "part1part2"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between
<tag1>part1<tag2>value2</tag2>part2</tag1>
and
<tag1>part1part2<tag2>value2</tag2></tag1>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are two different xml-s, but they will result in the same keys values.
This behaviour aligns with xmllint:
$ diff -s <(xmllint --xpath "/tag1/text()" <(echo "<tag1>part1<tag2>value2</tag2>part2</tag1>")) <(xmllint --xpath "/tag1/text()" <(echo "<tag1>part1part2<tag2>value2</tag2></tag1>"))
Files /proc/self/fd/11 and /proc/self/fd/13 are identical
{"<tag1>part1<tag2>value2</tag2>part2</tag1>", ".xml.tag1", "part1part2"}, | ||
{"<tag1><tag11></tag11><tag12><tag121>value</tag121></tag12></tag1>", ".xml.tag1.tag12.tag121", "value"}, | ||
{"<tag1><tag11></tag11><tag12><tag121 attr1='1' attr2='2'>value</tag121></tag12></tag1>", ".xml.tag1.tag12.tag121._attr1", "1"}, | ||
{"<tag1><tag11></tag11><tag12><tag121 attr1='1' attr2='2'>value</tag121></tag12></tag1>", ".xml.tag1.tag12.tag121._attr2", "2"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kokan: this is an invalid XML as tag2
is terminated by a non-matching tag.
But the question is still valid: what if an error occurs here? Do we recognize this as an error and drop the whole xml or we keep the result parsed from the valid part?
ParameterizedTestParameters(xmlparser, valid_inputs) | ||
{ | ||
static ValidXMLTestCase test_cases[] = | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing the negative cases and more complex XMLs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need some suggeston on the complex XMLs, let's talk irl.
On the other hand, I have a single negative test:
Test(xml_parser, test_drop_invalid)
which tests a single unbalanced tag. I can add a list with negative cases, but I am not sure on its benefit. Those would only test the glib's xml parser implementation, which we implicitely think (work hypothesis) it works. The integration is tested with test_drop_invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end we decided to add some failure testcase.
modules/xml/tests/test_xml_parser.c
Outdated
{"<longtag>Text</longtag>", "lo*gtag_break", ".xml.longtag", "Text"}, | ||
{"<longtag>Text</longtag>", "break_long*ag", ".xml.longtag", "Text"}, | ||
{"<longtag>Text</longtag>", "*tag_break", ".xml.longtag", "Text"}, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to test also more complex xmls.
{ | ||
.start_element = start_element_cb, | ||
.end_element = end_element_cb, | ||
.text = text_cb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about error
handler and passthrough
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
error
g_markup_parse_context_parse and g_markup_parse_context_end_parse also reports back a gerror, which I turn into an msg_error. For the invalid xml-s I passed, I could see the error logs. So I did not override the method. Did I overlook something and should I override? -
passthrough
Comments and processing instructions should not be part of the data imo so I did not override this virtual function. By the way xmllint also strips these data from text. For example
$xmllint --xpath "/tag/text()" <(echo "<tag>value<?mso-application progid="Excel.Sheet"?></tag>")
value%
modules/xml/xml.c
Outdated
g_free(self->prefix); | ||
if (prefix) | ||
self->prefix = g_strdup(prefix); | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULL-checking of prefix
is superfluous as g_strdup
returns NULL
when its input is NULL
.
modules/xml/xml.c
Outdated
{ | ||
XMLParser *self = (XMLParser *) s; | ||
g_free(self->prefix); | ||
self->prefix = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use NULL
instead of 0
.
modules/xml/xml.c
Outdated
} | ||
|
||
static gboolean | ||
tag_matches_patterns(const gchar *element_name, const GPtrArray *patterns, gboolean should_reverse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we pass the reversed_element_name
also? In this case we don't have to propagate this flag to here (and this decision already made by the caller...). On the other hand in this case it is better to pass also the length of the element_name
(as the caller also has to compute it because of the reverse).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good comment thanks! Especially again, element_name is NOT null terminated, so I should not have called strlen() on it inside this function. I will fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: it is a little bit confusing, but only the text_cb has nonterminating gchar *, the other api calls will give a nullterminated string.
modules/xml/xml.c
Outdated
static gint | ||
scan_for_last_dot(GString *str) | ||
{ | ||
gint i = str->len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use strrchr
instead. (pos = strchr(s, '.'); return (pos-s+1);
IIRC
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
Replaced the g_string_printf functions with a series of g_string_assign, g_string_append and g_string_append_c. |
This version will not support CDATA. We will document it as a limitation. |
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
The xml parser can process input in xml format, and adds the parsed data to the message object.
Configuration example
Output example
From the following xml:
the following output is generated: (through json-parse template, ignoring exclude-tags now)