Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml parser #1659

Merged
merged 9 commits into from Sep 15, 2017
Merged

xml parser #1659

merged 9 commits into from Sep 15, 2017

Conversation

furiel
Copy link
Collaborator

@furiel furiel commented Sep 5, 2017

The xml parser can process input in xml format, and adds the parsed data to the message object.

Configuration example

parser xml_parser {
       xml(template("$MSG")
           prefix(".xml")
           drop_invalid(no)
           exclude-tags("tag1", "tag2", "tag3?", "tag*")
           strip-whitespaces(no)
          );
};

Output example

From the following xml:

<tags attr='attrval'>part1<tag1>Tag1 Leaf</tag1>part2<tag2>Tag2 Leaf</tag2>part3</tags>

the following output is generated: (through json-parse template, ignoring exclude-tags now)

{"_xml":{"tags":{"tag2":"Tag2 Leaf","tag1":"Tag1 Leaf","_attr":"attrval","tags":"part1part2part3"}}}

@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

g_list_free_full(self->exclude_tags, g_free);
self->exclude_tags = g_list_copy(exclude_tags);

/* g_list_copy_deep would be better, but not supported on all glib versions :( */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to implement g_list_copy_deep under a compiler switch (do we have g_list_copy_deep or not) ?
In this case the code would document this comment :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! I backported a g_list_copy_deep implementation to use on machines where it is not available. It is less efficient than the one provided by glib master, because this version iterates twice through the list instead of one. But it does not use any internal api from the glib list implementation (like memory allocation), so I felt safer to go with this version.

@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

while (pattern != NULL)
{
gchar *str = ((gchar *)pattern->data);
if (strchr(str, '*') || strchr(str, '?'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could use strpbrk instead of the two strchr.

${CMAKE_CURRENT_BINARY_DIR}/xml-grammar.c
COMPILE_FLAGS ${BISON_FLAGS})

option(ENABLE_xml "Enable xml ON")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use for the name of the option variable upper case ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided as this is non-critical issue and general in other modules, this will not be scope of this patch.

${CMAKE_CURRENT_BINARY_DIR}/xml-grammar.c
COMPILE_FLAGS ${BISON_FLAGS})

option(ENABLE_xml "Enable xml ON")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is confusing, because it says "Enable xml ON", but the default value for option is OFF. It is not stated here otherwise. I would follow the pattern:

option(ENABLE_XML, "Enable xml module", ON)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided as this is non-critical issue and general in other modules, this will not be scope of this patch.


option(ENABLE_xml "Enable xml ON")

if (ENABLE_xml)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would exclude the whole file with the condition, not just the compiling and linking part.

option(ENABLE_BLA)

if (ENABLE_BLA)
 everything here
endif()

The advantage of this that people are used to search include guards there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided as this is non-critical issue and general in other modules, this will not be scope of this patch.

"${CMAKE_CURRENT_BINARY_DIR}/xml-grammar.h"
)

set(xml_SOURCES
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are missing the xml.c here.

{"<tag1>part1<tag2>value2</tag2>part2</tag1>", ".xml.tag1", "part1part2"},
{"<tag1><tag11></tag11><tag12><tag121>value</tag121></tag12></tag1>", ".xml.tag1.tag12.tag121", "value"},
{"<tag1><tag11></tag11><tag12><tag121 attr1='1' attr2='2'>value</tag121></tag12></tag1>", ".xml.tag1.tag12.tag121._attr1", "1"},
{"<tag1><tag11></tag11><tag12><tag121 attr1='1' attr2='2'>value</tag121></tag12></tag1>", ".xml.tag1.tag12.tag121._attr2", "2"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the following case:

<tag>
   <tag1>value11</tag1>
   <tag2>value22</tag1>
</tag>

What should be the result of tag.tag1 ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kokan: this is an invalid XML as tag2 is terminated by a non-matching tag.
But the question is still valid: what if an error occurs here? Do we recognize this as an error and drop the whole xml or we keep the result parsed from the valid part?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a drop_invalid() with default value: no. User can control this behaviour.

For the first part: if we write tag1 instead of tag2 as well:

<tag>
   <tag1>value11</tag1>
   <tag1>value22</tag1>
</tag>

the result will be concatenated. The same behaviour is visible with xmllint:
$ xmllint --xpath "/tag/tag1/text()" text.xml
value11value22%
$
It is currently a limitation of the parser users cannot access the individual tags in such vector-like structures. This limitation will be documented in the administrator's guide.


xml_opt
: KW_PREFIX '(' string ')' { xml_parser_set_prefix(last_parser, $3); free($3); }
| KW_DROP_INVALID '(' yesno ')' { xml_parser_set_forward_invalid(((XMLParser *) last_parser), !$3); };
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signature must use LogParser instead of XMLParser to reduce dependency from grammar towards module.

@@ -17,6 +17,7 @@ compat_sources = \
lib/compat/inet_aton.c \
lib/compat/memrchr.c \
lib/compat/pio.c \
lib/compat/glib.c \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add glib.c also to the lib/compat/CMakeLists.txt.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

InserterState *state = (InserterState *)user_data;
const gchar *current_value = log_msg_get_value_by_name(state->msg, state->key->str, NULL);
GString *value = scratch_buffers_alloc();
g_string_printf(value, "%s%s", current_value, text);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text is not nullterminated here.

@furiel
Copy link
Collaborator Author

furiel commented Sep 13, 2017

Added a strip_whitespaces() option. As xmllint also does not strip whitespaces, the default value is false in syslog-ng as well.
When a user creates an xml by hand and adds indentation for cosmetic purposes, the (significant) enters, tabs, spaces are maintained in the nvtable by default. But such whitespaces in the result might seem awkward in some cases, so this option can strip whitespaces if the user sets it.

@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

LogParser super;
gchar *prefix;
gboolean forward_invalid;
GList *exclude_tags;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both exclude_tags and exclude patterns are not needed. The exclude_tags set function can compile right away, there is no need to store exclude_tags.

@bazsi
Copy link
Collaborator

bazsi commented Sep 14, 2017

Please try not to use sprintf() and friends (g_string_sprintf qualifies) as that ruins performance a lot.

Copy link
Collaborator

@lbudai lbudai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
GList *new_list = NULL;

if (list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd return immediately instead of the if statement.
This (input validations & return/throwing exception) is a common technique for reducing the McCabe index.

{"<tag1>value1</tag1>", ".xml.tag1", "value1"},
{"<tag1 attr='attr_value'>value1</tag1>", ".xml.tag1._attr", "attr_value"},
{"<tag1><tag2>value2</tag2></tag1>", ".xml.tag1.tag2", "value2"},
{"<tag1>part1<tag2>value2</tag2>part2</tag1>", ".xml.tag1", "part1part2"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between

<tag1>part1<tag2>value2</tag2>part2</tag1>

and

<tag1>part1part2<tag2>value2</tag2></tag1>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are two different xml-s, but they will result in the same keys values.
This behaviour aligns with xmllint:

$ diff -s <(xmllint --xpath "/tag1/text()" <(echo "<tag1>part1<tag2>value2</tag2>part2</tag1>")) <(xmllint --xpath "/tag1/text()" <(echo "<tag1>part1part2<tag2>value2</tag2></tag1>"))
Files /proc/self/fd/11 and /proc/self/fd/13 are identical

{"<tag1>part1<tag2>value2</tag2>part2</tag1>", ".xml.tag1", "part1part2"},
{"<tag1><tag11></tag11><tag12><tag121>value</tag121></tag12></tag1>", ".xml.tag1.tag12.tag121", "value"},
{"<tag1><tag11></tag11><tag12><tag121 attr1='1' attr2='2'>value</tag121></tag12></tag1>", ".xml.tag1.tag12.tag121._attr1", "1"},
{"<tag1><tag11></tag11><tag12><tag121 attr1='1' attr2='2'>value</tag121></tag12></tag1>", ".xml.tag1.tag12.tag121._attr2", "2"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kokan: this is an invalid XML as tag2 is terminated by a non-matching tag.
But the question is still valid: what if an error occurs here? Do we recognize this as an error and drop the whole xml or we keep the result parsed from the valid part?

ParameterizedTestParameters(xmlparser, valid_inputs)
{
static ValidXMLTestCase test_cases[] =
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing the negative cases and more complex XMLs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need some suggeston on the complex XMLs, let's talk irl.

On the other hand, I have a single negative test:
Test(xml_parser, test_drop_invalid)
which tests a single unbalanced tag. I can add a list with negative cases, but I am not sure on its benefit. Those would only test the glib's xml parser implementation, which we implicitely think (work hypothesis) it works. The integration is tested with test_drop_invalid.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end we decided to add some failure testcase.

{"<longtag>Text</longtag>", "lo*gtag_break", ".xml.longtag", "Text"},
{"<longtag>Text</longtag>", "break_long*ag", ".xml.longtag", "Text"},
{"<longtag>Text</longtag>", "*tag_break", ".xml.longtag", "Text"},
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to test also more complex xmls.

{
.start_element = start_element_cb,
.end_element = end_element_cb,
.text = text_cb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about error handler and passthrough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • error
    g_markup_parse_context_parse and g_markup_parse_context_end_parse also reports back a gerror, which I turn into an msg_error. For the invalid xml-s I passed, I could see the error logs. So I did not override the method. Did I overlook something and should I override?

  • passthrough
    Comments and processing instructions should not be part of the data imo so I did not override this virtual function. By the way xmllint also strips these data from text. For example

$xmllint --xpath "/tag/text()" <(echo "<tag>value<?mso-application progid="Excel.Sheet"?></tag>")
value%

g_free(self->prefix);
if (prefix)
self->prefix = g_strdup(prefix);
else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULL-checking of prefix is superfluous as g_strdup returns NULL when its input is NULL.

{
XMLParser *self = (XMLParser *) s;
g_free(self->prefix);
self->prefix = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use NULL instead of 0.

}

static gboolean
tag_matches_patterns(const gchar *element_name, const GPtrArray *patterns, gboolean should_reverse)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we pass the reversed_element_name also? In this case we don't have to propagate this flag to here (and this decision already made by the caller...). On the other hand in this case it is better to pass also the length of the element_name (as the caller also has to compute it because of the reverse).

Copy link
Collaborator Author

@furiel furiel Sep 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good comment thanks! Especially again, element_name is NOT null terminated, so I should not have called strlen() on it inside this function. I will fix this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: it is a little bit confusing, but only the text_cb has nonterminating gchar *, the other api calls will give a nullterminated string.

static gint
scan_for_last_dot(GString *str)
{
gint i = str->len;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use strrchr instead. (pos = strchr(s, '.'); return (pos-s+1); IIRC

@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

@furiel
Copy link
Collaborator Author

furiel commented Sep 15, 2017

Replaced the g_string_printf functions with a series of g_string_assign, g_string_append and g_string_append_c.

@furiel
Copy link
Collaborator Author

furiel commented Sep 15, 2017

This version will not support CDATA. We will document it as a limitation.

@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

furiel and others added 9 commits September 15, 2017 14:52
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

@lbudai lbudai merged commit 0ba062e into syslog-ng:master Sep 15, 2017
@furiel furiel deleted the xml_parser branch May 8, 2018 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants