Skip to content

GSoC 2021 Proposal: Add regexp parser() (Xiaoyu Qiu)

LittleFish edited this page Mar 30, 2021 · 3 revisions

Google Summer of Code - 2021 Project Proposal : Syslog-ng

About Me

Basic Information

  • Name: Xiaoyu Qiu
  • Email: qiuxy1233@gmail.com
  • Github: LittleFish33
  • Time Zone: UTC+08:00 (China)
  • Education: 2020-Present, Sun Yat-Sen University, Computer Science and Technology (master)
  • Knowledge Area:
    • Thorough knowledge of C/C++ programming, familiarity with CMake, gtest and other common building tools
    • Technical expertise in analysis, design, integration, and implementation of Linux-based applications and systems
    • Followed the principles of Clean Code and TDD in daily programming and is comfortable with Git
    • A general idea of Syslog-ng codebase, a regexp-parser prototype is implemented

Pull Requests (March 2021 - Present)

No. Description Link
1 Fix issue #3598 date parser incorrect sets seconds #3615
2 Fix issue #3397 Improve regex related test cases #3619
3 Fix issue #3623 incorrect debug message for loggen #3624
4 Fix issue #3469: add reconnect opt in loggen #3630

Project

Project Overview

Using regular expressions to parse messages is very common and useful. Currently, syslog ng supports extracting fields (or name value pairs) in its filter expressions. However, those are primarily meant for filtering and don't store extracted groups by default (despite we can manually add store-matches flags option). To make up for this deficiency, this project is to develop a new parser type, here I call it regexp-parser(). The motivation behind regexp-parser() is to remove the need to use filters to extract name-value pairs from content, which has traditionally been done by parsers. And these names can be used as user-defined macros for subsequent filtering or other types of message processing.

Why the project will be useful for syslog-ng and/or the community?

Syslog-ng traditionally uses regular expression for filtering purposes. Despite the combination of filters and rewrite rules provides a possible solution for using regular expression to process the content of messages, it is a bit unintuitive and has the following drawbacks:

  • the store-matches flag need to be set explicitly, as regexp field extractions are not stored by default
  • the names of fields are in "matches" namespace ($0, $1, $2 ...), which can be difficult to remember and will get overwritten by the next regexp
  • one can set() a "normal" name-value pair using the value of $1 with a rewrite rule, but the config quickly becomes cumbersome/unreadable
  • one can also set a "normal" name-value pair by employing the PCRE named capture group syntax, but you can't use dots in the names which is regularly used in the naming of syslog-ng name-value pairs
  • filters should not be mutating the message (which they do with flags(store-matches))

By contrast, parsers provide effective tools to segment structured messages and extract desired fields. It is useful to develop a new parser type, regexp-parser(), which uses regular expression to process and segment the message contents into name-value pairs. And these names can be used as user-defined macros for subsequent filtering or other types of message processing. Therefore, the addition of regexp-parser() enhances the message processing functionality offered by syslog-ng and makes it better for developers and users in terms of ease of use.

Why the project is interesting for me?

GSoC is a good opportunity for me to make contributions to open source projects with mentorship from great developers all over the world. The experience of operation and maintenance on Linux servers let me know the significance of log management. Log messages contain information about the events happening on the hosts. Monitoring system events is essential for security and system health monitoring. I admire the previous efforts made by selfless developers around the world to improve syslog-ng, which gives birth to a flexible and highly scalable system logging application, ideal for creating centralized and trusted logging solutions. It's an honor to have the opportunity to serve the developer community and enhance the functionality offered by syslog-ng.

What knowledge areas are required for the success of the project?

  • Familiarity with Syslog-ng
  • Familiarity with C language and parallel programming/mutexes.
  • Familiarity with GLib, CMake, regular expression, and Bison

I am comfortable with

  • C/C++ language and Linux system programming
  • CMake, gtest and other common building tools
  • Git

I need to improve on

  • Parallel programming/mutexes
  • How to compile and parse regular expressions

Implementation

Preliminary thought

Upon startup, syslog-ng will first check the syntax of the configuration file and, if no syntax error, parse the defined configuration objects as log expressions. Sources, destinations, filters, parsers, rewrite rules and global log statements are log expressions. In syslog-ng, every configuration object is essentially a configuration block that can include multiple objects. Accordingly, every log expression is represented using a tree of LogExprNode elements. Each LogExprNode node is responsible for transforming the config representation into a pipeline of LogPipe elements. This transformation is called compilation. The LogPipe elements as constructed will be responsible for handling the production load. Each element is responsible for its own operation and to hand over its result to its "next" peer. All log expressions together describe a graph as dictated by the configuration. For example this is a singe log statement, each piece a LogPipe instance:

source -> filter -> filter -> parser -> rewrite -> destination  

According to the different needs of message processing, the LogPipes can be further specialized into different classes, one of which is the LogParser. In particular, various methods can be overridden by external objects within LogPipe and derived classes such as LogParser. The aim of this functionality to make it possible to attach new functions to a LogParser at runtime. Therefore, to implement regexp-parser, the essential task is to design a new class derived from LogParser and attach new functions to it, here I call it REGEXParser. Thanks to the excellent modular design of the syslog-ng, we can add a new REGEXParser with minimal changes to existing codes. The project outline is as follows.

1. Design on configuration and behavior level

What is it: As mentioned above, syslog-ng first checks the syntax of the configuration file and parse the defined configuration objects as log expressions. In practice, syslog-ng uses CfgParser to provide a high level interface to a configuration file parser, which encapsulates the grammar/lexer pair. Based on the Open Closure Principle, software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification, that is, such an entity can allow its behaviour to be extended without modifying its source code. Therefore, the syntax of regexp-parser in the configuration file should be consistent with previous built-in parsers. In addition, we should also discover the common parser options to build a well-functioning regexp-parser, such as prefix, flags, template, and etc. The syslog-ng administrator guide and the options of existing parsers should provide a good reference. Also, discuss with mentors about the use-cases and scenarios of regexp-parser.

Expected outcome: Determine the syntax of regexp-parser's configuration

2. Interface design

What is it: Next, before diving into the implementation of the algorithm, it is necessary to design the interface according to the syntax determined in the previous phase. The modular design of syslog-ng provides functionally-clean modules that are driven by customer needs and share standardized interface. Therefore, we need to examine the codes of the built-in parsers and the related documents to determine the standardized interfaces for the modules in syslog-ng. Possible interfaces may include: init_instance, free, clone, process, etc. After that, we can implement the bison grammar files to parse the configuration settings of regexp-paser(). A UML Diagram that depicts the interface is shown as follows. Note that interfaces in syslog-ng such as init are implemented as function pointer, which point to specific functions during initialization.

UML Diagram

Expected outcome: Determine the interface of regexp-parser and bison grammar file.

3. Algorithm implement

What is it: So far, we can implement the interface according to the desired functionally. The main one is the parser process function, which receives LogMessage and segments the message content into name-value pairs based on the regular expression. In this part, we need to understand how to read and modify log messages, how to compile and use regular expressions in LogParser, and how to extract and manage name-value pairs. It should be noted that syslog-ng currently has introduced regular expressions into filters and rewrite rules, i.e., LogMatcher. This module should be reused in our regexp-parser's implementation to avoid code redundancy. Therefore, in addition to the code about the built-in parsers, we should also examine the code about filters and rewrite rules to understand how regular expression is used in LogMatcher and how this can contribute to our project. And following the Open Closure Principle, avoid modifying the code of LogMatcher as much as possible.

Expected outcome: Complete most of the algorithm implementation.

4. Unit tests, performance evaluation, and code refactoring

What is it: After regexp-parser is implemented, we should add unit test cases to verify its validation. Perform code reviews and bug fixing to ensure that regexp-parser works properly. Besides, performance evaluation is required to measure how many messages can be processed by a single regexp per second. It is important to meet high scalability requirements. It's usually a good idea to refactor the code until it is satisfactory.

Expected outcome: A well-functioning regexp-parser

5. Update documentation

What is it: Update documentation about the newly supported regexp-parser and add example cases. Also, write a summary article throughout the project.

Project Goals

  • To begin with, analyze the existing parsers in syslog-ng and learn how to add a new parser
  • Discuss with mentors about use-cases and scenarios and refer to the existing parsers to design regexp-parser on configuration and behavior level
  • Determine the interface of regexp-parser implement the algorithm
  • Optimize the resultant code to work optimally with syslog-ng
  • Perform diagnostic and regression tests and get the code production ready
  • Update documentation

Deliverables

  • Implementation and integration of the new regexp-parser into syslog-ng
  • Thorough test and performance evaluation of regexp-parser
  • Update documentation about the newly supported regexp-parser and add example cases. Also, write a summary article throughout the project.

A prototype of regexp-parser

I have built a prototype for this proposal: regexp-dev branch of LittleFish33/syslog-ng. It uses LogMatcher to process regular expression and support the prefix() option common to other parsers, so that named sub-patterns can get a prefix in their names. Code changes and additions include:

  • New file in modules/regexp:
File name Main function
CMakeList.txt, Makefile.am CMake configuration for build purpose
regexp-parser-grammar.ym Used to parse configuration settings of regexp-parser
regexp-parser-plugin.c Register the module
regexp-parser-parser.c/h Implement a regexp-parser_parser and declares a binding
regexp-parser.c/h Responsible for the logic of parse process, including codes for init_instance, free, clone, process, etc.
  • Changes:
    • modules/Makefile.am: include modules/regexp/Makefile.am
    • modules/CMakeList.txt: add_subdirectory(xml)

How to use:

The syntax to declare a regexp-parser:

parser <parser_name> {
    regexp-parser(
        prefix("prefix of parser")
        pattern("regular expression pattern")
        template("<template-expression>")
    );
};

Example:

When the parser receivers foo ... as message content, a new name-value pair .regexp.foo will be generated and attached to current message.

parser { 
    regexp-parser(prefix(".regexp.") pattern("(?<DN>foo)") template("${MESSAGE}"));
};

Time Schedule

Date Work
Prior - May 18 * Keep diving into syslog-ng’s code
* Improve the prototype
May 18 - June 7 Community Bonding Period:
* Increase familiarity with the mentor and the community
* Improve knowledge of the codebase, including project structure, underlying technology, and code style
* Set up a development environment for coding and debugging
June 8 - June 13 * Discuss with mentors about the inadequacy of prototype
* Based on the discussion, design a more detailed and complete scheme for regexp-parser's configuration
June 14 - June 27 * Analyze the codebase and investigate the built-in modules which we can refer to
* Determine the common interfaces of a syslog-ng's parser
* Design the bison grammar file based on the syntax and interface of regexp-parser
* Discuss with mentors about the implementation of the regexp-parser
June 28 - July 13 * Learn how to use LogMatcher for regular expression
* Implement the regexp-parser based on the discussion and prototype, the code completed at this stage should at least deal with the simple regular expression cases
* Integrate the regexp-parser into syslog-ng
* Implement test cases to verify its validation
July 13 - July 17 Phase 1 evaluation and buffer for unexpected delay:
* Discuss with mentors and analyze the current implementation, what has been done, is there are bugs, are there any shortcomings that can be improved, what enhancements should be implemented in the next phase
* Create a detailed report based on the outcome of the analysis
July 18 - Aug 1 Based on the evaluation in phase 1:
* Fix bugs
* Make further enhancements and complete the codes
* Extend the test cases if needed
* Code refactoring if needed
Aug 2 - Aug 8 * Code review and analyze the performance of regexp-parse, fine tuning the performance to improve the regexp-parser
* Immediate bug fixes (if any)
* Update documentation and example cases
Aug 9 - Aug 17 Buffer for unexpected delay
Aug 17 - Aug 31 * Make a pull request on syslog-ng for the new regexp-parser
* Fix code reviews of pull request
* Write a summary article throughout the project.
* Submit final evaluation
Clone this wiki locally