Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a filter to handle newline-delimited JSON #226

Closed
tho opened this issue Mar 18, 2025 · 4 comments · Fixed by #227
Closed

Add a filter to handle newline-delimited JSON #226

tho opened this issue Mar 18, 2025 · 4 comments · Fixed by #227

Comments

@tho
Copy link
Contributor

tho commented Mar 18, 2025

I would like to propose adding a new JQnl (or JQLines) filter to the script package. The filter would process newline-delimited JSON data by applying a JQ query to each JSON object in the input. It should work with pretty-printed and compact (jq -c; one JSON object per line) JSON input.

Newline-delimited JSON is widely used for streaming data where each line represents a self-contained JSON object. The current JQ method, unlike the jq command line utility, only processes a single JSON object from the input, making it difficult to process streams of data.

Use cases

  1. Log Analysis

    Application logs are often output as JSON objects, one per line. With JQnl, users could extract and transform specific fields from log files, filtering for specific log levels, error messages, etc.

  2. Processing Paginated API Results

    When dealing with APIs that return paginated results, JQnl would allow processing paginated reponses as part of a script pipeline.

  3. Data ETL Workflows

    For Extract-Transform-Load workflows where each record is a separate JSON object, the method would streamline the processing of large data sets by applying transformations to each record in the stream without loading everything into memory at once or rather before applying the JQ filter.

Concrete example for the Log Analisys use case

Extrct all warning and error messages from a JSON log file, e.g. output of slog.

/tmp/log.json - The mixed compact-prettyprint-compact format is intentional for illustration purposes.

{"time": "2025-03-17T18:04:26.534789-07:00", "level": "INFO", "msg": "info message"}
{
        "time": "2025-03-17T18:04:26.534946-07:00",
        "level": "WARN",
        "msg": "warn message"
}
{"time": "2025-03-17T18:04:26.534953-07:00", "level": "ERROR", "msg": "error message"}

jq - command line utility for reference

$ cat /tmp/log.json | jq 'select(.level=="WARN" or .level=="ERROR") | .msg'
"warn message"
"error message"

JQ - script.Stdin().JQ(os.Args[1]).Stdout()

$ cat /tmp/log.json | ./scriptJQ 'select(.level=="WARN" or .level=="ERROR") | .msg'
$ # no output, since `JQ` only processes the first JSON object

JQnl - script.Stdin().JQnl(os.Args[1]).Stdout()

$ cat /tmp/log.json | ./scriptJQnl 'select(.level=="WARN" or .level=="ERROR") | .msg'
"warn message"
"error message"

Sample implementation of JQnl:

func (p *Pipe) JQnl(query string) *Pipe {
	return p.Filter(func(r io.Reader, w io.Writer) error {
		q, err := gojq.Parse(query)
		if err != nil {
			return err
		}

		code, err := gojq.Compile(q)
		if err != nil {
			return err
		}

		dec := json.NewDecoder(r)
		for dec.More() {
			var input interface{}
			err := dec.Decode(&input)
			if err != nil {
				return err
			}

			iter := code.Run(input)
			for {
				v, ok := iter.Next()
				if !ok {
					break
				}
				if err, ok := v.(error); ok {
					return err
				}
				result, err := gojq.Marshal(v)
				if err != nil {
					return err
				}
				_, err = fmt.Fprintln(w, string(result))
				if err != nil {
					return err
				}
			}
		}

		return nil
	})
}
@bitfield
Copy link
Owner

Great idea! Would it make sense to change JQ itself to behave this way, instead of adding a separate method?

@tho
Copy link
Contributor Author

tho commented Mar 18, 2025

Great idea! Would it make sense to change JQ itself to behave this way, instead of adding a separate method?

Yes! That would actually be my personal preference. I realize I forgot to mention this in the issue description, but the reason I was hesitant to suggest modifying the existing JQ filter is that it would result in behavior changes.

  1. Scripts might rely on the fact that JQ only processes the first object.
  2. Input containing a valid JSON object followed by an invalid one would result in an error.

Example illustrating the two changes in behavior:

package main

import (
	"fmt"

	"github.com/bitfield/script"
)

func main() {
	data := "[0,1,2]\n[3,4,5]\n[6,7,8]"
	fmt.Println(script.Echo(data).JQ(`.`).String())
	fmt.Println()
	fmt.Println(script.Echo(data).JQnl(`.`).String())

	fmt.Println("\n---\n")

	data = "[0,1,2]\ninvalid"
	fmt.Println(script.Echo(data).JQ(`.`).String())
	fmt.Println()
	fmt.Println(script.Echo(data).JQnl(`.`).String())
}
$ go run main.go 
[0,1,2]
 <nil>

[0,1,2]
[3,4,5]
[6,7,8]
 <nil>

---

[0,1,2]
 <nil>

[0,1,2]
 invalid character 'i' looking for beginning of value

As mentioned above, I would personally prefer to update JQ.

  • It would mimic the behavior of the jq command line utility, which I as a user expected.
  • This is probably the behavior people want most of the time. Single-input JSON objects work as before, with the changes of behavior outlined above if the input contains additional data.
  • Avoids introducing a new filter, coming up with a good name, and having to explain when to use JQ or the new filter.

Let me know what you think! In either case, I would be happy to work on a PR later this week.

@bitfield
Copy link
Owner

Yes, this appeals to me too—presumably it's easy enough to get only the first object if that's what you really want.

We are at version 0.x, so I don't mind making breaking changes if it results in a better API, which I think it will. Would you like to go ahead and make your suggested changes to JQ, and we can invite people to test against your branch and see if there are any significant issues.

@tho
Copy link
Contributor Author

tho commented Mar 18, 2025

Great! I'll work on a PR this week.

re: getting only the first object, I think it will depend on the output of the JQ query/use case, e.g. in some situations the First filter can probably be used. In others, where functionality similar to jq --slurp ("Instead of running the filter for each JSON object in the input, read the entire input stream into a large array and run the filter just once.") is needed, a slurp Filter can be implemented. Something like:

func JQSlurp(r io.Reader, w io.Writer) error {
	var inputs []interface{}

	dec := json.NewDecoder(r)
	for dec.More() {
		var v interface{}
		err := dec.Decode(&v)
		if err != nil {
			return err
		}
		inputs = append(inputs, v)
	}

	result, err := gojq.Marshal(inputs)
	if err != nil {
		return err
	}

	_, err = fmt.Fprintln(w, string(result))
	if err != nil {
		return err
	}

	return nil
}

I am not suggesting adding a JQSlurp filter at this time because I am not convinced how often it would be needed. Also, the filter should probably accept a query, but let's not get distracted with the design of the slurp filter in this issue, I haven't fully thought it through and the above is tailored to one of my use cases :-)

However, I am open to adding one if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants