Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to insert multi-line files #490

Closed
jeqo opened this issue Sep 22, 2022 · 4 comments
Closed

Ability to insert multi-line files #490

jeqo opened this issue Sep 22, 2022 · 4 comments
Labels
question Further information is requested

Comments

@jeqo
Copy link

jeqo commented Sep 22, 2022

I was looking into how to parse application log files that contain multiline text (e.g. Java stack traces) into sqlite.
I can see that at the moment --lines helps, but falls short when processing multi-line texts.

I wonder if this functionality would be useful for sqlite-utils. A similar approach to Elastic logstash/filebeat can be adopted: https://www.elastic.co/guide/en/beats/filebeat/current/multiline-examples.html

Potential changes:

  • add a --multiline option
  • additional properties for
    • multiline-pattern (regex expression)
    • multiline-negate: true/false
    • multiline-what: previous or next

Or if this is achievable in a different way, please share. Thanks!

@simonw
Copy link
Owner

simonw commented Sep 23, 2022

It should be possible to achieve this with the --text option: https://sqlite-utils.datasette.io/en/stable/cli.html?highlight=text#convert-with-text

Given an example like this in multiline.log:

2022-03-01T12:04:52: Here is a log message
  that spans multiple lines
2022-03-01T12:04:52: This is a single line
2022-03-01T12:04:52: Here is another message
  that spans multiple lines

You should be able to run something like this:

sqlite-utils insert /tmp/log.db log multiline.log --text --convert "
import re

r = re.compile(r'^(?P<datetime>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}):(?P<log>.*)', re.MULTILINE)

def convert(text):
    return [m.groupdict() for m in r.finditer(text)]
"

After running this I get:

sqlite-utils rows /tmp/log.db log
[{"datetime": "2022-03-01T12:04:52", "log": " Here is a log message"},
 {"datetime": "2022-03-01T12:04:52", "log": " This is a single line"},
 {"datetime": "2022-03-01T12:04:52", "log": " Here is another message"}]

@simonw simonw closed this as completed Sep 23, 2022
@simonw simonw added the question Further information is requested label Sep 23, 2022
@jeqo
Copy link
Author

jeqo commented Sep 24, 2022

🤯 this is beautiful. Thanks @simonw !

@jeqo
Copy link
Author

jeqo commented Sep 24, 2022

For completeness, the regex requires a bit more dark magic to capture the following lines, here is a working expression: https://regex101.com/r/rsuEcs/1

sqlite-utils insert /tmp/log.db log multiline.log --text --convert "
import re

r = re.compile(r'^(?P<datetime>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})(?:\:\s)(?P<log>(.*\s\s.*|.*)+)', re.MULTILINE)

def convert(text):
    return [m.groupdict() for m in r.finditer(text)]
"
BEGIN TRANSACTION;
CREATE TABLE [log] (
   [datetime] TEXT,
   [log] TEXT
);
INSERT INTO "log" VALUES('2022-03-01T12:04:52','Here is a log message
  that spans multiple lines');
INSERT INTO "log" VALUES('2022-03-01T12:04:52','This is a single line');
INSERT INTO "log" VALUES('2022-03-01T12:04:52','Here is another message
  that spans multiple lines');
COMMIT;

@simonw
Copy link
Owner

simonw commented Sep 26, 2022

Just saw your great write-up on this: https://jeqo.github.io/notes/2022-09-24-ingest-logs-sqlite/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants