Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to replace with Regex that needs multiple lines #71

Closed
stormboomer opened this issue Dec 13, 2020 · 4 comments
Closed

How to replace with Regex that needs multiple lines #71

stormboomer opened this issue Dec 13, 2020 · 4 comments
Assignees
Labels

Comments

@stormboomer
Copy link

Hi,

I am having trouble getting the sed function to work with a regex. I am trying to remove the New Line Character (\n) if the line does not start with 4 Diggits.

I have a file looking like this

0001 this is a test
0002 another test
with a new line
and maybe another
0003 another test

The following command works on the commandline

cat issue.txt | sed ':a;$!{N;/\n[0-9-]\{4\}/!{s/\n/ /;ba}};P;D' | less

But when I try to use it like this

List<String> result = Unix4j.cat("issue.txt")
                .sed(":a;$!{N;/\\n[0-9-]\\{4\\}/!{s/\\n/ /;ba}};P;D").toStringList();

I get the following Exception

Exception in thread "main" java.lang.IllegalArgumentException: sed regexp pattern is not terminated, expected a second unescaped ':' character in: :a;$!{N;/\n[0-9-]\{4\}/!{s/\n/ /;ba}};P;D
	at org.unix4j.unix.sed.Command.fromScript(Command.java:192)
	at org.unix4j.unix.sed.SedCommand.execute(SedCommand.java:22)
	at org.unix4j.command.JoinedCommand.execute(JoinedCommand.java:101)
	at org.unix4j.builder.DefaultCommandBuilder.toOutput(DefaultCommandBuilder.java:134)
	at org.unix4j.builder.DefaultCommandBuilder.toStringList(DefaultCommandBuilder.java:116)
....

Am I doing something wrong?
What would be the best way to get this to work?

Thank you in advance.

@terzerm terzerm self-assigned this Dec 14, 2020
@terzerm
Copy link
Member

terzerm commented Dec 14, 2020

Hi

Firstly thanks for using unix4j.

The simple and short answer to your question: the sed expression that you are trying is not supported by unix4j's sed.

In more detail:

  • unix4j is strictly line orientated, multi-line operations are not supported
  • not all sed editing commands are supported
  • consecutive sed operations are not directly supported; use piping into the next sed command instead

If you still want to use unix4j to solve your problem, you will have to first get around the multi-line issue. This can for instance be achieved by loading the file into a string and replacing newlines with some special character sequence:

String singleLine = Unix4j.fromFile("issue.txt").toStringResult().replace("\n", "<NL>").replace("\r", "");

You can then replace the "<NL>" sequences with proper new lines if preceeded by 4 digits --- and with a simple space otherwise:

List<String> result = Unix4j.fromString(singleLine)
    .sed("s/<NL>(\\d\\d\\d\\d)/\n$1/g")
    .sed("s/<NL>/ /g")
    .toStringList();

This will result in the following output given the sample input from above:

0001 this is a test
0002 another test with a new line and maybe another
0003 another test

Of course it is far from ideal to process the newline replacement operations in memory especially for large files --- you may want to replace new lines in the file itself with a different method or use a different tool altogether. As I said Unix4j is not really well suited for multiline operations.

I hope this helps anyway.

(The example has been added to the git repo as unit test SedTest.testSed_regexWithMultipleLines)

@stormboomer
Copy link
Author

Hi terzerm,

thanks for your response.
Sadly I am dealing with larger files so it is not an good option for me to load the file into memory first.
So I will have a look at some kind of portable sed utility that I can use.

@terzerm
Copy link
Member

terzerm commented Dec 14, 2020

Personally I would process the file manually by always looking 1 line ahead and then directly writing the output to a new file using BufferedReader and PrintWriter for instance. Something like this:

final BufferedReader reader = new BufferedReader(new FileReader("issue.txt"));
final PrintWriter writer = new PrintWriter(new FileWriter("result.txt"));
final StringBuilder lineBuffer = new StringBuilder(256);
String line;
while ((line = reader.readLine()) != null) {
    if (line.matches("^\\d\\d\\d\\d.*")) {
        if (lineBuffer.length() > 0) {
            writer.println(lineBuffer);
            lineBuffer.setLength(0);
        }
    }
    lineBuffer.append(lineBuffer.length() > 0 ? " " : "").append(line);
}
if (lineBuffer.length() > 0) {
    writer.println(lineBuffer);
}
writer.flush();

@terzerm
Copy link
Member

terzerm commented Dec 14, 2020

See SedTest.java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants