Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing multi-schema / master-detail files #66

Closed
blackfoxoh opened this issue Jan 20, 2016 · 20 comments
Closed

Writing multi-schema / master-detail files #66

blackfoxoh opened this issue Jan 20, 2016 · 20 comments
Assignees
Milestone

Comments

@blackfoxoh
Copy link

I have seen the docu about reading master-detail style files (https://github.com/uniVocity/univocity-parsers#reading-master-detail-style-files) and found #17 "creation of multiple types of Java beans".
Is there something similar for writing files I have missed so far?

Here is the usecase I'm evaluating:

# Header; line; of; master; record
# Header; line for; detailrecord1
# Header; line; for; detail; record2
MASTER; some; data; for; master1
DETAIL1; first other; data
DETAIL1; second other; data
DETAIL2; data; for; detail; record2
DETAIL2; data2; for; detail; record2
DETAIL2; data3; for; detail; record2
MASTER; some; data; for; master2
DETAIL2; data; for; detail; record2
...

This style might be too complex even for reading I fear as it has more than just one kind of detailrecord (actually I even have to write more than two kinds of detailrecods 😟 ).

My thoughts to solve this so far:
The headers have to be declared as comments so this would not be a problem. The master rows could be written as usual. Writing to fixedwidth the detailrows could be written using fixedWidthWriter.writeRow(string)while string could be collected from a second, third,.. fixedWidthWriter using fww2.processRecordToString() .
In case of CSV (usecase above) this seems a bit more difficult to me...!?

Did I miss something which makes this style of writing files easier?

@jbax
Copy link
Member

jbax commented Jan 20, 2016

The tutorial doesn't include anything (yet) for many new features in the upcoming 2.0.0 version. Support for that has been built some time ago. Check the test cases here

@jbax jbax closed this as completed Jan 20, 2016
@blackfoxoh
Copy link
Author

Thank you very much for the advice! This feature looks amazing!

Now I'm done evaluating uniVocity-parsers in the context we need a solution - kudos to you, there are quiet some special cases and univocity-parsers masters them all! Thank you for developing and sharing such a great library!

@blackfoxoh
Copy link
Author

I've did some tests and now have to ask again - I'm not sure if I do something wrong or if it is a bug...:
Inspired from your testcase I've tryed:

final ObjectRowWriterProcessor clientProcessor = new ObjectRowWriterProcessor();
final ObjectRowWriterProcessor accountProcessor = new ObjectRowWriterProcessor();

OutputValueSwitch writerSwitch = new OutputValueSwitch();
writerSwitch.addSwitchForValue("Account", accountProcessor, "type", "balance", "bank", "account", "swift");
writerSwitch.addSwitchForValue("Client", clientProcessor);

CsvWriterSettings settings = new CsvWriterSettings();
settings.getFormat().setLineSeparator("\n");
settings.setHeaderWritingEnabled(false);
settings.setRowWriterProcessor(writerSwitch);

CsvWriter writer = new CsvWriter(new File("path/to/file/filename"), settings);

The following code writes writer.processRecord("Account","bla","blub"); a line with the three columns - OK. What I want to achieve is, to get missing columns written emtpy: Account;bla;blub;;;

Ok, next thought was I have to provide which value belongs to which column (header), so the writer can decide which columns are provided and which one war empty.
Next try:

Map<String, Object> rowData = new HashMap<String, Object>();
rowData.put("type", "Account" );
rowData.put("balance","sp2");
rowData.put("bank","sp3");
rowData.put("acount","sp4");
rowData.put("swift","sp5");
writer.processRecord(rowData);

which resultet in an empty file as the keys from the map were used as headers but their ordering was different then provided in addSwitchForValue() because of the HashMap-implementation. Having the keys of the map matching the headers I've expected they were mapped accordingly.
None the less, next try I've added a headermapping expecting that the this might map the keys from rowData to the headers:

Map<String, String> headermapping = new HashMap<String, String>();
headermapping.put("type", "type" );
headermapping.put("balance","balance");
headermapping.put("bank","bank");
headermapping.put("acount","acount");
headermapping.put("swift","swift");

writer.processRecord(headermapping,rowData);

Result was the same, except that the headers now were provided from the headermapping-map instead of the rowData-map but without any mapping-logic (debugging into writer.processRecord(headermapping,rowData); ended in AbstractWriter.setHeadersFromMap(Map<?, ?> map, boolean keys) to get the headers)

Am I on the totally wrong track to achieve the above described or is this a bug?

@jbax
Copy link
Member

jbax commented Jan 23, 2016

Hello there. Regarding your first issue, trying to write

writer.processRecord("Account","bla","blub"); 

Will result in Account,bla,blub being written. The writer does not assume you want to add blank columns to match the header count. I'll make that configurable.

On the second issue with maps, using a LinkedHashMap will solve your problem. No rows are written to the output because the writer is extracting the headers from the map, and your OutputValueSwitch assumes the first element of the input rows must be used to determine which format to use. As you are using a HashMap, the first element is not your type column, therefore no processor is found to handle the row. I think it would be better to have it throw an exception instead of silently ignoring the row, so I'll add that as well.

jbax added a commit that referenced this issue Jan 25, 2016
@jbax
Copy link
Member

jbax commented Jan 25, 2016

@blackfoxoh
Copy link
Author

Thank you for the adjustments.
For my clarification: with a normal, non-multi-format csv the method writer.processRecord(Map); uses the keys of the Map to match the headers and if the keys of the map are different from the headers they can be mapped by providing a second map headerMapping as the first argument.
In case of the multi-format csv with the OutputValueSwitch this doesn't work because of the switch first has to decide which row-style to choose and therefore needs the keys preordered.
Is this assumption correct? At least it seems feasible to me and a quick test with non-multi-format succeeded accordingly.

What do you think - wouldn't it be possible when deciding to choose the correct switch to iterate over all switches, if they are provided with headersToUse get the headername of the switch-column and check the inputvalue of this headername if present to decide if it is a match for this switch?
Of course this can't work for switches without headersToUse and as well for row-inputs which don't provide headernames for the values. But in case of map, optional headerMapping-map or annotated beans this could be possible I think...?
I hope my description isn't to abstruse - if something isn't clear feel free to ask back...

@blackfoxoh
Copy link
Author

What I have forgotten: Your tipp with LinkedHashMap works, thanks. But in case you get the map to write from some code you don't controll this solution is knocked out as rowData = new LinkedHashMap<String, Object>(rowData); provides a LinkedHashMap now but the contents ordering of rowData obviously isn't touched by this conversion.
Hence and because of an API with a gerenic argument "Map" shouldn't depend on specific implementations of Map my suggestion...

@jbax
Copy link
Member

jbax commented Jan 25, 2016

The header definition of an output switch is used only to manage the output sequence of its written columns after a given format is identified.

The format is identified by reading what data is present at given column (by default, column index = 0). This requires that all rows come with the relevant data in the proper location for the switch to work (in your case, the account type) in the expected position (0).

When you pass a map as the input, and with no headers defined in the writer settings, it will simply iterate through the map keys to produce a record. In the case of a HashMap, that sequence cannot be determined.

In the particular case you want to use Map + OutputSwitch you can control the sequence of headers by providing an LinkedHashMap of header mappings:

        Map<String, String> headerData = new LinkedHashMap<String, String>();
        headerData.put("type", "type"); 
        headerData.put("balance", "balance");
        headerData.put("bank", "bank");
        headerData.put("account", "account");
        headerData.put("swift", "swift");

        writer.processRecord(headerData, rowData); 

This is a bit contrived and a slightly better approach would be to add a few overrides that allow you to provide the sequence as an array of headers. In this case you would be able to call something like writer.processRecord(rowData, "type", "balance", "etc"); instead.

Deriving the headers from each item of the output switch would be inefficient as that would have to be executed over and over each time you send in a map of values to write.

@blackfoxoh
Copy link
Author

OK, good point. Providing the headerMapping map as an instance of LinkedHashMap stands to reason.
Can you please confirm if the following is correct:
When using multi-format style csv there is no way to specify what "schema" each type of row has so the writer could support me in proper output of that type (except all types have the same structure so I could provide them as "common to all rowtypes" on the CsvWriterSettings)? => I have to ensure this on my own by providing proper values (or nulls) in the needed quantity?

Background: We will have some output to write with different row types. The schema is defined by an external partner and for each type of record we only need some of the lots of columns to fill. So it would have been very handy to define the columns of each rowtype and when writing just pass in a map with the fields which aren't null/not needed.

@jbax
Copy link
Member

jbax commented Jan 25, 2016

Not sure if I understood your question. Can you give me an example of expected input rows and expected output?

@blackfoxoh
Copy link
Author

Let's say there can be three rowtypes with theese headers:

SUPER;he1;he2;he3;he4
SUB1;a;b;c;d;e;f;g
SUB2;p;q;r;s;t;u;v;w;x;y;z;

Input as Map-Instances (one line=one map; key=>value):

SUPER=>SUPER;he1=>v1;he2=>v2;he3=>v3;he4=>v4
SUB1=>SUB1;a=>v5;d=>v6;e=>v7;g=>v8
SUB2=>SUB2;q=>v9;u=>v10;w=>v11;y=>v12
SUB1=>SUB1;a=>v13;d=>v14;g=>v15
SUB1=>SUB1;a=>v16;d=>v17;f=>v18;
SUPER=>SUPER;he1=>v19;;;he4=>v20

Most of the maps contain only partial data according to the defined headers, e.g. the first SUB1-record doesn't contain the keys b,c,f.

desired output:

SUPER;v1;v2;v3;v4
SUB1;v5;;;v6;v7;;v8
SUB2;;v9;;;;v10;;v11;;v12;
SUB1;v13;;;v14;;;v15
SUB1;v16;;;v17;;v18;
SUPER;v19;;;v20
...

I would like to define the inputschemata and than write row by row using something like writer.processRecord(dataMap); or writer.processRecord(headermapping,dataMap);.

CSV with only one type of record works fine, but the combination of multi schemata style csv and maps containing only partial data is what I didn't find.

@jbax
Copy link
Member

jbax commented Jan 25, 2016

Ok so I just committed a few changes that will make this work as expected, and also allow you to use any type of map without problems:

    public void testMultiple() {
        OutputValueSwitch writerSwitch = new OutputValueSwitch("type"); //switch based on field name
        writerSwitch.addSwitchForValue("SUPER", new ObjectRowWriterProcessor(), "type", "h1", "h2", "h3", "h4");
        writerSwitch.addSwitchForValue("SUB1", new ObjectRowWriterProcessor(), "type", "a", "b", "c", "d", "e", "f", "g");
        writerSwitch.addSwitchForValue("SUB2", new ObjectRowWriterProcessor(), "type", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z");

        CsvWriterSettings settings = new CsvWriterSettings();
        settings.setExpandIncompleteRows(true);
        settings.getFormat().setLineSeparator("\n");
        settings.setHeaderWritingEnabled(false);
        settings.setRowWriterProcessor(writerSwitch);

        StringWriter output = new StringWriter();
        CsvWriter writer = new CsvWriter(output, settings);

        writer.writeRow(newMap("SUPER", "h1=>v1;h2=>v2;h3=>v3;h4=>v4"));
        writer.writeRow(newMap("SUB1", "a=>v5;d=>v6;e=>v7;g=>v8"));
        writer.writeRow(newMap("SUB2", "q=>v9;u=>v10;w=>v11;y=>v12"));
        writer.writeRow(newMap("SUB1", "a=>v13;d=>v14;g=>v15"));
        writer.writeRow(newMap("SUB1", "a=>v16;d=>v17;f=>v18"));
        writer.close();

        assertEquals(output.toString(), "" +
                "SUPER,v1,v2,v3,v4\n" +
                "SUB1,v5,,,v6,v7,,v8\n" +
                "SUB2,,v9,,,,v10,,v11,,v12\n" +
                "SUB1,v13,,,v14,,,v15,,,\n" +
                "SUB1,v16,,,v17,,v18,,,,\n");

    }

    private Map<String, String> newMap(String type, String data) {
        Map<String, String> out = new HashMap<String, String>();
        out.put("type", type);
        for (String pair : data.split(";")) {
            String[] kv = pair.split("=>");
            out.put(kv[0], kv[1]);
        }

        return out;
    }

@blackfoxoh
Copy link
Author

thank you for the quick implementation. Unfortunately there is still a bug.
When the code enters AbstractWriter.fillPartialLineToMatchHeaders() the field headers is null so the code is skipped. My example you implemented has longer rows the further you come - the resulting empty columns at the end seem to have their origin in largestRowLength and not from the headers as expected.

Here is an additional unittest you can use:

   @Test
   public void testMultiple2() {
      OutputValueSwitch writerSwitch = new OutputValueSwitch("type"); //switch based on field name
      writerSwitch.addSwitchForValue("SUPER", new ObjectRowWriterProcessor(), "type", "h1", "h2", "h3", "h4");
      writerSwitch.addSwitchForValue("SUB1", new ObjectRowWriterProcessor(), "type", "a", "b", "c", "d", "e", "f", "g");
      writerSwitch.addSwitchForValue("SUB2", new ObjectRowWriterProcessor(), "type", "p", "q", "r", "s", "t", "u", "v",
            "w", "x", "y", "z");
      writerSwitch.addSwitchForValue("SUB3", new ObjectRowWriterProcessor(), "type", "a", "b", "c");

      CsvWriterSettings settings = new CsvWriterSettings();
      settings.setExpandIncompleteRows(true);
      settings.getFormat().setLineSeparator("\n");
      settings.setHeaderWritingEnabled(false);
      settings.setRowWriterProcessor(writerSwitch);

      StringWriter output = new StringWriter();
      CsvWriter writer = new CsvWriter(output, settings);

      writer.writeRow(newMap("SUPER", "h1=>v1;h2=>v2;h3=>v3"));
      writer.writeRow(newMap("SUB1", "a=>v5;d=>v6;e=>v7;g=>v8"));
      writer.writeRow(newMap("SUB2", "q=>v9;u=>v10;w=>v11;y=>v12"));
      writer.writeRow(newMap("SUB1", "a=>v13;d=>v14;g=>v15"));
      writer.writeRow(newMap("SUB1", "a=>v16;d=>v17;f=>v18"));
      writer.writeRow(newMap("SUB3", "a=>v16;b=>v17"));
      writer.writeRow(newMap("SUPER", "h1=>v1;h3=>v3"));
      writer.close();

      assertEquals("" + "SUPER,v1,v2,v3,\n" + "SUB1,v5,,,v6,v7,,v8\n" + "SUB2,,v9,,,,v10,,v11,,v12\n"
            + "SUB1,v13,,,v14,,,v15,,,\n" + "SUB1,v16,,,v17,,v18,,,,\n" + "SUB3,v16,v17," + "SUPER,v1,,v3,\n",
            output.toString());

   }

by the way: in your unittest the values for expected and actual are swapped - was a little confusing when modifying the test and check the complaining message ;-)

@blackfoxoh
Copy link
Author

a second little issue: in the OutputValueSwitch.switchRowProcessor(Object[] row) is missing the alternative handling by headername. At the moment row[columnIndex] throws ArrayIndexOutOfBoundsException as columnIndex is -1.
The rest seems to work fine so conversions can be applyed as well

@jbax
Copy link
Member

jbax commented Jan 26, 2016

Fixed everything :)
The expected and actual are not swapped, your IDE is probably swapping the values for you. TestNG's assertEquals method signature is:

    public static void assertEquals(String actual, String expected) {

@blackfoxoh
Copy link
Author

Oh, sorry - seems like TestNG and JUnit have the two parameters swapped. Innocent as I've been I didn't think about what testframework you might use - and having any imports stripped away the testcase was a valid JUnit testcase as well which is what I'm used to :-)

I'll give it a try later on, thank you.

@blackfoxoh
Copy link
Author

Works like a charm. 👍
Sorry for jangling your nerves with my issues while your work on release 2.0.0
I really appreciate your super fast fixes and addons. I'm confident I'm done now 😃
Thank you

@jbax
Copy link
Member

jbax commented Jan 27, 2016

Hey, thanks a lot for the valuable input. I feel much more confident about the features I'm adding to this library when people have something to complain about as it means they are being used in the real world.
Feel free to open new issues any time you find something that doesn't work as expected.

@jbax jbax self-assigned this Jan 27, 2016
@jbax jbax added this to the 2.0.0 milestone Jan 27, 2016
@blackfoxoh
Copy link
Author

I came over one more little issue in context of this usecase:
I've tryed to write a Map using writer.processRecord(row); having keys in the map which aren't defined as headers. My expectation was they are just getting ignored but I've got a TextWritingException Header 'unknownKey' could not be found. Defined headers are: null.
I think this might be usecase dependand. In my case I would like to reuse the same map for other purposes and therefore have some more entrys in the map which aren't needed for writing this row with univocity. In some other usecases it might be desired to get an exception if an entry couldn't be matched to a column so I would suggest a configuration-setting for this (ignoreNotMatchingColumns?).
This feature might apply for fixedWidth as well.
Another usecase for this feature: I want to collect a set of data in a map and write out different kinds of rows to a file. Lots of fields have identical values, only some differ. Now I'd like to set the type-column to TYPE_A, pass it to the writer, reset the type-column to TYPE_B and pass it to the writer again,... Then the writer can pick out just the values intended for the current rowtype and ignore the others.

Beside this there is a little bug in the exception-message above: AbstractWriter line 1069 throw throwExceptionAndClose("Header '" + headerName + "' could not be found. Defined headers are: " + Arrays.toString(headers) + '.', null);: it should be ...Arrays.toString(headersInContext)... I think

@jbax
Copy link
Member

jbax commented Jan 30, 2016

Thank you! This makes a lot of sense. When writing from maps unknown keys should be ignored. The use cases you reported should all work now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants