Do not throw exception when line size mismatch #22

leroyvi · 2015-01-30T14:12:08Z

Hi,

I have a problem using supercsv. Sometimes, I have to deal with some "incorrect" CSV files like that :

"header1";"header2"
"value1";"value2";

As you can see, I have 2 headers and 3 values in the second line. This prevent me from doing a partial read. This is the same case when executing CellProcessors.

Is it ok for you ?

kbilsted · 2015-01-30T18:14:24Z

Hi @chameleooo and @vleroy

Many thanks for submitting ideas for improvements and even pull-requests! Awesome!

I think there are a few issues we need to clear before we can accept the idea and changes to the main repository.

When there is a mismatch in the number of columns, how do you know which ones are ok to read and which are ok to ignore? You seem to know its the last column. But I'm not sure that is a general assumption to make
Please submit some example code on how you envision using supercsv with these changes?

Now there are alternatives to solving you issue in a more flexible manner than the suggested pull request.

Use null as the processor for columns for ignore as explained here http://super-csv.github.io/super-csv/examples_partial_reading.html
Use the list reader to get an array of strings, then reduce the array to match the number of columns you need an manually invoke the cellprocessors using Util.executeCellProcessors()
Create a Reader class which reads lines and truncates ; and feed that reader to SuperCsv

I hope we can get a discussion going rather than you guys feel that I'm plain rejecting you.

Also please feel free to help out on some of the existing issues that need fixing. Before coding or sending pull requests, please discuss the issue with us first,

-Kasper

jamesbassett · 2015-01-30T20:54:29Z

Yeah CSV with variable columns is a difficult problem. There's a recipe for using CsvBeanReader (in parallel with CsvListReader to figure out how many columns there are) in your situation (optional fixed column) on StackOverflow. If it's not always the same column that is optional (or there's more than 1), then you're really stuck.

I'd agree with Kasper regarding the PR - it works in this particular situation but introduces weird behaviour for other situations. For example, if I use NotNull() on all my columns and one column is missing then it's going to get silently ignored!

kbilsted · 2015-01-30T22:26:33Z

James, where's the "like button" ?? :-)

kbilsted · 2015-01-30T22:33:55Z

Below text moved to #23 . Please discuss there

I was thinking that maybe we could introduce TryReadxxx() methods on the reader implementations which do not throw exceptions, rather they may return null/an empty list maybe along with a list of strings for the caller to inspect and react upon. In C# "out parameters" are typically used with TryXXX methods. I'm not sure what to use in Java. Especially if we want to keep compatibility with java 1.5

something like

public boolean TryRead(T bean, string[] nameMapping, List<String> rawLineData)

What are your thoughts? If you all think its a nice idea, I would not mind @vleroy to submit a PR for that.

leroyvi · 2015-01-30T23:29:47Z

Hi guys,

First of all, thanks for quick responses. I'm Victor (chameleooo and vleroy are my personal and professional aliases).
You ask the same questions I asked to myself.

I don't have any idea on how someone can create a CSV file forgetting some "middle" columns. But some editors (like Excel) can create empty cells at the end.
I have to parse files with unpredictable columns and I just know the header of one of them and I need to get every values in this column.

I don't want to deal with optional columns but I think that if we ask user to pass a String[], we can assume that other columns have to be ignored.
In my case, I don't know how many columns the file contains. So I read the header and create my "mask" from the header. But if one line has a more cells than the header we can ignore lasts.

I don't use CellProcessors but I assume that both methods should be consistent.

kbilsted · 2015-01-31T12:58:39Z

Hmm thinking about it, I think it may be reasonable to add a reader configuration option AllowTrailingEmptyCells boolean with the default to false. If true we trim the empty cells before doing the sizecheck as it is today. I think this is orthogonal to #23 so I think both can co-exist.

This option has to be implemented for all the readers. @vleroy What do you think of this idea?

leroyvi · 2015-02-01T14:11:30Z

I don't really understand why we have to know the line size before parsing it ?
What do you think about a strictParsing boolean with a default value to true and if the value is false, we do something like what I do in the PR ?

kbilsted · 2015-02-01T14:27:32Z

Hi @chameleooo

The problem is that if we have more columns with data in them, we cannot be sure what we are getting out of the file. In many enterprise situations, we are better of rejecting the file than importing wrong data.

I can see your situation, but Im not convinced that its a common situation. That being said, if we can get @jamesbassett to agree, I think a nice alternative is to, through configuration, to allow one or more EMPTY cells to exist without raising exceptions. It will solve your problem, and it will still keep the reading process sane.

leroyvi · 2015-02-01T14:41:31Z

I understand that in almost every case, you should reject the file if it does not have the same number of columns as expected. But, IMHO, there is no difference between removing empty or not empty trailing cells. And it's just an optional behaviour.

kbilsted · 2015-02-01T14:46:27Z

I think it conflicts with the principle of least surprise. I think the code would be more readable if you read the lines using the ListReader and then manipulated it yourself before applying the cell processors as I described above.

leroyvi · 2015-02-01T14:56:27Z

Ok, I'll implement a Reader for my own needs.

kbilsted · 2015-02-01T15:11:14Z

No problem, thanks for taking the time to discuss future improvements of the framework

kbilsted · 2015-02-01T17:42:14Z

Just found another motivation for supporting trailing empty cells - the google csv export

http://stackoverflow.com/questions/23661530/supercsvexception-cellprocessornumber-processing-google-contacts-csv-exports/28178933#28178933

kbilsted · 2015-02-01T17:45:02Z

Closing this PR - further discussion is moved to #25

Do not throw exception when line size mismatch

13f4844

kbilsted added the question label Jan 30, 2015

kbilsted added enhancement and removed question labels Feb 1, 2015

kbilsted closed this Feb 1, 2015

kbilsted added question and removed enhancement labels Feb 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not throw exception when line size mismatch #22

Do not throw exception when line size mismatch #22

leroyvi commented Jan 30, 2015

kbilsted commented Jan 30, 2015

jamesbassett commented Jan 30, 2015

kbilsted commented Jan 30, 2015

kbilsted commented Jan 30, 2015

leroyvi commented Jan 30, 2015

kbilsted commented Jan 31, 2015

leroyvi commented Feb 1, 2015

kbilsted commented Feb 1, 2015

leroyvi commented Feb 1, 2015

kbilsted commented Feb 1, 2015

leroyvi commented Feb 1, 2015

kbilsted commented Feb 1, 2015

kbilsted commented Feb 1, 2015

kbilsted commented Feb 1, 2015

Do not throw exception when line size mismatch #22

Do not throw exception when line size mismatch #22

Conversation

leroyvi commented Jan 30, 2015

kbilsted commented Jan 30, 2015

jamesbassett commented Jan 30, 2015

kbilsted commented Jan 30, 2015

kbilsted commented Jan 30, 2015

leroyvi commented Jan 30, 2015

kbilsted commented Jan 31, 2015

leroyvi commented Feb 1, 2015

kbilsted commented Feb 1, 2015

leroyvi commented Feb 1, 2015

kbilsted commented Feb 1, 2015

leroyvi commented Feb 1, 2015

kbilsted commented Feb 1, 2015

kbilsted commented Feb 1, 2015

kbilsted commented Feb 1, 2015