Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not throw exception when line size mismatch #22

Closed
wants to merge 1 commit into from

Conversation

leroyvi
Copy link

@leroyvi leroyvi commented Jan 30, 2015

Hi,

I have a problem using supercsv. Sometimes, I have to deal with some "incorrect" CSV files like that :

"header1";"header2"
"value1";"value2";

As you can see, I have 2 headers and 3 values in the second line. This prevent me from doing a partial read. This is the same case when executing CellProcessors.

Is it ok for you ?

@kbilsted
Copy link
Member

Hi @chameleooo and @vleroy

Many thanks for submitting ideas for improvements and even pull-requests! Awesome!

I think there are a few issues we need to clear before we can accept the idea and changes to the main repository.

  1. When there is a mismatch in the number of columns, how do you know which ones are ok to read and which are ok to ignore? You seem to know its the last column. But I'm not sure that is a general assumption to make
  2. Please submit some example code on how you envision using supercsv with these changes?

Now there are alternatives to solving you issue in a more flexible manner than the suggested pull request.

  1. Use null as the processor for columns for ignore as explained here http://super-csv.github.io/super-csv/examples_partial_reading.html
  2. Use the list reader to get an array of strings, then reduce the array to match the number of columns you need an manually invoke the cellprocessors using Util.executeCellProcessors()
  3. Create a Reader class which reads lines and truncates ; and feed that reader to SuperCsv

I hope we can get a discussion going rather than you guys feel that I'm plain rejecting you.

Also please feel free to help out on some of the existing issues that need fixing. Before coding or sending pull requests, please discuss the issue with us first,

-Kasper

@jamesbassett
Copy link
Contributor

Yeah CSV with variable columns is a difficult problem. There's a recipe for using CsvBeanReader (in parallel with CsvListReader to figure out how many columns there are) in your situation (optional fixed column) on StackOverflow. If it's not always the same column that is optional (or there's more than 1), then you're really stuck.

I'd agree with Kasper regarding the PR - it works in this particular situation but introduces weird behaviour for other situations. For example, if I use NotNull() on all my columns and one column is missing then it's going to get silently ignored!

@kbilsted
Copy link
Member

James, where's the "like button" ?? :-)

@kbilsted
Copy link
Member

Below text moved to #23 . Please discuss there

I was thinking that maybe we could introduce TryReadxxx() methods on the reader implementations which do not throw exceptions, rather they may return null/an empty list maybe along with a list of strings for the caller to inspect and react upon. In C# "out parameters" are typically used with TryXXX methods. I'm not sure what to use in Java. Especially if we want to keep compatibility with java 1.5

something like

public boolean TryRead(T bean, string[] nameMapping, List<String> rawLineData)

What are your thoughts? If you all think its a nice idea, I would not mind @vleroy to submit a PR for that.

@leroyvi
Copy link
Author

leroyvi commented Jan 30, 2015

Hi guys,

First of all, thanks for quick responses. I'm Victor (chameleooo and vleroy are my personal and professional aliases).
You ask the same questions I asked to myself.

  1. I don't have any idea on how someone can create a CSV file forgetting some "middle" columns. But some editors (like Excel) can create empty cells at the end.
  2. I have to parse files with unpredictable columns and I just know the header of one of them and I need to get every values in this column.

I don't want to deal with optional columns but I think that if we ask user to pass a String[], we can assume that other columns have to be ignored.
In my case, I don't know how many columns the file contains. So I read the header and create my "mask" from the header. But if one line has a more cells than the header we can ignore lasts.

I don't use CellProcessors but I assume that both methods should be consistent.

@kbilsted
Copy link
Member

Hmm thinking about it, I think it may be reasonable to add a reader configuration option AllowTrailingEmptyCells boolean with the default to false. If true we trim the empty cells before doing the sizecheck as it is today. I think this is orthogonal to #23 so I think both can co-exist.

This option has to be implemented for all the readers. @vleroy What do you think of this idea?

@leroyvi
Copy link
Author

leroyvi commented Feb 1, 2015

I don't really understand why we have to know the line size before parsing it ?
What do you think about a strictParsing boolean with a default value to true and if the value is false, we do something like what I do in the PR ?

@kbilsted
Copy link
Member

kbilsted commented Feb 1, 2015

Hi @chameleooo

The problem is that if we have more columns with data in them, we cannot be sure what we are getting out of the file. In many enterprise situations, we are better of rejecting the file than importing wrong data.

I can see your situation, but Im not convinced that its a common situation. That being said, if we can get @jamesbassett to agree, I think a nice alternative is to, through configuration, to allow one or more EMPTY cells to exist without raising exceptions. It will solve your problem, and it will still keep the reading process sane.

@leroyvi
Copy link
Author

leroyvi commented Feb 1, 2015

I understand that in almost every case, you should reject the file if it does not have the same number of columns as expected. But, IMHO, there is no difference between removing empty or not empty trailing cells. And it's just an optional behaviour.

@kbilsted
Copy link
Member

kbilsted commented Feb 1, 2015

I think it conflicts with the principle of least surprise. I think the code would be more readable if you read the lines using the ListReader and then manipulated it yourself before applying the cell processors as I described above.

@leroyvi
Copy link
Author

leroyvi commented Feb 1, 2015

Ok, I'll implement a Reader for my own needs.

@kbilsted
Copy link
Member

kbilsted commented Feb 1, 2015

No problem, thanks for taking the time to discuss future improvements of the framework

@kbilsted
Copy link
Member

kbilsted commented Feb 1, 2015

Just found another motivation for supporting trailing empty cells - the google csv export

http://stackoverflow.com/questions/23661530/supercsvexception-cellprocessornumber-processing-google-contacts-csv-exports/28178933#28178933

@kbilsted
Copy link
Member

kbilsted commented Feb 1, 2015

Closing this PR - further discussion is moved to #25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants