Skip to content

Loading…

Add delimiter detection to Reader #16

Closed
MarcusDalgren opened this Issue · 4 comments

2 participants

@MarcusDalgren

Hello,

I've tried out your and many other CSV libraries in the hope of finding one that automatically detects the delimiter and sadly none of the libraries I've tried have had a go at this.

I find this odd since it seems like a really useful abstraction that I as the consumer of the library don't want to worry about. The delimiter is a ;? Bakame should have my back without me having to set this manually.

Just grabbing the first line and counting occurences of valid delimiters for a csv file should cover 95% of the cases. Open office lists tab, comma, semicolon and space as valid delimiters for csv.

I'm guessing that if space is the delimiter then you have to use an enclosure for it to work.

@nyamsprod nyamsprod added the question label
@nyamsprod
The League of Extraordinary Packages member

Hi,

As it stands the library was not design to validate CSV. So detecting CSV controls is out of the scope of the library, for now.

CSV documents come in essentially 2 forms, those you create and those you import from outside your controlled environment.

  • The first type, I imagine you do know how they are build so the current behavior is not an issue.
  • The second type is to be checked and validated against what the provider documents when being used by the library.

This means that whoever is providing you some data using CSV should tell which controls is being used and how the CSV is being construct just like with any other data transfer format. If the information is not given you should contact the provider and ask him/her to give it to you!

So you end up with the first situation where you do know the CSV constrols without processing the document. If processing fails then it's either a bug in the library or the API documentation is wrong!!

Just grabbing the first line and counting occurences of valid delimiters for a csv file should cover 95% of the cases. Open office lists tab, comma, semicolon and space as valid delimiters for csv

What processing software like Open office do is giving you hint or suggestion, the final decision is always made by you because sometimes (often I may say) they are incorrect.

@nyamsprod
The League of Extraordinary Packages member

@MarcusDalgren

I have flag your issue as enhancement and I'm working on adding your feature request but I have some questions and maybe you could help me shaping the feature with your feedback.

I've created a function called detectDelimiter and it works like this:

$nb_rows = 5; //the numbers of rows to parse to detect the delimiter by default $nb_rows = 1;
$additional_delimiters = ["\s", "|"]; //additional_delimiters (",", ";", "\t") are already considered
$csv = new Reader('/path/to/my/csv/file');
$res = $csv->detectDelimiter($nb_rows, $additional_delimiters);

Now the crucial part is what the return value $res will be. I have selected 2 options which are the following:
In the first option $res return a scalar value.

  • $res is equals to null if no delimiter is found;
  • $res is equals to false if the CSV is inconsistent meaning the CSV uses more than 1 delimiters;
  • $res is equals to a single character which is the uniquely found delimiter;

using this first option you can do something like this

if (is_null($res)) {
    echo 'no delimiter detected';
} elseif (false === $res) {
    echo 'the CSV is inconsistent it uses several different delimiters';
} else {
    $csv->setDelimiter($res); 
}

In the second option $res return an array.

This array will contained the founded delimiters.

  • If no delimiter is found then the array will be empty.
  • The more delimiters found the more value will be in the array. So an inconsistent CSV will have an array which size will be greater than 1.
  • The delimiters in the array are sorted depending on the number of rows found
$nb_delimiters = count($res);
if ($nb_delimiters == 1) {
    $csv->setDelimiter($res[0]);
} elseif ($nb_delimiters > 1) {
    echo 'the CSV uses ', implode(' ', $res). ' delimiters'; 
} else {
    echo 'no delimiter detected';
}

Can you tell me which version you prefer and why ? Of course if you have a better solution it is more than welcomed :+1:

Of note:

  • the more rows and/or additional delimiters you had, the more time consuming the method will be!
  • If the CSV contains only 1 column, this method will return null or an empty array.
  • this method result should definitely be use as a hint and the developer should be aware of it before using it, My first comment is still valid even thought this method cover your 95% cases.
  • this method still depend on the CSV enclosure and escape properties. This means that you may end up with different results if you change those properties.
@MarcusDalgren

Hi,
First of all, thank you for doing this!

I got the impression that you weren't that interested in the feature so it came as a really nice surprise to see you doing this. I'm more than willing to help out if you've got this in a branch that I can try out and send PRs for.

How about using exceptions? Either an InconsistentDelimiterException and a NoDelimiterException or a more generic CSVParseException that tells you what the problem is in a message or a method. I think that's more clear than keeping track of the different meanings for null and false in your first option. If you end up getting an exception odds are that reading the file wouldn't give you a good result anyway.

Another question is if \s really should be considered as a valid delimiter considering that you'd have to switch the parsing strategy. Using \s really only works if the text is in enclosures so you'd have to keep track of that, you can't just split the string on \s and then count the results. I'm willing to help out with these kinds of special scenarios so that you don't have to spend time on it.

Is this stuff in a branch that I can check out and help you with?
Thanks again!

@nyamsprod nyamsprod added a commit that referenced this issue
@nyamsprod nyamsprod implementing detectDelimiter #16 3bcd75c
@nyamsprod
The League of Extraordinary Packages member

Hi,
I have just push the branch with the method in it. As of this writing it uses the array syntax.

  • about the \s this is an example. I don't think someone would use that really ? :)
  • about the exception since this is a hint I thought throwing exception for a hint method is too much. But I can be wrong. This is the same tought that I had that's why I did not want to implement something like a getLastErrorCode/getLastErrorMessage
@nyamsprod nyamsprod removed the question label
@nyamsprod nyamsprod added a commit that referenced this issue
@nyamsprod nyamsprod implementing detectDelimiter #16 ff40b6d
@nyamsprod nyamsprod added a commit that referenced this issue
@nyamsprod nyamsprod implementing detectDelimiter #16 5a46f7d
@nyamsprod nyamsprod added a commit that referenced this issue
@nyamsprod nyamsprod implementing detectDelimiter #16 7968b7b
@nyamsprod nyamsprod added a commit that referenced this issue
@nyamsprod nyamsprod # This is a combination of 2 commits.
# The first commit's message is:

# This is a combination of 4 commits.
# The first commit's message is:
implementing detectDelimiter #16

# The 2nd commit message will be skipped:

#	implementing detectDelimiter with Exceptions #16

# The 3rd commit message will be skipped:

#	implementing detectDelimiter #16

# The 4th commit message will be skipped:

#	implementing detectDelimiter with Exceptions #16

# The 2nd commit message will be skipped:

#	implementing detectDelimiter with Exceptions #16
ce37e26
@nyamsprod nyamsprod closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.