Handle csv files with data values containing carriage returns #31

Merged
merged 3 commits into from Oct 28, 2014

Projects

None yet

6 participants

@chrismhilton
Contributor

There are 3 commits in my branch with associated tests to handle various scenarios when the data values within a file contain carriage return characters.

The first caters for the quoted data values containing the row separator carriage return character (which previously resulted in the CSV::MalformedCSVError). When reading the lines of data if the string contains an uneven number of quote characters then the content of the next line is added to the current line string.

The second commit caters for carriage return characters within the data when guessing line endings by ignoring those characters contained within quote characters. I made this change to resolve an issue whereby a file contained more quoted data carriage return characters than line ending characters.

The third commit caters for double carriage return characters when removing empty values so that such values are not deemed to be empty.

@tilo
Owner
tilo commented Feb 17, 2014

thanks, I'll have a look

@tilo
Owner
tilo commented Feb 18, 2014

What is the source of your CSV-files? I'd argue that the source program contains a bug if it writes carriage return characters other than at the end-of-line.

Thank you for sharing your modifications, but this looks too much like a rare corner-case to me. I'm not sure if many people would benefit from this.

@chrismhilton
Contributor

There will always be quoted carriage return characters present within the file if a cell value consists of multi-line text. I'd argue that was a common occurrence if you're attempting to read files containing description field data.

@chrisbranson

As per wikipedia on "Basic Rules and Examples": -

"A record ends at a line terminator. However, line-terminators can be embedded as data within fields, so software must recognize quoted line-separators (see below) in order to correctly assemble an entire record from perhaps multiple lines."

http://en.wikipedia.org/wiki/Comma-separated_values

@tilo
Owner
tilo commented Feb 19, 2014

OK, thanks for the input!

@robly
robly commented Feb 21, 2014

This is a rather common occurrence when dealing with any data that contains something along the lines of 'notes' that users could enter regarding the other data fields.

I agree that this would be a nice feature, specially since FasterCSV handle this properly and losing that feature is frustrating.
Thanks

@wyaeld wyaeld added a commit to wyaeld/smarter_csv that referenced this pull request May 3, 2014
@wyaeld wyaeld Merge pull #31 from @chrismhilton to support carriage returns b156b45
@wyaeld
wyaeld commented May 3, 2014

I've temporarily forked and merged for my own use
@tilo These features seem well worth adding.

@sunito
sunito commented May 13, 2014

I was in desperate need for this feature.
I had thought of abandoning smarter_csv until I found this patch.
I've now incorporated it in a monkey-patching way into my smarter_csv.

@tilo
Owner
tilo commented Oct 28, 2014

@chrismhilton thanks for your contribution! nice work! sorry I didn't have time to look at this project for a while.

@tilo
Owner
tilo commented Oct 28, 2014

@sunito @wyaeld I'm merging this into the project and will release a new version

@tilo tilo merged commit ece7737 into tilo:master Oct 28, 2014

1 check passed

default The Travis CI build passed
Details
@tilo
Owner
tilo commented Oct 28, 2014

@wyaeld @chrismhilton @sunito @robly Sorry for the delay! It's been super-busy at work :-P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment