Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use super CSV for writing CSV files #416

Merged
merged 2 commits into from
Feb 24, 2014
Merged

Use super CSV for writing CSV files #416

merged 2 commits into from
Feb 24, 2014

Conversation

domoritz
Copy link
Member

Instead of:

"foo", "bar,bar","baz"

we will write

foo,"bar,bar",baz

…essary quotes which makes the files much smaller
@dhalperi
Copy link
Member

Any speed results here?

@@ -42,11 +45,14 @@ public CsvTupleWriter(final OutputStream out) {
* @param out the {@link OutputStream} to which the data will be written.
*/
public CsvTupleWriter(final char separator, final OutputStream out) {
csvWriter = new CSVWriter(new BufferedWriter(new OutputStreamWriter(out)), separator);
final CsvPreference sepratorPreference =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reason to not expose the quote char and/or the eol char, etc., as options too? I guess right now CSV really means CSV but in the future I could imagine us customizing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should extend it when we need it.

@domoritz
Copy link
Member Author

Running: B(x,xx,y,yy) :- TwitterK(x,xx), TwitterK(y,yy)

Super CSV:

  • 18.082097s
  • 20.295329s
  • 18.257644s

Open CSV (master):

  • 20.042965s
  • 16.579250s
  • 24.542094s
  • 16.357882s

It looks like the variation is much larger for Open CSV in my small sample but it doesn't look significantly slower or faster. Whereas the file size for downloads can be significantly smaller.

@stechu
Copy link
Contributor

stechu commented Feb 22, 2014

FYI:

My experience on bigger dataset implies that Open CSV is faster than the
previous implemention using java file scan, which is somehow different from
what @dhalperi have found?

Using java file scan to ingest twitter data (about 20GB) takes around 1hr.
Using the Open CSV to ingest freebase (about 200GB) takes 2.6 hr.

Although this is not a fair comparison, consider that freebase is much
harder to parse, this still could show something.

On Fri, Feb 21, 2014 at 8:18 PM, Dominik Moritz notifications@github.comwrote:

Running: B(x,xx,y,yy) :- TwitterK(x,xx), TwitterK(y,yy)

Super CSV:

  • 18.082097s
  • 20.295329s
  • 18.257644s

Open CSV (master):

  • 20.042965s
  • 16.579250s
  • 24.542094s
  • 16.357882s

It looks like the variation is much larger for Open CSV in my small sample
but it doesn't look significantly slower or faster.

Reply to this email directly or view it on GitHubhttps://github.com//pull/416#issuecomment-35794234
.

dhalperi added a commit that referenced this pull request Feb 24, 2014
Use super CSV for writing CSV files
@dhalperi dhalperi merged commit f32b5f0 into master Feb 24, 2014
@dhalperi dhalperi deleted the domoritz-supercsv branch February 24, 2014 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants