Skip to content
This repository has been archived by the owner on May 19, 2021. It is now read-only.

Data 'diff' format #19

Open
rdpeng opened this issue Mar 1, 2015 · 28 comments
Open

Data 'diff' format #19

rdpeng opened this issue Mar 1, 2015 · 28 comments

Comments

@rdpeng
Copy link

rdpeng commented Mar 1, 2015

One thing I've always wanted is a 'diff' type output for datasets (let's say tabular datasets for now). When I use git to manage projects, changes to the datasets I use are difficult to visualize using the standard diff output, which is line based. That works when rows are changed but not when columns are added/deleted or transformations are made. Is there a way to categorize the types of changes that can be made to a dataset and then visualize them in a useful way?

@srvanderplas
Copy link

I would be most interested in this as well.

On Sun, Mar 1, 2015 at 11:10 AM, Roger D. Peng notifications@github.com
wrote:

One thing I've always wanted is a 'diff' type output for datasets (let's
say tabular datasets for now). When I use git to manage projects, changes
to the datasets I use are difficult to visualize using the standard diff
output, which is line based. That works when rows are changed but not when
columns are added/deleted or transformations are made. Is there a way to
categorize the types of changes that can be made to a dataset and then
visualize them in a useful way?


Reply to this email directly or view it on GitHub
#19.

@rdpeng
Copy link
Author

rdpeng commented Mar 1, 2015

Awesome! I'm not 100% sure how this would work, but I think for it to be useful it would have to sit on top of git and then maybe show how the dataset changes independent of git's own output. The issue there would then be efficiency....

@ledell
Copy link

ledell commented Mar 1, 2015

@rdpeng, are you familar with dat? It's a version control system for data, which feels very similar to git. At the moment, I think it tracks modifications by row only, but I am hoping that column-based diffs will be part of a future release. I agree that something that tracks a variety of transformations/modifications would be very useful.

@jennybc
Copy link
Member

jennybc commented Mar 1, 2015

👍 I was just talking to @gvwilson about this exact thing earlier this week….

@jeroen
Copy link
Member

jeroen commented Mar 1, 2015 via email

@karthik
Copy link
Member

karthik commented Mar 1, 2015

Thanks @ledell for mentioning Dat. I'm on my phone so this will be brief but I'll expand later. Dat can natively do diffs and ropensci has a rDat package in the works, waiting on Dat to come to beta (which is soon). I've invited the Dat project to join us and Karissa from their team will join us.

There are some issues with rDat that I'm hoping Jeroen will help resolve. But 💯 to pursuing this idea. It should be easy to complete at the event.

@rdpeng
Copy link
Author

rdpeng commented Mar 1, 2015

Daff looks quite good actually, and seems to implement most of what I was thinking about.

One thing I was hoping to do was implement was something a bit more "intelligent" (and likely more constraining). So for example, if I transform a column by squaring it, is there a way to show that rather than just indicating that every value in the column changed? Perhaps a diff could be expressed via R code rather than the something along the lines of the usual +/- diff format.

@gmbecker
Copy link

gmbecker commented Mar 9, 2015

@rdpeng your latest comment sounds like a provenance-tracking problem. Are you thinking this will be applied in a system aware of what is done, or does it need to work like diff, I.e. given two datasets and no extra information, tell me the differences?

@jennybc
Copy link
Member

jennybc commented Mar 10, 2015

Pardon if this is slightly off-topic but I want to park these links in a few relevant places, like this thread.

Re: weaning people off of Excel for data inspection and cleaning. OpenRefine comes up a lot and is generally popular with people expecting a GUI. I had always thought it was only mouse driven, but that it wrote some sort of log file. Did not realize these logs are perhaps re-executable. But a recent Twitter conversation intrigues me and also alerted me to Ruby and Python wrappers around the underlying Refine API. @ostephens says:

  • OpenRefine captures actions which can be exported as JSON and replayed against same/other data
  • the only things that aren’t captured for replay are edits at the level of individual cells
  • so OpenRefine works well where the fix can be applied across rows - all transforms exportable

@rdpeng
Copy link
Author

rdpeng commented Mar 10, 2015

I'm not sure, to be honest. I think I would need a brief discussion of the
pros and cons of either one.

On Mon, Mar 9, 2015 at 5:22 PM, Gabe Becker notifications@github.com
wrote:

@rdpeng https://github.com/rdpeng your latest comment sounds like a
provenance-tracking problem. Are you thinking this will be applied in a
system aware of what is done, or does it need to work like diff, I.e. given
two datasets and no extra information, tell me the differences?


Reply to this email directly or view it on GitHub
#19 (comment).

Roger D. Peng | @rdpeng https://twitter.com/rdpeng |
http://www.biostat.jhsph.edu/~rpeng/

@sjackman
Copy link

I'm interested in this topic.

@vsbuffalo
Copy link

Me too! I really like this idea. In the Unix tradition, I think the best approach to an implementation might be a C or C++ library and command line tool (e.g. like curl). The, we could maybe write a simple R wrapper.

@gmbecker
Copy link

Maybe, though that requires us to write it in C/C++ instead of the much
easier R. There are benefits (though not that many to R users), but pretty
major downsides too.

I would argue that - for prototyping algorithms and features, at least -
implementing it initially in R is a more efficient use of our time.

Remember what Duncan always said: for every two lines of C you write, you
introduce 3 bugs.

~G

On Mon, Mar 23, 2015 at 10:18 PM, Vince Buffalo notifications@github.com
wrote:

Me too! I really like this idea. In the Unix tradition, I think the best
approach to an implementation might be a C or C++ library and command line
tool (e.g. like curl). The, we could maybe write a simple R wrapper.


Reply to this email directly or view it on GitHub
#19 (comment).

Gabriel Becker, PhD
Computational Biologist
Bioinformatics and Computational Biology
Genentech, Inc.

@bbest
Copy link

bbest commented Mar 26, 2015

There's a slick little Chrome extension CSVHub that will visualize the daff like differencing of a CSV from within Github:

@sjackman
Copy link

@bbest Wow! That's fantastic! Oddly it doesn't work with TSV files. I've opened an issue Data-Liberation-Front/csvhub#8 to request this feature.

@sjackman
Copy link

These are the git aliases that I use for diffing TSV and CSV files.

[alias]
    wdiff = diff --word-diff=plain
    wdiffc = diff --word-diff=color
    wdiffcsv = diff --word-diff=color --word-diff-regex=[^,]+

See https://github.com/sjackman/dotfiles/blob/master/.gitconfig#L3-L5

screenshot 2015-03-26 14 48 35

@jordansread
Copy link

👍

@jules32
Copy link

jules32 commented Mar 26, 2015

thanks for sharing @sjackman :)

@karthik
Copy link
Member

karthik commented Mar 26, 2015

Nice, @sjackman!

@jeroen
Copy link
Member

jeroen commented Mar 26, 2015

Very cool. So you mentioned git uses wdiff under the hood to diff code by word as well?

@okdistribute
Copy link

I took some notes from our conversation today. Thanks for contributing to the workshop! okdistribute/knead#1

@bbest
Copy link

bbest commented Mar 27, 2015

Good one @sjackman! Here's my little play session with trying out this technique...

# add alias to git's config
git config --global alias.diffcsv "diff --word-diff=color --word-diff-regex=[^,]+"

# initialize repo
git init test_csv; cd test_csv

# 1st commit of test csv
echo -e 'a,b,c\n1,2,3\n4,5,6' > x.csv; cat x.csv
git add x.csv; git commit -m 'initial csv'

# modify csv: b->c, 4->8
echo -e 'a,c,d\n1,2,3\n8,5,6' > x.csv; cat x.csv 

# compare against previous commit
git diff x.csv 
git diffcsv x.csv 

# 2nd commit on modified csv: b->c, 4->8
git commit -a -m 'modified csv'

# modify csv: +e column with 0's
echo -e 'a,c,d,e\n1,4,3,0\n8,5,6,0' > x.csv 

# compare against previous commit 
git diffcsv x.csv

# 3rd commit on modified csv: +e column with 0's
git commit -a -m 'modified csv again'

# look at history of commits
git log

# compare between specific commits of the csv (swapping from your git log output)
git diffcsv 56515ac..97bfd69 -- x.csv 

@sjackman
Copy link

daff works really well!

screenshot 2015-03-27 09 54 33

@jules32
Copy link

jules32 commented Mar 27, 2015

Following up with @sjackman and @bbest's examples: I moved @bbest's script into R since for us this kind of visual differencing would need to be portable (ie to be able to share it with colleagues outside of your own terminal window).

Unfortunately RStudio doesn't do the color differencing (would that even be possible?) and in fact the display is not useful. What would further options be? @Karissa?

Examples and full R script below.

Comparing git diffcsv x.csv from @bbest's code above run in Terminal and RStudio:

image

image

R translation of @bbest's bash script above

# add alias to git's config
system('git config --global alias.diffcsv "diff --word-diff=color --word-diff-regex=[^,]+"')

# initialize repo
system('git init test_csv; cd test_csv')


# 1st commit of test csv
x = data.frame(a = c(1,4), b = c(2,5), c = c(3,6)); x
write.csv(x, 'x.csv', row.names = F)

system("git add x.csv; git commit -m 'initial csv'")


# modify csv: b->c, c->d, 4->8
x = data.frame(a = c(1,8), c = c(2,5), d = c(3,6)); x
write.csv(x, 'x.csv', row.names = F)

# compare against previous commit
system('git diff x.csv') 
system('git diffcsv x.csv') 

# 2nd commit on modified csv: b->c, 4->8
system("git commit -a -m 'modified csv'")

# modify csv: +e column with 0's
x = data.frame(a = c(1,8), c = c(2,5), d = c(3,6), e = c(0,0)); x
write.csv(x, 'x.csv', row.names = F)


# compare against previous commit 
system('git diffcsv x.csv')

# 3rd commit on modified csv: +e column with 0's
system("git commit -a -m 'modified csv again'")

# look at history of commits
system('git log')

# compare between specific commits of the csv (swapping from your git log output)
system('git diffcsv a4c1add0..5cf47e62 -- x.csv') 

@sjackman
Copy link

It's not as pretty, but you can use --word-diff=plain instead of --word-diff=color inside of RStudio. It'll use [-foo-] to indicate removed text and {+bar+} to indicate added text.

❯❯❯ git diff --word-diff=plain --word-diff-regex='[^,]+' foo.csv bar.csv
diff --git a/foo.csv b/bar.csv
index cbe2c72..46c6789 100644
--- a/foo.csv
+++ b/bar.csv
@@ -1,3 +1,3 @@
A,B,C
1,2,3
4,[-5-]{+7+},6

@sjackman
Copy link

It's possible that RStudio could render the ANSI colour code escape sequences. Certainly no harm in opening an issue with a feature request.

@bbest
Copy link

bbest commented Mar 30, 2015

By the way, pushing the test_csv repo created above to https://github.com/bbest/test_csv and viewing in Google Chrome with CSVHub nicely renders the differences between the following csv commits:

  1. 56515ac initial csv

    image

  2. db8644a modified csv: b->c, c->d, 4->8

    image

  3. 97bfd69 modified csv again: +e column with 0's

    image

And now for the comparisons using daff style differencing (green add, red delete, blue modify) with the CSVHub Google Chrome extension:

@jules32
Copy link

jules32 commented Mar 31, 2015

Thanks @bbest!

On Mon, Mar 30, 2015 at 3:05 PM, Ben Best notifications@github.com wrote:

By the way, pushing the test_csv repo created above to
https://github.com/bbest/test_csv and viewing in Google Chrome with CSVHub
https://chrome.google.com/webstore/detail/csvhub/dbemglgpbebafkibfncdpdmdikacingf
nicely renders the differences between the following csv commits:

56515ac initial csv

[image: image]
https://cloud.githubusercontent.com/assets/2837257/6907627/b417caec-d6ec-11e4-894b-8de6f330e278.png
2.

db8644a modified csv: b->c, c->d, 4->8

[image: image]
https://cloud.githubusercontent.com/assets/2837257/6907643/e060458e-d6ec-11e4-8ce4-252d4aa41a13.png
3.

97bfd69 modified csv again: +e column with 0's

[image: image]
https://cloud.githubusercontent.com/assets/2837257/6907638/d4339040-d6ec-11e4-8353-86be4c223ff8.png

And now for the comparisons using daff style differencing (green add, red
delete, blue modify) with the CSVHub
https://chrome.google.com/webstore/detail/csvhub/dbemglgpbebafkibfncdpdmdikacingf
Google Chrome extension:

bbest/test_csv@56515ac...db8644a
bbest/test_csv@56515ac...db8644a

1st to 2nd: b->c, c->d, 4->8

[image: image]
https://cloud.githubusercontent.com/assets/2837257/6907701/465aa78a-d6ed-11e4-8655-1472b0503afe.png
-

bbest/test_csv@db8644a...97bfd69
bbest/test_csv@db8644a...97bfd69

2nd to 3rd: +e column with 0's

[image: image]
https://cloud.githubusercontent.com/assets/2837257/6907719/70ef4596-d6ed-11e4-8670-14d2d6c7df6f.png
-

bbest/test_csv@56515ac...97bfd69
bbest/test_csv@56515ac...97bfd69 1st to
3rd

1st to 3rd: b->c, c->d, 4->8, +e column with 0's

[image: image]
https://cloud.githubusercontent.com/assets/2837257/6907747/ac2ac4e6-d6ed-11e4-9f20-ded7d6355331.png


Reply to this email directly or view it on GitHub
#19 (comment).

Julia Stewart Lowndes, PhD
Project Scientist, Ocean Health Index http://www.oceanhealthindex.org
National Center for Ecological Analysis and Synthesis (NCEAS
http://www.nceas.ucsb.edu)
University of California, Santa Barbara
735 State Street, Suite 300
Santa Barbara, CA, 93101, USA
Phone: 1-805-893-7523

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests