Improve git-tracking for Jupyter notebooks #35

aufdenkampe · 2020-05-19T14:05:35Z

As we were working on the testing system (#31), @rheaphy, @steveskrip, @ptomasula and I discussed on a call the challenges of git-tracking Jupyter notebooks, because output cells are updated every time they are run even if the Python code or markdown doesn't change.

Let's work out a good way to save and track Jupyter notebooks.

Here's @ptomasula's HSP2 Potential Solutions for Python Notebooks in GitHub email with background and options:

I did a bit of digging into a potential solution to dealing with merging python notebooks in git. I found three potential solutions, which are outline below, but I’m personally leaning towards option 2. Option 2 is extremely easy to implement, solves the immediate merging challenges Bob was describing, and while there is a slight potential for issues around resolving merge conflicts between binary files, those can be largely avoided by coordinating our efforts (which we’ve done very well thus far). I’d be curious to hear thoughts from the rest of the group. I’d also note that whichever option we choose, isn’t necessarily set in stone. If we find the approach isn’t working for us we can change it down the road. We can also deviate from these options if anyone has great suggestion for a solution.

Background/Problem
Python notebooks are stored as JSON, which provides for source tracking. However; when a notebook cell is run, certain cell attributes (output, execution count, etc.) are updated. This caused a number of impacts including;

Added difficultly managing merge conflicts (manual line by line process)

Larger and somewhat unruly commits

More difficult to review. Important changes in code vs less critical (trivial for the purposes of source tracking?) changes resulting from running a cell both appear in a diff.

Option 1 - Strip output block out prior to commit
Python notebooks are stored as JSON. This makes it fairly easy to read and programmatically strip out the pieces of the document that are causing the issues described above. We could either write or use an existing script to accomplish this.

Pros

Fairly easy to implement. There already appears to be a number of tools that solve do this (https://pypi.org/project/nbstripout/ or https://github.com/toobaz/ipynb_output_filter). It would also not be a big lift to develop something to do this if we need to.

Still allows us to source track code changes in the cells contents (‘source’ attribute).

Cons

We lose the ability to share the output directly in the notebook, which may be of some value to users.

Extra step to committing code. Need to runs conversion tool prior to commit. (We might be able to get around this using the gitattribute filter, but I haven’t an experience using that)

Option 2 - Enable notebooks to be handled as binary
Utilized the git attributes for flag all or select notebook files as binary. This would overwrite the entire file upon commit, and reduce conflict resolution to which version to use (instead of manually resolving lines)

Pros

Easy to implement

Allows for the output from the notebooks to be shared

Cons

Slight potential to overwrite code changes because of how merge conflicts are handled between binary files (i.e. must use either my file our their file)

Less explicit tracking of code changes, but could still tease them out by comparing versions

Option 3 – Use a merge management tool
Use a tool to ease the merge process. The most promising one I've came across is nbdime (https://github.com/jupyter/nbdime), which was developed by the jupyter team.

Pros

Potential for best of all both world approach, easier to manage conflicts while still retaining change tracking on a line by line basis

Cons

Appears to be console only (at least nbdime does, but there may be other tools out there)

Doesn't solve unruly commits when looking at them on GitHub

Still requires some level of manual conflict resolution (it’s just drastically reduced)

aufdenkampe · 2020-05-19T14:06:50Z

Email response from @rheaphy:

Thank you for this research! I too lean toward option 2.

Even if we save Notebooks as binary data, perhaps we can still use the Jupyter nbdive to perform the diff's. It was designed to handle the JSON, embedded HTML, and other junk. I haven't used it since it assumes you are using Git and until the last couple of weeks, I was using mercurial.

@rheaphy

Addresses respec#35. Also reverses addition of `*.ipynb` to `.gitignore`, which @rheaphy used as a temporary solution to merge conflicts.

aufdenkampe · 2020-05-19T14:30:49Z

@rheaphy, my commit LimnoTech@32c93ef should now enable @ptomasula's Option 2 and therefore allow us to use this repo to exchange the Jupyter notebooks that are an essential component of #31.

aufdenkampe · 2020-11-16T21:48:01Z

With merging #43 into Master, we can close this!

@ptomasula

@ptomasula, try running either of these notebooks. Connects to issue #21 & PR #35

aufdenkampe · 2021-05-13T18:18:29Z

Our implementation of Option 2 - Enable notebooks to be handled as binary, described above, is no longer working sufficiently well, as it completely obscures advances in our Jupyter notebooks.

Recent advances in GitHub and GitHub desktop visualization of commit changes has made it easier to deal with the navigating the diff of a the large JSON formated .ipynb file content. For this reason, I think we should revert to our original approach. I'll do this shortly.

Meanwhile, I'm reopening this issue to remind us to explore Option 3 – Use a merge management tool. A few articles on this topic are worth reviewing:

One new option is particularly interesting: https://www.reviewnb.com (free for public repositories)

also addresses #35 (git tracking Jupyter notebooks). The new conda environment substantially improves over the previous version, with a more consolidated HDF5 version (1.10.6) and upgrading JupyterLab to v3.

aufdenkampe added this to the Release 1.0! milestone May 19, 2020

aufdenkampe added a commit to LimnoTech/HSPsquared that referenced this issue May 19, 2020

Git-track Jupyter notebooks at binary

32c93ef

Addresses respec#35. Also reverses addition of `*.ipynb` to `.gitignore`, which @rheaphy used as a temporary solution to merge conflicts.

aufdenkampe self-assigned this May 19, 2020

aufdenkampe mentioned this issue Sep 23, 2020

LimnoTech test files & better git-tracking of notebooks from develop to master #43

Merged

aufdenkampe closed this as completed Nov 16, 2020

aufdenkampe mentioned this issue Apr 28, 2021

Improve readUCI & readWDM for a broader range of valid files #40

Closed

PaulDudaRESPEC pushed a commit that referenced this issue Apr 30, 2021

Test10 no longer runs since readWDM time series updates

2a8373e

@ptomasula, try running either of these notebooks. Connects to issue #21 & PR #35

aufdenkampe reopened this May 13, 2021

aufdenkampe removed this from the Release 1.0! milestone Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve git-tracking for Jupyter notebooks #35

Improve git-tracking for Jupyter notebooks #35

aufdenkampe commented May 19, 2020 •

edited

Loading

aufdenkampe commented May 19, 2020

aufdenkampe commented May 19, 2020

aufdenkampe commented Nov 16, 2020

aufdenkampe commented May 13, 2021 •

edited

Loading

Improve git-tracking for Jupyter notebooks #35

Improve git-tracking for Jupyter notebooks #35

Comments

aufdenkampe commented May 19, 2020 • edited Loading

aufdenkampe commented May 19, 2020

aufdenkampe commented May 19, 2020

aufdenkampe commented Nov 16, 2020

aufdenkampe commented May 13, 2021 • edited Loading

aufdenkampe commented May 19, 2020 •

edited

Loading

aufdenkampe commented May 13, 2021 •

edited

Loading