Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve git-tracking for Jupyter notebooks #35

Open
aufdenkampe opened this issue May 19, 2020 · 4 comments
Open

Improve git-tracking for Jupyter notebooks #35

aufdenkampe opened this issue May 19, 2020 · 4 comments
Assignees

Comments

@aufdenkampe
Copy link
Collaborator

aufdenkampe commented May 19, 2020

As we were working on the testing system (#31), @rheaphy, @steveskrip, @ptomasula and I discussed on a call the challenges of git-tracking Jupyter notebooks, because output cells are updated every time they are run even if the Python code or markdown doesn't change.

Let's work out a good way to save and track Jupyter notebooks.

Here's @ptomasula's HSP2 Potential Solutions for Python Notebooks in GitHub email with background and options:

I did a bit of digging into a potential solution to dealing with merging python notebooks in git. I found three potential solutions, which are outline below, but I’m personally leaning towards option 2. Option 2 is extremely easy to implement, solves the immediate merging challenges Bob was describing, and while there is a slight potential for issues around resolving merge conflicts between binary files, those can be largely avoided by coordinating our efforts (which we’ve done very well thus far). I’d be curious to hear thoughts from the rest of the group. I’d also note that whichever option we choose, isn’t necessarily set in stone. If we find the approach isn’t working for us we can change it down the road. We can also deviate from these options if anyone has great suggestion for a solution.

Background/Problem
Python notebooks are stored as JSON, which provides for source tracking. However; when a notebook cell is run, certain cell attributes (output, execution count, etc.) are updated. This caused a number of impacts including;

  • Added difficultly managing merge conflicts (manual line by line process)
  • Larger and somewhat unruly commits
  • More difficult to review. Important changes in code vs less critical (trivial for the purposes of source tracking?) changes resulting from running a cell both appear in a diff.

Option 1 - Strip output block out prior to commit
Python notebooks are stored as JSON. This makes it fairly easy to read and programmatically strip out the pieces of the document that are causing the issues described above. We could either write or use an existing script to accomplish this.

  • Pros
  • Cons
    • We lose the ability to share the output directly in the notebook, which may be of some value to users.
    • Extra step to committing code. Need to runs conversion tool prior to commit. (We might be able to get around this using the gitattribute filter, but I haven’t an experience using that)

Option 2 - Enable notebooks to be handled as binary
Utilized the git attributes for flag all or select notebook files as binary. This would overwrite the entire file upon commit, and reduce conflict resolution to which version to use (instead of manually resolving lines)

  • Pros
    • Easy to implement
    • Allows for the output from the notebooks to be shared
  • Cons
    • Slight potential to overwrite code changes because of how merge conflicts are handled between binary files (i.e. must use either my file our their file)
    • Less explicit tracking of code changes, but could still tease them out by comparing versions

Option 3 – Use a merge management tool
Use a tool to ease the merge process. The most promising one I've came across is nbdime (https://github.com/jupyter/nbdime), which was developed by the jupyter team.

  • Pros
    • Potential for best of all both world approach, easier to manage conflicts while still retaining change tracking on a line by line basis
  • Cons
    • Appears to be console only (at least nbdime does, but there may be other tools out there)
    • Doesn't solve unruly commits when looking at them on GitHub
    • Still requires some level of manual conflict resolution (it’s just drastically reduced)
@aufdenkampe
Copy link
Collaborator Author

Email response from @rheaphy:

Thank you for this research! I too lean toward option 2.

Even if we save Notebooks as binary data, perhaps we can still use the Jupyter nbdive to perform the diff's. It was designed to handle the JSON, embedded HTML, and other junk. I haven't used it since it assumes you are using Git and until the last couple of weeks, I was using mercurial.

@aufdenkampe aufdenkampe added this to the Release 1.0! milestone May 19, 2020
aufdenkampe added a commit to LimnoTech/HSPsquared that referenced this issue May 19, 2020
Addresses respec#35.
Also reverses addition of `*.ipynb` to `.gitignore`, which @rheaphy used as a temporary solution to merge conflicts.
@aufdenkampe aufdenkampe self-assigned this May 19, 2020
@aufdenkampe
Copy link
Collaborator Author

@rheaphy, my commit LimnoTech@32c93ef should now enable @ptomasula's Option 2 and therefore allow us to use this repo to exchange the Jupyter notebooks that are an essential component of #31.

@aufdenkampe
Copy link
Collaborator Author

With merging #43 into Master, we can close this!

PaulDudaRESPEC pushed a commit that referenced this issue Apr 30, 2021
@ptomasula, try running either of these notebooks.
Connects to issue #21 & PR #35
@aufdenkampe
Copy link
Collaborator Author

aufdenkampe commented May 13, 2021

Our implementation of Option 2 - Enable notebooks to be handled as binary, described above, is no longer working sufficiently well, as it completely obscures advances in our Jupyter notebooks.

Recent advances in GitHub and GitHub desktop visualization of commit changes has made it easier to deal with the navigating the diff of a the large JSON formated .ipynb file content. For this reason, I think we should revert to our original approach. I'll do this shortly.

Meanwhile, I'm reopening this issue to remind us to explore Option 3 – Use a merge management tool. A few articles on this topic are worth reviewing:

One new option is particularly interesting: https://www.reviewnb.com (free for public repositories)

@aufdenkampe aufdenkampe reopened this May 13, 2021
@aufdenkampe aufdenkampe removed this from the Release 1.0! milestone Aug 20, 2021
aufdenkampe added a commit that referenced this issue Sep 20, 2021
also addresses #35 (git tracking Jupyter notebooks).
The new conda environment substantially improves over the previous version, with a more consolidated HDF5 version (1.10.6) and upgrading JupyterLab to v3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant