Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple approach for concurrent opening of Notebooks #11

Closed
diocas opened this issue Sep 3, 2020 · 18 comments
Closed

Simple approach for concurrent opening of Notebooks #11

diocas opened this issue Sep 3, 2020 · 18 comments

Comments

@diocas
Copy link
Contributor

diocas commented Sep 3, 2020

I decided to move this issue from swan-cern/jupyter-extensions/issues/26 to this repo, since I believe it should be added to this extension. Some of the initial discussion can be checked in that issue.


While real-time collaborative editing of notebooks is a feature we would like to have, it would be useful to have an intermediate step, easier to accomplish, to allow users to open notebooks concurrently without causing conflicts.

This feature is of interest of JRC and CERN, and it would be useful for the project CS3MESH4EOSC. The current status of collaborative editing of notebooks is still uncertain, with a considerable effort still needed.

This ticket describes a possible approach, combining some of my ideas with the ones presented by JRC. Please let me know if you disagree with any of this.

Approach

I suggest taking a similar approach to the editing of office files. This means, we would lock a notebook if someone else is editing, which would only allow another person to open it in read-only mode. This would prevent the current Jupyter behavior of complaining that a file has changed on the disk and the possible loss off data.

But, we can still allow other's to contribute. We can ask the user if he just wants to open (ro) or if he wants to create a copy and merge it after. This copy should have a specific name, following some convention, that would allow the system to detect it and suggest a merge at a later stage.

Implementation

To make this approach possible, we need to modify the upstream Contents interface.
Opening a file should provide the following information:

  • Is the notebook locked or not and by whom?;
  • Do we have copies of the notebook?
  • Do we have a copy of the notebook that was initiated by us?

We also need API endpoints to:

  • Get the information about all the notebook copies (user that started, last save, is it ready to be merged or not)
  • Set a notebook copy as "ready to merge";

This means that we need to create abstract methods for locking, unlocking, getLockStatus, getConcurrentCopies, setReadyToMerge etc.

We also need to change the UI to understand this new concept. It needs to understand that, in case a notebook is considered locked, it needs to ask the user to open in ro or create a copy. It also needs to suggest the merge of the changes. This can be done to the current user locking the file in another copy has been marked as "ready to merge", or to anyone else opening the file on which non merged copies still exist.

For the task of comparing notebooks, there's already an official tool called nbdime: https://nbdime.readthedocs.io/en/latest/

@diocas
Copy link
Contributor Author

diocas commented Sep 3, 2020

@JarCz I propose we discuss this in the next meeting and maybe we could start working on some of the required components after you finish the authentication stuff and we check that the basic file access functionality is working.

@mateusz-garlacz
Copy link
Contributor

Hi, I would like to summarize the topic, add questions and ask for further comments to make decision about implementation and divide topic into subtasks to move on.

As I understand the possible flow would be as follows:

  • user1 creates a notebook share for user2 as editor
  • user1 locks the notebook
  • user2 can read the notebook and create a copy
  • user1 and user2 can see diffs between the documents
  • user2 issues a 'pull request' by setting his copy as 'ready to merge'
  • user1 can merge changes
  • user1 unlocks the notebook

Would the users take turns in locking and having right to merge and the user holding the lock would have the right to merge changes from all the copies marked as ready, or would the merge right be limited to the owner only? When it comes to locking and unlocking, I guess it might be implemented when opening/closing file. To provide uniqness the naming convention for copies would be something like notebook_name-username-idp-opaque_id.

When it comes to conflicts resolution, nbdime was suggested as a solution for diffing and merging. I am not sure how do we imagine using it on our side. If I am not mistaken, nbdime works as a command line tool and I saw that there is nbdime jupyter extension as well, but my concerns are that:

  • the nbdime extension includes only nbdiff without nbmerge
  • and apparently it needs checkpoints implementation
  • it seems to be used with git integration
  • and if we wanted to provide the functionality for other files, it is limited to .ipynb

@diocas please comment in a spare moment.

@diocas
Copy link
Contributor Author

diocas commented Jan 27, 2021

Hi, let me try to clarify the misunderstandings. Let me know if I succeeded :)

As soon as you share a notebook/folder in RW, the other users have as much power as you. So there is no consideration of "owners" in here. The notebook get locked by whovever opens it first and the merge can be done by anyone. And the lock is automatic: as soon as you open the notebook, the system locks the file to prevent others from changing the same file (implemented, as you said, in opening/closing the file). So, if you share a notebook with me and I open it first, you will be the one that will get a read mode + write mode on a clone.

For the merge, I'm open to suggestions. I imagine the following use cases:

  1. I open the notebook first, you second. You finish first and I get a notification saying "Mateusz changed the notebook. Click here to merge the changes". When the merge is done, the clone is deleted.
  2. I open the notebook first, you sencond. I finish first and you get the notification "Diogo has closed the original notebook. Click here to merge your changes and edit directly the original file". This will allow you to lock the original file, merge your changes and delete the clone.
  3. The third case is in case something happens and no one merged the changes that you, for example, did (otherwise, the clone would've been deleted). When I, or you, or someone else opens the notebook, the system sees that a clone still exists and sends a notification saying "There are un-merged changes from Mateusz. Please merge or delete.". Then you would be able to merge, which would delete the clone.

Ofc, in all cases, in the review mode, you should be able to merge changes (which deletes), delete without changing anything and postpone the merge (for example, you changed something, I'm the one who is asked to merge but I don't know what should be merged, so I ignore it thus allowing you to merge at a later stage).


For the merge. I suggested nbdime because I saw there was a UI which is what we want to show our userrs. I didn't see the details, nor tried the extension. But what we're looking for is a tool that checks the differences between 2 notebooks and suggests a merged version. One that also allows users revert changes or pick the changes from one of the notebooks. Basically, like any graphical diff tool that you can use for git.

AFAIU, you don't need to have git underneath. I.e, nbdiff says:

usage: nbdiff [-h] [before after]

Produce a diffed IPython Notebook from before and after notebooks. (...)

positional arguments:
  before      The notebook to diff against.
  after       The notebook to compare `before` to.

apparently it needs checkpoints implementation

What do you mean?

and if we wanted to provide the functionality for other files, it is limited to .ipynb

If we can do it for other files (the merge), fine. But for now I would just lock them and prevent the editing and focus the diff in the notebooks.

And ofc we should do this in steps. The first thing is locking and unlocking. Then, allowing/detecting the clones. And finally, the merge.

@diocas
Copy link
Contributor Author

diocas commented Jan 27, 2021

As a sidenote, I'm not sure how to do the unlocking.. JupyterLab doesn't give the information that a file has been closed, right? It would be nice if it did.

I think we can do the following:

  • We have a lab extension (the one Piotr is doing) that subscribes to the notification of opening/closing files and we create an endpoint (backend) where we constantly ping to inform about the closed notebooks (list that can be empty if nothing was closed). This service would then remove the locks.*
  • If we don't get the ping from the frontend (the user closed the window, for example), we remove all the locks
  • If something failed and we have leftover locks, we can ignore them whenever opening the notebooks at a later stage. How do we know it's a leftover? Because is older than the time the auto-save kicks in (on every autosave we should refresh the age of the token - by age, I mean, putting a datetime inside the lockfile, together with the username of the owner).

(* we need to understand how that would work if the user has more than one window open.. Does lab allow this?)

@JarCz
Copy link
Contributor

JarCz commented Feb 1, 2021

@diocas I analyzed the merge way and have some comments:

"We need to understand how that would work if the user has more than one window open... Does lab allow this?"

  • when user have only one browser can open the file in one instance
  • when the user opens jupyterlab in two tabs in the browser, can open the file at once in one tab. If open the file in two tabs then the user has two copies of the file
  • when the user open jupyterlab two different browsers, can open the file at once in one browser

Before save the JupyterLab creates the GET request to check the current timestamp of the file, at in conflict situation when the file on disk is new, the user sees the message:

File Changed
"file.txt" has changed on disk since the last time it was opened or saved. Do you want to overwrite the file on disk with the version open here, or load the version on disk (revert)?

In this situation, I think, we can add a new button with the "Merge option" and merge in manual or create an event to the backend to merge two notebooks.

"If we don't get the ping from the frontend (the user closed the window, for example), we remove all the locks"

  • it's necessary to define "time window" when we remove all lock and copies of the file
  • probably frontend can catch the close event and inform the backend on a closed document

I looked at the code of the plugin:

I propose to choose a way to write in:

  1. Simple compare and use timestamp to define with the document it's newer:

    • if the notebook cell is the same in both documents we copy it to the output document
    • if the notebook cell is different in both documents we use a cell from a newer notebook
    • We can add an option to "merge with keeping previous" - if the cell has conflict we save an old and new version of a notebook cell
      In merged notebook user see all versions of the cell and can manually remove conflict cells
    • If one time it's necessary to merge a lot of notebook copies we merge only two notebooks at one time based on the timestamp of copies

    Comments:
    * good start point to write own merge and locking mechanism (by Mateusz)
    * we don't need to write the fronted part of the plugin, only add event and button using on merge action
    * simple method can't work properly in the heavy notebooks - creates a lot of duplicates or removes too much data
    * maybe we shall inform all users when the notebooks are merged and present a message to refresh the notebook?

  2. Use an algorithm from https://jupyter.org/enhancement-proposals/08-notebook-diff/notebook-diff.html
    or use merge method from https://github.com/jupyter/nbdime/blob/c82362344e596efdc4f54c927d90338940e0fa41/nbdime/merging/notebooks.py#L168

    • In conflict user see windows to manual resolve conflict like as nbdime
    • backend uses the merge method from nbdime or own algorithms

    Comments:
    * a lot of user interactions when notebook have conflicts
    * it necessary to write the fronted, I can't define how many time we need to write the frontend

  3. Use own implementation of concurrent editing in real-time based on https://github.com/jupyterlab/rtc and https://hackmd.io/UbnBH58hS8itoWgfiWT77A

    • We need to have an RTC relay server (https://jupyter-rtc.readthedocs.io/en/latest/developer/architecture.html) -
      I think it's to be a part of IOP because if we use the RTC instance in owner jupyter notebook (who first open the notebook) we have a problem with network security and connection behind the NAT and typical P2P connection
    • We need lof changes in the frontend and backend part like as original RTC plugin
    • We need to have a strong discussion this resolve

In my opinion, in the first version, we need to choose the first way, because it's a good ground to work with a strong topic like real-time editing.

@diocas
Copy link
Contributor Author

diocas commented Feb 9, 2021

In this situation, I think, we can add a new button with the "Merge option" and merge in manual or create an event to the backend to merge two notebooks.

We could add it here as well. But I would expect that if the users are using our extension, this pop up would never appear because they are working in 2 different versions (and, as you said, in Lab you won't be able to open 2 times the same notebook).

About your options, I'm more inclined to the second one. The first one looks like you will have to implement your own algorithm, which I don't think we should do (less things to maintain and less things to break), especially if there's an official package that could do that for us (nbmerge). Or that package is not ok? And why?
I also think we should show the conflicts and let the user decide, otherwise it will be hard to know if they should pick one or the other a priori. My expectation is that we could use nbdime-web/nbmerge-web and not having to develop our own frontend extension. And I assume it works by looking in a file that, like git, contains the conflicts. If, at first, nbdime is not a viable alternative, I think that keeping the notebook with the conflicts would be a possibility and users resolve them manually. Os this isn't possible? Why is nb*-web not an option? Do we have a lab extension? If not, we should be able to open this in a separate window, so the implementation should be easy, from the lab point of view.
The third option is way harder to acomplish. I don't want to maintain that. If we decided to do that, then we would have the live concurrent editing, and that is what we're trying to avoid with this "simple approach".

Please let me know if my assumptions are wrong.
But please, let's do this in stages. The first thing is locking. Then creating the clone. Then finding the differences and finally merging them. And I would try to use modules that already use instead of developing our own stuff.

@JarCz
Copy link
Contributor

JarCz commented Feb 10, 2021

We could add it here as well. But I would expect that if the users are using our extension, this pop up would never appear because they are working in 2 different versions (and, as you said, in Lab you won't be able to open 2 times the same notebook).

True, If the user use cs3api4lab plugin with locking and copy mechanism user don't see this message.

About your options, I'm more inclined to the second one. The first one looks like you will have to implement your own algorithm, which I don't think we should do (less things to maintain and less things to break), especially if there's an official package that could do that for us (nbmerge). Or that package is not ok? And why?

It's better to use the official package, but from my experience integration, it's not easy and sometimes had problems. Probably we need to extend and wrapper 'nbdime' code to working correctly in our code.

...
But please, let's do this in stages. The first thing is locking. Then creating the clone. Then finding the differences and finally merging them. And I would try to use modules that already use instead of developing our own stuff.

I agree to use this way.

@diocas
Copy link
Contributor Author

diocas commented Feb 16, 2021

Probably we need to extend and wrapper 'nbdime' code to working correctly in our code.

Yes, but it's better to wrap than doing everything from scratch. 🤞

@mateusz-garlacz
Copy link
Contributor

TLDR - nbdime code both backend and frontend is reusable, but the front end may not be good enough

I understand that our goal is to do as little work as possible. So far, nbdime is the best candidate. When it comes to merging, the nbdime code is reusable, however we will not avoid some adjustments. The question is, how much of the code we want to incorporate.

For clarity, the nbdime package consists of several sub-apps, we are especially interested in nbdiff, nbdiff-web, nbmerge, nbmerge-web. Nbmerge performs automatic merge, nbmerge-web provides GUI. Nbmerge is alright for the simplest cases like merging two identical notebooks, where one of them has additional data appended at the end - in other case nbmerge results in unresolved conflicts. I assume that in many cases we would need a GUI for conflict resolution.

I think nbmerge-web front end part would also be reusable to great extend, but the question is, wether nbdime-nbmerge-web meet our requirements. To me, it is unintuitive and uninteractive (and unaesthetic). I suggest taking a look at nbmerge-web so we could decide if we want to adjust it to our needs or if it's better to do something new.

@diocas
Copy link
Contributor Author

diocas commented May 11, 2021

Better implementation should use locking mechanism provided by the cs3 api: cs3org/cs3apis#6

@dagl
Copy link
Contributor

dagl commented Aug 30, 2021

Hello @diocas,

I'm curently working on the backend part of this task. So far we managed to do the following:

  • Shared files are locked when opened
  • File's ArbitraryMetadata is used for storing all the necessary information about the lock (user, date, etc), in the future we can swap for storage: locking support cs3org/cs3apis#6
  • Lock's information (including date) is updated whenever the file is saved
  • The lock is removed when a different user tries to open or save the file and the lock is older than an interval between autoaves (with a margin)
  • If the lock hasn't expired and someone else tries to save the file the backend throws an exception and the frontend then shows an error:

File is locked, please use 'Save Text As' functionality

Since we can't differentiate between autosaving and normal saving/closing, the locking mechanism is based on the autosave feature. We are working on a solution that would allow us to know when the file is being closed. Assuming that's possible, the next step would be:

  • Autosave goes to a local copy instead of the main file
  • When the file is closed, the content of the local copy (along with any new changes after the last autosave) is transferred to the main file and the lock is removed
  • Locks are still cleared when they expire

Please let me know what you think about the current and future steps in development.

sidenote: in the future, thanks to the local copy, we could consider letting the user continue editing the file even if they exit Jupyter without properly closing/saving

@diocas
Copy link
Contributor Author

diocas commented Sep 10, 2021

Hi, terribly sorry it too so long to reply...

Question: what do you mean by "ArbitraryMetadata"?
How much is this interval margin? Doesn't autosave only kick in if the user changed something? Meaning that, if I'm reading a document (which was locked when I opened it), but only after some (long-ish) time I change something, will I have the file locked for me or not? Please double check.
I think we should still save the file as <filename>.conflict_<date> when the lock was lost, instead of just showing an error to the user. I sleep better if we always try to store the information vs having the chance of losing users' data. Also, the error message doesn't seem clear enough for a normal user. If you store as a conflict, then you don't even need to ask user to download anything.

Sincerely I don't know why we care if it's autosave or not. If the autosave is not constant (as mentioned above), then I would have a frontend extension pinging the backend to let it know which files were still open. Or we could subscribe to a notebook closed event and notify the backend (to avoid having to wait for the lock timeout; also to avoid having a recurrent ping to the backend). But maybe the first option is more secure. Of course you would have to keep state, but I think this is ok given that these servers only run for a specific user.
I would keep the saving/autosaving behaviour as is now and keep writing to the destination file. (do you have a reason not to?)

If you want to have a meeting to brainstorm/discuss something you/I did not understand, please let me know!

@dagl
Copy link
Contributor

dagl commented Sep 14, 2021

ArbitraryMetadata is a part of files metadata in Reva where you can story any additional information about the file. You can get it back by stating.

Yes, I think this is how it works, but in my opinion it fits our intentions, e.g. when someone opens a file, makes some changes and leaves it open, then the lock should be discarded after some time (changes would be saved by an autosave). The default autosave interval is 120s, the time after locks are ignored is set to 150s. Both are configurable.

I think .conflict files are a good idea so I've added this functionality and removed the error message. The only problem now is that the user doesn't know that the file has been saved in another location. On the other hand if we keep the message it would be shown on every autosave. Also, I think creating only 1 file with the .conflict extension should suffice, otherwise we would keep creating new file with every autosave, let me know what you think. Also, should we try to open the .conflict file instead of original when we open any shared file?

About the rest. We've found it hard to know reliably when the notebooks are closed and I think what we have right now is good enough so I suggest we test and release the locking functionality in its current state, skip the step with local files and move on to start working on merges. We have come up with a proposal as to how merges should work which we would like to present on a meeting. I will post a short summary in a separate comment.

In my opinion we shouldn't spend too much time on locks, since they are going to be working quite differently when we implement merging feature.

@dagl
Copy link
Contributor

dagl commented Sep 15, 2021

Our proposal for the merge functionality:

  • files are no longer "locked" when opened
  • Autosave and save go to a local diff file
  • There is a separate "commit" button that pushes changes from the local file to the original
  • when multiple people edit the same file, whoever does the commit first "locks" the file and others have to merge when comitting
  • "ready to merge" message shows up when a currently edited file has been modified/"locked" by someone else

@dagl
Copy link
Contributor

dagl commented Sep 15, 2021

@diocas

@dagl
Copy link
Contributor

dagl commented Sep 16, 2021

Additionally:

  • when a shared file is opened the ArbitraryMetadata field is read to know who was the last person to edit the file and what was the date of the edit
  • when editing the file an endpoint is pinged to check if the ArbitraryMetadata has changed
  • if it has changed we show the message "ready to merge"/"merge is needed"
  • we additionally check this when trying to commit the file
  • when a merge is completed and the original file is overwritten we also overwrite ArbitraryMetadata

There might be a problem when 2 or more people are merging the file at the same time, we need to test the plugin to see what would be the best solution in that situation

@diocas
Copy link
Contributor Author

diocas commented Sep 16, 2021

This has been discussed in a meeting and we will proceed with a discussion to evaluate some possible options. @dagl @piotrWichlinskiSoftwaremind if you can circulate the slides before, it would be helpful. As I said, I'm on holidays next week but will try to be present in the meeting.

@glpatcern
Copy link
Member

Discussing with @diocas, please have a look at cs3org/cs3apis#160 as that would hopefully provide native methods for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants