Skip to content

Distributed UmpleOnline design

Finn Hackett edited this page Nov 12, 2017 · 5 revisions

As it is, we have a Docker container that runs UmpleOnline but does not have a proper persistent storage implementation. The reason UmpleOnline makes persistent storage difficult is that it stores any and all data directly to the server's file system in the same directory ump/, mixing semi-permanent user data, scratch files, documentation and example code.

A distributed UmpleOnline must replicate properly. Here are the principal concerns surrounding that:

  • All distributed copies of UmpleOnline must provide the same response to the same query
  • UmpleOnline must work under load balancing
  • UmpleOnline must, as much as possible, be resilient to spontaneous hardware and software failure such that if an instance dies, becomes unresponsive or disappears another can take its place

Current suggestion is to have UmpleOnline store the model source along with enough metadata to regenerate the appropriate compilation outputs in a centralised database of some kind. If one instance of UmpleOnline gets a request for data it doesn't have, it can then fetch that data from the database and serve the result. This fulfills the requirement that all UmpleOnlines provide the same response to the same query, and as a result UmpleOnline should be resilient to both load balancing and spontaneous failure.

Detailed breakdown of operations

  • Keep all the existing lookup code so old data stores still work
  • If some configuration is present (maybe an environment variable, credentials, etc), do the following for a set of containers running across a cloud network:
    • If a certain directory is requested, look it up according to the old rules (and in a folder like ump/data where we would keep "new" things)
    • If it is not found, look it up on a storage system and download it into a folder under ump/data
    • Perform the request as usual
    • Back up the resulting model and any metadata to the storage system

TL: So let's say someone opens two browser windows on the same data. Now what happens. Case 1: They are pointing to the same instance of the data. Browser a makes changes; it backs up to storage system. Browser b, however doesn't know anything has happened and doesn't know to reload from the file on disk; any attempt to generate code, however, will generate code from the data that browser a saved, but any attempt to edit will overwrite the data browser a edited. Case 2: They are pointing two different instances of the data. Similar problem; the data gets saved where browser a is pointing 2, but browser b data does not get touched. This time any attempt to generate code will not generate from what browser a saved; and any attempt to edit will overwrite the central store

TL: The only way to overcome this would require: 1) As a minimum some mechanism to check for changes and reload if needed whenever a browser window gets focus. But even this wouldn't be as good as Google Docs, where changes appear dynamically. I am looking at together.js to see if we could use it. 2) The check for changes would have to check the disk and also the database if there is one.

Finn: Interesting, I hadn't considered that possibility. I've looked into together.js but am not hopeful - since we would then have two different systems (the Together server and UmpleOnline) synchronising the same data, I imagine it would just break the problem you described above into smaller parts which would crop up whenever there was network lag and the older version pressed "compile" before Together.js got round to updating it.

Finn: I suppose the other issue is that this whole concurrent editing thing is getting out of scope compared to allowing horizontal scaling of UmpleOnline. It's really a whole other issue, perhaps worth considering separately. I can think of conceptually better ways to get around this but they would require large-scale refactors and some fundamental changes to how UmpleOnline handles requests.

TL: See also my other comments about schema below.

How to store distributed data

Currently, it is suggested that distributed UmpleOnline instances interface with a SQL server (maybe Postgres). This has several advantages:

  • As an interface, SQL is very mature and basic uses of it are portable across a wide variety of database servers. It is also built into PHP.
  • Any compliant SQL server supports ACID transactions, nullifying the various concurrency concerns (#1122 is a known issue) surrounding UmpleOnline's current filesystem-based storage
  • Several SQL servers are well-supported out of the box by Docker, and as a result an assembly of UmpleOnline and a SQL server is fairly trivial to package and distribute
  • Since SQL databases support sophisticated optimisations, using an SQL database may be faster in some of our use cases than the current filesystem store (not accounting for network overhead which can be mitigated by well-designed queries and good practice).

TL: SQL has the disadvantage of requiring a schema. I anticipate allowing multiple umple files to co-exist and be editable by tabs. I want to do this soon. Also I anticipate other changes where files would need to be backed up to the DB. I think we should try to exoeriment with S3 first (below)

Finn: You raise a valid point about making sure the system is extensible for the sake of future plans. I didn't know about the tabs idea. I would however note that formal SQL schema or no, any system we put in place will have an interface and any work that breaks prior assumptions will require corresponding assumptions. Whether this is done formally via SQL or by juggling subfolders seems like the same kind of thing to me, give or take the affordances, advantages, disadvantages of each specific technology.

Finn: If we are OK with locking ourselves into S3, sure trying to use FUSE sounds interesting. I shied away from platform-specific tools like that in my initial search - the links you posted look a lot more workable than the generic tools I found previously. Regarding concurrent access from multiple instances, I can have a look at file locking in issue 1122, it should at least limit the problems to those you highlighted above rather than causing random data loss and file corruption. We will also have to move UmpleOnline's data storage into a subfolder, or the mounts will not work well with the documentation, etc...

Some alternatives:

  • CouchDB:
    • Advantages: optimised for high volume data volume, very sophisticated replication and reliability
    • Disadvantages: uses a specialised REST/JSON API, may have a higher engineering cost
  • Amazon S3:
    • Advantages: deals with replication and large-scale storage automatically, vendor support
    • Disadvantages: non-standard API, risk of vendor lock-in
  • Distributed Docker Volumes
    • Advantages: in principle, may require the least in terms of modification to UmpleOnline code. TL: This is a major advantage. In fact, it might be possible to do this without special drivers by using. Fuse https://github.com/s3fs-fuse/s3fs-fuse/wiki/Fuse-Over-Amazon. See also
    • Disadvantages: requires special OS drivers, may suffer from UmpleOnline's current obliviousness to concurrency, requires Docker. TL May be worth experimenting with it regardless, even for a single instance
Clone this wiki locally