New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple instances of Umpleonline to access common file store #1123

Open
TimLethbridge opened this Issue Oct 24, 2017 · 7 comments

Comments

Projects
None yet
2 participants
@TimLethbridge
Member

TimLethbridge commented Oct 24, 2017

Currently UmpleIOnline is tied to its own local store. Using Docker it is possible to map some other store into a running image. However it would be good to generalize this so UmpleIOnline can be run on a set of servers to get lots of CPU throughput, with file saving happening in a common location.

The main issues regarding this that need to be resolved are:
a) Having Umpleonline store its temporary files for a session not in ump/tmpxxxx directories but instead in ump/yyy/tmpxxxx directories, where yyy can vary based on various factors (perhaps the umpleonline instance that initially saves the file, perhaps the user, perhaps also the initial date of saving). The yyy would have to be created if it does not exist when first accessed. The yyy might be more than one level: e.g. sa/171005/ might designate the directory for models created on a server named sa on Oct 5, 2017.

b) Directories given a permanent URL would also need adjusting so they are not directly under ump following the same scheme.

c)Older directories should still be accessible.

d) The cleanup code that purges old tmpxxxx would need to be adapted.

e) Probably each server, on startup, could get a code that would become part of the yyy that it creates.

@fhackett

This comment has been minimized.

Show comment
Hide comment
@fhackett

fhackett Oct 28, 2017

Contributor

I've done some reading and some thinking, and I have an alternate idea: don't share UmpleOnline's data store, at least not literally.

What we can do instead is a little subtle, but may be rather effective. It may also be easier from a code point of view.

First, what I perceive to be the reasons for using distributed UmpleOnline:

  • Execute different queries in parallel on multiple machines
  • Avoid directory size related bugs that occur when scaling UmpleOnline's current data store
  • Make the system more redundant, so that if the compiler hits an infinite loop or other error we can just kill that instance and move on (with logs and postmortem so we can debug)

My idea can achieve the above somewhat like this:

  • Keep all the existing lookup code so old data stores still work
  • If some configuration is present (maybe an environment variable, credentials, etc), do the following for a set of containers running across a cloud network:
    • If a certain directory is requested, look it up according to the old rules (and in a folder like ump/data where we would keep "new" things)
    • If it is not found, look it up on a storage system (could be Amazon S3, could be something like CouchDB, to be discussed) and download it into a folder under ump/data
    • Perform the request as usual
    • Back up the resulting directory to the storage system (this may be done asynchronously)

Why I think this is interesting/useful:

  • ump/data would basically be a local cache. It could be cleared on an LRU scheme to ensure that active users do not have to wait for network fetches.
  • No one UmpleOnline instance would need to store massive volumes of data, which should mitigate the issue of hitting operating system limits on inode count
  • ump/data would be a Docker volume. If, say, someone killed an UmpleOnline container (or it crashed), you could create a new container with the old volume and carry on from where you left off. Each container would have its own volume, so no sharing problems. Container managers do this automatically.
  • All persisted user data could be stored in a specialised storage system designed for this kind of thing, which should take care of synchronisation issues.
  • Load balancers could be configured to favour routing requests from the same session to the same server, so ideally the network synchronisation latency should not occur during an active session due to load balancing shifts.

If you think this is an interesting avenue, I can clean it up and move it into a wiki for further discussion/refinement.

Contributor

fhackett commented Oct 28, 2017

I've done some reading and some thinking, and I have an alternate idea: don't share UmpleOnline's data store, at least not literally.

What we can do instead is a little subtle, but may be rather effective. It may also be easier from a code point of view.

First, what I perceive to be the reasons for using distributed UmpleOnline:

  • Execute different queries in parallel on multiple machines
  • Avoid directory size related bugs that occur when scaling UmpleOnline's current data store
  • Make the system more redundant, so that if the compiler hits an infinite loop or other error we can just kill that instance and move on (with logs and postmortem so we can debug)

My idea can achieve the above somewhat like this:

  • Keep all the existing lookup code so old data stores still work
  • If some configuration is present (maybe an environment variable, credentials, etc), do the following for a set of containers running across a cloud network:
    • If a certain directory is requested, look it up according to the old rules (and in a folder like ump/data where we would keep "new" things)
    • If it is not found, look it up on a storage system (could be Amazon S3, could be something like CouchDB, to be discussed) and download it into a folder under ump/data
    • Perform the request as usual
    • Back up the resulting directory to the storage system (this may be done asynchronously)

Why I think this is interesting/useful:

  • ump/data would basically be a local cache. It could be cleared on an LRU scheme to ensure that active users do not have to wait for network fetches.
  • No one UmpleOnline instance would need to store massive volumes of data, which should mitigate the issue of hitting operating system limits on inode count
  • ump/data would be a Docker volume. If, say, someone killed an UmpleOnline container (or it crashed), you could create a new container with the old volume and carry on from where you left off. Each container would have its own volume, so no sharing problems. Container managers do this automatically.
  • All persisted user data could be stored in a specialised storage system designed for this kind of thing, which should take care of synchronisation issues.
  • Load balancers could be configured to favour routing requests from the same session to the same server, so ideally the network synchronisation latency should not occur during an active session due to load balancing shifts.

If you think this is an interesting avenue, I can clean it up and move it into a wiki for further discussion/refinement.

@fhackett

This comment has been minimized.

Show comment
Hide comment
@fhackett

fhackett Nov 8, 2017

Contributor

@TimLethbridge I've added a wiki entry summarising our discussions so far

Contributor

fhackett commented Nov 8, 2017

@TimLethbridge I've added a wiki entry summarising our discussions so far

@TimLethbridge

This comment has been minimized.

Show comment
Hide comment
@TimLethbridge

TimLethbridge Nov 8, 2017

Member

Great. I am going to look at it when I have some time (likely at airports tomorrow).

Member

TimLethbridge commented Nov 8, 2017

Great. I am going to look at it when I have some time (likely at airports tomorrow).

@fhackett

This comment has been minimized.

Show comment
Hide comment
@fhackett

fhackett Nov 22, 2017

Contributor

@TimLethbridge While working on the $filename refactoring, I checked up on the S3 mounts to make sure they supported some of the tricks I was doing and I have some bad news: the S3 approach will likely not work out very well. I think we both made a couple of assumptions about how S3 mounts work.

  • s3fs, the link you originally sent does not support multiple users in any way: https://github.com/s3fs-fuse/s3fs-fuse#limitations (see limitation - no coordination between clients mounting the same bucket)
  • yas3fs, a tool that proposes to help fix the issue, was last touched in 2015 and it is unclear whether it is fit for production use.

So this then leaves us with the question: what to do then? If we're ok with a system that by design choice will likely go no further than a single server with cloud storage without a complete 180, we can do that but it seems like that is a fairly large compromise to make. What are your thoughts?

Potential note of interest: the refactor is not trivial as I already have to go through and vet most of the file operations for assumptions like "../../ will get me to the root directory". Factoring in #1122 which is not pretty either (our code is more complex than I thought, consider for example temp folder generation), it might not actually be much harder to use a different storage medium if we want.

Also for what it's worth, if we don't want to do our own DBA, several SQL/NoSQL database hosting services exist.

Contributor

fhackett commented Nov 22, 2017

@TimLethbridge While working on the $filename refactoring, I checked up on the S3 mounts to make sure they supported some of the tricks I was doing and I have some bad news: the S3 approach will likely not work out very well. I think we both made a couple of assumptions about how S3 mounts work.

  • s3fs, the link you originally sent does not support multiple users in any way: https://github.com/s3fs-fuse/s3fs-fuse#limitations (see limitation - no coordination between clients mounting the same bucket)
  • yas3fs, a tool that proposes to help fix the issue, was last touched in 2015 and it is unclear whether it is fit for production use.

So this then leaves us with the question: what to do then? If we're ok with a system that by design choice will likely go no further than a single server with cloud storage without a complete 180, we can do that but it seems like that is a fairly large compromise to make. What are your thoughts?

Potential note of interest: the refactor is not trivial as I already have to go through and vet most of the file operations for assumptions like "../../ will get me to the root directory". Factoring in #1122 which is not pretty either (our code is more complex than I thought, consider for example temp folder generation), it might not actually be much harder to use a different storage medium if we want.

Also for what it's worth, if we don't want to do our own DBA, several SQL/NoSQL database hosting services exist.

@fhackett

This comment has been minimized.

Show comment
Hide comment
@fhackett

fhackett Nov 29, 2017

Contributor

@TimLethbridge here's the refactoring plan you requested: https://github.com/umple/umple/wiki/Refactoring-UmpleOnline's-storage-backend

The lines are not particularly granular, but should be generally representative. There may be a few changes to callsites of functions in compiler_config.php. Let me know if a more precise set would be useful.

I'm also interested in your thoughts on my interpretation of the interface UmpleOnline requires. Does anything seem off to you? Did I obviously miss something?

If all goes well, the proposed structure of my proposed marathon session is:

  1. Implement the API as described with absolutely no change in functionality (leave all the directories where they are); test it still works, commit results
  2. Swap out the API implementation to solve this issue, cross fingers and test it still works
  3. Leave the rest to someone else who can just keep tweaking/replacing the API's implementation as desired
Contributor

fhackett commented Nov 29, 2017

@TimLethbridge here's the refactoring plan you requested: https://github.com/umple/umple/wiki/Refactoring-UmpleOnline's-storage-backend

The lines are not particularly granular, but should be generally representative. There may be a few changes to callsites of functions in compiler_config.php. Let me know if a more precise set would be useful.

I'm also interested in your thoughts on my interpretation of the interface UmpleOnline requires. Does anything seem off to you? Did I obviously miss something?

If all goes well, the proposed structure of my proposed marathon session is:

  1. Implement the API as described with absolutely no change in functionality (leave all the directories where they are); test it still works, commit results
  2. Swap out the API implementation to solve this issue, cross fingers and test it still works
  3. Leave the rest to someone else who can just keep tweaking/replacing the API's implementation as desired
@TimLethbridge

This comment has been minimized.

Show comment
Hide comment
@TimLethbridge

TimLethbridge Nov 29, 2017

Member

OK sounds good

Member

TimLethbridge commented Nov 29, 2017

OK sounds good

@fhackett

This comment has been minimized.

Show comment
Hide comment
@fhackett

fhackett Dec 9, 2017

Contributor

@TimLethbridge part 1 is at the testing stage now. I haven't tried anywhere close to everything, but it works for basic usage. Posting here to show progress and a let people try it.

See PR 1178

Contributor

fhackett commented Dec 9, 2017

@TimLethbridge part 1 is at the testing stage now. I haven't tried anywhere close to everything, but it works for basic usage. Posting here to show progress and a let people try it.

See PR 1178

TimLethbridge added a commit that referenced this issue Dec 15, 2017

Merge pull request #1178 from umple/fhackett-implement-1123
Implements #1123 to improve file access in UmpleOnline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment