Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using git for tracking sparse updates and git smudge to apply them. #80

Closed
blester125 opened this issue Nov 17, 2022 · 6 comments

Comments

@blester125
Copy link
Collaborator

Instead of tracking/applying sparse updates manually (for example storing them in a different directory) can we just checking sparse updates and then move backwards through git history to build the real value (apply updates).

I have written this recursive smudge where when you smudge a file it will be transformed to include the content at each point in the history where it changed (and the commit the change happened at).

#!/bin/bash

COMMIT=${2:-"HEAD"}

echo "----------------------------" >> /tmp/smudge.log
echo "${COMMIT}" >> /tmp/smudge.log

if [ ${COMMIT} != "HEAD" ]; then
  PREV_COMMIT="${COMMIT}~1"
else
  PREV_COMMIT="${COMMIT}"
fi

echo "${PREV_COMMIT}" >> /tmp/smudge.log

echo "I'm running smudge"
LAST_CHANGE=$(git rev-list -1 ${PREV_COMMIT} -- $1)
echo "${LAST_CHANGE}" >> /tmp/smudge.log

if [ -z ${LAST_CHANGE} ]; then
  exit 0
else
  echo "The last time this file changed was ${LAST_CHANGE}"
  echo `git show ${LAST_CHANGE}:$1`
  /usr/local/google/home/brianlester/dev/git-theta-test/smudge.sh $1 ${LAST_CHANGE}
fi

Note, we can't run something like git checkout ${COMMIT} from inside a smudge but we can run things like git show and git rev-list.

We can apply this same idea to parameters. Reading in a sparse update will recurse backwards through history, until it hits a dense update. Once the dense update (which just returns the value) each sparse update (read from git) will be applied as we move back up the stack.

The main open questions are:

  • Does this still work when we hit a commit with multiple parents (from a merge for example)
  • Can tensorstore read a tensor when the binary blob (and the metadata file) are bytes sequences from git show
    • If it can't this solution would need to write the blobs to a temporary space causing an extra read/write per updated parameter. This could be mitigated by only changing updated parameters but could be costly otherwise.
@craffel
Copy link
Contributor

craffel commented Nov 17, 2022

This is cool. A small note that it's not just dense updates that we'd stop on - one could imagine other update types (e.g. "randomly set the values by drawing from a normal distribution with seed N" or "set all the values to 1") which are not dense per se (i.e. they don't involve storing explicit parameter values) but do involve setting all the parameter values while ignoring the previous values. Probably best to distinguish between updates that are truly updates (i.e. they rely on modifying the previous state) or aren't and use just look for the first instance of the latter kind of update. As an aside, I think it's informative to think about a from-scratch training run - ideally the first commit would just say "add these parameter groups and initialize them in this way".

I had something else to say but I forgot, maybe I will think of it another time.

@blester125
Copy link
Collaborator Author

Yeah, we definitely want to support stopping at other update types.

I think a recursive solution would handle that. A "true update" would look up the previous update type in git and call the .get (or whatever) method. If the previous one is also a "true update" it will continue the recursion. A "fake update" like dense or your random value one would just return values as is and function as a base case, without needing to enumerate what update types overwrite all params.

@blester125
Copy link
Collaborator Author

#84 was very close to implementing this approach, however, it seems like it is not possible to do this correctly when using git to time travel (git checkout ${commit}, git checkout branch, etc.).

The gist of it is that the code that looks back through the git history to build the parameter needs to know where in the history to start looking. In something like a checkout, we only the commit we are at, within the smudge filter there isn't a way to know what commit we are going to.

So basically the result is that whenever we time travel, we end up with the smudged model checkpoint of where we were not where we wanted to be. Running git reset --hard fixes this, but we don't want to have to run this everytime.

I talked with @nkandpa2 about this issue and neither of us found a way to fix it. Thus we took a lot of the ideas on how this implementation of updates worked and applied them to a file system based method of tracking and applying updates in #92.

I'm closing this as we don't think the git approach will work, but I'll leave the branch with the implementation on my fork as it may be useful to revisit in the future.

@nkandpa2
Copy link
Collaborator

nkandpa2 commented Dec 4, 2022

Re-opening this discussion. I can't remember if we talked about this solution but why wouldn't it work to store the the hash of HEAD in the metadata file at clean time?

For example:

  1. I stage a model for the first time and the hash of HEAD is 1. When it gets staged the metadata file contains a key "previous_commit": 1. I commit this checkpoint and now the hash of HEAD is 2.
  2. I make a sparse update to the model and stage that. The staged metadata file contains "previous_commit": 2. I commit this checkpoint and now HEAD is 3.
  3. I make another sparse update to the model and stage it. The staged metadata file contains "previous_commit": 3. I commit this checkpoint and now HEAD is 4.
  4. I make as many other commits as I like.

Now say, I run git checkout 4. The smudge filter reads the metadata file and loads up the files in commit 4. It sees that commit 4 was a sparse update so it looks up the "previous_commit" key in the metadata file and recursively loads commit 3. Since 3 is also a sparse update, it looks up the key in the metadata file and recursively loads commit 2. Finally, commit 2 is a dense update so we don't need to recurse any further.

Are there any issues with this solution?

@nkandpa2 nkandpa2 reopened this Dec 4, 2022
@blester125
Copy link
Collaborator Author

We talked about this solution and it seems like it will work. This branch has some tools for getting files from the git history which should help in the multi-pointer PR too.

We can get this up and running once the multi-pointer branch is working with dense updates.

One of the main questions to explore for us is if we will be able to track the last update directly or if we will need to iterate through history to find it but either way till work.

In the original git-tracks-updates implementation I occasionally has times where it was slow to re-build indices on something like a checkout. In the new format, the only file getting index'd is the main metadata file (not each parameter file) so it should be faster?

One question this does bring up is our tree processing algorithms. Currently we essentially process the parameter tree depth first where each parameter is processed individually (which might involve moving backwards through the git history) which could cause repeated work. It might be more efficient to collect all parameters that have changed in a batch and then go back in time once updating each parameter as appropriate. But before a large refactor like that we should 1) test that it is actually an issue and 2) check if memoization of our "get file from git history" function fixes any issue there is.

@nkandpa2
Copy link
Collaborator

Closed by #114

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants