-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to use S3 file backend with 2016.3.1 on Ubuntu 14.04 or 16.04 #34074
Comments
I had the same issue replacing salt/utils/aws.py and salt/utils/s3.py with the versions from 2016.3.0 fixed the issue on a test master |
@fooka03 I am able to replicate this when simply running Here is the git bisect:
|
Hi. This broke my salt masters using S3 backend for states and pillars. Salt master log
salt --versions-report
|
So tentatively it looks like I may have solved the issue simply by changing the logic around when to use streams in GET requests in salt/utils/s3.py. Replacing this:
with this:
Seems to do the trick as far as getting past the issue and also not breaking #33599 However it needs more cleanup as when salt/fileserver/s3fs.py is checking a file against the cache it'll spam the logs for each chunk ( |
@fooka03 @stanislavb @imcecil Huge appologies to all, this is totally my fault. I just discovered today that my code borked this up (I apparently didn't test it properly). I believe "elif methid == 'GET' and local_file and not return_bin" is the correct fix for that line of code, it corrects the issue for me and does not get the log spam @fooka03 describes |
@lomeroe I wouldn't be so quick on the draw here as I got the OOM issue on my salt-master with this code change. The strange thing was when I had multiple files in my bucket it seemed to be fine, but when I switched to using a test bucket with only the huge file it died on startup. There's something else at play here and I've been trying to dig through the internals to figure it out. I suspect this whole chunk in salt/fileserver/s3fs.py needs some TLC:
Namely in that it's making that call to s3 and downloading the file every time even though it only needs the headers. In my case this is happening multiple times and creating a new connection to s3 every time. Later today if I have some time I was going to start abusing the traceback module to try and get to the bottom of it. Test SetupAWS Ubuntu 16.04 t2.micro instance (ami-13be557e) running 2016.3.1 with the hotfix applied
Command: |
@fooka03 interesting, memory footprint seemed okay in my big file test (but I didn't test it in that same manner)...I'll look deeper as well. |
@fooka03 here's what I'm seeing on the cache miss/multiple downloads (let me know if you see different). ETag cache is missing on the hash check b/c of amultipart upload (in my scenario at least). Etag is apparently calculated differently on multipart upload, so the hash has a '-' in it and the s3fs module ignores it...strangely, the S3 console shows the "standard" ETag, but querying the metadata download has the multipart ETag... The size/date check is then failing b/c the cached file mtime is in local time and the S3 metadata file time is in GMT time. Honestly, I'm not seeing how these weren't an issue before the pyrequests streaming was added to the s3 utils module... |
Yea that seems like what I'm getting. Here's the log output:
|
@lomeroe The code for me at least makes the proper adjustment for UTC vs CDT. I think to reduce the impact of the ETAG misses (since they're just going to be a fact of life with multi-part uploads) we could add in the ability to do a HEAD request between s3.py and s3fs.py instead of a GET which would only return the metadata of the object. Additionally in regards to the ETAG logic it would probably be beneficial to allow for a supplied hash/algorithm to perform the check against (as that mechanism exists in file.managed) which would give the option for a user provided way to bypass the excessive s3 calls though at this point we're probably getting outside the scope of this issue. So back to the task at hand, here's what I'm getting in the system log when I run the above test:
The instance at this point is dead and needs to be restarted. |
@fooka03 out of curiosity is your system's timezone set to UTC? Edit to add: sorry, I see above you noted already that you are getting that log spam... |
@fooka03 if you are getting the log spam on the download being skipped due to size/mtime, then at least the file was successfully downloaded from S3 and the OOM issue is at least not being caused by the utils.s3 downloading the file (the original intent of #33682 seems to be working). That said, I agree that the utils.s3 update to using pyrequests streaming has definitely exposed other things inside the s3fs module. The function which contains the code you referenced above (function _get_file_from_s3 which is called from find_file) is run for each chunk that is returned from the master to the minion when the file is served. On files that are multipart uploaded, that function compares locally cached metadata vs the cached file as well as re-pulling the metadata from S3 and comparing again (method="HEAD" in the s3.query). On a 1GB file, that's 1024 checks with the default file buffer size. If the 1GB file wasn't multipart uploaded, on each chunk the md5 would be calculated for the cached file and compared to the cached metadata. For small files, it's no big deal, but for large files it really takes a toll. On a smaller system (t2.small for me), if the s3fs fileserver bucket is very big (mine is ~4GB with my 1.7GB "big file" and about 1000 total files), IO is consumed by the cache update/checking (all the files get downloaded successfully into the cache, but then the cache checksum/metadata checking takes so long it can't keep up with much of anything else). I get SaltReqTimeouts when even trying to run a state. It also only has around 70MB of free memory after starting up, so I can see how your micro would easily run out of memory once much of anything started happening. Using a t2.large, that can push a little more IO/CPU and keep up with the s3fs update() calls on my fileserver bucket: My 1.7GB test file being served off S3 via the s3fs module (i.e. salt://big_file) takes around one hour to complete the file.managed state. Changing _get_file_from_s3 to only check if the file exists in the cache (and download if not) knocks it down to ~15 minutes if the file has already been cached down from S3, but I don't know if that appropriate or is something more robust should be happening. In comparison, using
to manage the same 1.7GB file averages about 4 minutes in my testing. A third comparison, using the standard file system based fileserver with:
Runs in about 30 seconds... It also appears to me, that during the file.managed state, with the salt:// path and the s3fs backend, the file is served two times by the master. Perhaps that is what is supposed to be done, but it just exasperates the issue on a really big file like this. This may warrant a separate issue for s3fs, as I think the initially reported issue (which was caused by the changes to utils.s3.query) is fixed with the above PRs. |
Also appears to be broken on CentOS 7.
but does appear to be fixed with the above patches |
It looks like the initial error with this issue has been solved in the 2016.3 branch, with the
If it is, can we close out this issue, and open another one about the OOM issue for streaming large files. Thanks! |
@gtmanfred @lomeroe I'm good with marking this one closed and opening a new issue for the other stuff. I'll work on getting the new issue created tomorrow unless someone beats me to the punch before then. |
Awesome! Thanks! |
Description of Issue/Question
After upgrading to 2016.3.1, running states with source files located in a s3 bucket configured as a fileserver backend would fail with the message:
Source file salt://myfile not found
Explicitly syncing the fileserver cache withsalt-run fileserver.update backend=s3fs
results in no files being downloaded and a warning in the logs:[salt.loaded.int.fileserver.s3fs][WARNING ][26138] S3 Error! Do you have any files in your S3 bucket?
Downgrading to 2016.3.0 resolved the issue. This issue is present on both Ubuntu 14.04 and 16.04. Using AWS credentials or an IAM role both exhibit the issue. It appears to be caused by PR #33682 (looks like the only related changes in this release)Setup
Master Config
Minion Config:
Machine Type:
The machine is a m4.large (2 cores, 8GB RAM) running in AWS East with an IAM role allowing for full S3 access. It is only running the salt-master and a salt-minion for itself.
Sample State File:
S3 Bucket Layout:
Result:
Expected (Actual output after downgrading to 2016.3.0):
Steps to Reproduce Issue
This happened with probably the most basic use case for an s3 file backend, and on a brand new AWS EC2 instance (using AMI: ami-13be557e). Upload file to s3://my-bucket/base/, configure master to use s3fs file backend, apply a state file that uses the uploaded file as source.
I dug into the code a bit and added some logging to figure out what was coming back from S3. First I added logging to output the s3_meta keys and the resulting meta_response object in fileserver/s3fs.py and ended up with this:
Next I tried logging what was actually contained in the response from utils/s3.py and ended up with this:
By comparison, I get this from 2016.3.0 (truncated since it shows every file in the bucket):
Versions Report
Master
Minion
The text was updated successfully, but these errors were encountered: