Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large file transfers through gitfs are very slow #25672

Closed
bradthurber opened this issue Jul 23, 2015 · 20 comments
Closed

Large file transfers through gitfs are very slow #25672

bradthurber opened this issue Jul 23, 2015 · 20 comments
Labels
Bug broken, incorrect, or confusing behavior File Servers P2 Priority 2 Platform Relates to OS, containers, platform-based utilities like FS, system based apps severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around stale
Milestone

Comments

@bradthurber
Copy link
Contributor

Note: This is a follow-on to #25455

Repro steps:

  1. Setup gitfs using pygit2
  2. Add this repo that contains a 100MB file to the gitfs config:
    https://github.com/edubart/testrepo
  3. copy file out of formula to minion:
    salt 'minion-name' cp.get_url salt://largefile /tmp/largefile
  4. watch the target directory. You should notice the file size increasing about 1MB/minute

I'm probably getting too fancy with expecting large files to smoothly make the transition through gitfs in this way - and I'm aware of the large file/binary file limitations of git; however, adding setup code (large RPMs to a formula could be a nice way of encapsulating what certain formulas need to accomplish (I'm looking at you oracle sql client RPM!).

Perhaps this is an issue with pygit2/libgit rather than salt - but the large file issue was supposed to be fixed in a much older libgit. Ref: libgit2/libgit2#1081

@jfindlay jfindlay added Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists P2 Priority 2 File Servers Platform Relates to OS, containers, platform-based utilities like FS, system based apps labels Jul 23, 2015
@jfindlay jfindlay added this to the Approved milestone Jul 23, 2015
@jfindlay
Copy link
Contributor

@bradthurber, thanks for the new issue. We'll figure out whether this is an issue in gitfs or the git backend and decide from there what to do.

@mahresohl
Copy link

This issue is a show stopper for us. We absolutely need GitFS.
Test setup: Master on VM Ubuntu + Windows minion (VM or Non-VM) + GitFS + Git Host (Centos). All on local network.

With Non-VM Win Minion transferring a 150MB file takes more than an hour(!)
With VM Win Minion the transfer Always Fails with timeout error. Only Max 50MB is transferred (often much less than 50MB).

Non-GitFS setup works with NO problem. I.e. if on Master the file is copied into file_roots:base (cp /var/cache/salt/master/gitfs/refs/base/Distribution.7z -> /srv/salt/), the transfer is OK.

So far we traced the bottleneck on the Minion as follows:

c:\salt\bin\lib\site-packages\zmq\sugar\poll.py(101)poll()
-> return zmq_poll(self.sockets, timeout=timeout) <-- bottleneck
(Pdb) w
c:\salt\bin\scripts\salt-call(11)() -> salt_call()
c:\salt\bin\lib\site-packages\salt\scripts.py(331)salt_call() -> client.run()
c:\salt\bin\lib\site-packages\salt\cli\call.py(55)run() -> caller.run()
c:\salt\bin\lib\site-packages\salt\cli\caller.py(133)run() -> ret = self.call()
c:\salt\bin\lib\site-packages\salt\cli\caller.py(196)call() -> ret['return'] = func(args, *kwargs)
c:\salt\bin\lib\site-packages\salt\modules\state.py(731)sls() -> ret = st
.state.call_high(high
)
c:\salt\bin\lib\site-packages\salt\state.py(2127)call_high() -> ret = dict(list(disabled.items()) + list(self.call_chunks(chunks).items()))
c:\salt\bin\lib\site-packages\salt\state.py(1659)call_chunks() -> running = self.call_chunk(low, running, chunks)
c:\salt\bin\lib\site-packages\salt\state.py(1935)call_chunk() -> running[tag] = self.call(low, chunks, running)
c:\salt\bin\lib\site-packages\salt\state.py(1595)call() -> *_cdata['kwargs'])
c:\salt\bin\lib\site-packages\salt\loader.py(1491)wrapper() -> return f(_args, *_kwargs)
c:\salt\bin\lib\site-packages\salt\states\file.py(1581)managed() -> follow_symlinks)
c:\salt\bin\lib\site-packages\salt\modules\file.py(4161)manage_file() -> sfn = salt['cp.cache_file'](source, saltenv)
c:\salt\bin\lib\site-packages\salt\modules\cp.py(365)cache_file() -> result = context['cp.fileclient'].cache_file(path, saltenv)
c:\salt\bin\lib\site-packages\salt\fileclient.py(159)cache_file() -> return self.get_url(path, '', True, saltenv)
c:\salt\bin\lib\site-packages\salt\fileclient.py(544)get_url() -> return self.get_file(url, dest, makedirs, saltenv)
c:\salt\bin\lib\site-packages\salt\fileclient.py(1068)get_file() -> data = self.channel.send(load)
c:\salt\bin\lib\site-packages\salt\utils\async.py(73)wrap() -> ret = self._block_future(ret)
c:\salt\bin\lib\site-packages\salt\utils\async.py(82)_block_future() -> self.io_loop.start()
c:\salt\bin\lib\site-packages\zmq\eventloop\ioloop.py(151)start() -> super(ZMQIOLoop, self).start()
c:\salt\bin\lib\site-packages\tornado\ioloop.py(841)start() -> event_pairs = self._impl.poll(poll_timeout)
c:\salt\bin\lib\site-packages\zmq\eventloop\ioloop.py(122)poll() -> z_events = self._poller.poll(1000_timeout)

c:\salt\bin\lib\site-packages\zmq\sugar\poll.py(101)poll() -> return zmq_poll(self.sockets, timeout=timeout)

The above 'poll.py:poll->zmq_poll) hangs for a Few Seconds on each 5th call. I.e. doing 'pdb) c' at 'zmq\sugar\poll.py:101' hangs each fifth call. Seems the data is transferred in blocks, so on each 5th block a huge time delay happens.
We suspect that when Gitfs is involved, the Master does some verifications during file transfer, which halts the process.

We appreciate any hints that would allow to track the root of this problem faster.

@jfindlay jfindlay added severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around and removed severity-low 4th level, cosemtic problems, work around exists labels Jan 29, 2016
@jfindlay
Copy link
Contributor

@mahresohl, what is the output of salt --versions-report and salt <minion> test.versions_report? If that doesn't include the git utils you're using, what are their versions as well?

@mahresohl
Copy link

salt --versions-report
Salt Version:
Salt: 2015.8.3

Dependency Versions:
Jinja2: 2.7.2
M2Crypto: 0.21.1
Mako: 0.9.1
PyYAML: 3.10
PyZMQ: 14.0.1
Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
RAET: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.4
cffi: 1.4.2
cherrypy: Not Installed
dateutil: 1.5
gitdb: 0.5.4
gitpython: 0.3.2 RC1
ioflo: Not Installed
libnacl: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.3.0
mysql-python: 1.2.3
pycparser: 2.14
pycrypto: 2.6.1
pygit2: 0.23.1
python-gnupg: Not Installed
smmap: 0.8.2
timelib: Not Installed

System Versions:
dist: Ubuntu 14.04 trusty
machine: x86_64
release: 3.16.0-30-generic
system: Ubuntu 14.04 trusty


salt win_ak_bv --versions-report
Salt Version:
Salt: 2015.8.3

Dependency Versions:
Jinja2: 2.7.2
M2Crypto: 0.21.1
Mako: 0.9.1
PyYAML: 3.10
PyZMQ: 14.0.1
Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
RAET: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.4
cffi: 1.4.2
cherrypy: Not Installed
dateutil: 1.5
gitdb: 0.5.4
gitpython: 0.3.2 RC1
ioflo: Not Installed
libnacl: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.3.0
mysql-python: 1.2.3
pycparser: 2.14
pycrypto: 2.6.1
pygit2: 0.23.1
python-gnupg: Not Installed
smmap: 0.8.2
timelib: Not Installed

System Versions:
dist: Ubuntu 14.04 trusty
machine: x86_64
release: 3.16.0-30-generic
system: Ubuntu 14.04 trusty

@mahresohl
Copy link

BTW
salt win_ak_bv test.versions-report
win_ak_bv:
'test.versions-report' is not available.
ERROR: Minions returned with non-zero exit code

@mahresohl
Copy link

salt-minion --versions-report
Salt Version:
Salt: 2015.8.3

Dependency Versions:
Jinja2: 2.7.3
M2Crypto: Not Installed
Mako: Not Installed
PyYAML: 3.11
PyZMQ: 14.7.0
Python: 2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit (AMD64)]
RAET: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.2
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.4.2
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
libnacl: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
python-gnupg: 0.3.7
smmap: Not Installed
timelib: Not Installed

System Versions:
dist:
machine: AMD64
release: 7
system: 7 6.1.7601 SP1 Multiprocessor Free

@mahresohl
Copy link

Also on Master:
git --version
git version 1.9.1

On Minion
"C:\Program Files\Git\bin\git" --version
git version 2.6.4.windows.1

@mahresohl
Copy link

after more debugging it seems "zmq/sugar/poll.py:101" is not really a culprit. If I put a breakpoint only there, it is hit a few times, and then never again. The download proceeds further as usual i.e. darn slow. Without knowing the overall architecture and the logic of the application, it's extremely hard to debug.
Running the Master in debug also gives no clues.

@PKiman
Copy link

PKiman commented Aug 30, 2016

I am running into the same issue and was hoping that git-lfs (git large file support) would help, but git-lfs isn't supported apparently yet by pygit2 nor GitPython.
https://git-lfs.github.com/
git-lfs is supported by some repo providers yet like github and recently bitbucket.

Salt Version:
           Salt: 2016.3.2

Dependency Versions:
           cffi: 1.6.0
       cherrypy: 5.4.0
       dateutil: 2.1
          gitdb: 0.6.4
      gitpython: 2.0.3
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: 0.20.0
        libnacl: Not Installed
       M2Crypto: 0.21.1
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
         pygit2: 0.20.3
         Python: 2.6.9 (unknown, Jul 20 2016, 20:52:42)
   python-gnupg: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.5.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist:   
        machine: x86_64
        release: 4.4.15-25.57.amzn1.x86_64
         system: Linux
        version: Not Installed

@PKiman
Copy link

PKiman commented Aug 30, 2016

Or perhabs the libgit2 library of saltstack is just outdated:
libgit2/libgit2sharp#1236

@dxiri
Copy link

dxiri commented Feb 22, 2017

It seems I am hitting this same thing on a much worse case, while using salt virt to try and deploy a new VM. Base VM is about 8GB in size.

Using the following command: salt-run -l debug virt.init centosTEST 4 2048 salt://centos6-base.img seems to work until its stuck in this loop:

[DEBUG ] Checking whether jid 20170221183812866496 is still running
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'sjociara03_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG ] Passing on saltutil error. This may be an error in saltclient. 'retcode'
[DEBUG ] Checking whether jid 20170221183812866496 is still running
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'sjociara03_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG ] Passing on saltutil error. This may be an error in saltclient. 'retcode'
[DEBUG ] Checking whether jid 20170221183812866496 is still running
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'sjociara03_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG ] Passing on saltutil error. This may be an error in saltclient. 'retcode'

Looking a little bit further, I can see centos6-base.img is being copied to /var/cache/salt/minion/files/base/centos6-base.img on the minion at a VERY slow pace. Its been about 2 hours since I started the command and right now:
root@a172-25-151-56:~# ls -lh /var/cache/salt/minion/files/base/centos6-base.img
-rw------- 1 root root 3.2G Feb 22 02:30 /var/cache/salt/minion/files/base/centos6-base.img

Is there any workaround for this?

@ninja-
Copy link
Contributor

ninja- commented Feb 22, 2017

Could you guys confirm that it works at a reasonable speed without gitfs?

EDIT: Sorry, I haven't seen this:

Non-GitFS setup works with NO problem. I.e. if on Master the file is copied into file_roots:base (cp /var/cache/salt/master/gitfs/refs/base/Distribution.7z -> /srv/salt/), the transfer is OK.

@ninja-
Copy link
Contributor

ninja- commented Feb 22, 2017

@dxiri what's the ping between your master and minion? I might have an idea here.

@ninja-
Copy link
Contributor

ninja- commented Feb 22, 2017

@dxiri currently file_buffer_size default is somewhere in order of 220kb-1mb.
Minion needs to ACK every file part individually and request the next part.

That's what I think could be slowing down your transfer if you have a large ping between machines.

Could you try increasing that in both your master and minion configs?

Refs:
https://github.com/saltstack/salt/blob/develop/salt/config/__init__.py#L1386

fp_.seek(load['loc'])

@dxiri
Copy link

dxiri commented Feb 24, 2017

@ninja- ping is really fast:
root@sjociara03:~# ping a172-25-151-61.deploy
PING a172-25-151-61.deploy (172.25.151.61) 56(84) bytes of data.
64 bytes from 172.25.151.61: icmp_seq=1 ttl=59 time=93.3 ms
64 bytes from 172.25.151.61: icmp_seq=2 ttl=59 time=91.9 ms
64 bytes from 172.25.151.61: icmp_seq=3 ttl=59 time=91.9 ms
64 bytes from 172.25.151.61: icmp_seq=4 ttl=59 time=91.7 ms
64 bytes from 172.25.151.61: icmp_seq=5 ttl=59 time=91.8 ms

Thanks for the tip! I will try it out and come back with results!

@ninja-
Copy link
Contributor

ninja- commented Feb 24, 2017

@dxiri heh, 93ms is NOWHERE NEAR fast 😄 (fast would be < 3 ms)
Personally my ping is 30ms 10ms and I already had big highstate slowdowns with FileClient, I am working on making FileClient less sensitive to ping with some changes already being merged to develop branch.
It's simple math - if the ping is 100ms, then the client can only make 10 requests per second.
So it either needs to make less requests or send them in bulk.

@ninja-
Copy link
Contributor

ninja- commented Feb 24, 2017

What happened in your case, and I've been probably right about the workaround for you, is that every 2MB1MB the minion needs to send an ACK and request the next part.
So it limits the speed to ~ 10MB/s or less.
Increasing file_buffer_size should help and I am curious about your results 😄

Your current speed seems to be around ~ 13MB/s when I do 3200/2/60. Coincidence? 😈

@dxiri
Copy link

dxiri commented Feb 24, 2017

@ninja- well, "fast" is a relative measure 😆

Results are still not in, servers are being moved to a new location as we speak, so it will take a while, but I will get back and post the results when I have them.

@dxiri
Copy link

dxiri commented Mar 20, 2017

@ninja- ok, servers back online :) checking what you mentioned about the file_buffer_size I currently have:
'file_buffer_size': 262144

Assuming that is in bytes, is the default buffer really 256KB? seems extremely low.

By looking at the links you provided, I can see you have:
'file_buffer_size': 1048576

which is 1MB. Is 1MB the maximum I can set this to?

Thanks for the help so far!

@stale
Copy link

stale bot commented Sep 21, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label Sep 21, 2018
@stale stale bot closed this as completed Sep 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior File Servers P2 Priority 2 Platform Relates to OS, containers, platform-based utilities like FS, system based apps severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around stale
Projects
None yet
Development

No branches or pull requests

6 participants