Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Terminal Action #20

Closed
yebyen opened this issue Sep 26, 2013 · 83 comments
Closed

Slow Terminal Action #20

yebyen opened this issue Sep 26, 2013 · 83 comments

Comments

@yebyen
Copy link
Contributor

yebyen commented Sep 26, 2013

From the time I type ":infinite" to the time it shows up on my screen is at least a full 10 seconds. There is nothing else eating up cycles during this time. The vere process shows in top as state D (blocking I/O) until the typing is finally echoed to the screen.

Today it tells me ~zod |Tianming| not responding still trying, it hasn't said that before, but I've always had the same slow typing results. Same result with submarines or destroyers. It does not have to be ":infinite" but typing anything, with or without newline. Today I also don't get output from :ls %:

~divsem-misdef/try=> :ls %
~ /cx/~divsem-misdef/try/~2013.9.26..12.18.19..6428/bin/ls/hoon

My environment is Ubuntu Linux under Docker (CoreOS/Vagrant)

The host system is Windows 7
It's an SSD-bearing laptop with four cores, so there are no local barriers that I know of that should cause blocking for that long. Is it reaching out to the internet to act on these commands? Should it be blocking I/O?

@johncburnham
Copy link
Contributor

In reverse order:

:ls % isn't giving you output because :ls is a file that you need to pull from ~zod. If you can't connect to ~zod for the initial update, very few commands work.

You ship's probably out of sync. Did you delete and recreate your destroyers as outlined in Curtis' post to the mailing list? https://groups.google.com/d/msg/urbit-dev/FQqNyi15lu0/GB7Fgyk_wVQJ

~zod is up right now, so once you get reconnected, let's see if the terminal slowness is still there.

keep in mind, everything's inherently slow right now. it's not atrocious for an alpha version, but things will be much much better once we hit continuity (and thus beta) next week.

@yebyen
Copy link
Contributor Author

yebyen commented Sep 26, 2013

I hadn't got the daily digest yet. I read the initial warning that this was coming before Oct 4, but I didn't think that would mean "tomorrow" :)

I should have checked the mailing list before I tweeted. Thanks!

@yebyen
Copy link
Contributor Author

yebyen commented Sep 26, 2013

I am back on the network and I appear to have pulled down all the "binaries" from zod. It is still just as slow. I don't know what these ships are doing when I load them up; if I have a submarine and two destroyers, will they work together on their own (?) -- is it necessary to run them all at the same time on different hosts? I'm familiar with zookeeper enough to know that you need at least 3 zookeepers to have a working cluster (not two, and certainly not one, but always an odd number)

Does it work like that?

EDIT: OK so maybe not "just as slow" but there is a sizeable delay

@yebyen
Copy link
Contributor Author

yebyen commented Sep 26, 2013

It's probably a tolerable speed now, with just one submarine.

Thanks for making a neat thing!

@yebyen yebyen closed this as completed Sep 26, 2013
@johncburnham
Copy link
Contributor

We've got a big update coming, even before continuity, that should help the network performance. Just one of the innumerable things we've got to get done.

@cgyarvin
Copy link
Contributor

~zod/~doznec was down. Alas, it happens. The error on 'ls' is that the
file doesn't exist, since your destroyer didn't sync.

~ /cx/~divsem-misdef/try/~2013.9.26..12.18.19..6428/bin/ls/hoon

The "state D" thing is very interesting. The thing is, we shouldn't be
doing blocking I/O. We're using libuv, dammit! Others have reported this
bug... and I've never seen it on OS/X or Linux. I wonder if the
virtualization is an issue.

Unfortunately, yebyen, this makes you a valuable resource. If you can
chase down where the process is blocked, that'd be very useful...

On Thu, Sep 26, 2013 at 5:52 AM, yebyen notifications@github.com wrote:

From the time I type ":infinite" to the time it shows up on my screen is
at least a full 10 seconds. There is nothing else eating up cycles during
this time. The vere process shows in top as state D (blocking I/O) until
the typing is finally echoed to the screen.

Today it tells me ~zod |Tianming| not responding still trying, it hasn't
said that before, but I've always had the same slow typing results. Same
result with submarines or destroyers. It does not have to be ":infinite"
but typing anything, with or without newline. Today I also don't get output
from :ls %:

~divsem-misdef/try=> :ls %
~ /cx/~divsem-misdef/try/~2013.9.26..12.18.19..6428/bin/ls/hoon

My environment is Ubuntu Linux under Docker (CoreOS/Vagrant)

The host system is Windows 7
It's an SSD-bearing laptop with four cores, so there are no local barriers
that I know of that should cause blocking for that long. Is it reaching out
to the internet to act on these commands? Should it be blocking I/O?


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20
.

@johncburnham johncburnham reopened this Sep 26, 2013
@vrotaru
Copy link

vrotaru commented Sep 26, 2013

I have the same issue, using Fedora 19 (64bit if that matters). On the other side playing with Arvo/Urbit in VMWare client (Windows 7) with Debian freshly installed I did not have that issue.

I must be system specific. Currently I'm running the latest from the github

@yebyen
Copy link
Contributor Author

yebyen commented Sep 26, 2013

Oh, ok! It doesn't have to be unfortunate if I am a valuable resource :)

I have done this before using... strace. Unfortunately the output of strace is of absolutely ridiculous proportions even in trivial programs, and this is not a trivial program...

I see a lot of fdatasync, epoll_wait, clock_gettime when things settle down.
There appear to be segmentation violations left and right (SIGSEGV).

Performance has got a lot better trying it right now, but I don't have to touch a lot of keys to see it enter D state.

I can try and just capture all of the output of strace, does that help? Is there something you'd rather get?

@yebyen
Copy link
Contributor Author

yebyen commented Sep 26, 2013

I'm going to bet that it's something docker does with cgroups and namespaces, not something that VirtualBox or Vagrant is doing, based on nothing but intuition. Let me know if there's an IRC channel I should join, or another way you'd prefer to continue the debugging dialog.

@cgyarvin
Copy link
Contributor

Ooh! That makes me so mad! Cruiser to anyone who can find this. It's
undoubtedly some bad interaction in the bowels of libuv...

On Thu, Sep 26, 2013 at 9:59 AM, Vasile Rotaru notifications@github.comwrote:

I have the same issue, using Fedora 19 (64bit if that matters). On the
other side playing with Arvo/Urbit in VMWare client (Windows 7) with Debian
freshly installed I did not have that issue.

I must be system specific. Currently I'm running the latest from the github


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25184542
.

@cgyarvin
Copy link
Contributor

You know, I really don't want to create an IRC channel, because Urbit is
designed to be its own IRC channel. I haven't started using this because
the network remains a bit flaky, but we will...

The sigsegvs are normal - they're part of how we do checkpointing. Above
(I would encourage anyone doing active development, by the way, to switch
their urbit-dev defaults away from digest mode) someone is reporting the
slowbug on 64-bit Fedora with no virtualization/docker.

The question is simply what it's doing when it's in D mode. Somewhere,
somehow, an I/O operation is blocking. All I need is to know is where...
and yes, any kind of trace is (potentially) useful.

On Thu, Sep 26, 2013 at 10:35 AM, yebyen notifications@github.com wrote:

I'm going to bet that it's something docker does with cgroups and
namespaces, not something that VirtualBox or Vagrant is doing, based on
nothing but intuition. Let me know if there's an IRC channel I should join,
or another way you'd prefer to continue the debugging dialog.


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25187327
.

@vrotaru
Copy link

vrotaru commented Sep 26, 2013

So this is what I get from

    strace -c -p $vere_pid

when I'm doing a :ls % and here is what I get. Subjectively it is about a second from the time I press the key and when something appears on screen.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 68.88    0.062000         112       555           fsync
 21.95    0.019758           7      2741           read
  4.51    0.004055           6       717           write
  1.59    0.001434           3       554           lseek
  1.16    0.001047           2       651           mprotect
  0.77    0.000695          17        40           fdatasync
  0.26    0.000234           2       152           open
  0.20    0.000177           9        20           ftruncate
  0.17    0.000154           4        38           sigaltstack
  0.13    0.000114           1       228           rt_sigprocmask
  0.12    0.000105          11        10           sendmsg
  0.09    0.000083           1       152           close
  0.09    0.000078           0       631           rt_sigreturn
  0.08    0.000072           2        32        12 epoll_ctl
  0.00    0.000000           0         8           brk
  0.00    0.000000           0       247           rt_sigaction
  0.00    0.000000           0        38           setitimer
  0.00    0.000000           0        12         4 recvmsg
  0.00    0.000000           0        23           epoll_wait
------ ----------- ----------- --------- --------- ----------------
100.00    0.090006                  6849        16 total

@cgyarvin
Copy link
Contributor

Alas, this doesn't tell me anything - as you see, the number of seconds
consumed in CPU processing is much smaller than your delay. Whatever is
taking the time, it is not the CPU in userspace mode.

I wonder about fsync() and fdatasync() - it's a little surprising how often
fdatasync() is called. If you comment out the calls to both these
functions in v/*.c, does the delay go away?

On Thu, Sep 26, 2013 at 12:01 PM, Vasile Rotaru notifications@github.comwrote:

So this is what I get from

strace -c -p $vere_pid

when I'm doing a :ls % and here is what I get. Subjectively it is about a
second from the time I press the key and when something appears on screen.

% time seconds usecs/call calls errors syscall


68.88 0.062000 112 555 fsync
21.95 0.019758 7 2741 read
4.51 0.004055 6 717 write
1.59 0.001434 3 554 lseek
1.16 0.001047 2 651 mprotect
0.77 0.000695 17 40 fdatasync
0.26 0.000234 2 152 open
0.20 0.000177 9 20 ftruncate
0.17 0.000154 4 38 sigaltstack
0.13 0.000114 1 228 rt_sigprocmask
0.12 0.000105 11 10 sendmsg
0.09 0.000083 1 152 close
0.09 0.000078 0 631 rt_sigreturn
0.08 0.000072 2 32 12 epoll_ctl
0.00 0.000000 0 8 brk
0.00 0.000000 0 247 rt_sigaction
0.00 0.000000 0 38 setitimer
0.00 0.000000 0 12 4 recvmsg
0.00 0.000000 0 23 epoll_wait


100.00 0.090006 6849 16 total


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25194045
.

@yebyen
Copy link
Contributor Author

yebyen commented Sep 26, 2013

Nope, it's still holding the terminal after a certain amount of keyboard input. It seems like the amount has gotten larger since I started testing this earlier. I'm not sure that commenting those lines was enough, I've got strace -p running in a different terminal now, and it is calling fdatasync and fsync repeatedly while the D state is hanging.

@radiofreejohn
Copy link

Ditto for me with the delay, I'm also not able to recreate my destroyers but I predict I've done something else wrong along the way.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 84.48    0.083359          10      8114           read
  7.11    0.007012          23       300           open
  3.14    0.003102           5       689           mprotect
  2.26    0.002228           6       394           fsync
  2.25    0.002219           7       300           close
  0.36    0.000354           1       685           rt_sigreturn
  0.31    0.000305           1       392           lseek
  0.09    0.000093           0       422           write
  0.00    0.000000           0         8           gettimeofday
  0.00    0.000000           0         4           ftruncate
  0.00    0.000000           0         6           setitimer
  0.00    0.000000           0         8           fdatasync
  0.00    0.000000           0        39           rt_sigaction
  0.00    0.000000           0        36           rt_sigprocmask
  0.00    0.000000           0         6           sigaltstack
  0.00    0.000000           0         6         2 epoll_ctl
  0.00    0.000000           0         5           epoll_wait
  0.00    0.000000           0        11           clock_gettime
  0.00    0.000000           0         6         6 openat
  0.00    0.000000           0        11           sendmsg
------ ----------- ----------- --------- --------- ----------------
100.00    0.098672                 11442         8 total

Also, this is what appears (a portion) just before the output of %ls

rt_sigreturn(0x847f4e)                  = 1112387764
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
mprotect(0x42513000, 16384, PROT_READ|PROT_WRITE) = 0
rt_sigreturn(0x8f3b6a)                  = 1112387764
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
mprotect(0x42967000, 16384, PROT_READ|PROT_WRITE) = 0
rt_sigreturn(0x8b9f1a)                  = 1112387764
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
mprotect(0x426af000, 16384, PROT_READ|PROT_WRITE) = 0
rt_sigreturn(0x923eee)                  = 1112387764
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
mprotect(0x42657000, 16384, PROT_READ|PROT_WRITE) = 0
rt_sigreturn(0x875fda)                  = 1112387764
write(0, "\r\n", 2)                     = 2
write(0, "\r", 1)                       = 1
write(0, "\33[J", 3)                    = 3
write(0, "bin lib web", 11)             = 11
write(0, "\r\n", 2)                     = 2
write(0, "\r", 1)                       = 1
write(0, "\33[J", 3)                    = 3
write(0, "~bitnyl-folfet-maptyp-palwet--ma"..., 64) = 64

And many repetitions of this:

mprotect(0x60983000, 16384, PROT_READ|PROT_WRITE) = 0
rt_sigreturn(0x404db004)                = 135438337
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
open("/proc/self/maps", O_RDONLY)       = 21
read(21, "08048000-080cc000 r-xp 00000000 "..., 1024) = 1024
...
read(21, "   /lib/i386-linux-gnu/libtinfo."..., 1024) = 951
close(21)                               = 0
mprotect(0x60987000, 16384, PROT_READ|PROT_WRITE) = 0
rt_sigreturn(0x404db004)                = 135442435
rt_sigaction(SIGTERM, {SIG_IGN, [TERM], SA_RESTART}, {0x804ebe0, [TERM], SA_RESTART}, 8) = 0
rt_sigaction(SIGVTALRM, {SIG_IGN, [VTALRM], SA_RESTART}, {0x804ebb0, [VTALRM], SA_RESTART}, 8) = 0
rt_sigaction(SIGSEGV, {0xb76b0ad0, [HUP INT QUIT USR1 USR2 PIPE ALRM TERM CHLD URG XCPU XFSZ VTALRM PROF WINCH IO PWR], SA_SIGINFO}, NULL, 8) = 0
sigaltstack({ss_sp=0x80000000, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
setitimer(ITIMER_VIRTUAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0

@cgyarvin
Copy link
Contributor

John, did you try recreating your destroyers from a new submarine?

The mprotect() and sigsegvs are totally normal - that's part of the
checkpointing process.

I think it's clear that there's some blocking I/O going on somewhere. I
wonder if anyone can create this problem in a boxed-up virtual machine
where I can replicate it? It doesn't happen on any bare OS I have access
to.

On Thu, Sep 26, 2013 at 12:38 PM, John notifications@github.com wrote:

Ditto for me with the delay, I'm also not able to recreate my destroyers
but I predict I've done something else wrong along the way.

% time seconds usecs/call calls errors syscall


84.48 0.083359 10 8114 read
7.11 0.007012 23 300 open
3.14 0.003102 5 689 mprotect
2.26 0.002228 6 394 fsync
2.25 0.002219 7 300 close
0.36 0.000354 1 685 rt_sigreturn
0.31 0.000305 1 392 lseek
0.09 0.000093 0 422 write
0.00 0.000000 0 8 gettimeofday
0.00 0.000000 0 4 ftruncate
0.00 0.000000 0 6 setitimer
0.00 0.000000 0 8 fdatasync
0.00 0.000000 0 39 rt_sigaction
0.00 0.000000 0 36 rt_sigprocmask
0.00 0.000000 0 6 sigaltstack
0.00 0.000000 0 6 2 epoll_ctl
0.00 0.000000 0 5 epoll_wait
0.00 0.000000 0 11 clock_gettime
0.00 0.000000 0 6 6 openat
0.00 0.000000 0 11 sendmsg


100.00 0.098672 11442 8 total

Also, this is what appears (a portion) just before the output of %ls

rt_sigreturn(0x847f4e) = 1112387764
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
mprotect(0x42513000, 16384, PROT_READ|PROT_WRITE) = 0
rt_sigreturn(0x8f3b6a) = 1112387764
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
mprotect(0x42967000, 16384, PROT_READ|PROT_WRITE) = 0
rt_sigreturn(0x8b9f1a) = 1112387764
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
mprotect(0x426af000, 16384, PROT_READ|PROT_WRITE) = 0
rt_sigreturn(0x923eee) = 1112387764
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
mprotect(0x42657000, 16384, PROT_READ|PROT_WRITE) = 0
rt_sigreturn(0x875fda) = 1112387764
write(0, "\r\n", 2) = 2
write(0, "\r", 1) = 1
write(0, "\33[J", 3) = 3
write(0, "bin lib web", 11) = 11
write(0, "\r\n", 2) = 2
write(0, "\r", 1) = 1
write(0, "\33[J", 3) = 3
write(0, "~bitnyl-folfet-maptyp-palwet--ma"..., 64) = 64

And many repetitions of this:

mprotect(0x60983000, 16384, PROT_READ|PROT_WRITE) = 0
rt_sigreturn(0x404db004) = 135438337
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
open("/proc/self/maps", O_RDONLY) = 21
read(21, "08048000-080cc000 r-xp 00000000 "..., 1024) = 1024
...
read(21, " /lib/i386-linux-gnu/libtinfo."..., 1024) = 951
close(21) = 0
mprotect(0x60987000, 16384, PROT_READ|PROT_WRITE) = 0
rt_sigreturn(0x404db004) = 135442435
rt_sigaction(SIGTERM, {SIG_IGN, [TERM], SA_RESTART}, {0x804ebe0, [TERM], SA_RESTART}, 8) = 0
rt_sigaction(SIGVTALRM, {SIG_IGN, [VTALRM], SA_RESTART}, {0x804ebb0, [VTALRM], SA_RESTART}, 8) = 0
rt_sigaction(SIGSEGV, {0xb76b0ad0, [HUP INT QUIT USR1 USR2 PIPE ALRM TERM CHLD URG XCPU XFSZ VTALRM PROF WINCH IO PWR], SA_SIGINFO}, NULL, 8) = 0
sigaltstack({ss_sp=0x80000000, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
setitimer(ITIMER_VIRTUAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25196926
.

@yebyen
Copy link
Contributor Author

yebyen commented Sep 26, 2013

Docker is all about boxed-up virtual machines. I could send you mine. Let me check to make sure I haven't left any password hashes anywhere.

@yebyen
Copy link
Contributor Author

yebyen commented Sep 26, 2013

If you get Docker, you can pull yebyen/cgyarvin and you'll have my destroyer 'satnum' where the problem is happening for me.

source /.bash_profile
cd home/urbit-master
bin/vere hobnob

If by chance it's an artifact of the kernel (3.10.10) or other system bits (it's CoreOS, which has been having a bit of an update problem in the last few days, so it might not be the very latest CoreOS) you won't see the problem.

It's all pushed. Have a look see. The image is close to 3GB (sorry) because it includes some testnet bitcoin altchain I've been hashing on and an example ChicagoBoss app from the PDF tutorial I've been following along with.

(I can also export to a tar file but docker is built for doing this...)

@cgyarvin
Copy link
Contributor

Hm, I get a username/password prompt when I try to:

oxford:~/Documents/src; git pull https://github.com/yebyen/cgyarvin.git
Username:
Password:

I do suspect it might be a kernel thing, though. Well, if so, that'll
narrow it down...

On Thu, Sep 26, 2013 at 12:50 PM, yebyen notifications@github.com wrote:

If you get Docker, you can pull yebyen/cgyarvin and you'll have my
destroyer 'satnum' where the problem is happening for me.

source /.bash_profile
cd home/urbit-master
bin/vere hobnob

If by chance it's an artifact of the kernel (3.10.10) or other system bits
(it's CoreOS, which has been having a bit of an update problem in the last
few days, so it might not be the very latest CoreOS) you won't see the
problem.

It's all pushed. Have a look see. The image is close to 3GB (sorry)
because it includes some testnet bitcoin altchain I've been hashing on and
an example ChicagoBoss app from the PDF tutorial I've been following along
with.


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25198066
.

@yebyen
Copy link
Contributor Author

yebyen commented Sep 26, 2013

No, it's not on Git, you want this one: https://index.docker.io/u/yebyen/cgyarvin/

Unfortunately not sure how you can download from the docker index without docker. I'll start exporting to tar.

@cgyarvin
Copy link
Contributor

Well, I need Docker to run it! Point me to some stupid person docker doc
to show me what I should do.

On Thu, Sep 26, 2013 at 12:58 PM, yebyen notifications@github.com wrote:

No, it's not on Git, you want this one:
https://index.docker.io/u/yebyen/cgyarvin/

Unfortunately not sure how you can download from the docker index without
docker. I'll start exporting to tar.


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25198660
.

@yebyen
Copy link
Contributor Author

yebyen commented Sep 26, 2013

If you're in Ubuntu Linux, you're 2/3 of the way there...

http://www.docker.io/gettingstarted/#h_installation

You want:

docker pull yebyen/cgyarvin
docker run -i -t yebyen/cgyarvin /bin/bash

before doing

source /.bash_profile
cd home/urbit-master
bin/vere hobnob

I don't actually install docker though, so I start here:
http://coreos.com/blog/coreos-vagrant-images/

If you are using CoreOS (and possibly if you're not) you'll have to do the docker run command twice to get a shell, because of some weird bug involving fork and lxc-start.

@cgyarvin
Copy link
Contributor

I'm in OS X - so don't expect immediate results :-)

On Thu, Sep 26, 2013 at 1:02 PM, yebyen notifications@github.com wrote:

If you're in Ubuntu Linux, you're 2/3 of the way there...

http://www.docker.io/gettingstarted/#h_installation

You want:

docker pull yebyen/cgyarvin
docker run -i -t yebyen/cgyarvin /bin/bash

before doing

source /.bash_profile
cd home/urbit-master
bin/vere hobnob

I don't actually install docker though, so I start here:
http://coreos.com/blog/coreos-vagrant-images/


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25198966
.

@yebyen
Copy link
Contributor Author

yebyen commented Sep 26, 2013

OK :) The coreos/vagrant path is the quickest to an identical setup. I'm using Windows 7, so ^_^

@cgyarvin
Copy link
Contributor

Cool. One day maybe I'll rock as much as docker. Anyway I have something
else I need to fix first, but this is my next priority...

On Thu, Sep 26, 2013 at 1:17 PM, yebyen notifications@github.com wrote:

OK :) The coreos/vagrant path is the quickest to an identical setup. I'm
using Windows 7, so ^_^


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25200112
.

@radiofreejohn
Copy link

@cgyarvin still no love:

generating 2048-bit RSA key...
; ~doznec not responding still trying
[waiting...]

@cgyarvin
Copy link
Contributor

See latest urbit-dev reply.

On Thu, Sep 26, 2013 at 2:06 PM, John notifications@github.com wrote:

@cgyarvin https://github.com/cgyarvin still no love:

generating 2048-bit RSA key...
; ~doznec not responding still trying
[waiting...]


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25203924
.

@vrotaru
Copy link

vrotaru commented Sep 27, 2013

My idea, when posting the strace output was to compare it with the same output on a box without those problems, and bellow is the same output (during :ls %) on Debian under VMPlayer under Windows 7

Well, there is quite a difference . Unfortunately, I tells me very little about what the problem could be.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.001074          12        88           fsync
  0.00    0.000000           0      1186           read
  0.00    0.000000           0       159           write
  0.00    0.000000           0       143           open
  0.00    0.000000           0       143           close
  0.00    0.000000           0        86           lseek
  0.00    0.000000           0       191           mprotect
  0.00    0.000000           0        78           rt_sigaction
  0.00    0.000000           0        72           rt_sigprocmask
  0.00    0.000000           0       183           rt_sigreturn
  0.00    0.000000           0        12           setitimer
  0.00    0.000000           0        16           fdatasync
  0.00    0.000000           0         8           ftruncate
  0.00    0.000000           0        12           sigaltstack
  0.00    0.000000           0        14           epoll_wait
  0.00    0.000000           0        24        12 epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.001074                  2415        12 total

@yebyen
Copy link
Contributor Author

yebyen commented Sep 29, 2013

Ohp... I've just accidentally pushed all of my destroyers to the docker cloud.

Seems there is no way to push a repo and keep certain tags private at the same time.

@cgyarvin
Copy link
Contributor

Okay I think I get it.

It's just a crazy level of difference in OS performance. On some systems it is quite tolerable, those include mine.

Why is it blocking? It's a bug, or rather a prototype in production. What it should actually do is block, but block in a forked process. Assuming decent COW this should not delay the parent, which can continue past the snapshot point.

This change might be a fun way to learn vere internals...

Sent from my iPhone

On Sep 29, 2013, at 11:31 AM, yebyen notifications@github.com wrote:

save_time_cb seems to be called on average about every 5 seconds and it only lags the terminal if there's been some input since the last checkpoint.

I don't think it's being called too often! Every 5 seconds seems reasonable. But why is it blocking?


Reply to this email directly or view it on GitHub.

@cgyarvin
Copy link
Contributor

Doh! We'll send you some new ones.

Sent from my iPhone

On Sep 29, 2013, at 12:18 PM, yebyen notifications@github.com wrote:

Ohp... I've just accidentally pushed all of my destroyers to the docker cloud.

Seems there is no way to push a repo and keep certain tags private at the same time.


Reply to this email directly or view it on GitHub.

@yebyen
Copy link
Contributor Author

yebyen commented Sep 29, 2013

Thanks :)

I don't think anyone got them, but who knows if Docker index will keep them around for the NSA or not.

There is a much smaller image you can pull now, from tianon/debian:jessie which has optionally already generated a submarine:

docker pull yebyen/urbit
docker run -i -t yebyen/urbit /bin/bash
docker run -i -t yebyen/urbit:nalder-sarted /bin/bash

Everything is in /home/urbit-master as before
The .bash_profile in / is not automatically sourced, you could do bash --login to see that it was.

source /.bash_profile
cd /home/urbit-master
bin/vere hobnob

This time it's only 309.3MB, 123MB of which you will have already downloaded if you have pulled tianon/debian before. I actually have to go meet someone at a bar right now...

@cgyarvin
Copy link
Contributor

Nifty I'm sure we'll get way into Docker as time goes on.

Sent from my iPhone

On Sep 29, 2013, at 1:01 PM, yebyen notifications@github.com wrote:

Thanks :)

I don't think anyone got them, but who knows if Docker index will keep them around for the NSA or not.

There is a much smaller image you can pull now, from tianon/debian:jessie which has optionally already generated a submarine:

docker pull yebyen/urbit
docker run -i -t yebyen/urbit /bin/bash
docker run -i -t yebyen/urbit:nalder-sarted /bin/bash
Everything is in /home/urbit-master as before
The .bash_profile in / is not automatically sourced, you could do bash --login to see that it was.

source /.bash_profile
cd /home/urbit-master
bin/vere hobnob
This time it's only 309.3MB, 123MB of which you will have already downloaded if you have pulled tianon/debian before. I actually have to go meet someone at a bar right now...


Reply to this email directly or view it on GitHub.

@yebyen
Copy link
Contributor Author

yebyen commented Sep 29, 2013

I got the extra destroyers, thanks 💃
It seems that zod is down right now

@cgyarvin
Copy link
Contributor

Crap I'll get online and fix that.

Sent from my iPhone

On Sep 29, 2013, at 3:10 PM, yebyen notifications@github.com wrote:

I got the extra destroyers, thanks
It seems that zod is down right now


Reply to this email directly or view it on GitHub.

@yebyen
Copy link
Contributor Author

yebyen commented Sep 29, 2013

I think I've confirmed, it's not the absence or presence of a login shell with tty that causes this problem.

In docker, you need to run -privileged in order to have an sshd running that allows users to log in and get a tty.
(There is now yebyen/urbit:rsyslog and yebyen/urbit:sshd)

Unfortunately I couldn't get a tty any other way. Turns out I still don't need one, other than for screen...

@yebyen
Copy link
Contributor Author

yebyen commented Sep 29, 2013

I like your explanation that it's an abysmal COW that causes this to be noticeable... you're not the first person I've heard say (not that you so much said) that AUFS is a bad idea, and it should be killed. I like everything else about CoreOS though, so I'm probably going to comment that line in v/save.c and live with it until I understand better what needs to change.

@cgyarvin
Copy link
Contributor

AUFS is a great idea but it has to actually be as fast as a non-union
(scab?) filesystem.

You can't really live in Urbit without checkpointing - every time you quit
and resume, it will have to replay all your events. Which will quickly get
pretty unmanageable.

So, it's a performance bug that someone has to fix. I'll fix it myself if
no one else gets to it first...

On Sun, Sep 29, 2013 at 3:55 PM, yebyen notifications@github.com wrote:

I like your explanation that it's an abysmal COW that causes this to be
noticeable... you're not the first person I've heard say (not that you so
much said) that AUFS is a bad idea, and it should be killed. I like
everything else about CoreOS though, so I'm probably going to comment that
line in v/save.c and live with it until I understand better what needs to
change.


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25331115
.

@yebyen
Copy link
Contributor Author

yebyen commented Sep 30, 2013

Well, I run vim with set nobackup and docker in vagrant, where snapshots are really a bad idea (try it for a little while, the performance is basically fine, but just wait til a time comes when you need them in VirtualBox and see if you can actually get what you needed back out of them!)

I have the whole Dockerfile/Vagrantfile philosophy, where experiments should really be quickly repeatable with whatever input data, without a lot of resources, and images in general should be disposable, not depended on (except for basically all the rest of the time when they are actually part of a production process.)

I see the v/save.c call is the one that happens every 5 seconds regardless of everything else, but in my latest experiment simply commenting that one left me with quite a lot of terminal hanging still, pointing to generally abysmal performance of this function on my platform. Commenting them all (4) makes terminal performance more acceptable. It would be wiser in light of what you say to comment them all, except for the one that runs on clean shutdown, and then actually shutdown sometimes.

Of course, on my platform even sudo has the same problem, since the host does not generally have its own name in /etc/hosts or lookup-able by DNS.

I guess checkpoints are really not the same thing as snapshots...

@yebyen
Copy link
Contributor Author

yebyen commented Oct 1, 2013

I had an idea, at least for CoreOS/docker users, how to test this theory of what is wrong and whether it can be fixed without architectural changes or a patch upstream.

It appears that ~zod is down, but possibly hosting urbit on a volume will alleviate some of the terminal slowness?

At first attempt it doesn't appear so. I don't think volumes are copy-on-write, so back to not knowing what causes the slowdown.

@cgyarvin
Copy link
Contributor

cgyarvin commented Oct 1, 2013

I just put ~zod back up. You know, actually, the spawning sounds scary,
but it's actually a pretty minor fix...

On Tue, Oct 1, 2013 at 7:05 AM, yebyen notifications@github.com wrote:

I had an idea, at least for CoreOS/docker users, how to test this theory
of what is wrong and whether it can be fixed without architectural changes
or a patch upstream.

It appears that ~zod is down, but possibly hosting urbit on a volume will
alleviate some of the terminal slowness?

At first attempt it doesn't appear so. I don't think volumes are
copy-on-write, so back to not knowing what causes the slowdown.


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25452118
.

@cgyarvin
Copy link
Contributor

cgyarvin commented Oct 3, 2013

On Sep 27, 2013 1:08 AM, "Curtis Yarvin" curtis.yarvin@gmail.com wrote:

You know, I really don't want to create an IRC channel, because Urbit is
designed to be its own IRC channel. I haven't started using this because
the network remains a bit flaky, but we will...

Given the continued stability problems, I am going to start hanging out in
#urbit on freenode--but with the explicit plans to ++raze that channel as
soon as irc-in-urbit is viable.

@akx
Copy link

akx commented Oct 3, 2013

I'm also seeing the same laggy terminal input issue and also practically constant lseek/write 16384 bytes/sync cycles in strace (on fd 4 -- mypier/chk/a.dat). These cycles do occasionally quiesce, but start soon again after I enter something (without even hitting return).

This is on an actual iron machine (my NAS box, heh), btw, no virtualization.
Linux monolith 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.32-1~bpo60+1 x86_64 GNU/Linux

@cgyarvin
Copy link
Contributor

cgyarvin commented Oct 3, 2013

I want to clarify this bug once again.

The "bug" is that checkpointing is acceptably fast on some machines, but
not on others. There seems to be about an order of magnitude difference in
performance. Maybe your iron is old? Geeze, it's a NAS box, so I hope
it's not storage slowdown.

In any case, the solution is to fork() and checkpoint in a different
process. I have actually written and checked in this code (to v/save.c),
but libuv seems to be losing my SIGCHLD - a little more digging is needed.

On Thu, Oct 3, 2013 at 1:06 PM, Aarni Koskela notifications@github.comwrote:

I'm also seeing the same laggy terminal input issue and also practically
constant lseek/write 16384 bytes/sync cycles in strace (on fd 4 --
mypier/chk/a.dat). These cycles do occasionally quiesce, but start soon
again after I enter something (without even hitting return).

This is on an actual iron machine (my NAS box, heh), btw, no
virtualization.
Linux monolith 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.32-1~bpo60+1 x86_64
GNU/Linux


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25653445
.

@akx
Copy link

akx commented Oct 3, 2013

Well, heh. I'll be the first to admit my NAS box isn't exactly... well, erm, is far from the most performant one out there -- I built it to be reliable and cheap, and those goals it fulfills admirably.

Anyway, I copied my URB_HOME to /tmp (a tmpfs, thus in memory) and ran a new pier (just to be safe) from there, and there was almost no noticeable slowdown (and I think most of what remained was just strace overhead).

@cgyarvin
Copy link
Contributor

cgyarvin commented Oct 3, 2013

Aarni, that's what I'd expect, but I also do like a bit of disk under my
ship!

On Thu, Oct 3, 2013 at 1:29 PM, Aarni Koskela notifications@github.comwrote:

Well, heh. I'll be the first to admit my NAS box isn't exactly... well,
erm, is far from the most performant one out there -- I built it to be
reliable and cheap, and those goals it fulfills admirably.

Anyway, I copied my URB_HOME to /tmp (a tmpfs, thus in memory) and ran a
new pier (just to be safe) from there, and there was almost no noticeable
slowdown (and I think most of what remained was just strace overhead).


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25655280
.

@akx
Copy link

akx commented Oct 3, 2013

Likewise. :) But at least it proves (as if it needed to be proven any further) that it's an IO/FS performance issue.

(Aside: Luckily I'm in the process of upgrading my non-NAS system, so I can then hopefully happily run Urbit in a VM on non-old iron, backed by an SSD to boot. Yay!)

@yebyen
Copy link
Contributor Author

yebyen commented Oct 5, 2013

You've both given some pretty big hints in here. I am a little stumped what you mean by "libuv losing my SIGCHLD" -- I come from a CS major program where they teach you threads and forking in java, and I have an appreciation for literature like Dave Thomas -- Pipelines using Fibers in Ruby 1.9. Of course nearly all of the programs I use from day to day are written in C, and don't follow these methods at all.

I read the thread on news where they told you how to do a proper rebase/pull request and I was mystified. While everything sounds technically correct and seeing the solution explained I can read and follow, I can say with certainty I have literally never done anything like that. I will take a stab at this one issue if you want to engage a branch where we can discuss the problem, but I think my git-fu is more on your level than theirs, and I know my C chops don't measure up to yours!

I'm hoping you have already solved this in private and you're waiting to see if someone else comes up with the solution.

I'll pull now and see if I can find your changes around u_loom_save() calls, and do some reading on gotchas around fork and exec. I would be just as happy to read the fix as to implement it, but I have a feeling you'd be happier guy if someone else does it.

@cgyarvin
Copy link
Contributor

cgyarvin commented Oct 6, 2013

Kingdon,

I apologize for misconstruing your background! Of course there's nothing
wrong with not being a hardened Unix neckbeard - in fact, Unix is so uncool
these days that I don't know how anyone gets into it - and I would never
send an ordinary man to do the job of a neckbeard.

I will fix this bug - it's just that I have 10 million other top
priorities at the moment. Feel free to play around with it, of course, if
you like. But there's no kind of pressure, of course...

On Fri, Oct 4, 2013 at 7:54 PM, yebyen notifications@github.com wrote:

You've both given some pretty big hints in here. I am a little stumped
what you mean by "libuv losing my SIGCHLD" -- I come from a CS major
program where they teach you threads and forking in java, and I have an
appreciation for literature like Dave Thomas -- Pipelines using Fibers in
Ruby 1.9. Of course nearly all of the programs I use from day to day are
written in C, and don't follow these methods at all.

I read the thread on news where they told you how to do a proper
rebase/pull request and I was mystified. While everything sounds
technically correct and seeing the solution explained I can read and
follow, I can say with certainty I have literally never done anything like
that. I will take a stab at this one issue if you want to engage a branch
where we can discuss the problem, but I think my git-fu is more on your
level than theirs, and I know my C chops don't measure up to yours!

I'm hoping you have already solved this in private and you're waiting to
see if someone else comes up with the solution.

I'll pull now and see if I can find your changes around u_loom_save()calls, and do some reading on gotchas around fork and exec. I would be just
as happy to read the fix as to implement it, but I have a feeling you'd be
happier guy if someone else does it.


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-25740787
.

@cgyarvin
Copy link
Contributor

Well, you do see that fsync() and in fact fdatasync() is being called down
at the bottom of the caller, where it says // sync this mfer.

I am inclined to regard your change as good. I think I've basically been
treating fsync() as a no-op, which it is on some machines and not on
others. Can you put in a PR?

The checkpoint logic definitely doesn't pretend to be constantly consistent

  • I don't think a checkpoint can do that. Behind the checkpoint stands the
    event log, though. Although, for long-running things it would definitely
    be prudent to have a backup checkpoint, so restoring from 0 doesn't have to
    happen.

On Mon, Oct 14, 2013 at 12:06 PM, Aaron Sokoloski
notifications@github.comwrote:

Not sure if this causes other problems, but this change completely
eliminates the lag for me. Is the fsync intended to try to minimize the
amount of time that the disk is in an inconsistent state? Perhaps it could
be deferred until all the pages are written.

I don't really understand how the loom works, so apologies if this is no
help at all.

diff --git a/f/loom.c b/f/loom.c
index 6f13453 https://github.com/urbit/urbit/commit/6f13453..06518c4
100644
--- a/f/loom.c
+++ b/f/loom.c
@@ -84,7 +84,6 @@ _loom_write(c3_i fid_i, void* buf_w, c3_w len_w)
exit(1);
}
else {

  • fsync(fid_i); // for what it's worth return u2_yes; } // return ((4
  • len_w) == write(fid_i, buf_w, (4 * len_w))) ? u2_yes : u2_no;


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-26280293
.

@asokoloski
Copy link
Contributor

Ah, ok, will do. Sorry, I deleted my comment after re-reading some previous ones, thinking that the issue had already been covered.

@philipcmonk
Copy link
Contributor

Yeah, fsync() definitely works quite differently on different platforms. I
remember that being a significant issue for ext4. This has a good
explanation:

http://www.humboldt.co.uk/2009/03/fsync-across-platforms.html

Philip Monk

On Mon, Oct 14, 2013 at 12:59 PM, Aaron Sokoloski
notifications@github.comwrote:

Ah, ok, will do. Sorry, I deleted my comment after re-reading some
previous ones, thinking that the issue had already been covered.


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-26284077
.

@philipcmonk
Copy link
Contributor

I can confirm that this completely resolves the issue for me. Thank you so
much, Aaron.

(incidentally, for some reason I had to do a make clean before the make
or else it failed).

Philip Monk

On Mon, Oct 14, 2013 at 1:08 PM, Philip Monk pcmonk@asu.edu wrote:

Yeah, fsync() definitely works quite differently on different platforms.
I remember that being a significant issue for ext4. This has a good
explanation:

http://www.humboldt.co.uk/2009/03/fsync-across-platforms.html

Philip Monk

On Mon, Oct 14, 2013 at 12:59 PM, Aaron Sokoloski <
notifications@github.com> wrote:

Ah, ok, will do. Sorry, I deleted my comment after re-reading some
previous ones, thinking that the issue had already been covered.


Reply to this email directly or view it on GitHubhttps://github.com/urbit/urbit/issues/20#issuecomment-26284077
.

@asokoloski
Copy link
Contributor

So, I should amend my previous statement -- this mitigates the problem greatly, but if I try pasting into the terminal, there's still quite a noticeable lag as I think u2_loom_save is being called after every character input. But at least it's much more usable now.

@yebyen
Copy link
Contributor Author

yebyen commented Oct 15, 2013

Yeah, I can confirm that after this pull request is merged, on my laptop (different laptop than previous report, but practically the same software underneath) the issue is only a problem for extreme button mashers or big copy pasters. As fast as I can intelligibly type (and I am a fast typer), the key-presses are reflected on the screen.

The last laptop was an SSD and this one is an SSD/Hybrid, I have not tested their comparative write performance but with both machines running some variant of Windows (7 or 8), deploying CoreOS to VirtualBox (with Vagrant), Debian Jessie (testing) and Urbit master underneath, performance is good enough for me that I would close this issue.

I closed it before though, and it was reopened for good reason, so I'll just leave that $0.02.

If anyone wants to try on different hardware, docker pull yebyen/urbit; the image is already set up, you just need to source /.bash_profile; to Urbit devs, is it "not recommended" to publish a pier and submarine in this way? Should I take the keys out, will the network be upset if the same submarine is popping up in multiple locations, or is that fine?

I would be more careful with a destroyer, but if I'm helping someone try out Urbit, I might want to start by giving them a submarine, I just need to know if I'm going to start to have problems if I gave everyone the same submarine (at a 30-minute talk in November, for instance.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants