Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cpu hog 100% by sclang when it's built without QT #2144

Closed
hardiksingh-rathore opened this issue Jun 1, 2016 · 51 comments
Closed

Cpu hog 100% by sclang when it's built without QT #2144

hardiksingh-rathore opened this issue Jun 1, 2016 · 51 comments

Comments

@hardiksingh-rathore
Copy link

@hardiksingh-rathore hardiksingh-rathore commented Jun 1, 2016

sclang cpu hog issue in supercollider 3.7 when we build it without QT. the sclang process never goes to sleep state and using 100% cpu after booting server.

@miczac
Copy link
Contributor

@miczac miczac commented Jun 1, 2016

What OS, environment, compiled ... ?

@hardiksingh-rathore
Copy link
Author

@hardiksingh-rathore hardiksingh-rathore commented Jun 1, 2016

OS - Ubuntu-16.04,
Processor- Intel(R) Core(TM) i5 CPU M 460 @ 2.53GH

@hardiksingh-rathore
Copy link
Author

@hardiksingh-rathore hardiksingh-rathore commented Jun 2, 2016

OS - Ubuntu-16.04,Intel(R) Core(TM) i5 CPU M 460 @ 2.53GH
Thanks,Rathore

On Wednesday, 1 June 2016 7:22 PM, Michael Zacherl <notifications@github.com> wrote:

What OS, environment, compiled ... ?—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented Aug 3, 2016

@hardiksingh-rathore can you still reproduce this?

@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented Aug 12, 2016

Actually, you know what? I can repro this.

Running:

SC_JACK_DEFAULT_INPUTS="system" SC_JACK_DEFAULT_OUTPUTS="system" scsynth -u 57110

consumes (at most) 2% of CPU and 2% of memory. Running sclang itself after that uses less than 1% of both.

But when I open sclang and then type s.boot, the sclang CPU usage goes up to 99/100% (and never goes back down), regardless of if it successfully booted scsynth or not (e.g. it goes to 100% even if scsynth was already running)

Maybe there's something constantly polling?

@vivid-synth vivid-synth changed the title Cpu hog 100% by sclang when it's build without QT Cpu hog 100% by sclang when it's built without QT Aug 12, 2016
@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented Aug 12, 2016

(I'm on Linux, and compiling from this week's master)

@hardiksingh-rathore
Copy link
Author

@hardiksingh-rathore hardiksingh-rathore commented Aug 15, 2016

This is not present in SC-3.6.6, So you can differentiate the code SC-3.6 and SC-3.7.
Some Profilers can help like perf,Gprof etc.Nearly one month ago i had profiled the sclang process using perf tool and cancelled optimization using -O0 compiler option.
"scalng_O0" is the name of process so you can see the output of sclang_O0 process as given below
###############################################################################
%CPU
21.20% sclang_O0 sclang_O0 [.] Interpret(VMGlobals_)
|
--- 0x117758
11.46% sclang_O0 sclang_O0 [.] PyrGC::ScanSlots(pyrslot_, long) [clone .part.22]
9.23% sclang_O0 sclang_O0 [.] PyrGC::NewFrame(unsigned int, long, long, bool)
7.88% sclang_O0 [kernel.kallsyms] [k] _raw_spin_unlock_irq
###############################################################################
(Note - Only top highest cpu using functions i have copied)

@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented Aug 15, 2016

I can try to git-bisect this but it's going to be a long slog since I have to recompile each time. Anyone have thoughts about where the issue might have come from?

@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented Aug 16, 2016

This is kinda rough, slow going. Most older commits don't build anymore. Today I've narrowed it from between

(Earliest known bad):

commit ce15240
Date: Thu Aug 22 15:10:51 2013 +0100

(Latest known good):

commit efe8593
Date: Tue Feb 12 11:42:42 2013 +0100

I'll be working on narrowing it down more tonight and tomorrow, but if anyone can think of an obvious code change between these points in time, that could help!

@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented Aug 16, 2016

(Unless "NotificationCenter" is part of the problem -- I've had to disable it in several of the builds, so if the problem lies there I'm bisecting in the wrong direction)

@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented Aug 16, 2016

Currently:

Earliest bad:

commit ce15240
Date: Thu Aug 22 15:10:51 2013 +0100

Latest good:

commit c69da82
Date: Sun May 31 17:35:48 2015 +0200

@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented Aug 17, 2016

Phew! I think I found it! Pretty sure the bug is in 96e56b5, although it's possible it's in its parent ( 871aa6e ).

Can someone who is more familiar with sclang innards take a look at this?

Repro steps are:

  • Checkout the commit and build it, with cmake flags -DSC_IDE=False and -DSC_QT=False
  • Open top to observe resource usage
  • Start scsynth on the command line, e.g. with SC_JACK_DEFAULT_INPUTS="system" SC_JACK_DEFAULT_OUTPUTS="system" scsynth -u 57110
  • Check in top. scsynth should be using well below 5% of CPU and memory.
  • Create a file, foo.scd:
Server.default.waitForBoot {
   Routine {
      x = play { SinOsc.ar };
      "thing happening".postln;
      1.wait;
      x.free;
   }.play
}
  • Run sclang foo.scd. Check in top : sclang's resource usage should have skyrocketed to 100%, and won't drop back down.

(As above, you can also call sclang, observe that it uses <5% CPU, then call s.boot in sclang and -- regardless of the fact that scsynth is already booted -- watch it use 100%)

@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented Aug 17, 2016

@timblechmann this code is fairly above my head, but maybe you have a better understanding of where it's starting to endlessly spin? (e.g. constant polling?)

@smiarx
Copy link
Contributor

@smiarx smiarx commented Sep 23, 2016

I ran into the same problem on a headless Raspberry Pi with Archlinux.
The issue seems to be related to the startAliveThread method in the boot section (more precisely the Routine launched in this method).

One workaround I've found is to shutdown this routine by calling s.statusWatcher.stopAliveThread just after booting. The CPU goes back to normal.
(By the way the startAliveThread=false argument of Server.boot will work too, but will throw an error and say that the server cannot be started, is this argument obsolete ?)

Since the aliveThread is down, you need to call s.status before calling s.avgCPU, s.numSynths,...
Hope that helps.

@jamshark70
Copy link
Contributor

@jamshark70 jamshark70 commented Sep 23, 2016

@smiarx @vivid-synth I think we need one more test to clarify if it's the AppClock that's broken or all clocks.

Disable the server's alive thread. Then, does this work?

{ 5.do { |i| i.postln; 1.wait } }.fork;

Or this?

{ 5.do { |i| i.postln; 1.wait } }.fork(SystemClock);

Or...?

{ 5.do { |i| i.postln; 1.wait } }.fork(AppClock);

I'm guessing 1 and 2 will be ok and 3 will die.

@smiarx
Copy link
Contributor

@smiarx smiarx commented Sep 24, 2016

That was my first guess too, but all three routines work correctly. I even try to do something similar to the alive thread in the prompt

Routine {
   loop {
      s.status;
      s.aliveThreadPeriod.wait;
      s.statusWatcher.updateRunningState(true);
   }
}.play(AppClock)

and it runs correctly too, without cpu overloading.

Maybe it only breaks in compiled classes and not in the prompt ?

@telephon
Copy link
Member

@telephon telephon commented Sep 24, 2016

Could you try also with this?

Routine {
   loop {
      s.status;
      s.aliveThreadPeriod.wait;
      s.statusWatcher.updateRunningState(false); // <--------- false instead of true
   }
}.play(AppClock)
@smiarx
Copy link
Contributor

@smiarx smiarx commented Sep 24, 2016

I tried and it makes the CPU go 100% again, but I noticed that I was wrong in my last comment, and it does the same with updateRunningState(true) too, I must have made some kind of mistake when I first tried.

But I've found that the problematic part is the s.status line. I traced back the issue to the OSCFunc for /status.reply in addStatusWatcher, and it's cause by this defered function.
If I comment it out, I can run the Routine without any problem, I can even restart the alive thread after stopping it with the CPU usage still low. However when I call s.boot, sclang continue to monopolize the CPU...

@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented Oct 31, 2016

I have a feeling that the xruns I experienced during a performance last night were at least partially due to this bug. Does anyone know the way forward on this?

@telephon
Copy link
Member

@telephon telephon commented Oct 31, 2016

I'd like to have a look at it after merging #2422.

@smiarx interesting. So when you get rid of the defer of this function, it works?

Only when you run s.boot, it'll do it again?

Note (even if this is unrelated perhaps) that there are a lot of defers in the status update code (mainly for making it easier to deal with GUIs that update from the server: btw. I believe that the GUI updaters should do the deferrals, not the senders.)

@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented Dec 9, 2016

@telephon I just reproduced this again after building with #2422

@telephon
Copy link
Member

@telephon telephon commented Dec 10, 2016

@vivid-synth I don't know the boost internals at all, so I cn only help to continue to trace down the problem from the sclang side. As it seems, the problem depends on the running routine aliveThread, created in the startAliveThread method, isn't it?

So maybe you can comment out lines of code to see which one causes it:


startAliveThread { | delay = 0.0 |
		this.addStatusWatcher;
		^aliveThread ?? {
			aliveThread = Routine {
				// this thread polls the server to see if it is alive
				delay.wait;
				loop {
					alive = false;
					// server.sendStatusMsg; // <----------- this one?
					aliveThreadPeriod.wait;
					this.updateRunningState(alive);
				};
			}.play(AppClock);
			aliveThread
		}
	}





startAliveThread { | delay = 0.0 |
		this.addStatusWatcher;
		^aliveThread ?? {
			aliveThread = Routine {
				// this thread polls the server to see if it is alive
				delay.wait;
				loop {
					alive = false;
					server.sendStatusMsg;
					aliveThreadPeriod.wait;
					// this.updateRunningState(alive); // <----------- ... or that one?
				};
			}.play(AppClock);
			aliveThread
		}
	}
@gogobd
Copy link

@gogobd gogobd commented Jan 14, 2017

I can confirm that "s.statusWatcher.stopAliveThread" makes CPU usage drop again on a headless Raspberry Pi 3, scsynth 3.9dev built from branch 'master' [28713cd].

@jamshark70
Copy link
Contributor

@jamshark70 jamshark70 commented May 22, 2017

Since nobody ran my test, I just did it myself. First I reproduced the issue. Then I tried my small test. Result: no problem.

So I ran other tests, and as far as I can see, the issue is calling defer from the status.reply OSCFunc.

  • The problem goes away after commenting out this.updateRunningState(true);.
  • If I uncomment that line, and comment out both defers in updateRunningState, the problem also goes away.

OK... wait... I got it.

n = NetAddr("127.0.0.1", NetAddr.langPort);
o = OSCFunc({ |msg| defer { msg.postln } }, '/hello');
r = Routine { loop { n.sendMsg('/hello'); 1.0.wait } }.play(AppClock);

If a routine is pending on AppClock (1.0.wait) and at the same time a responder function schedules something else on AppClock, CPU goes kaboom. BUT this does not happen in Qt environments AND it does not happen if the defer is occurring within the routine on AppClock (oh, duh, now it's obvious -- if you call defer before the wait, then the routine is not pending).

Something about the main application loop, then?

You can thank me in beer, later.

@telephon
Copy link
Member

@telephon telephon commented May 22, 2017

beers forever!

@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented May 22, 2017

(Side node: the Qt "primitive not bound" referred to above issue is here: #1209 . Non-IDE SC could definitely use some love!)

@muellmusik
Copy link
Contributor

@muellmusik muellmusik commented May 22, 2017

@jamshark70 I'm guessing the issue is in SC_TerminalClient::tick. The current AppClock isn't like the old SC.app one, which fired at regular intervals. The timer updates its expiry when something is scheduled.

Run it in a debugger and see where it's getting stuck. Each schedule update triggers a 'tick'. I can't quite see the problem, but I think it's getting stuck in a loop. The tick() method is not much different than the IDE one, but uses a different timer. It is possible that mTimer.cancel(); should happen later, maybe in the haveNext if.

Also, note this:

If the timer has already expired when cancel() is called, then the handlers for asynchronous wait operations will:

  • have already been invoked; or
  • have been queued for invocation in the near future.

These handlers can no longer be cancelled, and therefore are passed an error code that indicates the successful completion of the wait operation.

From http://www.boost.org/doc/libs/1_49_0/doc/html/boost_asio/reference/basic_waitable_timer/cancel/overload1.html

Can't test myself at the moment, sorry...

@jamshark70
Copy link
Contributor

@jamshark70 jamshark70 commented May 23, 2017

Well, I've got the environment set up, so it may be easier for me to check it out than for others, but... the extent of my gdb knowledge for this issue is thread apply all bt. If that doesn't reveal anything interesting, then I would have to rely on others' knowledge.

To be honest, I would rather hand this off to people who understand how the clocks work. I'm good at differential testing and diagnosis, not so good with debuggers.

@muellmusik
Copy link
Contributor

@muellmusik muellmusik commented May 23, 2017

@jamshark70 I'm not much help I'm afraid. I'm lazy and usually use it with a GUI front end so I always have to look it up if I do cl. But there're lots of good tutorials if you google. (I'm sure there was a nice front end I used last time I had to do stuff on Linux...) Or maybe there's something like spindump that would let you sample?

@miguel-negrao
Copy link
Member

@miguel-negrao miguel-negrao commented May 23, 2017

@jamshark70 I can recommend using nemiver a graphical frontend for gdb which doesn't require any setup, you use it like you would use gdb. I can also recommend QtCreator which can also open a cmake project. Just do new project -> import project -> import existing project and select the supercollider source project. Then you can use the built-in debbuger front end.

@giuliomoro
Copy link
Contributor

@giuliomoro giuliomoro commented Aug 5, 2017

I tried to run this in gdb, but I get some funny results (despite building with -g -O0):

Breakpoint 1, SC_TerminalClient::tick (this=0x374220, error=...)
    at /root/supercollider/lang/LangSource/SC_TerminalClient.cpp:402
402		mTimer.cancel();
(gdb) n
405		lock();
(gdb)
406		bool haveNext = tickLocked( &secs );
(gdb)
Warning:
Cannot insert breakpoint 0.
Cannot access memory at address 0x273aa127

longjmp (env=0x2a11c0 <gVMGlobals+96>, val=2) at ../nptl/sysdeps/pthread/pt-longjmp.c:26
26	../nptl/sysdeps/pthread/pt-longjmp.c: No such file or directory.
(gdb)
Warning:
Cannot insert breakpoint 0.
Cannot access memory at address 0x273aa127

__libc_siglongjmp (env=0x2a11c0 <gVMGlobals+96>, val=2) at longjmp.c:28
28	longjmp.c: No such file or directory.
(gdb)
30	in longjmp.c
(gdb)
32	in longjmp.c
(gdb)
38	in longjmp.c
(gdb)
secs: 57.741871
nextAbsTime 1501893809191560000

Breakpoint 1, SC_TerminalClient::tick (this=0x374220, error=...)
    at /root/supercollider/lang/LangSource/SC_TerminalClient.cpp:402
402		mTimer.cancel();
(gdb)

this said, I try to print some values

@@ -410,8 +410,10 @@ void SC_TerminalClient::tick( const boost::system::error_code& error )
        std::chrono::system_clock::time_point nextAbsTime;
        ElapsedTimeToChrono( secs, nextAbsTime );

+       printf("secs: %f\n", secs);
        if (haveNext) {
                mTimer.expires_at(nextAbsTime);
+               printf("nextAbsTime %lld\n", nextAbsTime);
                mTimer.async_wait(boost::bind(&SC_TerminalClient::tick, this, _1));
        }
 }

and this floods my console once I do s.boot(true):

sc3> s.boot(true)
-> localhost
secs: 18.269868
nextAbsTime 1501893975282793000
sc3> secs: 18.470377
nextAbsTime 1501893975483301000
booting server 'localhost' on address: 127.0.0.1:57110
sc3> secs: 20.040324
nextAbsTime 1501893977053248000
secs: 20.240131
nextAbsTime 1501893977253055000
secs: 20.240131
nextAbsTime 1501893977253055000
<<<< these same two lines repeated about 20 more times >>>>
secs: 20.601629
nextAbsTime 1501893977614553000
<<<< these same two lines repeated about 20 more times >>>>

Ultimately, this - quick and dirty - fixes the CPU hog, but I am not sure if it introduces any horrible side effects (note: I got the magic number 70000 from this old help file which specifies 0.7 seconds as the default time interval for startAliveThread.

diff --git a/lang/LangSource/SC_TerminalClient.cpp b/lang/LangSource/SC_TerminalClient.cpp
index 00ff76e..99484aa 100644
--- a/lang/LangSource/SC_TerminalClient.cpp
+++ b/lang/LangSource/SC_TerminalClient.cpp
@@ -411,6 +411,7 @@ void SC_TerminalClient::tick( const boost::system::error_code& error )
        ElapsedTimeToChrono( secs, nextAbsTime );

        if (haveNext) {
+               usleep(700000);
                mTimer.expires_at(nextAbsTime);
                mTimer.async_wait(boost::bind(&SC_TerminalClient::tick, this, _1));
        }

Note that I know basically nothing about SC, but if someone provides pointers as to what and how needs to be tested, I can spend some time on this (working on this for Bela, but a better solution than the above should benefit upstream SC as well).

@giuliomoro
Copy link
Contributor

@giuliomoro giuliomoro commented Aug 5, 2017

A few hours later:

it seems that what is happening is that SC_terminalClient::tick() is being called with error == operation_canceled. It is suggested by the boost documentation linked above that calling cancel() causes the following:

This function forces the completion of any pending asynchronous wait operations against the timer. The handler for each cancelled operation will be invoked with the boost::asio::error::operation_aborted error code.

So, it happens that SC_TerminalClient::tick is invoked with error == boost::system::errc::operation_canceled. If - when invoked as such - it calls mTimer.cancel() again (as in the original code), the latter would - from what I can observe - call the function again with the same error condition, which is what is causing the CPU hog: SC_TerminalClient::tick being called with an operation_canceled error multiple times. Occasionally (precisely when std::chrono::system_clock::now(); <= nextAbsTime), the call mTimer.async_wait(boost::bind(&SC_TerminalClient::tick, this, _1)); would eventually be successful.

I have never used boost before so I am not sure if this behaviour is expected and/or how this is related with QT.
However, the patch below seems to fix it for me.

NOTE: I am actually getting a boost::system::errc::operation_canceled instead of the boost::asio::error::operation_aborted advertised in the docs. Perhaps the version of boost in use is different from the docs?

@@ -398,7 +398,11 @@ extern void ElapsedTimeToChrono(double elapsed, std::chrono::system_clock::time_
  
  void SC_TerminalClient::tick( const boost::system::error_code& error )
  {
 -	mTimer.cancel();
 +	if (error == boost::system::errc::success) {
 +		mTimer.cancel();
 +	} else {
 +		return;
 +	}
  
  	double secs;
  	lock();
giuliomoro added a commit to giuliomoro/supercollider that referenced this issue Aug 5, 2017
…This seems to be the solution for sensestage/supercollider/BelaPlatform#34 and supercollider/supercollider/supercollider#2144 , however I have no explanation as to why this issue only manifests itself when QT is disabled
@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented Aug 8, 2017

Anyone with knowledge of boost have thoughts about this one? (Maybe @brianlheim or @scztt ?) This is tempting: a nice small solution to a painful bug we've had for a long time.

@GregBakker
Copy link

@GregBakker GregBakker commented Sep 6, 2017

@GregBakker
Copy link

@GregBakker GregBakker commented Sep 10, 2017

Tried it; removing the 'cancel' alone doesn't work as an alternate fix. This implies that either it's the correct thing to do in this spot (and @giuliomoro's fix is necessary too), or that the 'tick' routine is being queued up more than is needed.

@nhthn nhthn modified the milestones: 3.9, 3.9.x Oct 28, 2017
@artfwo
Copy link
Contributor

@artfwo artfwo commented Feb 24, 2018

@snappizz @GregBakker @vivid-synth what's the status of this bug? is anyone looking into it currently?

@vivid-synth
Copy link
Member

@vivid-synth vivid-synth commented Feb 25, 2018

I don't use sclang much these days so I haven't been working on it, but I'd be very happy for it to be fixed!

@artfwo
Copy link
Contributor

@artfwo artfwo commented Mar 5, 2018

@giuliomoro would you like to create a PR for this repo with your patch? it works wonderfully :)

@patrickdupuis patrickdupuis modified the milestones: 3.9.x, 3.10 May 26, 2018
nhthn pushed a commit to nhthn/supercollider that referenced this issue Jun 9, 2018
…This seems to be the solution for sensestage/supercollider/supercollider#34 and supercollider/supercollider/supercollider#2144 , however I have no explanation as to why this issue only manifests itself when QT is disabled
@nhthn
Copy link
Contributor

@nhthn nhthn commented Jun 9, 2018

@giuliomoro's patch seems correct to me. i've filed it as a PR: #3772

i've also figured out the mystery of why sclang behaves differently with and without Qt. it's quite simple -- right at the bottom of SC_TerminalClient.cpp, in createLanguageClient:

#ifdef SC_QT
	return new QtCollider::LangClient(name);
#else
	return new SC_TerminalClient(name);
#endif

so when SC_QT is on, the tick function being called is completely different code. it's in QtCollider/LanguageClient.cpp:

void LangClient::tick()
{
  double secs;
  lock();
  bool haveNext = tickLocked( &secs );
  unlock();

  flush();

  if( haveNext ) {
    secs -= elapsedTime();
    secs *= 1000;
    int ti = qMax(0, qCeil(secs));
    qcDebugMsg(2, QStringLiteral("next at %1").arg(ti) );
    appClockTimer.start( ti, this );
  } else
	  appClockTimer.stop();
}

turns out appClockTimer has type QBasicTimer, so Qt is rather important to this code, somewhat complicating hopes of unifying the qt and non-qt implementations.

so there's a more fundamental problem of fragmentation of sclang with and without qt that probably should be addressed. (i'm guessing it's a historical artifact of the way QtCollider was originally developed.) but i do think this patch is okay nevertheless.

@brianlheim
Copy link
Member

@brianlheim brianlheim commented Jun 24, 2018

Closed in #3772

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
3.9 Top Priority Issues
Important for 3.9
Linked pull requests

Successfully merging a pull request may close this issue.

None yet