Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU consumption of discovery service and gateway #282

Closed
vvvlc opened this issue Apr 5, 2019 · 25 comments

Comments

@vvvlc
Copy link
Member

commented Apr 5, 2019

Describe the bug
As per screenshot GW and DS have high CPU usage , zowe was in indle state for majorioty of its run-time.
I think last activity of UI was around 2019-04-03 13:06:41.586
However Job log contains a lot of exceptions like:

2019-04-03 13:26:27.814 ERROR 50594700 --- [broadcom.net-14] c.n.e.cluster.ReplicationTaskProcessor   : It seems to be a socket read timeout exception, it will retry later. if it continues to happen and some eureka node occupied all the cpu time, you should set property 'eureka.server.peer-node-read-timeout-ms' to a bigger value

com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Read timed out
	at com.sun.jersey.client.apache4.ApacheHttpClient4Handler.handle(ApacheHttpClient4Handler.java:187) ~[jersey-apache-client4-1.19.1.jar!/:1.19.1]
	at com.netflix.eureka.cluster.DynamicGZIPContentEncodingFilter.handle(DynamicGZIPContentEncodingFilter.java:48) ~[eureka-core-1.9.3.jar!/:1.9.3]
	at com.netflix.discovery.EurekaIdentityHeaderFilter.handle(EurekaIdentityHeaderFilter.java:27) ~[eureka-client-1.9.3.jar!/:1.9.3]
	at com.sun.jersey.api.client.Client.handle(Client.java:652) ~[jersey-client-1.19.1.jar!/:1.19.1]
	at com.sun.jersey.api.client.WebResource.handle(WebResource.java:682) ~[jersey-client-1.19.1.jar!/:1.19.1]
	at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) ~[jersey-client-1.19.1.jar!/:1.19.1]
	at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:570) ~[jersey-client-1.19.1.jar!/:1.19.1]
	at com.netflix.eureka.transport.JerseyReplicationClient.submitBatchUpdates(JerseyReplicationClient.java:116) ~[eureka-core-1.9.3.jar!/:1.9.3]
	at com.netflix.eureka.cluster.ReplicationTaskProcessor.process(ReplicationTaskProcessor.java:80) ~[eureka-core-1.9.3.jar!/:1.9.3]
	at com.netflix.eureka.util.batcher.TaskExecutors$BatchWorkerRunnable.run(TaskExecutors.java:187) [eureka-core-1.9.3.jar!/:1.9.3]
	at java.lang.Thread.run(Thread.java:812) [na:2.9 (10-29-2018)]
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method) ~[na:1.8.0]
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:127) ~[na:1.8.0]
	at java.net.SocketInputStream.read(SocketInputStream.java:182) ~[na:1.8.0]
	at java.net.SocketInputStream.read(SocketInputStream.java:152) ~[na:1.8.0]
	at com.ibm.jsse2.b.a(b.java:262) ~[na:8.0 build_20181010]
	at com.ibm.jsse2.b.a(b.java:33) ~[na:8.0 build_20181010]
	at com.ibm.jsse2.av.a(av.java:579) ~[na:8.0 build_20181010]
	at com.ibm.jsse2.av.i(av.java:574) ~[na:8.0 build_20181010]
	at com.ibm.jsse2.av.a(av.java:280) ~[na:8.0 build_20181010]
	at com.ibm.jsse2.av.startHandshake(av.java:431) ~[na:8.0 build_20181010]
	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396) ~[httpclient-4.5.6.jar!/:4.5.6]
	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355) ~[httpclient-4.5.6.jar!/:4.5.6]
	at com.netflix.discovery.shared.transport.jersey.SSLSocketFactoryAdapter.connectSocket(SSLSocketFactoryAdapter.java:59) ~[eureka-client-1.9.3.jar!/:1.9.3]
	at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:414) ~[httpclient-4.5.6.jar!/:4.5.6]
	at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180) ~[httpclient-4.5.6.jar!/:4.5.6]
	at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144) ~[httpclient-4.5.6.jar!/:4.5.6]
	at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:134) ~[httpclient-4.5.6.jar!/:4.5.6]
	at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610) ~[httpclient-4.5.6.jar!/:4.5.6]
	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445) ~[httpclient-4.5.6.jar!/:4.5.6]
	at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835) ~[httpclient-4.5.6.jar!/:4.5.6]
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:118) ~[httpclient-4.5.6.jar!/:4.5.6]
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.6.jar!/:4.5.6]
	at com.sun.jersey.client.apache4.ApacheHttpClient4Handler.handle(ApacheHttpClient4Handler.java:173) ~[jersey-apache-client4-1.19.1.jar!/:1.19.1]
	... 10 common frames omitted

Expected behavior
Significantly lower CPU-Time of ZOWEMLG (Gateway) and ZOWEMLD (Discovery service) in range of minutes like ZOWEMLC (Catalog), but not hours.

Screenshots
Snipaste_2019-04-04_16-17-21

Logs
JOB04854.zip

Details
zowe 1.0.1

Additional context

@vvvlc vvvlc added the bug label Apr 5, 2019

@plavjanik plavjanik changed the title hi cpu consuption of discovery service a gateway High CPU consumption of discovery service and gateway Apr 8, 2019

@plavjanik

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2019

@vvvlc Thank you Vitek for the issue report. Can you please attach the API mediation start up scripts as well? They contain the configuration parameters for gateway and discovery service. It is possible that problem is caused by invalid hostnames that are causing repetitive attempts for connection to Eureka

@vvvlc

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

here are the scripts files from zowe/1.0.1/api-mediation/scripts
scripts.zip
at first glance I think hostnames are fine.
V

@JakubSkrivanek

This comment has been minimized.

Copy link

commented Apr 10, 2019

Hi, I can report exactly the same error for the GA 1.0.0. RC2 version.

ERROR 16842944 ---broadcom.net-6. c.n.e.cluster.ReplicationTaskProcessor : It seems to be a socket read timeout exception, it will retry later. if it continues to
happen and some eureka node occupied all the cpu time, you should set property 'eureka.server.peer-node-read-timeout-ms' to a bigger value.
J.

@plavjanik

This comment has been minimized.

Copy link
Collaborator

commented Apr 25, 2019

A similar problem has been reported as #293.

@Joe-Winchester

This comment has been minimized.

Copy link

commented May 3, 2019

@wjs4572 - on slow Java systems suggested fix is to increase a timeout value by updating
/api-mediation/scripts/api-mediation-start.sh to include a
-Djersey.config.client.readTimeout=90000 \
-Deureka.server.peer-node-read-timeout-ms=3000 \
statement.

Also increase the value that the API uses from 30000 default to 90000 /
-Dapiml.gateway.timeoutMillis=90000 \

@vvvlc

This comment has been minimized.

Copy link
Member Author

commented May 3, 2019

When you hit HIGH CPU again

run a uss command
kill -QUIT <pid>
then go to joblog you should see something like:

10.26.21 STC15923 ---- FRIDAY,    03 MAY 2019 ----                                                  
10.26.21 STC15923  JVMDUMP039I Processing dump event "user", detail "" at 2019/05/03  983           
   983             10:26:21 - please wait.                                                          
   983                                                                                              
10.26.21 STC15923  JVMDUMP032I JVM requested System dump using 'MSMSERV.JVM.CSMTP60.D1905  984      
   984             03.T102621.X&DS' in response to an event                                         
   984                                                                                              
10.26.22 STC15923  IGD01008I ALLOCATION SET TO SCWRKD  M3600                                        
10.26.22 STC15923  IGD01008I THIS DATASET WILL BE DELETED IN 3 DAYS.                                
10.26.22 STC15923  IGD01010I ALLOCATION SET TO SGWRKD STORAGE GROUP                                 
10.26.22 STC15923  IGD101I SMS ALLOCATED TO DDNAME (SYS08838)  988                                  
   988                     DSN (MSMSERV.JVM.CSMTP60.D190503.T102621.X001    )                       
   988                     STORCLAS (SCWRKD) MGMTCLAS (MCWRKD) DATACLAS (DCWRKD)                    
   988                     VOL SER NOS= WRKD11                                                      
10.26.48 STC15923  IEA827I COMPLETE TRANSACTION DUMP WRITTEN TO 00001    DATASETS  130              
   130             STARTING FROM 'MSMSERV.JVM.CSMTP60.D190503.T102621.X001'                         
   130             PLEASE COMBINE USING COPYDUMP PRIOR TO VIEWING IN IPCS                           
10.26.48 STC15923  IGD104I MSMSERV.JVM.CSMTP60.D190503.T102621.X001     RETAINED,  DDNAME=SYS08838  

Dataset with a transaction dump is MSMSERV.JVM.CSMTP60.D190503.T102621.X001

In some situations a uss file javacore.xxxx.txt with a stack trace is created as well. If you see a similar message to

03.35.42 STC30439  BPXM023I (MSMSERV)  910                                     
   910             JVMDUMP032I JVM requested Java dump using '/z/msmserv/mfaas/i
   910             este/GW2/javacore.20180830.033530.84215348.0002.txt' in respo
   910             an event                                                    
   910

and /.../javacore.20180830.033530.84215348.0002.txt gets created provide us with javacore.xxxx.txt instead of MSMSERV.JVM.CSMTP60.D190503.T102621.X001. javacore.xxxx.txt file is significantly shorter in comparison to transaction dump dataset.

NOTE: in case no dump is created add these variables in run-zowe.sh

export IBM_JAVA_ZOS_TDUMP=ALL                                                    .
export JAVA_DUMP_TDUMP_PATTERN="YOUR.PREFIX.DUMP.X&DS"
@vvvlc

This comment has been minimized.

Copy link
Member Author

commented May 3, 2019

Instruction for support team

how to analyze javacore file

javacore.xxxx.txt is a text file in EBCDIC that contains stack trace for all the threads including a JNI native stacktrace.

how to analyze dump

  1. copy dump dataset VLCVI01.DUMP.DATASET (known as a TRANSACTION DUMP) to a uss file dump
    cp "//'VLCVI01.DUMP.DATASET'" dump
  2. extract a java dump from TRANSACTION DUMP using jextract dump -v, this creates a dump.zip
  3. run dump viewver jdmpview -J-Xmx768m -zip dump.zip
  4. run info thread in jdmpview to display info about current thread java/native stack trace

when there is no current thread use
info thread *

Example of expected outputs

native threads for address space # 0
 process id: 0xe22a7b0

  thread id: 0xe712700
   registers:
    PSW    = 0x070c1000_8389c67e   R0     = 0x00000000_00000000   R1     = 0x00000000_026a3000   R2     = 0x00000000_00000031
    R3     = 0x00000000_7ee41e70   R4     = 0x00000000_6a3f6fc0   R5     = 0x00000000_00000000   R6     = 0x00000000_0389b366
    R7     = 0x00000000_020d2480   R8     = 0x00000000_7ee42330   R9     = 0x00000000_8389f1e8   R10    = 0x00000000_7ee41e70
    R11    = 0x00000000_00000031   R12    = 0x00000000_7ee43e60   R13    = 0x00000000_7ee42730   R14    = 0x00000000_00000200
    R15    = 0x00000000_00000000
   stack sections:
    0x441f2000 to 0x441fd000 (length 0xb000)
    0x43c3b000 to 0x43c45000 (length 0xa000)
    0x430ea000 to 0x430ed000 (length 0x3000)

Native stack trace
    bp: 0x43c40a68 pc: 0x06afc624 ExtraSymbolsModule::CEEHDSP+0xd2c
    bp: 0x43c3feb8 pc: 0x06b0a49e ExtraSymbolsModule::CEEHRNUH+0x96
    bp: 0x441f9b40 pc: 0x4509dd36 /u/maint/ca/msmserv/r41/msmruntime/tomcat/lib/libMCSParserFacility_Jni.so::CMCSParser::serializeMCS()+0xa6
    bp: 0x441fa420 pc: 0x4509d952 /u/maint/ca/msmserv/r41/msmruntime/tomcat/lib/libMCSParserFacility_Jni.so::CMCSParser::dispatchCurrentMcs(unsigned int)+0x8a
    bp: 0x441fa4a0 pc: 0x4509d4dc /u/maint/ca/msmserv/r41/msmruntime/tomcat/lib/libMCSParserFacility_Jni.so::CMCSParser::parseStream(iInputStream*)+0xccc
    bp: 0x441fa6a0 pc: 0x4509f014 /u/maint/ca/msmserv/r41/msmruntime/tomcat/lib/libMCSParserFacility_Jni.so::CMCSParser::Parse(iInputStream*,iMCSParserInterface*)+0x4c4
    bp: 0x441fb0c0 pc: 0x450b8c1e /u/maint/ca/msmserv/r41/msmruntime/tomcat/lib/libMCSParserFacility_Jni.so::Java_com_ca_mf20_sc_mcsentry_scanner_MCSScannerDriver_1Jni_native_1parse+0x18e

Java stacktrace
     bp: 0x43b8176c    method: int com/ca/mf20/sc/mcsentry/scanner/MCSScannerDriver_Jni.native_parse(String, int, int, int, int)    filename:line: MCSScannerDriver_Jni.java:<data unavailable>
     bp: 0x43b817bc    method: com.ca.mf20.sc.mcsentry.MCSFile com/ca/mf20/sc/mcsentry/scanner/MCSScannerDriver_Jni.scan(String, int, int)    filename:line: MCSScannerDriver_Jni.java:106
     bp: 0x43b817d4    method: com.ca.mf20.sc.mcsentry.MCSFile com/ca/mf20/sc/mcsentry/scanner/MCSScannerDriver_Jni.scan(String, int)    filename:line: MCSScannerDriver_Jni.java:173
     bp: 0x43b817e8    method: com.ca.mf20.sc.mcsentry.MCSFile com/ca/mf20/sc/mcsentry/scanner/MCSScanner.scan(String)    filename:line: MCSScanner.java:55
     bp: 0x43b81818    method: com.ca.mf20.sc.mcsentry.MCSFile com/ca/mf20/sc/commons/SC_CommonProcedures.scanFile(String)    filename:line: SC_CommonProcedures.java:487
     bp: 0x43b81834    method: java.util.List com/ca/mf20/sc/dataprovider/SC_CachedMaintenanceInfo.getHoldData()    filename:line: SC_CachedMaintenanceInfo.java:495
     bp: 0x43b81864    method: java.util.List com/ca/mf20/sc/dataprovider/SC_DataProvider.getHoldDataForPackageIDs(java.util.List)    filename:line: SC_DataProvider.java:2362
     bp: 0x43b81878    method: java.util.List com/ca/mf20/sc/commons/SC_SecureDataProvider.getHoldDataForPackageIDs(java.util.List)    filename:line: SC_SecureDataProvider.java:2316

Note other threads are displayed via:

> info proc

        address space # 0

         Native thread IDs for current process:
          0xe6f6d00 0xe6f7e00 0xe6f8f00 0xe6fa000 0xe6fb100 0xe6fc200 0xe6fd300
          0xe6fe400 0xe6ff500 0xe700600 0xe701700 0xe702800 0xe703900 0xe704a00
          0xe705b00 0xe706c00 0xe707d00 0xe708e00 0xe709f00 0xe70b000 0xe70c100
          0xe70d200 0xe70f400 0xe710500 0xe711600 0xe712700 0xe70e300 0xe714900
          0xe715a00 0xe716b00 0xe717c00 0xe718d00 0xe713800 0xe71e200 0xe71d100
          0xe71c000 0xe71af00 0xe719e00

         Command line arguments used for current process:

info thread 0xe6f6d00

info class com/ca/mf20/sc/mcsentry/scanner/MCSScannerDriver_Jni   <-number of instances
Runtime #1:

name = com/ca/mf20/sc/mcsentry/scanner/MCSScannerDriver_Jni

        ID = 0x43a419e0    superID = 0xed75a20
        classLoader = 0x40ddb710    modifers: public synchronized

        number of instances:     3
        total size of instances: 72 bytes

Inheritance chain....

        java/lang/Object
           com/ca/mf20/sc/mcsentry/scanner/MCSScannerDriver_Jni

Fields......

          static fields for "com/ca/mf20/sc/mcsentry/scanner/MCSScannerDriver_Jni"
            private static com.ca.mf20.logging.MSMLog logger = <object> @ 0x132995f8

          non-static fields for "com/ca/mf20/sc/mcsentry/scanner/MCSScannerDriver_Jni"
            private int driverId
            private java.util.concurrent.locks.Lock driverLock
            private com.ca.mf20.sc.mcsentry.BufferedResultsReader resultsReader


Methods......

Bytecode range(s): 44b6c08c -- 44b6c0ac:  public void <init>(int)
Bytecode range(s): :  private native int native_intialize(int, int)
Bytecode range(s): :  private native int native_create_parser(int)
Bytecode range(s): :  private native int native_free_parser(int)
Bytecode range(s): :  private native java.nio.ByteBuffer native_get_shared_buffer(int)
Bytecode range(s): :  private native java.nio.ByteBuffer native_get_shared_msg_buffer(int)
Bytecode range(s): :  private native int native_clenup_parser(int, java.nio.ByteBuffer)
Bytecode range(s): :  private native int native_parse(String, int, int, int, int)
Bytecode range(s): :  private native int native_finalize()
Bytecode range(s): 44b6c184 -- 44b6c1c1:  private com.ca.mf20.messages.intf.MSMMessageLog getMessages()
Bytecode range(s): 44b6c1e0 -- 44b6c364:  public com.ca.mf20.sc.mcsentry.MCSFile scan(String, int, int)
Bytecode range(s): 44b6c3b0 -- 44b6c3b8:  public com.ca.mf20.sc.mcsentry.MCSFile scan(String, int)
Bytecode range(s): 44b6c3d4 -- 44b6c462:  public int cleanupScanner(java.nio.ByteBuffer)
Bytecode range(s): 44b6c4a0 -- 44b6c521:  public int freeParser()
Bytecode range(s): 44b6c560 -- 44b6c5ed:  public com.ca.mf20.sc.mcsentry.scanner.MCSScannerDriver createScanner(int)
Bytecode range(s): 44b6c628 -- 44b6c652:  private int determineDebugMode()
Bytecode range(s): 44b6c668 -- 44b6c6d2:  public void initializePool(int)
Bytecode range(s): 44b6c70c -- 44b6c75c:  public void cleanupPool()
Bytecode range(s): 44b6c798 -- 44b6c7a7:  static void <clinit>()

@wjs4572

This comment has been minimized.

Copy link

commented May 3, 2019

Just a note, in version 1.0.1 the start scripts for api-mediation are separated:

from run-zowe.sh

cd `dirname $0`/../../zlux-app-server/bin && ./nodeServer.sh --allowInvalidTLSProxy=true &
`dirname $0`/../../api-mediation/scripts/api-mediation-start-discovery.sh
`dirname $0`/../../api-mediation/scripts/api-mediation-start-catalog.sh
`dirname $0`/../../api-mediation/scripts/api-mediation-start-gateway.sh
`dirname $0`/../../explorer-jobs-api/scripts/jobs-api-server-start.sh
`dirname $0`/../../explorer-data-sets-api/scripts/data-sets-api-server-start.sh
`dirname $0`/../../jes_explorer/scripts/start-explorer-jes-ui-server.sh
`dirname $0`/../../mvs_explorer/scripts/start-explorer-mvs-ui-server.sh
`dirname $0`/../../uss_explorer/scripts/start-explorer-uss-ui-server.sh

My assumption is the above Java change:

    -Djersey.config.client.readTimeout=90000 \
    -Deureka.server.peer-node-read-timeout-ms=3000 \
    -Dapiml.gateway.timeoutMillis=90000 \

Will need to be done for each of these scripts individually.

api-mediation-start-catalog.sh
api-mediation-start-discovery.sh
api-mediation-start-gateway.sh

@Joe-Winchester

This comment has been minimized.

Copy link

commented May 14, 2019

@vvvlc , @plavjanik . This is the log from @wjs4572 for the occurrence when there is runaway CPU.
Is there anything in here that helps to resolve the problem, or would you like me to schedule a call with a Java team here in Hursley ? Prob want to eliminate whether it's an API ML code fix or one to Java itself before we did that though.

javacore.20190503.134348.33620468.0002.txt

@vvvlc

This comment has been minimized.

Copy link
Member Author

commented May 14, 2019

Hi @Joe-Winchester,
First two threads main and JIT Compilation Thread-0 are consuming 59.sec of CPU time.
Before meeting java team I would suggest rerun it with disabled JIT (https://www.ibm.com/support/knowledgecenter/en/SSYKE2_8.0.0/com.ibm.java.vm.80.doc/docs/jit_disable.html) ie add -Xint options on command line.

java -Xint -Xms16m -Xmx512m ...

Thread details

3XMTHREADINFO      "main" J9VMThread:0x0000000014094600, omrthread_t:0x000000500865C0C0, java/lang/Thread:0x0000000080046910, state:R, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1, isDaemon:false)
3XMTHREADINFO1            (native thread ID:0x13EA1000, native priority:0x5, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000001)
3XMCPUTIME               CPU usage total: 59.608000000 secs, current category="Application"
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=8059744 (0x7AFB60)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at org/springframework/cglib/core/AbstractClassGenerator.generate(AbstractClassGenerator.java:339)
4XESTACKTRACE                at org/springframework/cglib/proxy/Enhancer.generate(Enhancer.java:492)
4XESTACKTRACE                at org/springframework/cglib/core/AbstractClassGenerator$ClassLoaderData$3.apply(AbstractClassGenerator.java:93)
...
org/springframework/boot/SpringApplication.prepareEnvironment(SpringApplication.java:361)
4XESTACKTRACE                at org/springframework/boot/SpringApplication.run(SpringApplication.java:320)
4XESTACKTRACE                at com/ca/mfaas/apicatalog/ApiCatalogApplication.main(ApiCatalogApplication.java:44)
4XESTACKTRACE                at sun/reflect/NativeMethodAccessorImpl.invoke0(Native Method)
4XESTACKTRACE                at sun/reflect/NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:90)
4XESTACKTRACE                at sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
4XESTACKTRACE                at java/lang/reflect/Method.invoke(Method.java:508)
4XESTACKTRACE                at org/springframework/boot/loader/MainMethodRunner.run(MainMethodRunner.java:48)
4XESTACKTRACE                at org/springframework/boot/loader/Launcher.launch(Launcher.java:87)
4XESTACKTRACE                at org/springframework/boot/loader/Launcher.launch(Launcher.java:50)
4XESTACKTRACE                at org/springframework/boot/loader/JarLauncher.main(JarLauncher.java:51)
3XMTHREADINFO3           No native callstack available on this platform
3XMTHREADINFO      "JIT Compilation Thread-0" J9VMThread:0x0000000014096B00, omrthread_t:0x000000500865CA10, java/lang/Thread:0x0000000080046AA8, state:R, prio=10
3XMJAVALTHREAD            (java/lang/Thread getId:0x4, isDaemon:true)
3XMTHREADINFO1            (native thread ID:0x13ED4000, native priority:0xB, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000001)
3XMCPUTIME               CPU usage total: 57.606000000 secs, current category="JIT"
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3           No Java callstack associated with this thread
3XMTHREADINFO3           No native callstack available on this platform

We are trying to recreate this problem in house but no luck.
Vitek

@jandadav

This comment has been minimized.

Copy link
Collaborator

commented May 15, 2019

@Joe-Winchester
closer inspection of the dump reveals the following:
The dump was taken around 5 minutes after the catalog has started
The application was still starting, as the number of threads is low compared to running catalog.

Apart from Vitek's suggestion, i recommend the following:
Leave the services running for longer period of time (roughly hours). Our internal testing shows idle apiml uses minutes of cpu over 24 hours. There is a spike on startup but then the usage drops off.

@vvvlc

This comment has been minimized.

Copy link
Member Author

commented May 15, 2019

We were able to recreate it on our local machine, @jandadav will review logs.
java coredumps and logs are in zowezz.zip
zowezz.zip

@jandadav

This comment has been minimized.

Copy link
Collaborator

commented May 16, 2019

We have replicated the high CPU load on our system and identified the cause of the usage as Spring's Actuator trying to measure delays. We have managed to fix the cpu usage with adding a system property to the command line parameters for each of apiml services.

I would like to kindly ask if you are still facing the high CPU load:

  1. Generate the java dump as described before and share the data with us
  2. Follow the instructions below and report back if it fixes the issue for you

Possible fix for the issue we replicated:

For each service that is experiencing the high CPU time (gateway, catalog, discovery) please add the following to the launch parameters:
-DLatencyUtils.useActualTime=false \

Example for Api Catalog service:

java -Xms16m -Xmx512m -Dibm.serversocket.recover=true -Dfile.encoding=UTF-8 \
    -Djava.io.tmpdir=/tmp -Xquickstart -Denvironment.hostname=xxxxxxx -Denvironment.port=7552  \
    -Denvironment.discoveryLocations=xxxxxxxx:7553/eureka/ -Denvironment.ipAddress=xxxxxx \
    -Denvironment.preferIpAddress=true -Denvironment.gatewayHostname=xxxxxx \
    -Denvironment.eurekaUserId=eureka \
    -Denvironment.eurekaPassword=password \
    -Dapiml.security.zosmfServiceId=zosmf \
    -Dapiml.security.verifySslCertificatesOfServices=false \
    -Dspring.profiles.include= \
    -Dserver.address=0.0.0.0 \
    -Dserver.ssl.enabled=true \
    -Dserver.ssl.keyStore=$DIR/../keystore/localhost/localhost.keystore.p12 \
    -Dserver.ssl.keyStoreType=PKCS12 \
    -Dserver.ssl.keyStorePassword=password \
    -Dserver.ssl.keyAlias=localhost \
    -Dserver.ssl.keyPassword=password \
    -Dserver.ssl.trustStore=$DIR/../keystore/localhost/localhost.truststore.p12 \
    -Dserver.ssl.trustStoreType=PKCS12 \
    -Dserver.ssl.trustStorePassword=password \
    -Djava.protocol.handler.pkgs=com.ibm.crypto.provider \
    -DLatencyUtils.useActualTime=false \
    -jar $DIR/../api-catalog-services.jar &

you can find the .sh files in the following location:
<Zowe Root Dir>/api-mediation/scripts/<affected service>.sh
where expands to whichever service is suffering from the high CPU usage

@vvvlc

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

Just to expand previous update

you can find the .sh files in the following location:
<Zowe Root Dir>/api-mediation/scripts/<affected service>.sh
where expands to whichever service is suffering from the high CPU usage

We would like you to modify these files

  • ~/zowe/1.0.1/api-mediation/scripts/api-mediation-start-catalog.sh
  • ~/zowe/1.0.1/api-mediation/scripts/api-mediation-start-discovery.sh
  • ~/zowe/1.0.1/api-mediation/scripts/api-mediation-start-gateway.sh

to add -DLatencyUtils.useActualTime=false \ .

NOTE: you can use eg ISPF 3.17 option to navigate and edit uss files, they are in EBCDIC.

@petr-galik petr-galik assigned jandadav and unassigned petr-galik May 17, 2019

@wjs4572

This comment has been minimized.

Copy link

commented May 17, 2019

This does not seem to help. I will send the dump to @Joe-Winchester.

@jandadav

This comment has been minimized.

Copy link
Collaborator

commented May 21, 2019

@Joe-Winchester @wjs4572
Can you please attach the dump to this thread or share it with me? Please get in touch with me if you require assistance.

@Joe-Winchester

This comment has been minimized.

Copy link

commented May 21, 2019

@jandadav - This is the Java core from William.

javacore.20190517.131513.16843724.0002.txt

@jandadav

This comment has been minimized.

Copy link
Collaborator

commented May 21, 2019

@Joe-Winchester
Similar feedback from this dump:
The dump was taken around 5 minutes after the catalog has started

1CISTARTTIME   JVM start time: 2019/05/17 at 13:11:39:837
1TIDATETIME    Date: 2019/05/17 at 13:15:57:204

The application was still starting, as the number of threads is low compared to running catalog.
Leave the services running for longer period of time (roughly hours). Our internal testing shows idle apiml uses minutes of cpu over 24 hours. There is a spike on startup but then the usage drops off.

1XMTHDSUMMARY  Threads CPU Usage Summary
NULL           =========================
1XMTHDCATEGORY All JVM attached threads: 169.848000000 secs
1XMTHDCATEGORY |
2XMTHDCATEGORY +--System-JVM: 93.789000000 secs
2XMTHDCATEGORY |  |
3XMTHDCATEGORY |  +--GC: 0.131000000 secs
2XMTHDCATEGORY |  |
3XMTHDCATEGORY |  +--JIT: 58.978000000 secs
1XMTHDCATEGORY |
2XMTHDCATEGORY +--Application: 76.059000000 secs

JIT is enabled and eats almost half of the cpu time.

Conclusion

Since the dump was generated during service startup, high CPU usage is not visible.
Please leave the services running for longer period of time (roughly hours) and then when you see high cpu usage, generate the dump.

@wjs4572

This comment has been minimized.

Copy link

commented May 21, 2019

I am not sure this is going to be feasible, as the CPU usage is over 90% during this period and the TN3270 services are completely locked up. This prevents anyone else from working on the system.

@jandadav

This comment has been minimized.

Copy link
Collaborator

commented May 24, 2019

If you are facing high CPU during startup, i can suggest the following:

Each of the following suggestions apply for apiml service startup script, as described above

you can find the .sh files in the following location:
/api-mediation/scripts/.sh

We would like you to modify these files

~/zowe/1.0.1/api-mediation/scripts/api-mediation-start-catalog.sh
~/zowe/1.0.1/api-mediation/scripts/api-mediation-start-discovery.sh
~/zowe/1.0.1/api-mediation/scripts/api-mediation-start-gateway.sh

NOTE: you can use eg ISPF 3.17 option to navigate and edit uss files, they are in EBCDIC.

Advise to take backup of the scripts so you can restore any changes

1) Disable JIT and AOT compilers

with goal to identify the compiler's impact on the startup
for each service:

  • Remove -Xquickstart option from your command line.
  • Add the -Xint command-line option to disable the JIT and AOT compilers.
  • Start the services and compare the cpu usage and system responsiveness.

Does this configuration make the startup faster or slower? Is the 3270 during service startup usable? What is the CPU usage with this configuration?

2) Configure AOT compiler to enable cache and improve startup performance

JVM startup time improvement with a populated cache is typically between 10% and 40% faster compared with a system not using class data sharing, depending on the operating system and the number of classes loaded.
for each service:

  • Revert previous changes to the scripts
  • Add the -Xshareclasses command-line option to enable cache. The cache's default location on z/OS: This information is stored in /tmp/javasharedresources
  • Start the services and allow them to start fully. The cache should be created in /tmp/javasharedresources and start to get populated.
  • Subsequent starts should see performance improvements.

Again please validate if this configuration makes the subsequent startups faster or slower. Is the 3270 during service startup usable? What is the CPU usage with this configuration?

3) Isolate only Api Mediation Layer startup

  • Edit the .sh file that starts Zowe components
    ~/zowe/1.0.1/scripts/internal/run-zowe.sh
  • Find the code block and comment out the following lines
# cd `dirname $0`/../../zlux-app-server/bin && ./nodeServer.sh --allowInvalidTLSProxy=true &
`dirname $0`/../../api-mediation/scripts/api-mediation-start-discovery.sh
`dirname $0`/../../api-mediation/scripts/api-mediation-start-catalog.sh
`dirname $0`/../../api-mediation/scripts/api-mediation-start-gateway.sh
# `dirname $0`/../../explorer-jobs-api/scripts/jobs-api-server-start.sh
# `dirname $0`/../../explorer-data-sets-api/scripts/data-sets-api-server-start.sh
# `dirname $0`/../../jes_explorer/scripts/start-explorer-jes-ui-server.sh
# `dirname $0`/../../mvs_explorer/scripts/start-explorer-mvs-ui-server.sh
# `dirname $0`/../../uss_explorer/scripts/start-explorer-uss-ui-server.sh
  • Start the Zowe instance.

Please observe the same metrics as requested for the earlier steps. This should tell us where the slowdownd of your system is commng from. Is it Apiml that is stressing the CPU or is is combination of all Zowe services starting at the same time?

@jandadav

This comment has been minimized.

Copy link
Collaborator

commented Jun 10, 2019

Hi all,
has any progress been made on this issue? Have you been successful to diagnose the source of the CPU load or generate the diagnostic data?

ilkinabdullayev pushed a commit that referenced this issue Jun 26, 2019

Fix of high cpu usage occurence replicated in Broadcom based on GIT i…
…ssue #282

Solution is to switch LatencyUtils.useActualTime off, as described in javadoc of TimeServices
class

/**
 * Provide an API for time-related service, such as getting the current time and waiting for
 * a given period of time. By default, these services are provided by actual time services
 * in the JDK (i.e. System.nanoTime(), System.currentTimeMillis(), Thread.sleep(), and
 * java.util.concurrent.locks.LockSupport.parkNanos()). However, if the property
 * LatencyUtils.useActualTime is set to "false", TimeServers will only move the notion
 * of time in response to calls to the #setCurrentTime() method.
 *
 */

Signed-off-by: janda06 <david.janda@broadcom.com>
@jandadav

This comment has been minimized.

Copy link
Collaborator

commented Jun 27, 2019

Please let us know if any progress has been made in identifying the cause of the issue.
Otherwise we aim to close this issue within one week.

@wjs4572

This comment has been minimized.

Copy link

commented Jun 27, 2019

We were informed that Zowe is not rated to run on the zPDT platform and are not pursuing this further at this point. Please let us know when Zowe is successfully running on zPDT and we reevaluate at that time.

Thank you,

@Joe-Winchester

This comment has been minimized.

Copy link

commented Jun 28, 2019

Hi @wjs4572 . We do have Zowe successfully running on zD&T platforms, athough the Zowe components that use Java (the API mediation layer) encounter issues where because zD&T is emulation (Java) on top of emulation (z/OS instructions interpreted on x86) the performance is not production speed. We are aware of this and hoping to publish diagnostic and tuning information to help with getting optimal configuration of zD&T and also look at longer term ways to address the problem. I'll keep you informed as we make progress on this work. Tks

ilkinabdullayev pushed a commit that referenced this issue Jul 1, 2019

Fix of high cpu usage occurence replicated in Broadcom based on GIT i…
…ssue #282

Solution is to switch LatencyUtils.useActualTime off, as described in javadoc of TimeServices
class

/**
 * Provide an API for time-related service, such as getting the current time and waiting for
 * a given period of time. By default, these services are provided by actual time services
 * in the JDK (i.e. System.nanoTime(), System.currentTimeMillis(), Thread.sleep(), and
 * java.util.concurrent.locks.LockSupport.parkNanos()). However, if the property
 * LatencyUtils.useActualTime is set to "false", TimeServers will only move the notion
 * of time in response to calls to the #setCurrentTime() method.
 *
 */

Signed-off-by: janda06 <david.janda@broadcom.com>

ilkinabdullayev pushed a commit that referenced this issue Jul 8, 2019

Fix of high cpu usage occurence replicated in Broadcom based on GIT i…
…ssue #282

Solution is to switch LatencyUtils.useActualTime off, as described in javadoc of TimeServices
class

/**
 * Provide an API for time-related service, such as getting the current time and waiting for
 * a given period of time. By default, these services are provided by actual time services
 * in the JDK (i.e. System.nanoTime(), System.currentTimeMillis(), Thread.sleep(), and
 * java.util.concurrent.locks.LockSupport.parkNanos()). However, if the property
 * LatencyUtils.useActualTime is set to "false", TimeServers will only move the notion
 * of time in response to calls to the #setCurrentTime() method.
 *
 */

Signed-off-by: janda06 <david.janda@broadcom.com>

arxioly added a commit that referenced this issue Jul 8, 2019

Fix of high cpu usage occurence replicated in Broadcom based on GIT i…
…ssue #282 (#326)

Fix of high cpu usage occurrence replicated in Broadcom based on GIT issue #282
Solution is to switch LatencyUtils.useActualTime off, as described in javadoc of TimeServices
class. Moved gateway-common to apiml-common. Added check rule for method names
suppressed security module as there are some classes that triggered the check and this module is soon to be decommissioned.

Signed-off-by: janda06 <david.janda@broadcom.com>

@jandadav jandadav closed this Jul 10, 2019

@armstro

This comment has been minimized.

Copy link

commented Jul 17, 2019

Is this include in Zowe 1.4? Being "closed" I assume so but I don't see an indication for sure - I'd like to be sure the customer is aware of this if pending delivery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.