Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu usage of searchnode suddenly become very high and memory usage keeps grow up #11245

Open
luoyetx opened this issue Nov 8, 2019 · 13 comments
Assignees
Projects

Comments

@luoyetx
Copy link

@luoyetx luoyetx commented Nov 8, 2019

some metrics get from *:19092/metrics/v1/values

cpu usage of search node
cpu

The search services is not affected by this problem.

I could email you more details if this can help solve the problem.

@jobergum

This comment has been minimized.

Copy link
Member

@jobergum jobergum commented Nov 8, 2019

Could it be caused by a change in the query pattern(s) ? E.g more traffic, recalling more documents per query or different ranking profiles? It might also be that this is triggered by maintenance jobs which have been configured with a too high thread count. https://docs.vespa.ai/documentation/proton.html#proton-maintenance-jobs

These jobs are pretty verbose (vespa-logfmt -l event) so you can correlate the log messages from the start of the cpu usage.

You can check what's using cpu by on a given node running searchnode:

$ vespa-sentinel-cmd list |grep searchnode
searchnode state=RUNNING mode=AUTO pid=X  exitstatus=0 id="content/search/cluster.content/0"
$ perf record --pid=X
$ perf report

Also having the proton configuration would help diagnosing the cpu usage:

$ vespa-configproxy-cmd -m cache |grep proton
vespa.config.search.core.proton,content/search/cluster.content/0,2,ac35dc8252564a136e28d48e37e35efa
$ vespa-get-config -n vespa.config.search.core.proton -i content/search/cluster.content/0 > proton.cfg 
@luoyetx

This comment has been minimized.

Copy link
Author

@luoyetx luoyetx commented Nov 9, 2019

It seems the wand operation cost too much cpu, but this feature is used for a few days, and when I rewrite the query without wand, the cpu usage is not change (but the query latency goes down), and the memory usage is still growing up.
searchnode

@luoyetx

This comment has been minimized.

Copy link
Author

@luoyetx luoyetx commented Nov 9, 2019

proton config

basedir "."
ptport 8003
rpcport 8004
httpport 0
clustername ""
partition 0
distributionkey -1
numsearcherthreads 64
numthreadspersearch 1
numsummarythreads 16
stoponioerrors false
flush.maxconcurrent 2
flush.idleinterval 10
flush.strategy "MEMORY"
flush.memory.maxmemory 4294967296
flush.memory.diskbloatfactor 0.2
flush.memory.maxtlssize 21474836480
flush.memory.each.maxmemory 1073741824
flush.memory.each.diskbloatfactor 0.2
flush.memory.maxage.time 86400
flush.memory.conservative.memorylimitfactor 0.5
flush.memory.conservative.disklimitfactor 0.5
flush.memory.conservative.lowwatermarkfactor 0.9
flush.preparerestart.replaycost 8
flush.preparerestart.replayoperationcost 3000
flush.preparerestart.writecost 1
indexing.write.io "DIRECTIO"
indexing.read.io "DIRECTIO"
indexing.threads 1
indexing.tasklimit 1000
indexing.semiunboundtasklimit 40000
index.warmup.time 0
index.warmup.unpack false
index.maxflushed 2
index.maxflushedretired 20
index.cache.size 0
attribute.write.io "DIRECTIO"
search.memory.limiter.maxthreads 0
search.memory.limiter.mincoverage 1
search.memory.limiter.minhits 1000000
search.mmap.advise "NORMAL"
grouping.sessionmanager.maxentries 500
grouping.sessionmanager.pruning.interval 1
distribution.redundancy 1
distribution.searchablecopies 1
summary.write.io "DIRECTIO"
summary.read.io "MMAP"
summary.read.mmap.advise "NORMAL"
summary.cache.maxbytes -5
summary.cache.allowvisitcaching true
summary.cache.initialentries 0
summary.cache.compression.type "LZ4"
summary.cache.compression.level 6
summary.cache.update_strategy "INVALIDATE"
summary.log.compact.compression.type "ZSTD"
summary.log.compact.compression.level 9
summary.log.chunk.maxbytes 65536
summary.log.chunk.compression.type "ZSTD"
summary.log.chunk.compression.level 9
summary.log.chunk.skipcrconread false
summary.log.maxfilesize 1000000000
summary.log.maxdiskbloatfactor 0.1
summary.log.maxbucketspread 2.5
summary.log.minfilesizefactor 0.2
periodic.interval 3600
tlsspec "tcp/localhost:13700"
tlsconfigid ""
slobrokconfigid ""
routingconfigid ""
filedistributor.configid ""
pruneremoveddocumentsinterval 0
pruneremoveddocumentsage 1.2096e+06
packetcompresslimit 1024
packetcompresslevel 3
packetcompresstype "LZ4"
lidspacecompaction.interval 600
lidspacecompaction.allowedlidbloat 1000
lidspacecompaction.allowedlidbloatfactor 0.01
lidspacecompaction.removebatchblockdelay 5
maxvisibilitydelay 1
visit.defaultserializedsize 1
visit.ignoremaxbytes true
initialize.threads 0
writefilter.attribute.enumstorelimit 0.9
writefilter.attribute.multivaluelimit 0.9
writefilter.memorylimit 0.8
writefilter.disklimit 0.8
writefilter.sampleinterval 60
hwinfo.memory.size 0
hwinfo.disk.size 0
hwinfo.disk.shared false
hwinfo.disk.writespeed 200
hwinfo.disk.samplewritesize 1073741824
hwinfo.disk.slowwritespeedlimit 100
hwinfo.cpu.cores 0
feeding.concurrency 0.2
maintenancejobs.resourcelimitfactor 1.05
maintenancejobs.maxoutstandingmoveops 10
bucketdb.checksumtype "LEGACY"
@jobergum

This comment has been minimized.

Copy link
Member

@jobergum jobergum commented Nov 9, 2019

Clearly the cpu usage is driven by query traffic as see from your perf report.
So you changed something and the cpu usage changed? So maybe you could share the changes you did (e.g document definition changes, query changes?)

The wand operator is sensitive to other filters/terms in the query in addition to number of documents, number of document features and the feature distribution and obviously also the N parameter. There are cases where wand operator is slower than using brute force or, especially when used in combination with other filter query terms.

On memory usage, it's expected that resident memory increases as index and summary store is touched by queries and requests as we Vespa search core uses normal IO for file read operations involved in search and summary requests.

@luoyetx

This comment has been minimized.

Copy link
Author

@luoyetx luoyetx commented Nov 10, 2019

I have remove the wand query, but still use query like "A or B or C"(something like should type in ES). If I use perf to see CPU usage, the top1 is still the same. And the document definition is not changed.
The cpu usage goes down a little but is still in very high level.
image
The memory usage keeps grow up.
image

@luoyetx

This comment has been minimized.

Copy link
Author

@luoyetx luoyetx commented Nov 10, 2019

I think the growth of memory is not normal, the total number of documents and query traffic is not changed that large.

@jobergum

This comment has been minimized.

Copy link
Member

@jobergum jobergum commented Nov 10, 2019

So what did you change when the graph shows the cpu increase on 11/7 approx at 12:00? I think you introduced wand? So maybe you went from using AND to WAND and also changed to OR later on? If you change the boolean logic from AND to OR like search you are going to recall more documents which means more cpu. You also claim that the search quality is not impacted so I'm not able to fully understand what this issue is about.

@luoyetx

This comment has been minimized.

Copy link
Author

@luoyetx luoyetx commented Nov 10, 2019

The cpu usage changed suddenly on 11/7 at 12:00 (I think I did nothing to the query and search definitions), and the memory usage started to grow up, the wand is already used at that time. After cpu perf, I remove wand but changes nothing (query search latency goes down a little). The recall number of documents and search requests is normal.

@jobergum

This comment has been minimized.

Copy link
Member

@jobergum jobergum commented Nov 10, 2019

How does your queries look like and how are the fields searched defined in the search definition?

@luoyetx

This comment has been minimized.

Copy link
Author

@luoyetx luoyetx commented Nov 11, 2019

Because of the memory growth, I have restart all services on every nodes, and the cpu and memory usage comes back to normal.

perf the searchnode
image

@jobergum

This comment has been minimized.

Copy link
Member

@jobergum jobergum commented Nov 11, 2019

Thanks for the update, so

If I use perf to see CPU usage, the top1 is still the same.

So even after you stopped using wand you could see the wand operator used cpu? If so then that could indicate a bug in the wand evaluation. What Vespa version are you on? How is the field you run wand over defined?

@luoyetx

This comment has been minimized.

Copy link
Author

@luoyetx luoyetx commented Nov 11, 2019

Yes, after removing the wand query, the cpu usage goes down a little but still in very high level, and the memory usage keeps growing.
We build with vespa-7.99.22-1 source code, and also modify java linguistics code to stop removing emoji.
image

The field is like below

field fieldA type string {
        indexing: summary | index
        normalizing: none
        summary-to: online
}
field fieldB type string {
        indexing: summary | index
        normalizing: none
        summary-to: online
}

# query over merged_field
fieldset merged_field {
        fields: fieldA,fieldB
}
@jobergum

This comment has been minimized.

Copy link
Member

@jobergum jobergum commented Nov 11, 2019

OK, Thanks for the clarification. There are AFAIk no known issues with wand so this is a first, potentially related to the emoji support you have built. The memory and the infinite loop could indicate that. Would it be possible for you to share more on the linguistic changes, the complete search definition and example queries which was executed around the time of cpu spike? You can send it to (bergum at verizonmedia.com)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Support
Awaiting triage
2 participants
You can’t perform that action at this time.