Skip to content

Tuning for large code bases

Vladimir Kotal edited this page Jun 5, 2023 · 42 revisions

If you landed on this page, chances are that you hit some problem w.r.t. scaling/latencies/performance etc. The initial step should be to setup monitoring to see related problems more clearly.

For both indexer and webapp, make sure to check how the operating system used deals with situations where allocated memory exceeds system resources. Specifically for Linux, make sure to avoid/tame the OOM killer to avoid nasty surprises like abrupt termination of the indexer process.

JVM

In general it is recommended to run both the indexer and web application with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/some/sensible/place/to/store/jvm/dumps in order to capture the JVM dumps in case of out-of-memory exception so that is possible to analyze the dumps with tools like jhat or http://www.eclipse.org/mat/

Indexer

If you run the Indexer via the opengrok-indexer script, keep in mind that by default it does not set Java heap size so it will use the default value. This might not be enough, especially for large projects such as AOSP or when indexing lots of mid sized projects.

Temporary space

Make sure to have enough space in temporary directory. Universal ctags create temporary files there during indexing and the bigger parallelism the more files need to be created. These files are sometimes hundreds of megabytes each.

JVM heap size

The usual JVM heap size for the indexer is 8 GB. The heap usage can be monitored, see https://github.com/oracle/opengrok/wiki/Monitoring#indexer

The heap size usage depends on the level of parallelism used by the indexer. Generally, the higher parallelism, the higher heap usage will be.

Also, when history cache is created for a repository with large amount of changesets, this could consume the heap significantly. Because of this, the history for some repositories (Mercurial, Git) is handled in chunks to limit the memory requirements. The size of the chunks (i.e. the number of changesets to process in one go) can be tuned. The higher the chunks size, the higher memory usage will be however the indexer may finish quickly.

See https://github.com/oracle/opengrok/wiki/Indexer-configuration#indexer-tunables for details.

Git Merge commits

It is possible to disable handling of merge commits in Git via global/per-project configuration. If you have repository with rich history, this might help. Also see the above for history chunk size.

Lucene flush buffer size

Lucene 4.x sets indexer defaults:

DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB = 1945;
DEFAULT_MAX_THREAD_STATES = 8;
DEFAULT_RAM_BUFFER_SIZE_MB = 16.0;
  • which might grow as big as 16GB (though DEFAULT_RAM_BUFFER_SIZE_MB shouldn't really allow it, but keep it around 1-2GB)

  • the Lucene RAM_BUFFER_SIZE_MB can be tuned now using indexer parameter -m, so running a 8GB 64 bit server JDK indexer with tuned docs flushing (assuming the indexer is being run from the Python wrapper. Otherwise pass the indexer options directly.):

    $ opengrok-indexer -J=-Xmx8g -J=-server --jar opengrok.jar -- \
         -m 256 -s /source -d /data ...

For Solaris you might want to use also -J=-d64

Open File and processes hard and soft limits

The initial index creation process is resource intensive and often the error java.io.IOException: error=24, Too many open files appears in the logs. To avoid this, increase the ulimit value to a higher number.

With the Java modularization, the indexer process by itself (not performing any indexing) will have easily 250 or so open files due to all the Java jmod files being open.

It is noted that the hard and soft limit for open files of 10240 works for mid-sized repositories and so the recommendation is to start with that value. Also, the higher parallelism of the indexer, the higher the limit has to be. See parallelism related tunables in https://github.com/oracle/opengrok/wiki/Indexer-configuration , in particular the indexingParallelism tunable. The parallelism can be also set by using the indexer command line options.

The resource limits can be se also set when using the opengrok-sync tool in the configuration for given command, using the:

limits: {RLIMIT_NOFILE: 2048}

directive in the configuration file. See the opengrok-sync documentation for more details and examples.

Thread limits

If you get a similar error to the open files limit, but for threads: java.lang.OutOfMemoryError: unable to create new native thread it might be due to strict security limits and you need to increase the limits.

Web application

The heap size limit for web application should be derived from the size of data generated by the indexer and also to reflect the size of WFST structures generated by the Suggester in the web application. The former will create memory pressure especially for multi-project searches. Thus, for precise tuning it might be prudent to estimate memory footprint of single all-project search (using memory profiler), determine how many requests the web application can serve simultaneously, multiply these 2 values and make sure the heap limit is bigger than that.

For Suggester data, it should be sufficient to compute the sum of lengths of all *.wfst files under the data root and bump the heap limit by that value.

Tomcat threads

The web application utilizes several thread pools. These are usually sized based on the number of on-line CPUs (cores) in the system. By default Tomcat allows only 200 or so threads for the basic Connector. The more CPUs (cores) you have in the system, the higher chance the limit will be reached. So, it might be necessary to bump the limit.

Also, when using the per project workflow, there is usually many indexer processes running in parallel. Each of these uses several RESTful API calls. These combined can lead to many threads created in the web application.

Configuration snippet example:

    <Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443"
               maxThreads="1024" />

There is also maxConnections variable.

Tomcat heap

Tomcat by default supports only small deployments. For bigger ones you might need to increase its heap (assuming 64-bit Java). It will most probably be the same for other containers as well. For Tomcat you can easily get this done by creating $CATALINA_BASE/bin/setenv.sh:

# cat $CATALINA_BASE/bin/setenv.sh
JAVA_OPTS="$JAVA_OPTS -server"

# OpenGrok memory boost to cover all-project searches
# (7 MB * 247 projects + 300 MB for cache should be enough)
# 64-bit Java allows for more so let's use 8GB to be on the safe side.
# We might need to allow more for concurrent all-project searches.
JAVA_OPTS="$JAVA_OPTS -Xmx8g"

export JAVA_OPTS

Tomcat/Apache tuning for HTTP headers

For tomcat you might also hit a limit for HTTP header size (we use it to send the project list when requesting search results):

For Tomcat increase(add) in conf/server.xml, for example:

  <Connector port="8888" protocol="HTTP/1.1"
             connectionTimeout="20000"
             maxHttpHeaderSize="65536"
             redirectPort="8443" />

Refer to docs of other containers for more info on how to achieve the same.

Failure to do so will result in HTTP 400 errors after first query - with the error "Error parsing HTTP request header".

The same tuning to Apache (handy in case you are running Apache in reverse proxy mode to Tomcat) can be done with the LimitRequestLine directive:

LimitRequestLine 65536
LimitRequestFieldSize 65536

and also bump the default limit of responsefieldsize:

LoadModule proxy_module libexec/mod_proxy.so
LoadModule proxy_http_module libexec/mod_proxy_http.so


<IfModule mod_proxy.c>
    # The number of seconds Apache httpd waits for data sent by / to the backend.
    # This should match the `interactiveCommandTimeout` setting in OpenGrok.
    ProxyTimeout 600

    ProxyPass /source/ http://localhost:8080/source/ responsefieldsize=16384
    ProxyPass /source http://localhost:8080/source/ responsefieldsize=16384
    ProxyPassReverse /source/ http://localhost:8080/source/
    ProxyPassReverse /source http://localhost:8080/source/
</IfModule>

Multi-project search speed tip

If multi-project search is performed frequently, it might be good to warm up file system cache after each reindex. This can be done e.g. with https://github.com/hoytech/vmtouch

Suggester

It is recommended to store the Suggester data on SSD/Flash. This benefits both the suggester rebuild operation (that happens during reindex and also periodically) as well as web application startup (performs suggester initialization).