Skip to content

Per project management and workflow

Vladimir Kotal edited this page Jun 26, 2023 · 32 revisions

Motivation

OpenGrok can be run with or without projects. A project is simply a directory directly underneath the OpenGrok source root directory. A project can have zero or more Source Code Management repositories underneath. In a setup without projects, all of the data has to be indexed at once. With projects however, each project has its own index so it is possible to index projects separately. What's more it is possible to reindex them in parallel, potentially speeding the overall process.

When working with project data, there are 2 types of processing that can take a long time:

  • synchronization: updating project data so that it matches its origin
    • usually involves running commands like git pull in all the repositories for given project.
  • indexing: updating the index so that it matches the project data

For some projects either or both steps can take a long time. Say you have a repository that has its origin residing on a NFS share across the Atlantic so it has high latency plus it uses legacy VCS that operates not on changesets but on individual files and therefore the repository takes a long time (say tens of minutes if not hours) to synchronize. Or, there is a repository that has a large number of files so the initial phase of indexing always takes a long time (due to scanning the whole project directory tree for changed files) even though the incremental changes are small.

Or maybe there are lots of projects that exhibit some of these characteristics.

Previously, it was necessary to index all of source root in order to discover new projects and put them to configuration. Starting with OpenGrok 1.1, it is possible to manage and index projects separately.

As a result, the indexing of complete source root is only necessary when upgrading across OpenGrok version with incompatible Lucene indexes.

Combine these procedures with the parallel processing tools (see repository synchronization) and you have per-project management with parallel processing.

The following examples assume that OpenGrok install base is under the /opengrok directory.

Workflow

It is possible to start from scratch or use OpenGrok instance that already indexes all projects in one go and convert it to index projects separately and in parallel.

There are some design choices that need to be dealt with:

  • The indexer either has to discover projects and their repositories during the indexing preparation or it has to know them in advance.
  • The configuration file has to be written once a project was added or modified or indexed for the first time.
  • The indexer uploads the configuration to the web app at the end of the indexing.

Thus, when indexing newly added project, it is necessary to add it to the configuration first, then index it and lastly make the new configuration persistent.

This page lists all the pieces and how to operate them.

Also see https://github.com/oracle/opengrok/wiki/Tuning-for-large-code-bases#tomcat-threads

Building blocks

The following is assuming that the commands opengrok-projadm, opengrok-groups and opengrok-config-merge tools are in PATH. You can install these from the opengrok-tools python package available in the release tarball.

Using the opengrok-projadm tool (that utilizes the opengrok-config-merge tool and RESTful API) it is possible to manage the projects.

Configuration backup

The next sections start by suggesting to backup current configuration. This could be done by e.g. copying the configuration.xml (that is written by the indexer when using the -W indexer option) file aside, taking file-system snapshot of the directory the configuration is stored in etc.

This is necessary as a prevention if something goes wrong.

Adding a project

  • backup current config
  • add the project data to a directory under the source root directory
    • this usually involves running VCS command such as git clone, extracting source code from an archive, etc.
  • perform any necessary authorization adjustments
  • add the project to configuration (also refreshes the configuration on disk):
   opengrok-projadm -b /opengrok -a PROJECT

Indexing a project

The indexing part of the wiki explains how to run the indexer in general.

Running the indexer for single project has several constraints:

  • scanning for repositories/projects is not wanted - no -P or -S options
    • however the indexer has to know the project/repository information so it needs to be either retrieved from the web application or use the persistent configuration on disk
  • it is undesirable to write the configuration that is created during the indexer run to disk - no -W option

Thus, running the indexer for single project may look like this:

$ curl -s -X GET http://localhost:8080/source/api/v1/configuration -o fresh_config.xml
$ opengrok-indexer -a /opengrok/dist/lib/opengrok.jar -- \
    -c /usr/local/bin/ctags \
    -U 'http://localhost:8080/source' \
    -R fresh_config.xml
    -H PROJECT_NAME \
    PROJECT_NAME

The -U option is important as it pokes the web app to use the most recent index (that was just created). Also it updates it with the latest repository metadata.

This does not deal with logging to a separate log file. Also, this is not robust when run in parallel due to the configuration handling (it should be stored in temporary file with random name).

Now, there is the opengrok-reindex-project script which is recommended to use. It downloads fresh configuration from the webapp so that the indexer has the knowledge about indexed project and its repositories. It can also generate logging configuration on the fly.

Once the project reindex is done, save the configuration (this is necessary so that the indexed flag of the project is persistent. If not made consistent and the web app restarts the project will not be accessible in the web app).

   opengrok-projadm -b /opengrok -r

The -R indexer option can be used for opengrok-projadm to supply path to read-only configuration so that it is merged with current configuration.

Deleting a project

  • backup current config
  • delete the project from configuration (deletes project's index data and refreshes on disk configuration).
   opengrok-projadm -b /opengrok -d PROJECT

The -R indexer option can be used with the opengrok-projadm script to supply the path to read-only configuration so that it is merged with current configuration.

opengrok-sync

provides a way how to run a sequence of commands for a set of projects in parallel. Thus, it can be used to synchronize and reindex a project.

The script accepts the configuration either in JSON or YAML.

Use e.g. like this:

  $ opengrok-sync -c /scripts/sync.conf -d /ws-local/

where the sync.conf file contents might look like this:

commands:
- call:
    uri: http://localhost:8080/source/api/v1/messages
    method: POST
    data:
      cssClass: info
      duration: PT1H
      tags: ['%PROJECT%']
      text: resync + reindex in progress
- command:
    args: [sudo, -u, wsmirror, /opengrok/dist/bin/opengrok-mirror, -c, /opengrok/etc/mirror-config.yml, -U, 'http://localhost:8080/source']
- command: 
    args: [sudo, -u, webservd, /opengrok/dist/bin/opengrok-reindex-project, -J=-d64,
      '-J=-XX:-UseGCOverheadLimit', -J=-Xmx16g, -J=-server, --jar, /opengrok/dist/lib/opengrok.jar,
      -t, /opengrok/etc/logging.properties.template, -p, '%PROJ%', -d, /opengrok/log/%PROJECT%,
      -P, '%PROJECT%', -U, 'http://localhost:8080/source', --, --renamedHistory, 'on', -r, dirbased, -G, -m, '256', -c,
      /usr/local/bin/ctags, -U, 'http://localhost:8080/source', -o, /opengrok/etc/ctags.config,
      -H, '%PROJECT%']
    env: {LC_ALL: en_US.UTF-8}
    limits: {RLIMIT_NOFILE: 1024}
- call:
    uri: 'http://localhost:8080/source/api/v1/messages?tag=%PROJECT%'
    method: DELETE
    data: ''
- command: [/scripts/check-indexer-logs.ksh]
cleanup:
  - call:
      uri: 'http://localhost:8080/source/api/v1/messages?tag=%PROJECT%'
      method: DELETE
      data: ''

Note: the above -U 'http://localhost:8080/source' twice in opengrok-reindex-project arguments is not a typo. It must be specified twice - for the python and for the indexer.

The above opengrok-sync command will basically take all directories under /ws-local and for each it will run the sequence of commands specified in the sync.conf file. This will be done in parallel - on project level. The level of parallelism can be specified using the the --workers option (by default it will use as many workers as there are CPUs in the system).

Another variant of how to specify the list of projects to be synchronized is to use the --indexed option of opengrok-sync that will query the webapp configuration for list of indexed projects and will use that list. Otherwise, the --projects option can be specified to process just specified projects.

The commands above will basically:

  • mark the project with alert (to let the users know it is being synchronized/indexed) using the RESTful API call (the %PROJECT% string is replaced with current project name)
  • pull the changes from all the upstream repositories that belong to the project using the opengrok-mirror command
  • reindex the project using opengrok-reindex-project
  • clear the alert using the second RESTful API call
  • execute the /scripts/check-indexer-logs.ksh script to perform some pattern matching in the indexer logs to see if there were any serious failures there. The script can look e.g. like this:
#!/usr/bin/ksh
#
# Check OpenGrok indexer logs in the last 24 hours for any signs of serious
# trouble.
#

if (( $# != 1 )); then
        print -u2 "usage: $0 <project_name>"
        exit 1
fi

project_name=$1

typeset -r log_dir="/opengrok/log/$project_name/"
if [[ ! -d $log_dir ]]; then
        print -u2 "cannot open log directory $log_dir"
        exit 1
fi

# Check the last log file.
if grep SEVERE "$log_dir/opengrok0.0.log"; then
        exit 1
fi

The opengrok-sync script will print any errors to the console and uses file level locking to provide exclusivity of run so it is handy to run from crontab periodically.

Each "command" can be either normal command execution (supplying the list of program arguments) or RESTful API call (supplying the HTTP verb and optional payload).

Note that the cleanup is a set of commands. If any of them fails (i.e. returns non zero value), the process is not interrupted, unlike the main command sequence.

URI specification

Note that if the web application is listening on non-standard host or port (localhost and 8080 is the default), the URI has to be used everywhere where it matters. Given that opengrok-sync performs RESTful API queries itself, one has to specify the location using the -U option of this script and then again it is necessary to specify it in the configuration file - for any RESTful API calls or for opengrok-indexer command (which also uses the -U option).

HTTP headers

In the configuration, each command that does an API call, can specify set of HTTP headers via dictionary in the 4th element of the list, e.g.:

- call:
    uri: 'http://192.160.0.1:8080/source/api/v1/messages?tag=%PROJECT%'
    method: DELETE
    data: 'resync + reindex in progress'
    headers:
      'Content-type': 'text/plain'
      'Authorization': 'Bearer foobar'

Also, it is possible to specify common set of headers to be used for each RESTful API command with top level headers item:

headers:
  'Authorization': 'Bearer foobar'
  'X-another-header': 'whoohooo'
commands:
- call:
    uri: http://192.160.0.1:8080/source/api/v1/messages
    method: POST
    data:
      messageLevel: warning
      duration: PT1H
      tags: ['%PROJECT%']
      text: resync + reindex in progress
    headers:
      'X-special-header': 'foo'
      'X-just-another-header': 'ha'
...

What happens is that the headers from the headers top level configuration item are merged with the command specific headers. So, in the above example, the RESTful command will have the Authorization, X-another-header, X-special-header, X-just-another-header headers.

Further, if opengrok-sync is run with the -H command line option (that can have multiple arguments, i.e. it is possible to specify multiple headers with it), these headers will be merged with the headers from the top level headers configuration item and the result will be merged with per command specific headers.

Cleanup

If any of the commands in "commands" fail, the "cleanup" command will be executed. This is handy in this case since the first RESTful API call will mark the project with alert in the WEB UI so if any of the commands that follow fails, the cleanup call will be made to clear the alert.

Normal command execution can be also performed in the cleanup section.

Ignoring projects

Sometimes it is useful to ignore some projects in a opengrok-sync run (assuming it is run e.g. with --indexed so it retrieves the list of projects to process from the web application) such as when a project needs special handling, e.g. different schedule.

These projects can be specified either in the "ignore_projects" section in the configuration file (holds a list of project names) or using the --ignore_project command line option (can have multiple arguments).

Ignoring errors

Some project can be notorious for producing spurious errors so their errors can be ignored via the "ignore_errors" section.

Run

In the above example it is assumed that opengrok-sync is run as root and synchronization and reindexing are done under different users. This is done so that the web application cannot tamper with source code even if compromised.

Pattern replacement and logging

The commands got appended project name unless one of their arguments contains %PROJECT%, in which case it is substituted with project name and no append is done.

For per-project reindexing to work properly, opengrok-reindex-project uses the logging.properties.template to make sure each project has its own log directory. The file can look e.g. like this:

handlers= java.util.logging.FileHandler

.level= FINE

java.util.logging.FileHandler.pattern = /opengrok/log/%PROJ%/opengrok%g.%u.log
# Create one file per indexer run. This makes indexer log easy to check.
java.util.logging.FileHandler.limit = 0
java.util.logging.FileHandler.append = false
java.util.logging.FileHandler.count = 30
java.util.logging.FileHandler.formatter = org.opengrok.indexer.logger.formatter.SimpleFileLogFormatter

java.util.logging.ConsoleHandler.level = WARNING
java.util.logging.ConsoleHandler.formatter = org.opengrok.indexer.logger.formatter.SimpleFileLogFormatter

The %PROJ% template is passed to the script for substitution in the logging template. This pattern must differ from the %PROJECT% pattern, otherwise the sync.py script would substitute it in the command arguments and the substitution in the template file would not happen.

You can find a logging.properties.template file in the final release tarball, under doc directory.