Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model operations performance improvements #347

Merged
merged 10 commits into from Nov 25, 2022

Conversation

litvinovg
Copy link
Contributor

@litvinovg litvinovg commented Nov 11, 2022

VIVO 3605 GitHub issue
VIVO 3792 GitHub issue

What does this pull request do?

Implemented some changes to improve performance of model modifications

What's new?

RDFSerivceBulkUnionUpdater which forwards remove(model), removeAll, add(model) calls to enclosed updaters.
RDFServiceModelMaker calls removeAll on model instead of removing triple by triple.
Removed excessive de-serialization/serialization cycle in BulkUpdatingOntModel.performRemoveAll
Created Bulk ModelCom, Bulk OntModelImpl, Bulk GraphMem to avoid triple by triple transactions on batch add/remove

How should this be tested?

  • Delete tdbContentModels and tdbModels folders in vivo home directory, start VIVO, check logs to check how much time it takes to load all n3 files from home directory
  • Open ingest tools -> Manage Jena Models -> Configuration models
  • Create new model
  • Add attached sample data to the model sample-data.zip
  • Click clear statements, measure the time
  • Check that model is empty
  • Add attached sample data to the model
  • Remove sample data from the model using the same "add/remove RDF data" dialog, measure the time
  • Add attached sample data to the model
  • Click remove, measure the time

Interested parties

@bkampe
@brianjlowe
@chenejac

…y triple transactions on add/remove model

Load n3 files from home directory into in-memory model to use bulk
loading.
Copy link
Contributor

@chenejac chenejac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@litvinovg it looks as great improvements in performance. I have tested performance, loading of models is significantly improved. Also, tried to validate whether everything is loaded in the triplestore by using SPARQL query, I have checked for academic degrees and continents, and the number of classes is the same. There is only one my comment regarding code review.

@litvinovg litvinovg dismissed a stale review via 3f49039 November 21, 2022 15:47
Copy link
Contributor

@chenejac chenejac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@litvinovg two more my comments

@@ -18,6 +18,8 @@
import java.util.TreeSet;

import edu.cornell.mannlib.vitro.webapp.i18n.selection.SelectedLocale;
import edu.cornell.mannlib.vitro.webapp.rdfservice.adapters.BulkUpdatingModel;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used import, please remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -182,12 +184,16 @@ private static Set<Path> getPaths(String parentDir, String... strings) {
private static void readOntologyFileIntoModel(Path p, Model model) {
String format = getRdfFormat(p);
log.debug("Loading " + p);
Model memModel = ModelFactory.createDefaultModel();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you are not using here:

Suggested change
Model memModel = ModelFactory.createDefaultModel();
Model memModel = VitroModelFactory.createModel();

If we are using ModelFactory.createDefaultModel() we are not using bulk model, bulk graph, etc., right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it wouldn't have any effect on performance, so no reason to use more complicated model.

Copy link
Contributor

@chenejac chenejac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great contribution!

Copy link
Member

@brianjlowe brianjlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge change. Seems even better than pre-1.10, before the clear statements started exhibiting the extreme slowness.

I initially did main-branch timings on the content models instead of config:

Main branch, content models
Load sample data 15s
Clear statements 820s
Remove sample data 39s
Remove model 955s

This PR:
Add/clear statements/remove RDF/Remove model all within 13-18s depending on trial

Main branch, config models
All operations, total craziness

This PR:
All operations 2-3 seconds

@brianjlowe brianjlowe merged commit 020b938 into vivo-project:main Nov 25, 2022
@michel-heon
Copy link
Member

Performance test of the PR 347

Here is the performance test of this PR 347:

Experimental condition

The comparison is made between Bulk loading and Classical loading (implemented in version 1.13.0 and earlier)

The comparison is performed on three target configurations:

  • TDB: VIVO using the tripleStore TDB local to the instance
  • Fuseki: The triplstore is remote on a Fuseki server using TDB2 (the VIVO server and the Fuseki server are instantiated on the same computer) with the following configuration

Memory: 4.0 GiB
Java: 11.0.17
OS: Linux 5.15.0-1026-aws amd64

  • Neptune: The tripelstore is remote on an AWS-Neptune server of dimension db.t3.medium. The Neptune server and the VIVO server are both instantiated in the same VPC and the same Subnet)
    1 core, 2 VCPU and 4.0 GiB of ram

VIVO Setup on AWS-Ec2 VM Machine

Instance Size vCPU Memory (GiB) Instance Storage (GB) Network Bandwidth (Gbps) EBS Bandwidth (Gbps)
c6i.2xlarge8 2 16 EBS-Only Up to 12.5 Up to 10

Methodology

The goal of the tests is to calculate the startup time of VIVO, i.e. the time needed to load the ontologies at the initial startup of VIVO.
For the experiments, no data is loaded.
For each type of loading and each configuration the following sequence is performed

Initialization

  • git clone VIVO Vitro VIVO-languages Vitro-languages branche origin/rel-1.13-maint
  • start solr
  • Install and start Fuseki
FUSEKI=apache-jena-fuseki-3.17.0
cd ~/Download
wget https://archive.apache.org/dist/jena/binaries/$FUSEKI.tar.gz
mkdir fuseki
tar xzvf $FUSEKI.tar.gz --directory ./fuseki --strip-components=1
cd fuseki
./fuseki-server

Test for bulk loading

  1. extracted the PR 347 of Vitro (gh pr checkout 347)
  2. Remove the tdb* directories in /vivo/home and /tomcat/webapps/vivo
  3. compile VIVO
  4. Install appropriate applicationSetup.n3 file for each config
    listOfApplicationSetup.zip
  5. Start VIVO
  6. In the file tomcat-log vivo.all.log subtract the end and start dates of loading
    Select the start date of the log, at the first call of RDFFilesLoader (e.g.: 2022-12-12 08:51:45) to be subtracted from the first call of IndexHistory
  7. Compile the result

Test For classical load:

  1. extract Vitro (origin/rel-1.13-maint)
  2. to 7 same as the previous test

Results

Test number Configuration Loading type Loading Time (Min:Sec)
1 TDB Bulk 00:28
2 FUSEKI Bulk 05:13
3 NEPTUNE Bulk 01:13
4 TDB Classical 02:20
5 FUSEKI Classical 16:40
6 NEPTUNE Classical 03:38

Conclusion

For all cases (TDB/Fuseki/Neptune) the 'Bulk' loading offers a real gain in loading performance and does not seem to generate instability although the present tests do not serve to test the VIVO stability.

ghost pushed a commit that referenced this pull request Feb 23, 2023
* created RDFServiceBulkUnionUpdater

* fix and indents

* use bulk updater for RDFServiceGraph in blankNodeFilteringGraph

* use removeAll to optimize removal all triples from TDB modeles

* avoid additional serialization/deserialization cycle of removing model chunk

* fixed VitroModelFactory tests

* fixes for BulkUpdating models

* Created custom ModelCom, OntModelImpl, BulkGraphMem to avoid triple by triple transactions on add/remove model
Load n3 files from home directory into in-memory model to use bulk
loading.

* refact: simplified BulkGraphMem checks

* removed unused import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants