Model operations performance improvements #347

litvinovg · 2022-11-11T11:01:20Z

VIVO 3605 GitHub issue
VIVO 3792 GitHub issue

What does this pull request do?

Implemented some changes to improve performance of model modifications

What's new?

RDFSerivceBulkUnionUpdater which forwards remove(model), removeAll, add(model) calls to enclosed updaters.
RDFServiceModelMaker calls removeAll on model instead of removing triple by triple.
Removed excessive de-serialization/serialization cycle in BulkUpdatingOntModel.performRemoveAll
Created Bulk ModelCom, Bulk OntModelImpl, Bulk GraphMem to avoid triple by triple transactions on batch add/remove

How should this be tested?

Delete tdbContentModels and tdbModels folders in vivo home directory, start VIVO, check logs to check how much time it takes to load all n3 files from home directory
Open ingest tools -> Manage Jena Models -> Configuration models
Create new model
Add attached sample data to the model sample-data.zip
Click clear statements, measure the time
Check that model is empty
Add attached sample data to the model
Remove sample data from the model using the same "add/remove RDF data" dialog, measure the time
Add attached sample data to the model
Click remove, measure the time

Interested parties

@bkampe
@brianjlowe
@chenejac

…l chunk

…y triple transactions on add/remove model Load n3 files from home directory into in-memory model to use bulk loading.

chenejac

@litvinovg it looks as great improvements in performance. I have tested performance, loading of models is significantly improved. Also, tried to validate whether everything is loaded in the triplestore by using SPARQL query, I have checked for academic degrees and continents, and the number of classes is the same. There is only one my comment regarding code review.

api/src/main/java/edu/cornell/mannlib/vitro/webapp/rdfservice/adapters/BulkModelCom.java

chenejac

@litvinovg two more my comments

chenejac · 2022-11-22T09:37:08Z

api/src/main/java/edu/cornell/mannlib/vitro/webapp/servlet/setup/RDFFilesLoader.java

@@ -18,6 +18,8 @@
 import java.util.TreeSet;

 import edu.cornell.mannlib.vitro.webapp.i18n.selection.SelectedLocale;
+import edu.cornell.mannlib.vitro.webapp.rdfservice.adapters.BulkUpdatingModel;


Not used import, please remove

chenejac · 2022-11-22T10:09:49Z

api/src/main/java/edu/cornell/mannlib/vitro/webapp/servlet/setup/RDFFilesLoader.java

@@ -182,12 +184,16 @@ private static Set<Path> getPaths(String parentDir, String... strings) {
 	private static void readOntologyFileIntoModel(Path p, Model model) {
 		String format = getRdfFormat(p);
 		log.debug("Loading " + p);
+		Model memModel = ModelFactory.createDefaultModel();


Why you are not using here:

Suggested change

Model memModel = ModelFactory.createDefaultModel();

Model memModel = VitroModelFactory.createModel();

If we are using ModelFactory.createDefaultModel() we are not using bulk model, bulk graph, etc., right?

Because it wouldn't have any effect on performance, so no reason to use more complicated model.

chenejac

Great contribution!

brianjlowe

Huge change. Seems even better than pre-1.10, before the clear statements started exhibiting the extreme slowness.

I initially did main-branch timings on the content models instead of config:

Main branch, content models
Load sample data 15s
Clear statements 820s
Remove sample data 39s
Remove model 955s

This PR:
Add/clear statements/remove RDF/Remove model all within 13-18s depending on trial

Main branch, config models
All operations, total craziness

This PR:
All operations 2-3 seconds

michel-heon · 2022-12-12T15:06:34Z

Performance test of the PR 347

Here is the performance test of this PR 347:

Experimental condition

The comparison is made between Bulk loading and Classical loading (implemented in version 1.13.0 and earlier)

The comparison is performed on three target configurations:

TDB: VIVO using the tripleStore TDB local to the instance
Fuseki: The triplstore is remote on a Fuseki server using TDB2 (the VIVO server and the Fuseki server are instantiated on the same computer) with the following configuration

Memory: 4.0 GiB
Java: 11.0.17
OS: Linux 5.15.0-1026-aws amd64

Neptune: The tripelstore is remote on an AWS-Neptune server of dimension db.t3.medium. The Neptune server and the VIVO server are both instantiated in the same VPC and the same Subnet)
1 core, 2 VCPU and 4.0 GiB of ram

VIVO Setup on AWS-Ec2 VM Machine

Instance Size	vCPU	Memory (GiB)	Instance Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)
c6i.2xlarge8	2	16	EBS-Only	Up to 12.5	Up to 10

Methodology

The goal of the tests is to calculate the startup time of VIVO, i.e. the time needed to load the ontologies at the initial startup of VIVO.
For the experiments, no data is loaded.
For each type of loading and each configuration the following sequence is performed

Initialization

git clone VIVO Vitro VIVO-languages Vitro-languages branche origin/rel-1.13-maint
start solr
Install and start Fuseki

FUSEKI=apache-jena-fuseki-3.17.0
cd ~/Download
wget https://archive.apache.org/dist/jena/binaries/$FUSEKI.tar.gz
mkdir fuseki
tar xzvf $FUSEKI.tar.gz --directory ./fuseki --strip-components=1
cd fuseki
./fuseki-server

Test for bulk loading

extracted the PR 347 of Vitro (gh pr checkout 347)
Remove the tdb* directories in /vivo/home and /tomcat/webapps/vivo
compile VIVO
Install appropriate applicationSetup.n3 file for each config
listOfApplicationSetup.zip
Start VIVO
In the file tomcat-log vivo.all.log subtract the end and start dates of loading
Select the start date of the log, at the first call of RDFFilesLoader (e.g.: 2022-12-12 08:51:45) to be subtracted from the first call of IndexHistory
Compile the result

Test For classical load:

extract Vitro (origin/rel-1.13-maint)
to 7 same as the previous test

Results

Test number	Configuration	Loading type	Loading Time (Min:Sec)
1	TDB	Bulk	00:28
2	FUSEKI	Bulk	05:13
3	NEPTUNE	Bulk	01:13
4	TDB	Classical	02:20
5	FUSEKI	Classical	16:40
6	NEPTUNE	Classical	03:38

Conclusion

For all cases (TDB/Fuseki/Neptune) the 'Bulk' loading offers a real gain in loading performance and does not seem to generate instability although the present tests do not serve to test the VIVO stability.

* created RDFServiceBulkUnionUpdater * fix and indents * use bulk updater for RDFServiceGraph in blankNodeFilteringGraph * use removeAll to optimize removal all triples from TDB modeles * avoid additional serialization/deserialization cycle of removing model chunk * fixed VitroModelFactory tests * fixes for BulkUpdating models * Created custom ModelCom, OntModelImpl, BulkGraphMem to avoid triple by triple transactions on add/remove model Load n3 files from home directory into in-memory model to use bulk loading. * refact: simplified BulkGraphMem checks * removed unused import

litvinovg added 6 commits November 10, 2022 19:07

created RDFServiceBulkUnionUpdater

6cf3ea9

fix and indents

4f84f85

use bulk updater for RDFServiceGraph in blankNodeFilteringGraph

1be5720

use removeAll to optimize removal all triples from TDB modeles

1fdb2f4

avoid additional serialization/deserialization cycle of removing mode…

598d4eb

…l chunk

fixed VitroModelFactory tests

04edb94

litvinovg requested a review from brianjlowe November 11, 2022 11:01

litvinovg added 2 commits November 14, 2022 18:53

fixes for BulkUpdating models

e237479

Created custom ModelCom, OntModelImpl, BulkGraphMem to avoid triple b…

b2127b7

…y triple transactions on add/remove model Load n3 files from home directory into in-memory model to use bulk loading.

litvinovg mentioned this pull request Nov 15, 2022

First time loading on HDD with journaling file system can take a lot of time vivo-project/VIVO#3605

Closed

litvinovg requested review from hudajkhan, gneissone and chenejac November 15, 2022 08:08

chenejac requested changes Nov 21, 2022

View reviewed changes

api/src/main/java/edu/cornell/mannlib/vitro/webapp/rdfservice/adapters/BulkModelCom.java Outdated Show resolved Hide resolved

refact: simplified BulkGraphMem checks

3f49039

litvinovg dismissed a stale review via 3f49039 November 21, 2022 15:47

litvinovg requested a review from chenejac November 21, 2022 15:48

chenejac requested changes Nov 22, 2022

View reviewed changes

removed unused import

d49c8c2

litvinovg requested a review from chenejac November 22, 2022 11:11

chenejac approved these changes Nov 22, 2022

View reviewed changes

brianjlowe approved these changes Nov 25, 2022

View reviewed changes

brianjlowe merged commit 020b938 into vivo-project:main Nov 25, 2022

chenejac mentioned this pull request Feb 20, 2023

Testing performance of external graph usage vivo-project/VIVO#3823

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model operations performance improvements #347

Model operations performance improvements #347

litvinovg commented Nov 11, 2022 •

edited

chenejac left a comment

chenejac left a comment

chenejac Nov 22, 2022

litvinovg Nov 22, 2022

chenejac Nov 22, 2022

litvinovg Nov 22, 2022

chenejac left a comment

brianjlowe left a comment

michel-heon commented Dec 12, 2022

	Model memModel = ModelFactory.createDefaultModel();
	Model memModel = VitroModelFactory.createModel();

Model operations performance improvements #347

Model operations performance improvements #347

Conversation

litvinovg commented Nov 11, 2022 • edited

What does this pull request do?

What's new?

How should this be tested?

Interested parties

chenejac left a comment

Choose a reason for hiding this comment

chenejac left a comment

Choose a reason for hiding this comment

chenejac Nov 22, 2022

Choose a reason for hiding this comment

litvinovg Nov 22, 2022

Choose a reason for hiding this comment

chenejac Nov 22, 2022

Choose a reason for hiding this comment

litvinovg Nov 22, 2022

Choose a reason for hiding this comment

chenejac left a comment

Choose a reason for hiding this comment

brianjlowe left a comment

Choose a reason for hiding this comment

michel-heon commented Dec 12, 2022

Performance test of the PR 347

Experimental condition

VIVO Setup on AWS-Ec2 VM Machine

Methodology

Initialization

Test for bulk loading

Test For classical load:

Results

Conclusion

litvinovg commented Nov 11, 2022 •

edited