Skip to content
This repository has been archived by the owner on Dec 23, 2022. It is now read-only.

stardog-union/demo-semmeddb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This prototype demonstrates the various levels of integrating and exploring the freely available Semantic MEDLINE Database (SemMedDB) in Stardog. The database is maintained and annualy updated by the U.S. National Library of Medicine (NLM). SemMedDB is automatically extracted via SemRep by parsing biomedical texts (PubMed citations). As such it may support use cases of relation discovery, hypothesis generation, clinical decision making etc.

SemMedDB v3 Schema

The primary purpose of SemMedDB is to capture RDF-like ternary relationships between biomedical entities (subject-predicate-object) extracted from scientifc sources. The SemMedDB database consists of multiple tables. These collect the metadata about processed PubMed citations (table CITATION), the content of respective citations, i.e. the title or abstract (table SENTENCE), a number of formalized statments derived from this input (table PREDICATION), and auxiliary information on these predications (table PREDICATION_AUX) allowing, e.g., to assess the extraction quality. Table COREFERENCE lists the back references (optionally) generated by SemRep with Anaphora resolution. Additional information on entities involved in predications is covered by table ENTITY. Following data attributes were evaluated for this demo:

SemMedDB v3 schema

This demo focuses on showcasing integration and querying of (relational) biomedical data in Stardog. It leverages the PREDICATION table to retrieve predications relating a biomedical entity (subject) to another entity (object) by means of a predicate such as AFFECTS or its negation (NEG_AFFECTS). Recognition confidence score of subject and object entities is retrieved from table PREDICATION_AUX. The ENTITY table is integrated on demand to augment gene-specific information, i.e. the Entrez Gene ID (column GENE_ID) and name (column GENE_NAME) of entities being described in predictions.

Table File size Rows
PREDICATION 13Gb 97.972.561
PREDICATION_AUX 12 GB 97.972.554
ENTITY 123Gb 1.369.837.426

Data Model

The target graph model semmeddb.ttl trivially lifts the SemMedDB predications to RDF. Instances of the sdb:Entity class are related by the sdb:predicate relationship (derived from owl:ObjectProperty). The sdb:SemanticType class represents the concept behind an entity. The sdb:Predication class maintains references to elements of an individual predication, e.g. in order to evaluate and trace its validity.

SemMedDB Graph Model

The following UML object diagram depicts the sdb:associated_with relation of the Cytokine gene (CUI C1333196, semantic type gngm, Gene or Genome) and GRASP gene (CUI C1425726) to the Asthma disease (CUI C0004096, semantic type dsyn, Disease or Syndrome):

Sample of the SemMedDB Graph Model

Identifiers

Entities in PREDICATION are identified by a concept unique identifier (CUI) in the column SUBJECT_CUI or OBJECT_CUI. The most CUIs are atomic strings (C0039258) while 3,28% of subject and 3% of object CUIs comprise a set of pipe-delimited parts, sample:

"C0668084"
"C0668084|2011|2149|8856|79581|145624"
"1523|4791|4940|6490|9733|22974|27044|84164"

The entities are further described by a readable name (column SUBJECT_NAME or OBJECT_NAME) and their semantic type (column SUBJECT_SEMTYPE or OBJECT_SEMTYPE). The naming convention apparently follows the CUI format with parts corresponding to positions in compound CUI and the name:

"Receptor, PAR-1"
"Receptor, PAR-1|MARK2|F2R|NR1I2|SLC52A2|PWAR1"
"CUX1|NFKB2|OAS3|PMEL|SART3|TPX2|SND1|ASCC2"

Depending on the interpretation of these composite identifiers alterantive mappings to Stardog's knowledge graph may apply:

  1. The CUI is considered an opaque identifier, no particular handling appllies (current solution)
  2. The parts correspond to aliases of given (first) entity and should be linked to it
  3. The parts correspond to independent entities, a new entity resource should be created for each. Approaches 1) and 2) may require a multi-pass integration

Similar ambiguity pertains to the ENTITY table. Some (TBD: ratio) of the entities indicate a list of comma-separated Entrez Gene IDs, maintained by the National Center for Biotechnology Information (NCBI), such as PIF1 (80119) and DCD (117159):

CUI       | GENE_ID
'C0016904','80119,117159'
'C0085828','2353,2354,3725,3726,3727'
'C0085828','2149,7012,7037,7296,10587,23671'

semmed@nlm.nih.gov has been requested to document the (implicit) relationship of individual ID parts.

Classes (Semantic Types)

The semantic type (category) of entities at subject or object position of a predication consits of string identifier (e.g. gngm). NLM provides a delimiter-separated file to reslove their labels, sample:

...
ftcn|T169|Functional Concept
genf|T045|Genetic Function
geoa|T083|Geographic Area
gngm|T028|Gene or Genome
...

Applying virtual graph import command on mapping srdef.ttl this metadata is turned into class definitions, each semantic type being defined as subclass of the generic sdb:SemanticType class. For management purposes the model is maintained separately from data within a dedicated named graph (urn:stardog:demo:semmeddb:model). Following sample is retrieved by issuing the query retrieve_semantic_types.rq:

sdb:gngm rdfs:subClassOf sdb:SemanticType .
sdb:gngm rdfs:label "Gene or Genome"@en .
sdb:gngm sdb:tui "T028" .
...

Predicates

Kilicoglu et al. provide definitions and usage examples of some predicates referenced by the PREDICATION.PREDICATE column. 66 distinct predicates were retrieved from the database. The mapping predicates.sms consolidates their inconsistent spelling and creates appropriate owl:ObjectProperty definitions, making each predicate rdfs:subPropertyOf the generic sdb:predicate. Sample definition retrieved from urn:stardog:demo:semmeddb:model graph by issuing the query retrieve_semantic_types.rq:

sdb:associated_with rdfs:subPropertyOf sdb:predicate .
sdb:associated_with rdfs:label "associated_with" .

Demo Environment

An active MySQL instance (8.0.17) and Stardog (7.0.2) with a MySQL JDBC-driver installed in $STARDOG_HOME/server/dbms/ or at $STARDOG_EXT are assumed. Any path expressions (file names) below are relative to this project's home directory (demo-semmeddb).

Database

The SemMedDB is distributed as individual MySQL import files. Please download, unpack and import the SemMedDB PREDICATION table (2.51Gb), PREDICATION_AUX table (3.15Gb), plus (optionally) the ENTITY table (38.2Gb)

  1. Set-up the database semmeddb4
mysql>
	CREATE DATABASE semmeddb4 CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
	CREATE USER 'tester'@'%' IDENTIFIED BY 'stardog';
	GRANT ALL PRIVILEGES ON semmeddb4 . * TO 'tester'@'%';
	FLUSH PRIVILEGES;
  1. Load the SQL-import files into semmeddb4:

(a) Import via a background process on command line (recommended):

bash>
 	nohup mysql -u tester --password=stardog semmeddb4 < semmedVER40_R_PREDICATION.sql &
 	nohup mysql -u tester --password=stardog semmeddb4 < semmedVER40_R_PREDICATION_AUX.sql &
	nohup mysql -u tester --password=stardog semmeddb4 < semmedVER40_R_ENTITY.sql &

or (b) via the mysql console:

mysql>
	USE semmeddb4 ;
	SOURCE semmedVER40_R_PREDICATION.sql;
	# etc.

Clean up an obviuos error, numeric predicates are invalid (3 rows affected):

mysql> 
	USE semmeddb4 ;
	DELETE FROM predication WHERE predicate IN ("127", "1532", "241") ;

Stardog server

Once the MySQL databse is ready configure the Stardog server. It is recommended to increase the available process memory, e.g. export STARDOG_SERVER_JAVA_ARGS="-Xms30g -Xmx30g -XX:MaxDirectMemorySize=30g" and specify the memory mode in $STARDOG_HOME/stardog.properties: memory.mode=write_optimized. Restart the server to apply the changes and revise the effective settings via stardog-admin server status.

Create the database semmeddb dedicating the named graph urn:stardog:demo:semmeddb:model for model maintenance:

stardog-admin db create --name  semmeddb --options reasoning.schema.graphs=urn:stardog:demo:semmeddb:model --

Load the ontology file and dynamic, data-generated parts of the model into the graph urn:stardog:demo:semmeddb:model:

# Main ontology
stardog data add --format TURTLE --named-graph urn:stardog:demo:semmeddb:model semmeddb model/semmeddb.ttl

# Extension for demo purposes (classes and predicates used with reasoning)
stardog data add --format TURTLE --named-graph urn:stardog:demo:semmeddb:model semmeddb model/semmeddb_ext.ttl
	
# Import generated type definitions
stardog-admin virtual import --named-graph urn:stardog:demo:semmeddb:model semmeddb mappings/srdef.properties mappings/srdef.ttl data/SemanticTypes_2018AB.txt

# Import generated predicate definitions
stardog-admin virtual import --named-graph urn:stardog:demo:semmeddb:model semmeddb --format SMS2 mappings/predicates.properties mappings/predicates.sms  

Import the individual data sets into the default graph (as background process):

# Standard and reified statements from PREDICATION
stardog-admin virtual import semmeddb --format SMS2  mappings/semmeddb.properties mappings/predication.sms

# Predication score from PREDICATION_AUX
stardog-admin virtual import semmeddb --format SMS2  mappings/semmeddb.properties mappings/predication_aux.sms

# Gene ID / NCBI gene reference from ENTITY
stardog-admin virtual import semmeddb --format SMS2 mappings/semmeddb.properties mappings/entity.sms

Rules

Reasoning rules help to separate modelling concerns. By recursively correlating facts they compose intermediate abstractions, e.g. classify resources based on their properties. These may in turn be reused in queries, obsolating error-prone enumerations (VALUES) or explicit definitions:

Queries

Sample queries selecting genes that:

Please look at the (queries) folder for further examples.

Constraints

The Integrity Constraint Validation (ICV) subsystem allows to validate graph data entering and being stored in a Stardog database according to user defined rules expressed in a variety of formats (e.g. SHACL, OWL, SPARQL). The queries above intentionally rely on the "positive" version of SemMedDb predicates (associated_with) because a number of predications was found to contradict by using the negated version of same predicates (neg_associated_with).

Sample constraints

The ICV rule contradicting_predications_count.ttl applied to the semmeddbdatabase revealed 112 negating predications to conflict:

stardog icv explain semmeddb constraints/contradicting_predications.ttl
+-----------+
| neg_count |
+-----------+
| 112       |
+-----------+

The rule contradicting_predications.ttl provides a listing of the conflicting predication pairs.

About

Demo of integrating and exploring SemMedDB

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published