This prototype demonstrates the various levels of integrating and exploring the freely available Semantic MEDLINE Database (SemMedDB) in Stardog. The database is maintained and annualy updated by the U.S. National Library of Medicine (NLM). SemMedDB is automatically extracted via SemRep by parsing biomedical texts (PubMed citations). As such it may support use cases of relation discovery, hypothesis generation, clinical decision making etc.
The primary purpose of SemMedDB is to capture RDF-like ternary relationships between biomedical entities (subject-predicate-object) extracted from scientifc sources. The SemMedDB database consists of multiple tables. These collect the metadata about processed PubMed citations (table CITATION
), the content of respective citations, i.e. the title or abstract (table SENTENCE
), a number of formalized statments derived from this input (table PREDICATION
), and auxiliary information on these predications (table PREDICATION_AUX
) allowing, e.g., to assess the extraction quality. Table COREFERENCE
lists the back references (optionally) generated by SemRep with Anaphora resolution. Additional information on entities involved in predications is covered by table ENTITY
. Following data attributes were evaluated for this demo:
This demo focuses on showcasing integration and querying of (relational) biomedical data in Stardog. It leverages the PREDICATION
table to retrieve predications relating a biomedical entity (subject) to another entity (object) by means of a predicate such as AFFECTS
or its negation (NEG_AFFECTS
). Recognition confidence score of subject and object entities is retrieved from table PREDICATION_AUX
. The ENTITY
table is integrated on demand to augment gene-specific information, i.e. the Entrez Gene ID (column GENE_ID
) and name (column GENE_NAME
) of entities being described in predictions.
Table | File size | Rows |
---|---|---|
PREDICATION | 13Gb | 97.972.561 |
PREDICATION_AUX | 12 GB | 97.972.554 |
ENTITY | 123Gb | 1.369.837.426 |
The target graph model semmeddb.ttl trivially lifts the SemMedDB predications to RDF. Instances of the sdb:Entity
class are related by the sdb:predicate
relationship (derived from owl:ObjectProperty
). The sdb:SemanticType
class represents the concept behind an entity. The sdb:Predication
class maintains references to elements of an individual predication, e.g. in order to evaluate and trace its validity.
The following UML object diagram depicts the sdb:associated_with
relation of the Cytokine gene (CUI C1333196
, semantic type gngm
, Gene or Genome) and GRASP gene (CUI C1425726
) to the Asthma disease (CUI C0004096
, semantic type dsyn
, Disease or Syndrome):
Entities in PREDICATION
are identified by a concept unique identifier (CUI) in the column SUBJECT_CUI
or OBJECT_CUI
. The most CUIs are atomic strings (C0039258
) while
3,28% of subject and 3% of object CUIs comprise a set of pipe-delimited parts, sample:
"C0668084"
"C0668084|2011|2149|8856|79581|145624"
"1523|4791|4940|6490|9733|22974|27044|84164"
The entities are further described by a readable name (column SUBJECT_NAME
or OBJECT_NAME
) and their semantic type (column SUBJECT_SEMTYPE
or OBJECT_SEMTYPE
). The naming convention apparently follows the CUI format with parts corresponding to positions in compound CUI and the name:
"Receptor, PAR-1"
"Receptor, PAR-1|MARK2|F2R|NR1I2|SLC52A2|PWAR1"
"CUX1|NFKB2|OAS3|PMEL|SART3|TPX2|SND1|ASCC2"
Depending on the interpretation of these composite identifiers alterantive mappings to Stardog's knowledge graph may apply:
- The CUI is considered an opaque identifier, no particular handling appllies (current solution)
- The parts correspond to aliases of given (first) entity and should be linked to it
- The parts correspond to independent entities, a new entity resource should be created for each. Approaches 1) and 2) may require a multi-pass integration
Similar ambiguity pertains to the ENTITY
table. Some (TBD: ratio) of the entities indicate a list of comma-separated Entrez Gene IDs, maintained by the National Center for Biotechnology Information (NCBI), such as PIF1 (80119) and DCD (117159):
CUI | GENE_ID
'C0016904','80119,117159'
'C0085828','2353,2354,3725,3726,3727'
'C0085828','2149,7012,7037,7296,10587,23671'
semmed@nlm.nih.gov has been requested to document the (implicit) relationship of individual ID parts.
The semantic type (category) of entities at subject or object position of a predication consits of string identifier (e.g. gngm
). NLM provides a delimiter-separated file to reslove their labels, sample:
...
ftcn|T169|Functional Concept
genf|T045|Genetic Function
geoa|T083|Geographic Area
gngm|T028|Gene or Genome
...
Applying virtual graph import
command on mapping srdef.ttl
this metadata is turned into class definitions, each semantic type being defined as subclass of the generic sdb:SemanticType
class. For management purposes the model is maintained separately from data within a dedicated named graph (urn:stardog:demo:semmeddb:model
). Following sample is retrieved by issuing the query retrieve_semantic_types.rq
:
sdb:gngm rdfs:subClassOf sdb:SemanticType .
sdb:gngm rdfs:label "Gene or Genome"@en .
sdb:gngm sdb:tui "T028" .
...
Kilicoglu et al. provide definitions and usage examples of some predicates referenced by the PREDICATION.PREDICATE
column. 66 distinct predicates were retrieved from the database. The mapping predicates.sms
consolidates their inconsistent spelling and creates appropriate owl:ObjectProperty
definitions, making each predicate rdfs:subPropertyOf
the generic sdb:predicate
. Sample definition retrieved from urn:stardog:demo:semmeddb:model
graph by issuing the query retrieve_semantic_types.rq
:
sdb:associated_with rdfs:subPropertyOf sdb:predicate .
sdb:associated_with rdfs:label "associated_with" .
An active MySQL instance (8.0.17) and Stardog (7.0.2) with a MySQL JDBC-driver installed in $STARDOG_HOME/server/dbms/
or at $STARDOG_EXT
are assumed. Any path expressions (file names) below are relative to this project's home directory (demo-semmeddb
).
The SemMedDB is distributed as individual MySQL import files. Please download, unpack and import the SemMedDB PREDICATION table (2.51Gb), PREDICATION_AUX table (3.15Gb), plus (optionally) the ENTITY table (38.2Gb)
- Set-up the database
semmeddb4
mysql>
CREATE DATABASE semmeddb4 CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'tester'@'%' IDENTIFIED BY 'stardog';
GRANT ALL PRIVILEGES ON semmeddb4 . * TO 'tester'@'%';
FLUSH PRIVILEGES;
- Load the SQL-import files into
semmeddb4
:
(a) Import via a background process on command line (recommended):
bash>
nohup mysql -u tester --password=stardog semmeddb4 < semmedVER40_R_PREDICATION.sql &
nohup mysql -u tester --password=stardog semmeddb4 < semmedVER40_R_PREDICATION_AUX.sql &
nohup mysql -u tester --password=stardog semmeddb4 < semmedVER40_R_ENTITY.sql &
or (b) via the mysql console:
mysql>
USE semmeddb4 ;
SOURCE semmedVER40_R_PREDICATION.sql;
# etc.
Clean up an obviuos error, numeric predicates are invalid (3 rows affected):
mysql>
USE semmeddb4 ;
DELETE FROM predication WHERE predicate IN ("127", "1532", "241") ;
Once the MySQL databse is ready configure the Stardog server. It is recommended to increase the available process memory, e.g. export STARDOG_SERVER_JAVA_ARGS="-Xms30g -Xmx30g -XX:MaxDirectMemorySize=30g"
and specify the memory mode in $STARDOG_HOME/stardog.properties
: memory.mode=write_optimized
. Restart the server to apply the changes and revise the effective settings via stardog-admin server status
.
Create the database semmeddb
dedicating the named graph urn:stardog:demo:semmeddb:model
for model maintenance:
stardog-admin db create --name semmeddb --options reasoning.schema.graphs=urn:stardog:demo:semmeddb:model --
Load the ontology file and dynamic, data-generated parts of the model into the graph urn:stardog:demo:semmeddb:model
:
# Main ontology
stardog data add --format TURTLE --named-graph urn:stardog:demo:semmeddb:model semmeddb model/semmeddb.ttl
# Extension for demo purposes (classes and predicates used with reasoning)
stardog data add --format TURTLE --named-graph urn:stardog:demo:semmeddb:model semmeddb model/semmeddb_ext.ttl
# Import generated type definitions
stardog-admin virtual import --named-graph urn:stardog:demo:semmeddb:model semmeddb mappings/srdef.properties mappings/srdef.ttl data/SemanticTypes_2018AB.txt
# Import generated predicate definitions
stardog-admin virtual import --named-graph urn:stardog:demo:semmeddb:model semmeddb --format SMS2 mappings/predicates.properties mappings/predicates.sms
Import the individual data sets into the default graph (as background process):
# Standard and reified statements from PREDICATION
stardog-admin virtual import semmeddb --format SMS2 mappings/semmeddb.properties mappings/predication.sms
# Predication score from PREDICATION_AUX
stardog-admin virtual import semmeddb --format SMS2 mappings/semmeddb.properties mappings/predication_aux.sms
# Gene ID / NCBI gene reference from ENTITY
stardog-admin virtual import semmeddb --format SMS2 mappings/semmeddb.properties mappings/entity.sms
Reasoning rules help to separate modelling concerns. By recursively correlating facts they compose intermediate abstractions, e.g. classify resources based on their properties. These may in turn be reused in queries, obsolating error-prone enumerations (VALUES
) or explicit definitions:
- classification based on matchig label: (
classification_by_label_copd.rule
) - classification based on confidence score: (
predication_confidence_by_score.rule
)
Sample queries selecting genes that:
- are associated with Asthma or COPD (using explicit
VALUES
enumeration): (genes_asthma_or_copd_values.rq
) - are associated with Asthma or COPD: (
genes_asthma_or_copd.rq
) - are associated with Asthma but not COPD: (
genes_asthma_not_copd.rq
) - are associated with Asthma and COPD: (
genes_asthma_and_copd.rq
)
Please look at the (queries
) folder for further examples.
The Integrity Constraint Validation (ICV) subsystem allows to validate graph data entering and being stored in a Stardog database according to user defined rules expressed in a variety of formats (e.g. SHACL, OWL, SPARQL). The queries above intentionally rely on the "positive" version of SemMedDb predicates (associated_with
) because a number of predications was found to contradict by using the negated version of same predicates (neg_associated_with
).
The ICV rule contradicting_predications_count.ttl
applied to the semmeddb
database revealed 112 negating predications to conflict:
stardog icv explain semmeddb constraints/contradicting_predications.ttl
+-----------+
| neg_count |
+-----------+
| 112 |
+-----------+
The rule contradicting_predications.ttl
provides a listing of the conflicting predication pairs.