# Demo: iBench GUSToBIOSQL

In the following, we will connect to a [Neo4j Community Edition](https://neo4j.com/product/neo4j-graph-database/) instance running inside a Docker container.
You can type 
```
sudo docker run --name neo4jDemo -p 7687:7687 -p 7474:7474 -v ~/research/DTGraph/output-ibench-data:/var/lib/neo4j/import --env=NEO4J_AUTH=none neo4j:5.16.0-community
``` 
to install and run a Neo4j Community Edition locally. (Of course you need to have [Docker](https://docs.docker.com/engine/install/) already installed on your system.) You should then be able to access [Neo4j browser](http://localhost:7474/browser/) running locally on your computer.

You need to replace `~/research/DTGraph` with the DTGraph's installation path on your computer. We need to mount the volume on the Docker instance to run the import scripts.

*Note:* We have specifically tested the compatibility of this framework with Neo4j Community Edition 5.16.0, which was the latest versions by the time of writting this guide.

In [1]:
from dtgraph import Neo4jGraph, Rule, Transformation
hostname = "localhost"
password = ""
uri = f"bolt://{hostname}:7687"
graph = Neo4jGraph(uri, database="neo4j", username="", password=password)

For this tutorial, we will use the [GUSToBIOSQL](https://github.com/yannramusat/TPG/tree/main/input-ibench-config/gtb) data integration scenario from [iBench](https://github.com/RJMillerLab/ibench), which can be loaded into the database using the following command.

In [2]:
from dtgraph.scenarios.ibench_gtb import iBenchGUSToBIOSQL
iBenchGUSToBIOSQL.load(graph, size = 1_000)

Flushed database: Deleted 136312 nodes, deleted 46311 relationships, completed after 361 ms.
CSV:    Added 1000 labels, created 1000 nodes, set 8000 properties, created 0 relationships, completed after 477 ms.
CSV:    Added 1000 labels, created 1000 nodes, set 5000 properties, created 0 relationships, completed after 253 ms.
CSV:    Added 1000 labels, created 1000 nodes, set 4000 properties, created 0 relationships, completed after 207 ms.
CSV:    Added 1000 labels, created 1000 nodes, set 5000 properties, created 0 relationships, completed after 226 ms.
CSV:    Added 1000 labels, created 1000 nodes, set 12000 properties, created 0 relationships, completed after 276 ms.
CSV:    Added 1000 labels, created 1000 nodes, set 6000 properties, created 0 relationships, completed after 256 ms.
CSV:    Added 1000 labels, created 1000 nodes, set 5000 properties, created 0 relationships, completed after 251 ms.


In [3]:
rule1 = Rule('''
MATCH (gtn:GUSTaxonName)
MATCH (gt:GUSTaxon)
WHERE gtn.taxonID = gt.taxonID
GENERATE
(x = (gtn.taxonID):BIOSQLTaxonName {
    name = gtn.name,
    nameClass = gtn.nameClass
})-[():TAXON_HAS_NAME]->(y = (gt):BIOSQLTaxon {
    taxonID = gt.taxonID,
    ncbiTaxonID = gt.ncbiTaxonID,
    parentTaxonID = gt.parentTaxonID,
    nodeRank = gt.rank,
    geneticCode = gt.geneticCodeID,
    mitoGeneticCode = gt.mitochondialGeneticCodeID,
    leftValue = "SK1(" + gt.taxonID + ")",
    rightValue = "SK2(" + gt.taxonID + ")" 
})
''')

rule2 = Rule('''
MATCH (gt:GUSTaxon)
GENERATE
(x = (gt):BIOSQLTaxon {
    taxonID = gt.taxonID,
    ncbiTaxonID = gt.ncbiTaxonID,
    parentTaxonID = gt.parentTaxonID,
    nodeRank = gt.rank,
    geneticCode = gt.geneticCodeID,
    mitoGeneticCode = gt.mitochondialGeneticCodeID,
    leftValue = "SK1(" + gt.taxonID + ")",
    rightValue = "SK2(" + gt.taxonID + ")"
})
''')

rule3 = Rule('''
MATCH (gg:GUSGene)
GENERATE
(x = (gg):BIOSQLBioEntry {
    bioEntryID = gg.geneID,
    bioDatabaseEntry = "SK3(" + gg.geneSymbol + ")",
    taxonID = "SK4(" + gg.geneID + "," + gg.geneSymbol + "," + gg.geneCategoryID + ")", 
    name = gg.name,
    accession = gg.geneSymbol,
    identifier = gg.sequenceOntologyID,
    division = gg.geneCategoryID,
    description = gg.description,
    version = "SK5(" + gg.geneID + "," + gg.reviewStatusID + ")" 
})-[():HAS_TAXON]->(y = (gg.geneID, gg.geneSymbol, gg.geneCategoryID):BIOSQLTaxon {
    taxonID = "SK4(" + gg.geneID + "," + gg.geneSymbol + "," + gg.geneCategoryID + ")", 
    ncbiTaxonID = "SK6(" + gg.geneID + ")",
    parentTaxonID = "SK7(" + gg.geneID + ")",
    nodeRank = "SK8(" + gg.geneID + ")",
    geneticCode = "SK9(" + gg.geneID + ")",
    mitoGeneticCode = "SK10(" + gg.geneID + ")",
    leftValue = "SK11(" + gg.geneID + ")",
    rightValue ="SK12(" + gg.geneID + ")"
})
''')

rule4 = Rule('''
MATCH (ggs:GUSGeneSynonym)
MATCH (gg:GUSGene)
WHERE ggs.geneID = gg.geneID
GENERATE
(x = (ggs):BIOSQLTermSynonym {
    synonym = ggs.geneSynonymID,
    termID = ggs.geneID
})-[():HAS_SYNONYM]->(y = (gg):BIOSQLTerm {
    termID = gg.geneID,
    name = gg.name,
    definition = gg.description,
    identifier = "SK13(" + gg.geneID + ")",
    isObsolete = ggs.isObsolete,
    ontologyID = "SK15(" + gg.sequenceOntologyID + ")"
})
''')

rule5 = Rule('''
MATCH (ggt:GUSGoTerm) 
GENERATE
(x = (ggt):BIOSQLTerm {
    termID = ggt.goTermID,
    name = ggt.name,
    definition = ggt.definition,
    identifier = ggt.goID,
    isObsolete = ggt.isObsolete,
    ontologyID = "SK15(" + ggt.goTermID + ")"
})
''')

rule6 = Rule('''
MATCH (ggs:GUSGoSynonym)
MATCH (ggt:GUSGoTerm)
WHERE ggs.goTermID = ggt.goTermID
GENERATE
(x = (ggs):BIOSQLTermSynonym {
    synonym = ggs.goSynonymID,
    termID = ggs.goTermID
})-[():HAS_SYNONYM]->(y = (ggt):BIOSQLTerm {
    termID = ggt.goTermID,
    name = ggt.name,
    definition = ggt.definition,
    identifier = ggt.goID,
    isObsolete = ggt.isObsolete,
    ontologyID = "SK15(" + ggt.goTermID + ")"
})
''')

rule7 = Rule('''
MATCH (ggr:GUSGoRelationship)
MATCH (ggt1:GUSGoTerm)
WHERE ggr.parentTermID = ggt1.goTermID
MATCH (ggt2:GUSGoTerm)
WHERE ggr.childTermID = ggt2.goTermID
GENERATE
(x = (ggt1):BIOSQLTerm {
    termID = ggt1.goTermID,
    name = ggt1.name,
    definition = ggt1.definition,
    identifier = ggt1.goID,
    isObsolete = ggt1.isObsolete,
    ontologyID = "SK21(" + ggt1.goTermID + ")"
})-[():TERM_RELATIONSHIP {
    termRelationshipID = ggr.goRelationshipID,
    subjectTermID = ggr.parentTermID,
    predicateTermID = ggr.goRelationshipTypeID,
    objectTermID = ggr.childTermID,
    ontologyID = "SK20(" + ggr.goRelationshipID + ")"
}]->(y = (ggt2):BIOSQLTerm {
    termID = ggt2.goTermID,
    name = ggt2.name,
    definition = ggt2.definition,
    identifier = ggt2.goID,
    isObsolete = ggt2.isObsolete,
    ontologyID = "SK22(" + ggt2.goTermID + ")"
})
''')

rule8 = Rule('''
MATCH (gg:GUSGene)
GENERATE
(x = (gg):BIoSQLTerm {
    termID = gg.geneID,
    name = gg.name,
    definition = gg.description,
    identifier = "SK13(" + gg.geneID + ")", 
    isObsolete = "SK18(" + gg.geneID + "," + gg.reviewStatusID + ")",
    ontologyID = "SK14(" + gg.sequenceOntologyID + ")"
})
''')

In [4]:
gtb_transform = Transformation([rule1, rule2, rule3, rule4, rule5, rule6, rule7, rule8], with_diagnose = False)
gtb_transform.apply_on(graph)

Index: Added 0 index, completed after 0 ms.
Rule: Added 2472 labels, created 1236 nodes, set 11854 properties, created 618 relationships, completed after 143 ms.
Rule: Added 764 labels, created 382 nodes, set 8382 properties, created 0 relationships, completed after 86 ms.
Rule: Added 4000 labels, created 2000 nodes, set 20000 properties, created 1000 relationships, completed after 266 ms.
Rule: Added 2629 labels, created 1000 nodes, set 10000 properties, created 1000 relationships, completed after 143 ms.
Rule: Added 2000 labels, created 1000 nodes, set 7000 properties, created 0 relationships, completed after 75 ms.
Rule: Added 2000 labels, created 1000 nodes, set 10000 properties, created 1000 relationships, completed after 148 ms.
Rule: Added 0 labels, created 0 nodes, set 18000 properties, created 1000 relationships, completed after 229 ms.
Rule: Added 1000 labels, created 0 nodes, set 6000 properties, created 0 relationships, completed after 97 ms.


1187