***
***

<img width='700' src="https://user-images.githubusercontent.com/8030363/108961534-b9a66980-7634-11eb-96e2-cc46589dcb8c.png" style="vertical-align:middle">

## Pre-Knowledge Graph Build Ontology Cleaning
***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
**Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**
  

## Purpose

This notebook serves as a script to help prepare ontologies prior to be ingested into the knowledge graph build algorithm. This script performs the following steps:  
1. [Clean Ontologies](#clean-ontologies)  
2. [Merge Ontologies](#merge-ontologies)  

## Assumptions and Dependencies  
  
**Assumptions:**   
- Directory of Imported Ontologies has been populated ➞ `./resources/ontologies`     

**Dependencies:**   
- <u>Scripts</u>: This notebook utilizes several helper functions from the following scripts:  
  - [utility scripts](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils)  
  - [ontology_cleaning.py](https://github.com/callahantiff/PheKnowLator/blob/master/builds/ontology_cleaning.py) 
- <u>Software</u>: [OWLTools](https://github.com/owlcollab/owltools)  
- <u>Data</u>: [`Merged_gene_rna_protein_identifiers.pkl`](https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/Merged_gene_rna_protein_identifiers.pkl), which is automatically downloaded to the `./resources/ontologies` directory     

<br>

Details on the data utilized in this script can be found on the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki. Data can be downloaded from [this](https://console.cloud.google.com/storage/browser/pheknowlator/release_v2.0.0?project=pheknowlator) dedicated Google Cloud Storage Bucket. Please note that all build data are freely available and organized by release and build date. 

<br>

***
### CLEAN ONTOLOGIES <a class="anchor" id="clean-ontologies"></a>
***

The ontology cleaning step includes the following error checks, each of which are explained below and each of which are applied to individual ontologies, the set of merged ontologies or both: (1) Value Errors, (2) Identifier Errors, (3) Duplicate and Obsolete Entities, (4) Punning Errors, and (5) Entity Normalization Errors.

<br>

### Value Errors  
*** 
**Level:** `individual ontology`; `merged-ontology`    

**Description:** This check utilizes the [`owlready2`](https://pypi.org/project/Owlready2/) library to read in each of the ontologies. This library is strict and will catch a wide variety of value errors. 

**Solution:** Parse the error message using the provided `ErrorType` and line number and repair it. For `ValueErrors` incorrectly typed input are re-typed.

*Example Findings*  
The [Cell Line Ontology](http://www.clo-ontology.org/) yield the following error message:

```python
ValueError: invalid literal for int() with base 10: '永生的乳腺衍生细胞系细胞'
...
OwlReadyOntologyParsingError: RDF/XML parsing error in file clo_with_imports.owl, line 10970, column 99.
```

This tells us that we need to repair the triple containing the Literal '永生的乳腺衍生细胞系细胞' by removing it and redefining it as a `string`, rather than an `int` as it is currently defined as. This is currently noted as an issue in the [Cell Line Ontology's](http://www.clo-ontology.org/) GitHub repo ([issue #48](https://github.com/CLO-ontology/CLO/issues/48)). 

<br>

### Identifier Errors  
*** 
**Level:** `individual ontology`; `merged-ontology`  

**Description:** This check verifies consistency of identifier prefixes. For example, we want to find identifiers that are incorrectly formatted like occurrences of `PRO_XXXXXXX` which should be `PR_XXXXXXX`.

**Solution:** Incorrectly formatted class identifiers are updated. This is a tricky task to do in an automated manner and is something that should be updated if any new ontologies are added to the `PheKnowLator` build. Currently, the code below checks and logs any hits, but only fixes the following known errors: Vaccine Ontology: `PRO` which should be `PR`.

*Example Findings*  
Running this check revealed mislabeling of `2` [pROtein Ontology](https://proconsortium.org/) identifiers in the [Vaccine Ontology](http://www.violinet.org/vaccineontology/) (see [this](https://github.com/vaccineontology/VO/issues/4) GitHub issue).

<br>

### Obsolete and/or Deprecated Entities
*** 
**Level:** `individual ontology`; `merged-ontology`  

**Description:** Verify that the ontology only contains current content.

**Solution:** All obsolete classes and any triples that they participate in are removed from an ontologies.

<br>

### Normalization Errors  
*** 
**Level:** `merged-ontology`

These checks are performed at the individual- and merged-ontology levels. There are two types of checks that are performed:  

<u>Normalize Existing Ontology Classes</u>  
  - **Description:** Checks for inconsistencies in ontology classes that overlap with non-ontology entity identifiers (e.g. if HP includes `HGNC` identifiers, but PheKnowLator utilizes `Entrez` identifiers). 

  - **Solution:** While there are other types of identifiers, we currently focus primarily on resolving errors involving the genomic identifiers, since we have a master dictionary we can use([`Merged_gene_rna_protein_identifiers.pkl`](https://storage.googleapis.com/pheknowlator/release_v2.0.0/current_build/data/processed_data/Merged_gene_rna_protein_identifiers.pkl)). This check can be updated in future iterations to include other types of identifiers, but given our detailed examination of the `v2.0.0` ontologies, these were the identifier types that needed repair.

<u>Normalize Duplicate Ontology Concepts</u>  
  - **Description:** Make sure that all classes that represent the same entity are connected to each other. For example, consider the following: the [Sequence Ontology](http://www.sequenceontology.org/), [ChEBI](https://www.ebi.ac.uk/chebi), and [PRotein Ontology](https://proconsortium.org/) all include terms for protein, but none of these classes are connected to each other.

  - **Solution:** Choose a primary concept for all duplicate scenarios and make duplicate concepts an `RDFS:subClassOf` the primary concept. In the future, this check could be improved by leveraging [KBOOM](https://www.biorxiv.org/content/10.1101/048843v3).

*Example Findings*  
The follow classes occur in all of the ontologies used in the current build and have to be normalized so that there are not multiple versions of the same concept:  

- Gene: [VO](http://purl.obolibrary.org/obo/OGG_0000000002)  
  - <u>Solution</u>: Make the `VO` imported `OGG` class a subclass of the `SO` gene term  

- Protein: [SO](http://purl.obolibrary.org/obo/SO_0000104), [PRO](http://purl.obolibrary.org/obo/PR_000000001), [ChEBI](http://purl.obolibrary.org/obo/CHEBI_36080) 
  - <u>Solution</u>: Make the `CHEBI` and `PRO` classes a subclass of the `SO` protein term  
  
- Disorder: [VO](http://purl.obolibrary.org/obo/OGMS_0000045)  
  - <u>Solution</u>: Make the `VO` imported `OGMS` class a subclass of the `MONDO` disease term  

- Antigen: [VO](http://purl.obolibrary.org/obo/OBI_1110034)  
  - <u>Solution</u>: Make the `VO` imported OBI class a subclass of the `CHEBI` antigen term  

- Gelatin: [VO]('http://purl.obolibrary.org/obo/VO_0003030') 
  - <u>Solution</u>: Make the `VO` class a subclass of the `CHEBI` gelatin term 

- Hormone: [VO](http://purl.obolibrary.org/obo/FMA_12278) 
  - <u>Solution</u>: Make the `VO` imported `FMA` class a subclass of the `CHEBI` hormone term

<br>

### Punning Errors 
*** 
**Level:** `individual ontology`; `merged-ontology`

**Description:** [Punning](https://www.w3.org/2007/OWL/wiki/Punning) or redeclaration errors occur for a few different reasons, but the primary or most prevalent cause observed in the ontologies used in `PheKnowLator` is due to an `owl:ObjectProperty` being incorrectly redeclared as an `owl:AnnotationProperty` or an `owl:Class` also being defined as an `OWL:ObjectProperty`. 

**Solution:** Consistent with the solution described [here](https://github.com/oborel/obo-relations/issues/130), for `owl:ObjectProperty` redeclarations we remove all `owl:AnnotationProperty` declarations. For all `owl:Class` redeclarations, we remove all `owl:ObjectProperty` redeclarations. 

*Example Findings* 
The [Cell Line Ontology](http://www.clo-ontology.org/) had 7 object properties that were illegally redeclared and triggered punning errors. More details regarding these errors are shown below. 

```bash
2020-12-03 20:57:15,616 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002091 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002091>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002091>))]
2020-12-03 20:57:15,619 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/BFO_0000062 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/BFO_0000062>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/BFO_0000062>))]
2020-12-03 20:57:15,620 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/BFO_0000063 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/BFO_0000063>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/BFO_0000063>))]
2020-12-03 20:57:15,620 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002222 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002222>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002222>))]
2020-12-03 20:57:15,620 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0000087 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0000087>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0000087>))]
2020-12-03 20:57:15,620 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002161 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002161>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002161>))]
```

From this message, we can see that we need to remove the following `owl:ObjectProperty` redeclared to `owl:AnnotationProperty`: `RO_0002091`, `BFO_0000062`, `BFO_0000063`, `RO_0002222`, `RO_0000087`, `RO_0002161`. There were also 2 classes (i.e. `CLO_0054407` and `CLO_0054409`) defined as being a `owl:Class` and an `owl:ObjectProperty`. This is currently noted as an issue in the Cell Line Ontology's GitHub repo [issue #43](https://github.com/CLO-ontology/CLO/issues/43)).

<br>

***  
### Set-Up Environment
***  

In [1]:
# # uncomment and run to install any required modules from notebooks/requirements.txt
import sys
!{sys.executable} -m pip install -r requirements.txt



In [1]:
# to ensure builds/*.py files and pkt_kg scripts can be reached from notebooks dir
import sys
sys.path.append('../')

#### Load Needed Modules

In [2]:
# import needed libraries
import datetime
import glob
import pickle
import shutil

from rdflib import Graph
from tqdm import tqdm

# import script containing helper functions
from pkt_kg.utils import * 
from builds.ontology_cleaning import *

#### Set Global Variables

In [3]:
# set up environment variables
write_location = '../resources/ontologies'
knowledge_graphs_location = '../resources/knowledge_graphs'
processed_data_location = '../resources/processed_data/'

# set global namespaces
schema = Namespace('http://www.w3.org/2001/XMLSchema#')
obo = Namespace('http://purl.obolibrary.org/obo/')
oboinowl = Namespace('http://www.geneontology.org/formats/oboInOwl#')

#### Helper Functions

In [4]:
# functions needed for processing ontologies
def logically_verifies_cleaned_ontologies(graph, temp_dir, file_location, owltools_location):
    """Logically verifies an ontology by running the ELK deductive logic reasoner. Before running the reasoner
    the instantiated RDFLib object is saved locally.

    Args:
        graph: An RDFLib Graph object containing data.
        temp_dir: A string specifying where where to read from and write to.
        file_location: The name of the file to read and write to in the temp_dir directory.
        owltools_location: A string specifying the location of OWLTOOLs (included in pkt_kg no need to download).
    
    Returns:
        None.
    """

    print('Logically Verifying Ontology')

    # save graph in order to run reasoner
    filename = temp_dir + '/' + file_location
    graph.serialize(destination=filename, format='xml')
    
    # run reasoner
    command = "{} {} --reasoner {} --run-reasoner --assert-implied -o {}"
    return_code = os.system(command.format(owltools_location, filename, 'elk', filename))
    if return_code != 0: raise ValueError('Reasoner Finished with Errors.')

    return None

<br>

***
### INDIVIDUAL ONTOLOGIES <a class="anchor" id="individual-ontologies"></a>
***

**Purpose:** This section focuses on cleaning the individual ontologies which consists of fixing: (1) Parsing Errors; (2) Identifier Errors; (3) Deprecated and Obsolete Classes; and (4) Punning Errors.


**Inputs:** A directory (`write_location`) containing ontology files (`.owl`)

**Outputs:** A directory (`write_location`) containing cleaned ontology files (`.owl`)  


***

### ⚡ Important ⚡

The `OWL API`, when running the [ELK reasoner](), seems to add back some of the errors that this script removes.

- <u>Example 1</u>: In the Vaccine Ontology, we fix prefix errors where `"PR"` is recorded as `"PRO"`. If you save the ontology without running the reasoner and reload it, the fix remains. If you open it after running ELK, the fix has been reversed. 


- <u>Example 2</u>: When we create the human subset of the Protein Ontology we verify that it contains only a single large connected component. If you re-calculate the number of connected components after running ELK, there will be three components.  

Luckily, the merged ontologies are not logically verified using a reasoner, thus the version used to build knowledge graphs remains free of these errors.

In [8]:
# instantiate and set-up class
ont_data = OntologyCleaner('', '', '', write_location)

# updating ontology info dictionary
ont_data.ontology_info = {k.split('/')[-1]: {} for k, v in ont_data.ontology_info.items()}

# set owl tools location
ont_data.owltools_location = '../pkt_kg/libs/owltools'

In [11]:
#version1
#ont_data.ontology_info.keys()

dict_keys(['chebi-merged-20210624.owl', 'ro_with_imports_AD_mods.owl', 'pr_with_imports.owl', 'chebi_lite_with_imports.owl', 'po_with_imports.owl', 'mondo_with_imports.owl', 'pw_with_imports.owl', 'go_with_imports.owl'])

In [9]:
ont_data.ontology_info.keys()

dict_keys(['clo_with_imports.owl', 'chebi_lite_merged_with_imports.owl', 'oae_merged_with_imports.owl', 'ro_with_imports_AD_mods.owl', 'pr_with_imports.owl', 'chebi_lite_with_imports.owl', 'ext_with_imports.owl', 'po_with_imports.owl', 'so_with_imports.owl', 'hp_with_imports.owl', 'mondo_with_imports.owl', 'pw_with_imports.owl', 'go_with_imports.owl'])

In [7]:
ont_data.temp_dir

'../resources/ontologies'

In [10]:
import rdflib
# clean data
for ont in ont_data.ontology_info.keys():
    print('\n#### Processing Ontology: {} ####'.format(ont.upper()))
    ont_data.ont_file_location = ont
    try:
        ont_data.ont_graph = Graph().parse(ont_data.temp_dir + '/' + ont_data.ont_file_location)
    except rdflib.exceptions.ParserError as e:
        command = "sed -i 's/_:genid/_genid/g' {}"
        return_code = os.system(command.format(ont_data.temp_dir + '/' + ont_data.ont_file_location))
        ont_data.ont_graph = Graph().parse(ont_data.temp_dir + '/' + ont_data.ont_file_location)
    # get starting statistics
    ont_data.updates_ontology_reporter()
    
    # clean ontologies
    ont_data.fixes_ontology_parsing_errors()
    ont_data.fixes_identifier_errors()
    ont_data.removes_deprecated_obsolete_entities()
    ont_data.fixes_punning_errors()
    
    # run cleaned ontology through the elk reasoner
    logically_verifies_cleaned_ontologies(ont_data.ont_graph,
                                          ont_data.temp_dir,
                                          ont_data.ont_file_location,
                                          ont_data.owltools_location)

    # verifies no errors caused during cleaning
#     ontology_file_formatter(ont_data.temp_dir, '/' + ont_data.ont_file_location, ont_data.owltools_location)
    
    # read in cleaned, verified, and updated ontology containing inference
    print('Reading in Cleaned Ontology -- Needed to Calculate Final Statistics')
    try:
        ont_data.ont_graph = Graph().parse(ont_data.temp_dir + '/' + ont_data.ont_file_location)
    except rdflib.exceptions.ParserError as e:
        command = "sed -i 's/_:genid/_genid/g' {}"
        return_code = os.system(command.format(ont_data.temp_dir + '/' + ont_data.ont_file_location))
        ont_data.ont_graph = Graph().parse(ont_data.temp_dir + '/' + ont_data.ont_file_location)
    # get finishing statistics
    ont_data.updates_ontology_reporter()


#### Processing Ontology: CLO_WITH_IMPORTS.OWL ####
Obtaining Ontology Statistics


100%|██████████| 1422155/1422155 [00:36<00:00, 38529.76it/s]


Calculating Connected Components
Finding Parsing Errors




Fixing Identifier Errors
Removing Deprecated and Obsolete Classes


0it [00:00, ?it/s]


Resolving Punning Errors


100%|██████████| 1422155/1422155 [00:08<00:00, 172949.24it/s]
100%|██████████| 402213/402213 [00:06<00:00, 58962.82it/s]


Logically Verifying Ontology
Reading in Cleaned Ontology -- Needed to Calculate Final Statistics
Obtaining Ontology Statistics


100%|██████████| 1422133/1422133 [00:33<00:00, 42372.12it/s]


Calculating Connected Components

#### Processing Ontology: CHEBI_LITE_MERGED_WITH_IMPORTS.OWL ####
Obtaining Ontology Statistics


100%|██████████| 1472055/1472055 [00:34<00:00, 42552.50it/s]


Calculating Connected Components
Finding Parsing Errors
Fixing Identifier Errors
Removing Deprecated and Obsolete Classes


100%|██████████| 18506/18506 [00:03<00:00, 6165.51it/s]


Resolving Punning Errors


100%|██████████| 1397951/1397951 [00:07<00:00, 177521.22it/s]
100%|██████████| 223121/223121 [00:04<00:00, 52807.24it/s]


Logically Verifying Ontology
Reading in Cleaned Ontology -- Needed to Calculate Final Statistics
Obtaining Ontology Statistics


100%|██████████| 1397975/1397975 [00:40<00:00, 34759.83it/s]


Calculating Connected Components

#### Processing Ontology: OAE_MERGED_WITH_IMPORTS.OWL ####
Obtaining Ontology Statistics


100%|██████████| 96018/96018 [00:01<00:00, 67096.26it/s]


Calculating Connected Components
Finding Parsing Errors




Fixing Identifier Errors
Removing Deprecated and Obsolete Classes


100%|██████████| 9/9 [00:00<00:00, 2615.99it/s]


Resolving Punning Errors


100%|██████████| 95949/95949 [00:00<00:00, 189123.84it/s]
100%|██████████| 18251/18251 [00:00<00:00, 54771.74it/s]


Logically Verifying Ontology
Reading in Cleaned Ontology -- Needed to Calculate Final Statistics
Obtaining Ontology Statistics


100%|██████████| 96011/96011 [00:01<00:00, 66776.95it/s]


Calculating Connected Components

#### Processing Ontology: RO_WITH_IMPORTS_AD_MODS.OWL ####
Obtaining Ontology Statistics


100%|██████████| 7687/7687 [00:00<00:00, 85068.20it/s]

Calculating Connected Components





Finding Parsing Errors




Fixing Identifier Errors
Removing Deprecated and Obsolete Classes


0it [00:00, ?it/s]


Resolving Punning Errors


100%|██████████| 7687/7687 [00:00<00:00, 221123.77it/s]
100%|██████████| 1508/1508 [00:00<00:00, 39831.80it/s]

Logically Verifying Ontology





Reading in Cleaned Ontology -- Needed to Calculate Final Statistics
Obtaining Ontology Statistics


100%|██████████| 7687/7687 [00:00<00:00, 80998.70it/s]

Calculating Connected Components






#### Processing Ontology: PR_WITH_IMPORTS.OWL ####
Obtaining Ontology Statistics


100%|██████████| 2078223/2078223 [00:54<00:00, 37966.67it/s]


Calculating Connected Components
Finding Parsing Errors
Fixing Identifier Errors
Removing Deprecated and Obsolete Classes


0it [00:00, ?it/s]


Resolving Punning Errors


100%|██████████| 2078223/2078223 [00:12<00:00, 172422.12it/s]
100%|██████████| 379567/379567 [00:07<00:00, 53241.87it/s]


Logically Verifying Ontology
Reading in Cleaned Ontology -- Needed to Calculate Final Statistics
Obtaining Ontology Statistics


100%|██████████| 2078223/2078223 [01:10<00:00, 29360.85it/s]


Calculating Connected Components

#### Processing Ontology: CHEBI_LITE_WITH_IMPORTS.OWL ####
Obtaining Ontology Statistics


100%|██████████| 1290926/1290926 [00:30<00:00, 42039.67it/s]


Calculating Connected Components
Finding Parsing Errors
Fixing Identifier Errors
Removing Deprecated and Obsolete Classes


0it [00:00, ?it/s]


Resolving Punning Errors


100%|██████████| 1290926/1290926 [00:07<00:00, 181598.81it/s]
100%|██████████| 209252/209252 [00:04<00:00, 50936.69it/s]


Logically Verifying Ontology
Reading in Cleaned Ontology -- Needed to Calculate Final Statistics
Obtaining Ontology Statistics


100%|██████████| 1290926/1290926 [00:34<00:00, 37017.50it/s]


Calculating Connected Components

#### Processing Ontology: EXT_WITH_IMPORTS.OWL ####
Obtaining Ontology Statistics


100%|██████████| 750999/750999 [00:20<00:00, 36312.48it/s]


Calculating Connected Components
Finding Parsing Errors
Fixing Identifier Errors
Removing Deprecated and Obsolete Classes


0it [00:00, ?it/s]


Resolving Punning Errors


100%|██████████| 750999/750999 [00:04<00:00, 180123.01it/s]
100%|██████████| 142696/142696 [00:02<00:00, 55020.68it/s]


Logically Verifying Ontology
Reading in Cleaned Ontology -- Needed to Calculate Final Statistics
Obtaining Ontology Statistics


100%|██████████| 750999/750999 [00:18<00:00, 40623.58it/s]


Calculating Connected Components

#### Processing Ontology: PO_WITH_IMPORTS.OWL ####
Obtaining Ontology Statistics


100%|██████████| 56659/56659 [00:00<00:00, 72932.70it/s]


Calculating Connected Components
Finding Parsing Errors
Fixing Identifier Errors
Removing Deprecated and Obsolete Classes


0it [00:00, ?it/s]


Resolving Punning Errors


100%|██████████| 56659/56659 [00:00<00:00, 197506.43it/s]
100%|██████████| 8595/8595 [00:00<00:00, 53930.40it/s]


Logically Verifying Ontology
Reading in Cleaned Ontology -- Needed to Calculate Final Statistics
Obtaining Ontology Statistics


100%|██████████| 56659/56659 [00:00<00:00, 68270.12it/s]


Calculating Connected Components

#### Processing Ontology: SO_WITH_IMPORTS.OWL ####
Obtaining Ontology Statistics


100%|██████████| 42085/42085 [00:00<00:00, 66891.87it/s]


Calculating Connected Components
Finding Parsing Errors
Fixing Identifier Errors
Removing Deprecated and Obsolete Classes


0it [00:00, ?it/s]


Resolving Punning Errors


100%|██████████| 42085/42085 [00:00<00:00, 199622.83it/s]
100%|██████████| 6891/6891 [00:00<00:00, 56251.45it/s]


Logically Verifying Ontology
Reading in Cleaned Ontology -- Needed to Calculate Final Statistics
Obtaining Ontology Statistics


100%|██████████| 42085/42085 [00:00<00:00, 63529.13it/s]


Calculating Connected Components

#### Processing Ontology: HP_WITH_IMPORTS.OWL ####
Obtaining Ontology Statistics


100%|██████████| 949336/949336 [00:19<00:00, 48386.68it/s]


Calculating Connected Components
Finding Parsing Errors
Fixing Identifier Errors
Removing Deprecated and Obsolete Classes


0it [00:00, ?it/s]


Resolving Punning Errors


100%|██████████| 949336/949336 [00:05<00:00, 182127.07it/s]
100%|██████████| 193509/193509 [00:03<00:00, 57100.21it/s]


Logically Verifying Ontology
Reading in Cleaned Ontology -- Needed to Calculate Final Statistics
Obtaining Ontology Statistics


100%|██████████| 949336/949336 [00:21<00:00, 44051.93it/s]


Calculating Connected Components

#### Processing Ontology: MONDO_WITH_IMPORTS.OWL ####
Obtaining Ontology Statistics


100%|██████████| 2277245/2277245 [00:58<00:00, 39205.62it/s]


Calculating Connected Components
Finding Parsing Errors
Fixing Identifier Errors
Removing Deprecated and Obsolete Classes


0it [00:00, ?it/s]


Resolving Punning Errors


100%|██████████| 2277245/2277245 [00:12<00:00, 176508.87it/s]
100%|██████████| 388202/388202 [00:07<00:00, 52988.36it/s]


Logically Verifying Ontology
Reading in Cleaned Ontology -- Needed to Calculate Final Statistics
Obtaining Ontology Statistics


100%|██████████| 2277245/2277245 [01:15<00:00, 30017.70it/s]


Calculating Connected Components

#### Processing Ontology: PW_WITH_IMPORTS.OWL ####
Obtaining Ontology Statistics


100%|██████████| 34901/34901 [00:00<00:00, 63200.32it/s]


Calculating Connected Components
Finding Parsing Errors
Fixing Identifier Errors
Removing Deprecated and Obsolete Classes


0it [00:00, ?it/s]


Resolving Punning Errors


100%|██████████| 34901/34901 [00:00<00:00, 216014.41it/s]
100%|██████████| 5065/5065 [00:00<00:00, 54500.69it/s]


Logically Verifying Ontology
Reading in Cleaned Ontology -- Needed to Calculate Final Statistics
Obtaining Ontology Statistics


100%|██████████| 34901/34901 [00:00<00:00, 62838.87it/s]


Calculating Connected Components

#### Processing Ontology: GO_WITH_IMPORTS.OWL ####
Obtaining Ontology Statistics


100%|██████████| 1337562/1337562 [00:32<00:00, 41044.38it/s]


Calculating Connected Components
Finding Parsing Errors
Fixing Identifier Errors
Removing Deprecated and Obsolete Classes


0it [00:00, ?it/s]


Resolving Punning Errors


100%|██████████| 1337562/1337562 [00:07<00:00, 179230.04it/s]
100%|██████████| 234160/234160 [00:04<00:00, 53003.80it/s]


Logically Verifying Ontology
Reading in Cleaned Ontology -- Needed to Calculate Final Statistics
Obtaining Ontology Statistics


100%|██████████| 1337562/1337562 [00:37<00:00, 35494.72it/s]


Calculating Connected Components


<br>

***
### MERGED ONTOLOGIES <a class="anchor" id="merge-ontologies"></a>
***

**Purpose:** In this step, the [OWLTools](https://github.com/owlcollab/owltools) library is used to merge the directory of cleaned ontology files into a single ontology file. Then, the following cleaning steps are performed: (1) Identifier Errors; (2) Duplicate Classes; (3) Duplicate Class Concepts; and (4) Punning Errors.  

**Inputs:** A directory of ontology files (`.owl`)

**Outputs:** `PheKnowLator_MergedOntologies.owl`


In [15]:
ont_data.temp_dir

'../resources/ontologies'

In [11]:
print('Merge Clean Ontology Data')
ont_data.ont_file_location = ont_data.merged_ontology_filename

# reorder list of ontology files to prepare for merging
onts = [ont_data.temp_dir + '/' + x for x in list(ont_data.ontology_info.keys())
        if x != ont_data.merged_ontology_filename]

# merge ontologies
merges_ontologies(onts, ont_data.temp_dir + '/', ont_data.ont_file_location, ont_data.owltools_location)

Merge Clean Ontology Data
Merging Ontologies: go_with_imports.owl, pw_with_imports.owl
Merging Ontologies: mondo_with_imports.owl, PheKnowLator_MergedOntologies.owl
Merging Ontologies: hp_with_imports.owl, PheKnowLator_MergedOntologies.owl
Merging Ontologies: so_with_imports.owl, PheKnowLator_MergedOntologies.owl
Merging Ontologies: po_with_imports.owl, PheKnowLator_MergedOntologies.owl
Merging Ontologies: ext_with_imports.owl, PheKnowLator_MergedOntologies.owl
Merging Ontologies: chebi_lite_with_imports.owl, PheKnowLator_MergedOntologies.owl
Merging Ontologies: pr_with_imports.owl, PheKnowLator_MergedOntologies.owl
Merging Ontologies: ro_with_imports_AD_mods.owl, PheKnowLator_MergedOntologies.owl
Merging Ontologies: oae_merged_with_imports.owl, PheKnowLator_MergedOntologies.owl
Merging Ontologies: chebi_lite_merged_with_imports.owl, PheKnowLator_MergedOntologies.owl
Merging Ontologies: clo_with_imports.owl, PheKnowLator_MergedOntologies.owl


In [None]:
#sed command for MergedOntologies here

In [21]:
#Load merged ontology and add GO
import subprocess
owltools = ont_data.owltools_location
ont1 = ont_data.temp_dir + '/go_with_imports.owl'
ont2 = '../resources/knowledge_graphs/' + ont_data.merged_ontology_filename
loc = ont_data.temp_dir + '/' + ont_data.merged_ontology_filename
try:
    subprocess.check_call([owltools, str(ont1), str(ont2), '--merge-support-ontologies', '-o', loc])
except subprocess.CalledProcessError as error: print(error.output)

In [16]:
# load merged ontologies into RDF Lib Graph object
print('Loading Merged Ontology Data')
ont_data.ont_graph = Graph().parse(ont_data.temp_dir + '/' + ont_data.ont_file_location)

#command = "sed -i 's/_:genid/_genid/g' {}"

# add merged ontology to dict
ont_data.ontology_info[ont_data.ont_file_location] = {}

# get stats on merged ontologies
print(derives_graph_statistics(ont_data.ont_graph))

Loading Merged Ontology Data
Graph Stats: 9791095 triples, 4132828 nodes, 344 predicates, 549645 classes, 42 individuals, 806 object props, 626 annotation props


### Clean Merged Ontologies
🤔 *IMPORTANT*🤔  Please note there are a few decisions that can made be made at this point that you may want to consider. For our monthly `PheKnowLator` builds, we prefer to use Entrez gene identifiers. If you have run the [`Data_Preparation.ipynb`](https://github.com/callahantiff/PheKnowLator/blob/master/notebooks/Data_Preparation.ipynb) Jupyter Notebook without makeing updates, you would have also committed yourself to using this type of gene identifier. If you have not done with this and do not want to use Entrez gene, but rather prefer to use what the ontologies provide, please comment out `ont_data.normalizes_existing_classes()` below.

In [17]:
# get starting statistics
ont_data.updates_ontology_reporter()

# clean merged ontologies
ont_data.fixes_identifier_errors()
ont_data.normalizes_duplicate_classes()
ont_data.normalizes_existing_classes()
ont_data.fixes_punning_errors()

# get finishing statistics
print(derives_graph_statistics(ont_data.ont_graph))
ont_data.updates_ontology_reporter()

Obtaining Ontology Statistics


100%|██████████| 9791095/9791095 [05:29<00:00, 29702.49it/s] 


Calculating Connected Components
Fixing Identifier Errors
Normalizing Duplicate Concepts
Normalizing Existing Classes


100%|██████████| 23619/23619 [00:16<00:00, 1464.47it/s]


Resolving Punning Errors


100%|██████████| 9792223/9792223 [01:01<00:00, 159851.70it/s]
100%|██████████| 1852487/1852487 [00:33<00:00, 55193.36it/s]


Graph Stats: 9792157 triples, 4129131 nodes, 344 predicates, 545958 classes, 35 individuals, 801 object props, 622 annotation props
Obtaining Ontology Statistics


100%|██████████| 9792157/9792157 [04:06<00:00, 39800.78it/s] 


Calculating Connected Components


### Output and Save Results
The cleaned merged ontology file is saved to the `resources/knowledge_graphs` directory where it can be detected by the `PheKnowLator` algorithm during the build process.

In [18]:
print('Save and Format Merged Ontology Data')
ont_data.ont_graph.serialize(destination=knowledge_graphs_location + '/' + ont_data.ont_file_location, format='xml')
ontology_file_formatter(knowledge_graphs_location, '/' + ont_data.ont_file_location, ont_data.owltools_location)

Save and Format Merged Ontology Data
Applying OWL API Formatting to Knowledge Graph OWL File


#### Save Ontology Cleaning Results  
To view the results of the ontology cleaning process print the `ont_data.ontology_info` dictionary. This dictionary is keyed by ontology filename and contains a separate dictionary for each ontology with descriptions of the results for each error check that is performed at the individual- and merged-ontology level. The results are also saved to `resources/ontologies/ontology_cleaning_report.txt`.

In [19]:
# save output locally
ont_order = sorted([x for x in ont_data.ontology_info.keys() if not x.startswith('Phe')]) + [ont_data.ont_file_location]
with open(ont_data.temp_dir + '/ontology_cleaning_report.txt', 'w') as o:
    o.write('=' * 50 + '\n{}'.format('ONTOLOGY CLEANING REPORT'))
    o.write('\n{}\n'.format(str(datetime.datetime.utcnow().strftime('%a %b %d %X UTC %Y'))) + '=' * 50 + '\n\n')
    for key in ont_order:
        o.write('\n\n\nONTOLOGY: {}\n'.format(key)); o.write('*' * (len(key) + 10) + '\n\n')
        x = ont_data.ontology_info[key]
        if 'Original GCS URL' in x.keys(): o.write('\t- Original GCS URL: {}\n'.format(x['Original GCS URL']))
        if 'Processed GCS URL' in x: o.write('\t- Processed GCS URL: {}\n'.format(x['Processed GCS URL']))
        o.write('\t- Statistics Before Cleaning:\n\t\t- {}\n'.format(x['Starting Statistics']))
        o.write('\t- Statistics After Cleaning:\n\t\t- {}\n'.format(x['Final Statistics']))
        if 'ValueErrors' in x.keys():
            if isinstance( x['ValueErrors'], str): o.write('\t- Value Errors (n=1):\n\t\t- {}\n'.format(x['ValueErrors']))
            else:
                for i in x['ValueErrors']: o.write('\t\t- {}\n'.format(str(i)))
        else: o.write('\t- Value Errors: 0\n')     
        if x['IdentifierErrors'] != 'None':
            o.write('\t- Identifier Errors (n={}):\n'.format(len(x['IdentifierErrors'].split(', '))))
            for i in x['IdentifierErrors'].split(', '): o.write('\t\t- {}\n'.format(str(i)))
        else: o.write('\t- Identifier Errors: 0\n')
        if 'PheKnowLator_MergedOntologies' not in key:
            if x['Deprecated'] != 'None':
                o.write('\t- Deprecated Classes (n={}):\n'.format(len(x['Deprecated'])))
                for i in x['Deprecated']: o.write('\t\t- {}\n'.format(str(i)))
            else: o.write('\t- Deprecated Classes: 0\n') 
            if x['Obsolete'] != 'None':
                o.write('\t- Obsolete Classes (n={}):\n'.format(len(x['Obsolete'])))
                for i in x['Obsolete']: o.write('\t\t- {}\n'.format(str(i)))
            else: o.write('\t- Obsolete Classes: 0\n')
        o.write('\t- Punning Errors:\n')
        if x['PunningErrors - Classes'] != 'None':
            o.write('\t\t- Classes (n={}):\n'.format(len(x['PunningErrors - Classes'].split(', '))))
            for i in x['PunningErrors - Classes'].split(', '): o.write('\t\t\t- {}\n'.format(i))
        else: o.write('\t\t- Classes: 0\n')
        if x['PunningErrors - ObjectProperty'] != 'None':
            o.write('\t\t- Object Properties (n={}):\n'.format(len(x['PunningErrors - ObjectProperty'].split(', '))))
            for i in x['PunningErrors - ObjectProperty'].split(', '): o.write('\t\t\t- {}\n'.format(i))
        else: o.write('\t\t- Object Properties: 0\n')
        if 'Normalized - Duplicates' in x.keys():
            o.write('\t- Normalization:\n')
            if x['Normalized - Duplicates'] != 'None':
                o.write('\t\t- Existing Entity Normalization (n={}):\n'.format(len(x['Normalized - Duplicates'].split(', '))))
                for i in x['Normalized - Duplicates'].split(', '): o.write('\t\t\t- {}\n'.format(i))
            else: o.write('\t\t- Entity Normalization: 0\n')
            if x['Normalized - Gene IDs'] != 'None': o.write('\t\t- Normalized HGNC IDs: {}\n'.format(x['Normalized - Gene IDs']))
            if x['Normalized - NonOnt'] != 'None': o.write('\t\t- Other Classes that May Need Normalization: {}\n'.format(x['Normalized - NonOnt']))
            if x['Normalized - Dep'] != 'None':
                o.write('\t\t- Deprecated Ontology HGNC Identifiers Needing Alignment (n={}):\n'.format(len(x['Normalized - Dep'])))
                for i in x['Normalized - Dep']: o.write('\t\t- {}\n'.format(i))
            else: o.write('\t\t- Deprecated Ontology HGNC Identifiers Needing Alignment: 0\n')
o.close()

#### Clean-Up Environment

In [20]:
# remove temp file in resources/ontologies
os.remove(write_location + '/' + ont_data.ont_file_location)
os.remove(write_location + '/Merged_gene_rna_protein_identifiers.pkl')

# # remove logs directory
# logs = glob.glob('..builds/logs/*.log')
# shutil.rmtree('/'.join(logs[0].split('/')[:-1]))


<br>

***
***

```
@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
```