# Link a Human Phenotype to existing QID on WikiData
The goal is to understand why obographs_DO is not appending certain properties to WikiData.  Is it because an existing Human Phenotype (that's already on WikiData and has an existing QID) is unlinked to an existing QID which is why the QID can't be found? <br> <br>
1.) Get a list of potential HP, look them on WikiData<br>
2.) Setup a fresh environment with a fresh pull (make edits on your branch and push to github). <br>
3.) Link one's  that match to WikiData<br>
4.) Rerun script<br>


#### Here's a list of HP's not found. Output from Notebook4.

```
None;None;None;no qids found for: http://purl.obolibrary.org/obo/HP_0003202;None
None;None;None;no qids found for: http://purl.obolibrary.org/obo/HP_0002912;None
None;None;None;no qids found for: http://purl.obolibrary.org/obo/HP_0003231;None
None;None;None;no qids found for: http://purl.obolibrary.org/obo/HP_0005832;None
```

More or less chosen at random. Maybe I should screen all HP results (put it in a set), and count them with a dictionary. The most popular human phenotype most likely will have a lot of links to it. Look at both Subject and Object counts, summed together, as it will tell you how popular each phenotype is as having specific features, or being a feature of. <br> <br>
Human Phenotype Ontologies only exist in Edges (not logicalDefinitionAxioms). Use `doid_edges.json` file for counts.

In [1]:
import json
Edg_path = "/home/rogertu/WikiData/doidtest/doid_edges.json"

In [2]:
edg = json.load(open(Edg_path))
print("\nedg here\n", edg[0])


edg here
 {'sub': 'http://purl.obolibrary.org/obo/DOID_820', 'pred': 'http://purl.obolibrary.org/obo/RO_0001025', 'obj': 'http://purl.obolibrary.org/obo/UBERON_0000948'}


#### Add each `sub` and `obj` to sets to isolate unique items.

In [3]:
# for each item in edg (a list of dictionaries), if index is subject, and the slice at position 31:33 is HP, add it to the set. 
edgSetSub = set()
for item in edg:
    if item["sub"][31:33] == 'HP':
        edgSetSub.add(item["sub"])
print("Items in Edge Set w/ Subject that has HP", len(edgSetSub))
    
    
# for each item in edg (a list of dictionaries), if index is object, and the slice at position 31:33 is HP, add it to the set.    
edgSetObj = set()
for item in edg:
    if item["obj"][31:33] == 'HP':
        edgSetObj.add(item["obj"])

print("Items in Edge Set w/ Object that has HP", len(edgSetObj))

Items in Edge Set w/ Subject that has HP 0
Items in Edge Set w/ Object that has HP 121


#### Convert `edgSetSub` and `edgSetObj` to Dictionaries

In [4]:
# converts set to dictionary with values 0. Prints dictionary to confirm
subCounter = {x: 0 for x in edgSetSub}
objCounter = {x: 0 for x in edgSetObj}

print("Subject Counter Dictionary : \n", len(subCounter))
print("Object Counter Dictionary : \n", len(objCounter))

Subject Counter Dictionary : 
 0
Object Counter Dictionary : 
 121


#### Count number of Objects in Edges
Seems like there are no subjects with HP. This actually makes sense because I grabbed Edges from under Disease Ontology Node.

In [5]:
for item in edg:
    x = item["obj"]
    if x in objCounter:
        objCounter[x] = objCounter[x]+1
        

In [6]:
# Order your dictionary output by descending, pick the first 5.
listKV = sorted(objCounter.items(), key=lambda x: x[1], reverse=True)
print (*listKV[:5], sep = '\n')

('http://purl.obolibrary.org/obo/HP_0011001', 6)
('http://purl.obolibrary.org/obo/HP_0001250', 5)
('http://purl.obolibrary.org/obo/HP_0003510', 5)
('http://purl.obolibrary.org/obo/HP_0001482', 4)
('http://purl.obolibrary.org/obo/HP_0001252', 4)


#### Here's a list of HP ID's and their corresponding QID's.
```
'http://purl.obolibrary.org/obo/HP_0011001', Increased bone mineral density, No specific QID's found.
'http://purl.obolibrary.org/obo/HP_0001250', Seizures, Q852376 (convulsion | seizures)
'http://purl.obolibrary.org/obo/HP_0003510', Severe short stature, Q7502090 (short stature) # Closest I found to it..
'http://purl.obolibrary.org/obo/HP_0001482', Subcutaneous nodule, No specific QID's found.
'http://purl.obolibrary.org/obo/HP_0001252', Muscular hypotonia, Q1753547 (hypotonia | muscular hypotonia)
```
<br>
I'll play with Seizure and Muscular Hypotonia

#### Create new virtual environment and run code.
Make sure github is up-to-date and that you clone from the branch `dev`. <br>
Install all dependencies via pip (requirements) and `scheduled-bots` and run the setup.

In [4]:
%%bash
which pip
pwd

/home/rogertu/WikiData/WD_Test2/bin/pip
/home/rogertu/WikiData


In [None]:
%%bash
python3 Test/scheduled-bots/scheduled_bots/ontology/obographs_DO.py ./doid.json

#### So I wrote to WikiData on accident.
Good thing to note (or unfortunate to note) was I cloned my master file for this run, so there were no edits that were made to it that are different from SuLab master file. Edits in branch would be to `__init__.py` (adding more properties), `obographs_DO.py` (adding to `APPEND_PROPS`) and `obographs_DO.py`(changing `try_write` to `False`). The branch also wouldn't have run because the last two lines are commented out...  <br>
Look at logs created for this run and compare it to Jenkins as well as prior runs. Are there any differences?

Converting log to html.

In [1]:
%%bash
python3 /home/rogertu/WikiData/scheduled-bots/scheduled_bots/logger/bot_log_parser.py /home/rogertu/WikiData/logs/'Disease Ontology-20190415_23:31.log'

/home/rogertu/WikiData/logs/Disease Ontology-20190415_23:31.log


  object.__getattribute__(self, name)
  return object.__setattr__(self, name, value)


</style>
</head>
<body>

<h2>Disease Ontology Log Comparison between Jenkins and Roger</h2>

<table>
  <tr>
    <th>Count</th>
    <th>Jenkin's Run</th>
    <th>Roger's Run</th>
  </tr>
  <tr>
    <td>Items Processed Successfully</td>
    <td>27,970</td>
    <td>27,388</td>
  </tr>
  <tr>
    <td>Items Skipped Due to a Warning</td>
    <td>3,755</td>
    <td>3,756</td>
  </tr>
  <tr>
    <td>Items Skipped Due to an Error</td>
    <td>4</td>
    <td>15</td>
  </tr>
</table>

<table>
    <tr>
        <th>Actions Taken</th>
        <th>Jenkin's Run</th>
        <th>Roger's Run</th>
    </tr>
    <tr>
        <td>No Action</td>
        <td> - </td>
        <td>27,388</td>
    </tr>
    <tr>
        <td>Update</td>
        <td>27,966</td>
        <td>3,756</td>
    </tr>
    <tr>
        <td>Create</td>
        <td>4</td>
        <td>15</td>
    </tr>
</table>
<table>
    <tr>
        <th>Error Types</th>
        <th>Jenkin's Run</th>
        <th>Roger's Run</th>
    </tr>
    <tr>
        <td>.WDApiError</td>
        <td> 11 </td>
        <td>2</td>
    </tr>
    <tr>
        <td>.NonUniqueLabeleDescriptionPairError</td>
        <td>11</td>
        <td>12</td>
    </tr>
    <tr>
        <td>.ChunkedEncodingError</td>
        <td>1</td>
        <td>0</td>
    </tr> 
    <tr>
        <td>.ManualInterventionReqException</td>
        <td>1</td>
        <td>1</td>
    </tr>
</table>
</body>
</html>

<br>


Its seems that most of the edits were OMIM ID edits that seem related to PBB and krBot.   It seems that krBot deletes statements that PBB creates,. Take for example, "split hand-foot malformation" `Q30989072`.  If you go into the history, you can see  PBB updated items on March 4th, and KRBot Removed claims that were generated by PBB.  The main difference in this example seems to be that PBB associates OMIM ID to PS183600 and KRBot reassociates it with OMIM ID 183600 (missing the PS).  In Disease Ontology `.json` file, its listed as PS183600.

The issue with PBB and krBot seems to be the OMIM ID  `PS` (phenotypic series) removal ends up being too specific, in comparison to what would actually be referenced. For example, Inflammatory Bowel Disease  `Q917447` was linked by PBB (via Disease Ontology) to `PS266600`. Going onto the  OMIM website, if PS was searched, it would return a list of Irritable bowel diseases. However, the removal of PS links IBD specifically with Crohn's Disease (IBD-1).  The removal of PS, while more specific, makes the link between IBD and OMIM incorrect.

#### Alright. Let's Clone the right bits this time..
Let's clone the branch `dev`

In [None]:
%%bash
git clone --branch dev https://github.com/turoger/scheduled-bots.git

Let's run the code now.
* `obographs_DO.py` is missing a comma on line 30.
* Need to create a copy of `local.py` (USER/PASS file) in scheduled-bots

In [3]:
%%bash
python3 Test/scheduled-bots/scheduled_bots/ontology/obographs_DO.py ./doid.json

https://www.wikidata.org/w/api.php
Successfully logged in as Torogertu
Done running obographs_DO


* Bot runs. Perfect. Now uncomment the last two lines in `obographs_DO.py` and delete indent. Check `try_write = false` in obographs.py)

In [5]:
%%bash
python3 Test/scheduled-bots/scheduled_bots/ontology/obographs_DO.py ./doid.json

Process is terminated.


#### It works! Terminating...
Terminated the process because I wanted to just see if it would work. Looking at the logs, it seemed like it worked.  Linking  Seizure.  Muscular Hypotonia already has Human Phenotype Ontology ID linked. 

'http://purl.obolibrary.org/obo/HP_0001250', Seizures, Q852376 (convulsion | seizures) <br>
'http://purl.obolibrary.org/obo/HP_0001252', Muscular hypotonia, Q1753547 (hypotonia | muscular hypotonia)

In [6]:
%%bash
python3 Test/scheduled-bots/scheduled_bots/ontology/obographs_DO.py ./doid.json

Process is terminated.


#### Terminating.
Noticed that my account was still editing WikiData. No changes added to WikiData (+0 bytes). I guess not only does line 199 and 530 need to be changed to `False`, but also WikiData `try_write` itself.  Made ~ x < 1500 edits. Seemed to be a delay from when I ran the bot ~3:05pm to when it actually wrote WikiData, ~4:13pm. Noticed at 4:42pm. <br>
<br>
Changed WikiDataIntegrator wdihelper `__init__.py` statement for `try_write` to `False` as default... Run it again.. One last time.

In [None]:
%%bash
python3 Test/scheduled-bots/scheduled_bots/ontology/obographs_DO.py ./doid.json

Convert log to html. Compare log to past runs. Does The addition of HP_0001250 to Q852376 decrease the count of updates?

In [1]:
%%bash
python3 /home/rogertu/WikiData/scheduled-bots/scheduled_bots/logger/bot_log_parser.py /home/rogertu/WikiData/logs/'Disease Ontology-20190418_00:08.log'

/home/rogertu/WikiData/logs/Disease Ontology-20190418_00:08.log


  object.__getattribute__(self, name)
  return object.__setattr__(self, name, value)


#### So I wasn't aware that `HP_0001250` was already linked to Epilectic Seizure prior to me linking it to convulsion (seizures).
predictable I got exactly 5 of these errors:
`multiple qids ({'Q852376', 'Q6279182'}) found for: http://purl.obolibrary.org/obo/HP_0001250` <br>
This points to the script working, and that the missing QID's really are just unlinked or not created items.  This past run (log file 2019418), had 1 error vs 15 from the past accidental WikiData write run. Interesting to see since this was a fresh virtual machine... Or could mean that all the errors were already written over from my accidental run..<br>

To test if qid errors were based specifically on WikiData edits..
* Change the QID reference for HP_0001250 back to only Epilectic Seizure `Q6279182` and remove tag from convulsions `Q852376`
* link `http://purl.obolibrary.org/obo/HP_0002912` to Methylmalonic acidemia `Q742500`		
* link `http://purl.obolibrary.org/obo/HP_0003231` to hypertyrosinemia `Q39209282`	
<br>

These edits should result in 11 less errors overall. (5 from Epilectic Seizure, 3 from the other two)


In [None]:
%%bash
python3 Test/scheduled-bots/scheduled_bots/ontology/obographs_DO.py ./doid.json

While waiting for the run... Most issues with multiple qids or no qids found are with human phenotype and uberon id's. This makes sense, as we added only `has phenotype` and `anatomical location/located in` to the script. Here's an example.<br> 
```
multiple qids ({'Q75865', 'Q492038', 'Q1073'}) found for: http://purl.obolibrary.org/obo/UBERON_0000955
```
(Points to Brain). <br>
<table>
    <tr>
        <th>QID</th>
        <th>WD Reference</th>
    </tr>
    <tr>
        <td>Q75865</td>
        <td> human brain</td>
    </tr>
    <tr>
        <td>Q492038</td>
        <td>encephalon</td>
    </tr>
    <tr>
        <td>Q1073</td>
        <td>brain</td>
    </tr> 
</table>
</body>
</html>

#### Convert the log file

In [3]:
%%bash
python3 /home/rogertu/WikiData/scheduled-bots/scheduled_bots/logger/bot_log_parser.py /home/rogertu/WikiData/logs/'Disease Ontology-20190418_17:48.log'

/home/rogertu/WikiData/logs/Disease Ontology-20190418_17:48.log


  object.__getattribute__(self, name)
  return object.__setattr__(self, name, value)


#### Log shows a decrease in errors by 11! (From 3831 to 3820)
... Only 6 were updated however, and that's because the 5 that were broken were originally working.  <br> 
<br> Going back to `Notebook2` I ran the program with the two edge conditions `doid3_2.json` for DOID_13146 and 14796. The reason that only 1 QID was found for both test conditions was because UBERON-ID was loaded for DOID_13146, doesn't explain how thouhg as it isn't linked in Append_props (maybe its in `fastrun`...) and no QID found for growth delay linked to DOID14796 as is also unlinked on WikiData.