 Please read the [README](README.md) file first to download necessary data and make essential configuration. If you follow the instructions in [README](README.md), then you should be using *pipenv* to create virtual env. Remember to enter the virtual env before running this script. 

# Initial Probability Assignment
Before being able to assign probabilities, a probability calibrator must be trained. This step will require some training data. The command below will train a calibration model that could be used to transform scores into probabilityes. 

In [1]:
!python train_cal.py -i inputs/train.dat -o inputs/cal.model

INFO:root:Loading training data from inputs/train.dat ...
INFO:root:training...
INFO:root:Saving calibraiton model to inputs/cal.model...
INFO:root:Done!


Now we can assign initial probabilities

In [2]:
!python assign.py -i inputs/prod.dat -o outputs/pkb.dat -m inputs/cal.model

INFO:root:Loading triple from inputs/prod.dat ...
INFO:root:Loading calibration model from inputs/cal.model ...
INFO:root:Assigning probabilities ...
INFO:root:Storing PKB to outputs/pkb.dat ...
INFO:root:Done!


The results are stored in the file `outputs/pkb.dat` (or your configured directory). The output data format is like `<head, relation, tail, probability, strength>`, where the strength indicates our confidence of the computed probability. Now we can check the former 10 lines of the result file.

In [3]:
!head -n 10 "outputs/pkb.dat"

umberto_i_of_italy	cause_of_death	tyrannicide	0.994	2.0
umberto_i_of_italy	cause_of_death	cerebral_aneurysm	0.445	2.0
john_glenn_beall_jr	nationality	united_states	0.583	2.0
john_glenn_beall_jr	nationality	ancient_greece	0.43	2.0
john_atkinson_grimshaw	gender	male	0.53	2.0
john_atkinson_grimshaw	gender	female	0.512	2.0
hardinge_giffard_1st_earl_of_halsbury	gender	male	0.575	2.0
hardinge_giffard_1st_earl_of_halsbury	gender	female	0.486	2.0
mike_von_erich	nationality	united_states	0.534	2.0
mike_von_erich	nationality	serbia	0.468	2.0


Actually, the *probability* and the *strength* together form a Beta distribution $Beta(a, b)$ where $a=probability*strength, b=(1-probability)*strength$. In this way, we not only have a probability value of a triple, but a **probability distribution of that probability** of the triple. However, the *strength* here is assigned a fixed value, 2 (on behalf of a weak strength, it could be an arbitrary small number, but conventionally we use 2). It is to be updated in the following step.

# Probability Updating

The updating procedure will integrate the information of the evidence into the existing knowledge, by adding new probabilistic triples, or updating the probability value of existing triples. The required data format is `<head, relation, tail, probability>`, the *probability* of which could be obtained by `assign.py`. An example input is:

In [4]:
!head -n 10 "inputs/synthetic_evidence.dat"

bill_owen	profession	actor	0.134
mother_cabrini	nationality	italy	0.498
bill_haley	nationality	united_states	0.107
david_fasold	profession	sailor	0.266
norbert_poehlke	gender	female	0.232
gustav_stresemann	gender	female	0.978
airey_neave	gender	male	0.349
billy_preston	profession	political_prisoner	0.39
rosemary_clooney	nationality	united_states	0.698
thomas_kettle	profession	barrister	0.74


To update probabilities with uncertain evidence, use this command

In [5]:
!python update.py -e inputs/synthetic_evidence.dat -d outputs/pkb.dat -o outputs/pkb.dat

INFO:root:Loading PKB from outputs/pkb.dat ...
INFO:root:Performing probabilistic updating by evidence from inputs/synthetic_evidence.dat ...
INFO:root:Saving PKB to outputs/pkb.dat ...
INFO:root:Done!


Now we check the former 10 lines of the result file.

In [6]:
!head -n 10 "outputs/pkb.dat"

umberto_i_of_italy	cause_of_death	tyrannicide	0.692	5.0
umberto_i_of_italy	cause_of_death	cerebral_aneurysm	0.556	6.0
john_glenn_beall_jr	nationality	united_states	0.571	3.0
john_glenn_beall_jr	nationality	ancient_greece	0.43	2.0
john_atkinson_grimshaw	gender	male	0.321	7.0
john_atkinson_grimshaw	gender	female	0.673	4.0
hardinge_giffard_1st_earl_of_halsbury	gender	male	0.518	5.0
hardinge_giffard_1st_earl_of_halsbury	gender	female	0.559	5.0
mike_von_erich	nationality	united_states	0.372	5.0
mike_von_erich	nationality	serbia	0.444	4.0


We can see the probabilities and strengths updated.

## Complexity Estimate

Initial probability assignment is essentially putting a triple's statistical scores into a probabilisitc classifier, so it takes constant time ($O(1)$) to assign probability per triple.

Probability updating works by directly adding a whole evidence (also a probabilitstic triple) into the PKB, or applying Jeffrey's Conditionalisation formula, which is also constant time ($O(1)$) per evidence