# Introduction
We implement [TruthFinder](http://hanj.cs.illinois.edu/pdf/kdd07_xyin.pdf) in this notebook. There are three goals we want to achieve

1. Implement and integrate truthfinder into spectrum
2. Understand how they evaluate it
3. See how they prepare dataset

### How TruthFinder works

TruthFinder takes into inputs as a set of websites (sources), facts (claims), objects (things of interest). Its mission is to estimate the source reliability and truth facts.

**confidence of facts**: The confidence of a fact f is defined as the probability of f being correct.

**source reliability**: The reliability of a source s is defined as the expected confidence of facts provided by s.

We can visualize the relationship between sources, facts, and objects as shown below (Note that here the websites are the sources).

![](./gfx/truthfinder.png)

TruthFinder works as follows:
1. Initialize all source reliabilities to $t_0$
2. While $sim(t_{prev}, t_{now}) < threshold$:
    - compute fact confidences
    - compute website trustworthiness
    
here $t$ are the vector of all source reliabities and similarity function is the cosine similarity.

Now, in order to implement TruthFinder, we need to know how to compute fact confidences and website trustworthiness. We have the following formulas

$$s(f) = 1 -  \prod_{w \in W(f)}(1 -t(w))$$, where t(w) is computed as 

![](./gfx/truthfinder_source_reliability.png)

However we need to address to subtle problems with the above formula:

1. Numerical instability
2. Taking into account the influence among facts about the same object.

To address numerical instability, we will operate in log space (a common trick). Let's define the source reliabity score 
$$\tau(f) = -ln(1 - t(w))$$

and fact confidence score:

$$\sigma(f)= -ln(1 - s(f))$$.

It is easy to prove $$\sigma(f) = \sum_{w \in W(f)}\tau(w)$$

To account for the influence among facts about the same object, we introduce an adjusted confidence score 
![](./gfx/truthfinder_adjusted_score.png)


#### Implication between facts
$imp(f_1 \rightarrow f_2)$ is $f_1$'s influence on $f_{2}$'s confidence. This value is between -1 and 1. A positive value means they support each other and negative one means the opposite. This implication function is domain specific! and can be defined as 

$imp(f_1 \rightarrow f_2) = sim(f_1, f_2) - base\_sim$, where base_sim is a threshold.



![](./gfx/truthfinder_implication.png)

# Implementation

They compare source and object results.

In [1]:
import pandas as pd
import os.path as op
import numpy as np
import seaborn as sns

In [2]:
import sys
sys.path.insert(0, '../')

In [3]:
from spectrum import utils
from spectrum.truthfinder import truthfinder

In [4]:
DATA_DIR = '../data'
DATA_SET = 'population'

In [5]:
truths = pd.read_csv(op.join(DATA_DIR, DATA_SET, 'truths.csv'))
claims = pd.read_csv(op.join(DATA_DIR, DATA_SET, 'claims.csv'))

In [6]:
truths.head()

Unnamed: 0,object,value,object_id
0,milton_newhampshire_Population2000,3910,157
1,omaha_nebraska_Population2000,390007,189
2,schaumburg_illinois_Population2000,75386,240
3,lakeoswego_oregon_Population2000,35278,127
4,culver_oregon_Population2000,802,53


In [7]:
claims.head()

Unnamed: 0,object,SourceID,value,object_id,source_id
0,milton_newhampshire_Population2000,16168: SatyrTN,3910,157,352
1,milton_newhampshire_Population2000,0 (76.19.53.22),23910,157,274
2,milton_newhampshire_Population2000,5512121: CapitalBot,3910,157,561
3,omaha_nebraska_Population2000,201610: Pentawing,390007,189,401
4,omaha_nebraska_Population2000,89326: Swid,390007,189,630


In [8]:
discovered_trusts, discovered_truths = truthfinder(claims, verbose=True, initial_trust=0.5)

trust similarity - 0.9925411015127318
trust similarity - 0.9992731992825723
trust similarity - 0.999811816640213
trust similarity - 0.9999462675746896
trust similarity - 0.9999823732249723
trust similarity - 0.9999938011012202


In [9]:
discovered_truths = discovered_truths.groupby('object_id').apply(lambda x: x.loc[x.confidence.idxmax()])[['value']]
discovered_truths.reset_index(inplace=True)

# Evaluation

In [10]:
utils.accuracy(truths, discovered_truths)

0.8205980066445183