# Introduction 

In this notebook, we will implement [*Latent Credible Analysis*](https://research.fb.com/publications/latent-credibility-analysis/) models. These are latent probablistic models that use hidden (latent) variables to represents the unknown data source reliabilities and underlying truth values. 

We implement only simpleLCA for now as extension to other models are relatively straight forward.

# SimpleLCA

Here is the plate model of simpleLCA. 

![simpleLCA](./gfx/simpleLCA.png)

### Data 

In [101]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [102]:
import pandas as pd
import os.path as op
import numpy as np
import seaborn as sns
import pyro

In [103]:
import sys
sys.path.insert(0, '../')

In [104]:
from spectrum.preprocessing import encoders
from spectrum.discovers import lca
from spectrum.discovers import utils

In [105]:
DATA_DIR = '../data'
DATA_SET = 'population'

In [106]:
truths = pd.read_csv(op.join(DATA_DIR, DATA_SET, 'truths.csv'))
claims = pd.read_csv(op.join(DATA_DIR, DATA_SET, 'claims.csv'))

In [107]:
truths.head()

Unnamed: 0,object,value,object_id
0,milton_newhampshire_Population2000,3910,157
1,omaha_nebraska_Population2000,390007,189
2,schaumburg_illinois_Population2000,75386,240
3,lakeoswego_oregon_Population2000,35278,127
4,culver_oregon_Population2000,802,53


In [108]:
claims.head()

Unnamed: 0,object,SourceID,value,object_id,source_id
0,milton_newhampshire_Population2000,16168: SatyrTN,3910,157,352
1,milton_newhampshire_Population2000,0 (76.19.53.22),23910,157,274
2,milton_newhampshire_Population2000,5512121: CapitalBot,3910,157,561
3,omaha_nebraska_Population2000,201610: Pentawing,390007,189,401
4,omaha_nebraska_Population2000,89326: Swid,390007,189,630


In [109]:
truths.shape, claims.shape

((301, 3), (1046, 5))

We decide to model city population as discrete value. Moreover we consider the hidden truth value is only from the set of available assertions. Thus we need to label encode `value` of claims data frame.

### Data Preprocessing 

We need to label encode values of objects in order to feed them to our simpleLCA model

In [110]:
claims_enc, le_dict = encoders.transform(claims)

build the confidence matrix, $[w_{s,o}]$ in the paper, if $w_{s,o} = 1$, then the source s does make an assertion about object o.

In [111]:
W = lca.build_mask(claims)

In [112]:
W.shape, claims.source_id.nunique(), claims.object_id.nunique()

((643, 301), 643, 301)

we also need to build an observation dictionary.

In [113]:
claims.head()

Unnamed: 0,object,SourceID,value,object_id,source_id
0,milton_newhampshire_Population2000,16168: SatyrTN,3910,157,352
1,milton_newhampshire_Population2000,0 (76.19.53.22),23910,157,274
2,milton_newhampshire_Population2000,5512121: CapitalBot,3910,157,561
3,omaha_nebraska_Population2000,201610: Pentawing,390007,189,401
4,omaha_nebraska_Population2000,89326: Swid,390007,189,630


In [114]:
claims_enc.head()

Unnamed: 0,object,SourceID,value,object_id,source_id
0,milton_newhampshire_Population2000,16168: SatyrTN,0,157,352
1,milton_newhampshire_Population2000,0 (76.19.53.22),1,157,274
2,milton_newhampshire_Population2000,5512121: CapitalBot,0,157,561
3,omaha_nebraska_Population2000,201610: Pentawing,0,189,401
4,omaha_nebraska_Population2000,89326: Swid,0,189,630


In [115]:
observation = lca.build_observation(claims_enc)

In [116]:
# claims.groupby(['object_id']).nunique()

### Model

Create some data

In [117]:
claims = dict()
claims['source_id'] = [0, 0, 1]
claims['object_id'] = [0, 1, 1]
claims['value'] = [0, 1, 0]
claims = pd.DataFrame(data=claims)

build inputs for simpleLCA model

In [118]:
mask = lca.build_mask(claims)
observation = lca.build_observation(claims)

In [119]:
def generate_one_simpleLCA_sample(observation, mask):
    tracer = pyro.poutine.trace(lca.lca_model)
    trace = tracer.get_trace(observation, mask)

    for name, node in trace.nodes.items():
        if node['type'] == 'sample':
            print(f'{node["name"]} - sampled value {node["value"]} ')
    return trace

In [120]:
for i in range(3):
    generate_one_simpleLCA_sample(observation, mask)
    print('-'*10)

s_0 - sampled value 1.0 
s_1 - sampled value 0.0 
y_0 - sampled value 0 
y_1 - sampled value 1 
b_0_0 - sampled value 0 
b_0_1 - sampled value 1 
b_1_1 - sampled value 0 
----------
s_0 - sampled value 0.0 
s_1 - sampled value 0.0 
y_0 - sampled value 0 
y_1 - sampled value 0 
b_0_0 - sampled value 0 
b_0_1 - sampled value 0 
b_1_1 - sampled value 1 
----------
s_0 - sampled value 0.0 
s_1 - sampled value 0.0 
y_0 - sampled value 0 
y_1 - sampled value 1 
b_0_0 - sampled value 0 
b_0_1 - sampled value 1 
b_1_1 - sampled value 1 
----------


In [121]:
data = lca.make_observation_mapper(observation, mask)

In [122]:
conditioned_model = pyro.condition(lca.lca_model, data=data)

In [123]:
tracer = pyro.poutine.trace(conditioned_model)

In [124]:
trace = tracer.get_trace(observation, mask)

In [125]:
trace.log_prob_sum()

tensor(-3.4657, grad_fn=<AddBackward0>)

In [126]:
utils.print_trace(trace)

s_0 - sampled value 0.0
s_1 - sampled value 1.0
y_0 - sampled value 0
y_1 - sampled value 0
b_0_0 - sampled value 0
b_0_1 - sampled value 1
b_1_1 - sampled value 0


In [127]:
data

{'b_0_0': tensor(0), 'b_0_1': tensor(1), 'b_1_1': tensor(0)}

# Inference