# Introduction 

In this notebook, we will implement [*Latent Credible Analysis*](https://research.fb.com/publications/latent-credibility-analysis/) models. These are latent probablistic models that use hidden (latent) variables to represents the unknown data source reliabilities and underlying truth values. 

We implement only simpleLCA for now as extension to other models are relatively straight forward.

# SimpleLCA

Here is the plate model of simpleLCA. 

![simpleLCA](./gfx/simpleLCA.png)

### Data 

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import os.path as op
import numpy as np
import seaborn as sns
import pyro

In [3]:
import sys
sys.path.insert(0, '../')

In [4]:
from spectrum.preprocessing import encoders
from spectrum.discovers import lca

In [5]:
DATA_DIR = '../data'
DATA_SET = 'population'

In [6]:
truths = pd.read_csv(op.join(DATA_DIR, DATA_SET, 'truths.csv'))
claims = pd.read_csv(op.join(DATA_DIR, DATA_SET, 'claims.csv'))

In [7]:
truths.head()

Unnamed: 0,object,value,object_id
0,milton_newhampshire_Population2000,3910,157
1,omaha_nebraska_Population2000,390007,189
2,schaumburg_illinois_Population2000,75386,240
3,lakeoswego_oregon_Population2000,35278,127
4,culver_oregon_Population2000,802,53


In [8]:
claims.head()

Unnamed: 0,object,SourceID,value,object_id,source_id
0,milton_newhampshire_Population2000,16168: SatyrTN,3910,157,352
1,milton_newhampshire_Population2000,0 (76.19.53.22),23910,157,274
2,milton_newhampshire_Population2000,5512121: CapitalBot,3910,157,561
3,omaha_nebraska_Population2000,201610: Pentawing,390007,189,401
4,omaha_nebraska_Population2000,89326: Swid,390007,189,630


In [9]:
truths.shape, claims.shape

((301, 3), (1046, 5))

We decide to model city population as discrete value. Moreover we consider the hidden truth value is only from the set of available assertions. Thus we need to label encode `value` of claims data frame.

### Data Preprocessing 

We need to label encode values of objects in order to feed them to our simpleLCA model

In [10]:
claims_enc, le_dict = encoders.transform(claims)

build the confidence matrix, $[w_{s,o}]$ in the paper, if $w_{s,o} = 1$, then the source s does make an assertion about object o.

In [11]:
W = lca.build_mask(claims)

In [12]:
W.shape, claims.source_id.nunique(), claims.object_id.nunique()

((643, 301), 643, 301)

we also need to build an observation dictionary.

In [13]:
claims.head()

Unnamed: 0,object,SourceID,value,object_id,source_id
0,milton_newhampshire_Population2000,16168: SatyrTN,3910,157,352
1,milton_newhampshire_Population2000,0 (76.19.53.22),23910,157,274
2,milton_newhampshire_Population2000,5512121: CapitalBot,3910,157,561
3,omaha_nebraska_Population2000,201610: Pentawing,390007,189,401
4,omaha_nebraska_Population2000,89326: Swid,390007,189,630


In [14]:
claims_enc.head()

Unnamed: 0,object,SourceID,value,object_id,source_id
0,milton_newhampshire_Population2000,16168: SatyrTN,0,157,352
1,milton_newhampshire_Population2000,0 (76.19.53.22),1,157,274
2,milton_newhampshire_Population2000,5512121: CapitalBot,0,157,561
3,omaha_nebraska_Population2000,201610: Pentawing,0,189,401
4,omaha_nebraska_Population2000,89326: Swid,0,189,630


In [15]:
observation = lca.build_observation(claims_enc)

In [16]:
# claims.groupby(['object_id']).nunique()

### Model

In [17]:
claims = dict()
claims['source_id'] = [0, 0, 1]
claims['object_id'] = [0, 1, 1]
claims['value'] = [0, 1, 0]
claims = pd.DataFrame(data=claims)

In [18]:
claims

Unnamed: 0,source_id,object_id,value
0,0,0,0
1,0,1,1
2,1,1,0


In [19]:
mask = lca.build_mask(claims)

In [20]:
observation = lca.build_observation(claims)

In [21]:
observation

{0: array([[1.],
        [0.]]), 1: array([[0., 1.],
        [1., 0.]])}

In [22]:
tracer = pyro.poutine.trace(lca.lca_model)

In [23]:
trace = tracer.get_trace(observation, mask)

In [24]:
for e in trace.edges:
    print(e)

In [25]:
trace.topological_sort()

['_RETURN',
 'b_(1, 1)',
 'alpha_sm_(1, 1)',
 'b_(0, 1)',
 'alpha_sm_(0, 1)',
 'b_(0, 0)',
 'alpha_sm_(0, 0)',
 'y_1',
 'theta_m_1',
 'y_0',
 'theta_m_0',
 'H_1',
 'theta_s_1',
 'H_0',
 'theta_s_0',
 '_INPUT']

### Inference