# Introduction 

In this notebook, we will implement [*Latent Credible Analysis*](https://research.fb.com/publications/latent-credibility-analysis/) models. These are latent probablistic models that use hidden (latent) variables to represents the unknown data source reliabilities and underlying truth values. 

We implement only simpleLCA for now as extension to other models are relatively straight forward.



# SimpleLCA

Here is the plate model of simpleLCA. 

![simpleLCA](./gfx/simpleLCA.png)

### Data 

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import os.path as op
import numpy as np
import seaborn as sns
import pyro

In [None]:
import sys
sys.path.insert(0, '../')

In [None]:
from spectrum.preprocessing import encoders
from spectrum.judge import lca
from spectrum.judge import utils

In [None]:
DATA_DIR = '../data'
DATA_SET = 'population'

In [None]:
truths = pd.read_csv(op.join(DATA_DIR, DATA_SET, 'truths.csv'))
claims = pd.read_csv(op.join(DATA_DIR, DATA_SET, 'claims.csv'))

In [None]:
truths.shape, claims.shape

We decide to model city population as discrete value. Moreover we consider the hidden truth value is only from the set of available assertions. Thus we need to label encode `value` of claims data frame.

### Data Preprocessing 

We need to label encode values of objects in order to feed them to our simpleLCA model

In [None]:
claims_enc, le_dict = encoders.transform(claims)

build the confidence matrix, $[w_{s,o}]$ in the paper, if $w_{s,o} = 1$, then the source s does make an assertion about object o.

In [None]:
mask = lca.build_mask(claims_enc)

we also need to build an observation dictionary.

In [None]:
observation = lca.build_observation(claims_enc)

### Model

Create some data

# Inference 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch

import pyro
import pyro.infer
import pyro.optim
import pyro.distributions as dist

pyro.set_rng_seed(101)

In [None]:
data = lca.make_observation_mapper(observation, mask)
conditioned_lca = pyro.condition(lca.lca_model, data=data)

# guide
lca_guide = lca.lca_guide

pyro.clear_param_store()
svi = pyro.infer.SVI(model=conditioned_lca,
                     guide=lca_guide,
                     optim=pyro.optim.Adam({"lr":1e-5}),
                     loss=pyro.infer.Trace_ELBO())

In [None]:
losses = []
num_steps = 10
for t in range(num_steps):
    cur_loss = svi.step(observation, mask)
    losses.append(cur_loss)
    print(f'current loss - {cur_loss}')

In [None]:
sns.tsplot(losses)

We can see that BVI does not do so well given Trace_ELBO loss. I think Trace_ELBO is not suitable for discrete distributions.

# Evaluation 

In [None]:
def get_trusted_source(posteriors, reliability_threshold=0.8):
    """Compute a list of trusted sources given a threshold of their relability
    
    Parameters
    ----------
    posteriors: dict
        a dictionary rv_name->posterior dist
    
    reliability_threshold: float
        if a source has reliability > reliability_threshold then it will be included
        in the result
    
    Returns
    -------
    trusted_sources: list
        a list of trusted sources id
    """
    result = [
        int(k.split('_')[2]) for k, v in posteriors.items()
        if k.startswith('beta_s') and torch.exp(v) > reliability_threshold
    ]
    return result

In [None]:
def discover_truths(posteriors):
    results = [(int(k.split('_')[2]), int(torch.argmax(v).numpy())) for k, v in posteriors.items() if k.startswith('beta_m')]
    return dict(results)

In [None]:
discovered_truths = discover_truths(posteriors=pyro.get_param_store())

# Appendix