# Introduction 

In this notebook, we will implement [*Latent Credible Analysis*](https://research.fb.com/publications/latent-credibility-analysis/) models. These are latent probablistic models that use hidden (latent) variables to represents the unknown data source reliabilities and underlying truth values. 

We implement only simpleLCA for now as extension to other models are relatively straight forward.



# SimpleLCA

Here is the plate model of simpleLCA. 

![simpleLCA](./gfx/simpleLCA.png)

### Data 

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import os.path as op
import numpy as np
import seaborn as sns
import pyro

import sys
sys.path.insert(0, '../')

from spectrum.preprocessing import encoders
from spectrum.judge import lca, utils
from spectrum import evaluator

In [None]:
DATA_DIR = '../data'
DATA_SET = 'population'

In [None]:
truths = pd.read_csv(op.join(DATA_DIR, DATA_SET, 'truths.csv'))
claims = pd.read_csv(op.join(DATA_DIR, DATA_SET, 'claims.csv'))

In [None]:
truths.shape, claims.shape

We decide to model city population as discrete value. Moreover we consider the hidden truth value is only from the set of available assertions. Thus we need to label encode `value` of claims data frame.

### Data Preprocessing 

We need to label encode values of objects in order to feed them to our simpleLCA model

In [None]:
claims_enc, le_dict = encoders.transform(claims)

build the confidence matrix, $[w_{s,o}]$ in the paper, if $w_{s,o} = 1$, then the source s does make an assertion about object o.

In [None]:
mask = lca.build_mask(claims_enc)

we also need to build an observation dictionary.

In [None]:
observation = lca.build_observation(claims_enc)

### Model

# Inference 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch

import pyro
import pyro.infer
import pyro.optim
import pyro.distributions as dist

pyro.set_rng_seed(101)

In [None]:
losses = lca.bvi(lca.lca_model, lca.lca_guide, observation, mask, epochs=30, num_samples=3, learning_rate=1e-6)

In [None]:
sns.tsplot(losses)

We can see that BVI does not do so well given Trace_ELBO loss. I think Trace_ELBO is not suitable for discrete distributions.

# Evaluation 

In [None]:
discovered_truths = lca.discover_truths(posteriors=pyro.get_param_store())

We need to inverse transform the discovered truth value of each object into their original space.

In [None]:
discovered_truths['value'] = discovered_truths.apply(lambda x: le_dict[x['object_id']].inverse_transform([x['value']])[0], axis=1)

In [None]:
evaluator.accuracy(truths, discovered_truths)

The result is bad. This is expected if we look at the plot of loss values during training. They did not converge! The following can be reasons:
    1. Since SVI estimate gradients at each training step by performing sampling of `guide()`. The default number of is 1. We may increase the estimation accuracy by increasing the number of training.
    2. Our `guide()` model is just not good enough or `model()

# Appendix

In [None]:
len(observation)

In [None]:
len(observation[0])

In [None]:
len(observation[1])

In [None]:
(mask !=0).sum()

In [None]:
observation[1].shape

In [None]:
observation[5].shape

In [None]:
mask.shape

In [None]:
claims.shape

In [None]:
claims_enc[claims_enc.source_id == 2]

In [None]:
claims_enc[claims_enc.object_id == 250]

In [None]:
data = lca.make_observation_mapper(observation, mask)
conditioned_lca = pyro.condition(lca.lca_model, data=data)

# for i in range(3):
#     utils.print_trace(pyro.poutine.trace(conditioned_lca).get_trace(observation, mask))
#     print('_' * 10)

In [None]:
trace = pyro.poutine.trace(conditioned_lca).get_trace(observation, mask)

In [None]:
def get_observed_nodes(trace):
    observed_rvs = dict()
    for name, node in trace.nodes.items():
        if node['type'] == 'sample' and node['is_observed']:
            observed_rvs[node['name']] = node['value']
#             print(f'{node["name"]} - sampled value {node["value"]}')
    return observed_rvs

In [None]:
observed_rvs = get_observed_nodes(trace)

In [None]:
len(observed_rvs)

In [None]:
len(claims_enc)

ok sth wrong here! we have less claims in the model then the data

In [None]:
mask.sum()

In [None]:
mask.shape

In [None]:
mask[555, 197]

In [None]:
mask.shape