### Factor Identification Test

The following set of functions are to test how well the algorithms (PMF, LS-NMF, WS-NMF) are able to capture the factor/source contributions to the input data set by generating simulated data for each factor and providing the product dataset to the algorithms. The outputs are reviewed and compared to the original 'true' factor matrices.

The simulated datasets are generated by specifying the number of factors/sources (k), the number of features (m), the number of samples (n), the min and max % of noise to be applied to the input dataset, and the min and max % uncertainty to be used to create the uncertainty dataset. The pre-processing steps include: 
1. For each feature, select at random which factors contribute.
2. Each factor that contributes determine the ratio of contribution for those factors.
3. Based upon the ratio, create a concentration amount for each factor/feature.
4. Input dataset is created by summing all factor matrices together and adding random noise between min_noise_p (%) and max_noise_p (%), for each cell in the input dataset.
5. Calcualte the uncertainty dataset from a random value between min_uncertainty_p (%) and max_uncertainty_p (%) for each cell in the input dataset.

The outputs are compared by checking factor mapping $R^2$ and the resulting time-series $R^2$.

In [80]:
# Notebook imports
import os
import sys
import copy
import logging
import time
import json
import pandas as pd
import numpy as np
from scipy import stats
import plotly.graph_objects as go
from plotly.subplots import make_subplots

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
from python.data.datahandler import DataHandler
from python.model.nmf import NMF
from python.model.model import BatchNMF

In [233]:
k = 6    # Number of factors
m = 40   # Number of features
n = 500  # Number of samples
min_noise = 0.0    # Minimum noise applied to combined input dataset cell
max_noise = 1.0     # Maximum noise applied to combined input dataset cell
min_unc = 5.0       # Minimum uncertainty of a cell value
max_unc = 7.0       # Maximum uncertainty of a cell value
seed = 42           # Random seed

min_value = 1e-3
rng = np.random.default_rng(seed)

In [234]:
factor_matrices = [np.zeros(shape=(n, m)) for _k in range(k) ]     # Set all factor matrices to zero
for feature in range(n):
    feature_factors = rng.choice(k, rng.integers(2, k), replace=False)
    for ff in feature_factors:
        ff_contributions = rng.uniform(1e-3, rng.integers(1, 10) * 1.0, size=(m))
        factor_matrices[ff][feature] = ff_contributions
base_input = np.sum(factor_matrices, axis=0)                                # Add together all the factors to form the base input dataset
noise_matrix = (rng.uniform(min_noise, max_noise, size=(n, m)) / 100) + 1   # Generate uniform sampling of noise to be applied to the base input dataset
data_matrix = np.multiply(base_input, noise_matrix)                         # Add noise to the base input dataset
uncertainty_values = (rng.uniform(min_unc, max_unc, size=(n, m)) / 100)     # Generate uniform sampling of uncertainty used to create the uncertainty matrix
uncertainty_matrix = np.multiply(data_matrix, uncertainty_values)           # Create uncertainty dataset

In [222]:
# true_H = rng.uniform(1e-3, 1.0, size=(k, m))
# true_H = np.divide(true_H, np.sum(true_H, axis=0))
# true_W = rng.uniform(min_value, 100.0, size=(n, k))
# base_input = np.matmul(true_W, true_H)
# noise_matrix = (rng.uniform(min_noise, max_noise, size=(n, m)) / 100) + 1   # Generate uniform sampling of noise to be applied to the base input dataset
# data_matrix = np.multiply(base_input, noise_matrix)                         # Add noise to the base input dataset
# uncertainty_values = (rng.uniform(min_unc, max_unc, size=(n, m)) / 100)     # Generate uniform sampling of uncertainty used to create the uncertainty matrix
# uncertainty_matrix = np.multiply(data_matrix, uncertainty_values)           # Create uncertainty dataset

In [223]:
factors = 6
models = 10

In [237]:
%%time
# Training multiple models
method = "ws-nmf"

batch_br = BatchNMF(V=data_matrix, U=uncertainty_matrix, max_iter=10000, converge_delta=0.0001, converge_n=10, factors=factors, models=models, method=method, seed=seed, verbose=True)
batch_br.train()
wh = batch_br.results[batch_br.best_epoch]['wh']
res = np.corrcoef(wh, base_input)
res_cor = res[0, 1]
r2 = res_cor**2
r2

30-Jun-23 16:05:36 - Results - Best Model: 4, Converged: True, Q: 1313849.4856006966
30-Jun-23 16:05:36 - Runtime: 8.01 min(s)


CPU times: total: 2.66 s
Wall time: 8min


0.5312108699064803

In [236]:
%%time
# Training multiple models
method = "ls-nmf"

batch_br2 = BatchNMF(V=data_matrix, U=uncertainty_matrix, max_iter=100000, converge_delta=0.0001, converge_n=100, factors=factors, models=50, method=method, seed=seed, verbose=True)
batch_br2.train()
wh2 = batch_br2.results[batch_br2.best_epoch]['wh']
res2 = np.corrcoef(wh2, base_input)
res_cor2 = res2[0, 1]
r2_2 = res_cor2**2
r2_2

30-Jun-23 15:25:20 - Results - Best Model: 5, Converged: True, Q: 1255049.4113168067
30-Jun-23 15:25:20 - Runtime: 16.17 min(s)


CPU times: total: 2.08 s
Wall time: 16min 10s


0.09936929952348014

In [11]:
from python.model.optimization import FactorSearch

In [12]:
# fs = FactorSearch(seed=seed, data=data_matrix, uncertainty=uncertainty_matrix)
# fs.search(min_factor=2, max_factor=12, max_iterations=20000)

In [13]:
q_values = [round(v['Q'],2) for k, v in fs.results.items()]
q_delta = [ q_values[i-1] - q_values[i] for i in range(1, len(q_values))]

NameError: name 'fs' is not defined

In [None]:
import plotly.graph_objects as go

x = [i for i in range(1, len(q_values)+1)]
x2 = [i for i in range(2, len(q_delta)+2)]
factor_graph = go.Figure(data=[go.Scatter(x=x, y=q_values, name="Q"), go.Scatter(x=x2, y=q_delta, name="Q Delta")])
factor_graph.update_layout(title="Factor Count vs Loss (Q)", xaxis_title="Factors", yaxis_title="Loss (Q)")
factor_graph.layout.height = 600
factor_graph.show()