## Factor Count Evaluation using CV

This workflow investigates a single dataset factor count evaluation using cross validation on synthetic data. The workflow has the following steps:
1. Generate synthetic dataset
2. Create a subset of the synthetic dataset, using random sampling without replacement, for a % of the data to create a train and test dataset.
3. Create a SA instance (base) using the train dataset for k factors.
4. Take the base H matrix and run a new SA instance holding H constant on the test dataset (V_test).
   1. Evaluate the loss of a direct calculation of W using V_test and H_base.
5. Keep track of the RMSE of the test model.
6. Repeat steps 3-5 increasing k.
7. Evaluate/plot the change in RMSE

#### Code Imports

In [1]:
from esat.data.datahandler import DataHandler
from esat.model.batch_sa import BatchSA
from esat.model.sa import SA
from esat.data.analysis import ModelAnalysis, BatchAnalysis
from esat_eval.simulator import Simulator
from esat.estimator import FactorEstimator

# from scipy.sparse import csr_matrix
# from scipy.sparse.csgraph import min_weight_full_bipartite_matching
import logging
import plotly.graph_objects as go
import numpy as np
import copy

logger = logging.getLogger(__name__)

#### Synthetic Dataset

Generate a synthetic dataset where the factor profiles and contributions are pre-determined for model output analysis.

In [2]:
# Synethic dataset parameters
seed = 42
syn_factors = 6                # Number of factors in the synthetic dataset
syn_features = 40              # Number of features in the synthetic dataset
syn_samples = 2000             # Number of samples in the synthetic dataset
outliers = True                # Add outliers to the dataset
outlier_p = 0.10               # Decimal percent of outliers in the dataset
outlier_mag = 1.25                # Magnitude of outliers
contribution_max = 2           # Maximum value of the contribution matrix (W) (Randomly sampled from a uniform distribution)
noise_mean_min = 0.03          # Min value for the mean of noise added to the synthetic dataset, used to randomly determine the mean decimal percentage of the noise for each feature.
noise_mean_max = 0.05          # Max value for the mean of noise added to the synthetic dataset, used to randomly determine the mean decimal percentage of the noise for each feature.
noise_scale = 0.02             # Scale of the noise added to the synthetic dataset
uncertainty_mean_min = 0.04    # Min value for the mean uncertainty of a data feature, used to randomly determine the mean decimal percentage for each feature in the uncertainty dataset. 
uncertainty_mean_max = 0.06    # Max value for the mean uncertainty of a data feature, used to randomly determine the mean decimal percentage for each feature in the uncertainty dataset. 
uncertainty_scale = 0.01       # Scale of the uncertainty matrix

In [3]:
# Initialize the simulator with the above parameters
simulator = Simulator(seed=seed,
                      factors_n=syn_factors,
                      features_n=syn_features,
                      samples_n=syn_samples,
                      outliers=outliers,
                      outlier_p=outlier_p,
                      outlier_mag=outlier_mag,
                      contribution_max=contribution_max,
                      noise_mean_min=noise_mean_min,
                      noise_mean_max=noise_mean_max,
                      noise_scale=noise_scale,
                      uncertainty_mean_min=uncertainty_mean_min,
                      uncertainty_mean_max=uncertainty_mean_max,
                      uncertainty_scale=uncertainty_scale
                     )

18-Apr-25 12:05:17 - Synthetic profiles generated


In [4]:
# Example command for passing in a custom factor profile matrix, instead of the randomly generated profile matrix.
# my_profile = np.ones(shape=(syn_factors, syn_features))
# simulator.generate_profiles(profiles=my_profile)

In [5]:
# Example of how to customize the factor contributions. Curve_type options: 'uniform', 'decreasing', 'increasing', 'logistic', 'periodic'
# simulator.update_contribution(factor_i=0, curve_type="logistic", scale=0.1, frequency=0.5)
# simulator.update_contribution(factor_i=1, curve_type="periodic", minimum=0.0, maximum=1.0, frequency=0.5, scale=0.1)
# simulator.update_contribution(factor_i=2, curve_type="increasing", minimum=0.0, maximum=1.0, scale=0.1)
# simulator.update_contribution(factor_i=3, curve_type="decreasing", minimum=0.0, maximum=1.0, scale=0.1)
# simulator.plot_synthetic_contributions()

#### Load Data
Assign the processed data and uncertainty datasets to the variables V and U. These steps will be simplified/streamlined in a future version of the code.

In [6]:
syn_input_df, syn_uncertainty_df = simulator.get_data()

18-Apr-25 12:05:17 - Synthetic data generated
18-Apr-25 12:05:17 - Synthetic uncertainty data generated
18-Apr-25 12:05:17 - Synthetic dataframes completed
18-Apr-25 12:05:17 - Synthetic source apportionment instance created.


In [7]:
data_handler = DataHandler.load_dataframe(input_df=syn_input_df, uncertainty_df=syn_uncertainty_df)
V, U = data_handler.get_data()

#### Input Parameters

In [8]:
index_col = "Date"                  # the index of the input/uncertainty datasets
method = "ls-nmf"                   # "ls-nmf", "ws-nmf"
models = 20                         # the number of models to train
init_method = "col_means"           # default is column means "col_means", "kmeans", "cmeans"
init_norm = True                    # if init_method=kmeans or cmeans, normalize the data prior to clustering.
seed = 42                           # random seed for initialization
max_iterations = 20000              # the maximum number of iterations for fitting a model
converge_delta = 0.1                # convergence criteria for the change in loss, Q
converge_n = 25                     # convergence criteria for the number of steps where the loss changes by less than converge_delta
verbose = True                      # adds more verbosity to the algorithm workflow on execution.

### Utility Functions

In [12]:
def calculate_W(V, U, H):
    H[H <= 0.0] = 1e-8
    # W = np.matmul(V * np.divide(1, U ** 2), H.T)
    W = np.matmul(V, H.T)
    return W

def q_loss(V, U, H, W):
    residuals = ((V-np.matmul(W, H))/U)**2
    return np.sum(residuals)

def rmse(_V, _U, _H, _W):
    WH = np.matmul(W, H)
    residuals = ((V-WH)/U)**2
    return np.sqrt(np.sum(residuals)/V.size)
            
def plot_results(train_loss, test_loss, min_k, max_k, true_k):
    x = np.arange(min_k, max_k)
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=x, y=train_loss, name="Train"))
    fig.add_trace(go.Scatter(x=x, y=test_loss, name="Test"))
    fig.add_trace(go.add_vline(x0=true_k))
    fig.update_layout(title_text="RMSE of Test data by Factor(k)", width=800, heigh=600)
    fig.show()

In [10]:
# Split the dataset
rng = np.random.default_rng(seed)
p = 0.5

train_selection = rng.random(size=syn_samples)
train_V = copy.copy(V)[train_selection < p]
train_U = copy.copy(U)[train_selection < p]
test_V = copy.copy(V)[train_selection >= p]
test_U = copy.copy(U)[train_selection >= p]
print(f"V: {V.shape}, Vtrain: {train_V.shape}, Vtest: {test_V.shape}")
print(f"Number of samples - train: {train_V.shape[0]}, test: {test_V.shape[0]}")

V: (2000, 40), Vtrain: (1008, 40), Vtest: (992, 40)
Number of samples - train: 1008, test: 992


In [11]:
%%capture
min_factors = 2
max_factors = 12

batch = True
test_rmse = []
train_rmse = []
calc_rmse = []

for k in range(min_factors, max_factors+1):
    if batch:
        sa_models = BatchSA(V=train_V, U=train_U, factors=k, models=20, method=method, seed=seed, max_iter=max_iterations,
                            converge_delta=converge_delta, converge_n=converge_n, verbose=False)
        _ = sa_models.train()
        error0 = np.mean([rmse(V=train_V, U=train_U, H=sa.H, W=sa.W) for sa in sa_models.results])
        batch_test = []
        w_test = []
        for sa in sa_models.results:
            sa_test = SA(factors=k, method=method, V=test_V, U=test_U, seed=seed, verbose=False)
            sa_test.initialize(H=sa.H)
            _ = sa_test.train(max_iter=max_iterations, converge_delta=converge_delta, converge_n=converge_n, hold_h=True)
            batch_test.append(rmse(V=test_V, U=test_U, H=sa_test.H, W=sa_test.W))
            w_test.append(rmse(V=test_V, U=test_U, H=sa_test.H, W=calculate_W(V=test_V, U=test_U, H=sa_test.H)))
        error = np.mean(batch_test)
        w_error = np.mean(w_test)
    else:
        sa = SA(factors=k, method=method, V=train_V, U=train_U, seed=seed, verbose=False)
        sa.initialize()
        _ = sa.train(max_iter=max_iterations, converge_delta=converge_delta, converge_n=converge_n)
        error0 = rmse(V=train_V, U=train_U, H=sa.H, W=sa.W)
    
        sa_test = SA(factors=k, method=method, V=test_V, U=test_U, seed=seed, verbose=False)
        sa_test.initialize(H=sa.H)
        _ = sa_test.train(max_iter=max_iterations, converge_delta=converge_delta, converge_n=converge_n, hold_h=True)
    
        error = rmse(V=test_V, U=test_U, H=sa_test.H, W=sa_test.W)
        w_error = rmse(V=test_V, U=test_U, H=sa_test.H, W=calculate_W(V=test_V, U=test_U, H=sa_test.H))
    test_rmse.append(error)
    train_rmse.append(error0)
    calc_rmse.append(w_error)
    logger.info(f"Factor: {k}, Train RMSE: {error0:.4f}, Test RMSE: {error:.4f}, Calc RMSE: {w_error:.4f}")

plot_results(train_rmse, test_rmse, min_factors, max_factors, syn_factors)

18-Apr-25 12:05:27 - Factor: 2, Train RMSE: 490.2499084472656, Test RMSE: 934.0377807617188, Calc RMSE: 320.7646484375
18-Apr-25 12:05:46 - Factor: 3, Train RMSE: 542.1683349609375, Test RMSE: 1041.618896484375, Calc RMSE: 551.7781982421875
18-Apr-25 12:06:16 - Factor: 4, Train RMSE: 593.281005859375, Test RMSE: 1074.587158203125, Calc RMSE: 846.4871215820312
18-Apr-25 12:06:46 - Factor: 5, Train RMSE: 648.4396362304688, Test RMSE: 1132.601806640625, Calc RMSE: 990.6328125
18-Apr-25 12:09:34 - Factor: 6, Train RMSE: 663.2329711914062, Test RMSE: 1145.047607421875, Calc RMSE: 964.9929809570312
18-Apr-25 12:14:46 - Factor: 7, Train RMSE: 669.8819580078125, Test RMSE: 1133.564697265625, Calc RMSE: 1010.6566162109375
18-Apr-25 12:21:22 - Factor: 8, Train RMSE: 691.0838623046875, Test RMSE: 1186.409423828125, Calc RMSE: 1222.4281005859375
18-Apr-25 12:29:43 - Factor: 9, Train RMSE: 702.8465576171875, Test RMSE: 1216.1820068359375, Calc RMSE: 1146.267578125
18-Apr-25 12:40:54 - Factor: 10, T

AttributeError: module 'plotly.graph_objects' has no attribute 'add_vrect'

In [16]:
# sa_models0 = BatchSA(V=train_V, U=train_U, factors=6, models=20, method=method, seed=seed, max_iter=max_iterations,
#                     converge_delta=converge_delta, converge_n=converge_n, verbose=True)
# _ = sa_models0.train()
import pandas as pd

pd.DataFrame(train_V).describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
count,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,...,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0
mean,2.587054,2.323097,0.78724,1.663305,1.8984,1.788197,1.333199,1.450338,0.898467,0.721875,...,2.120828,1.486728,0.443981,0.763109,2.330314,1.132336,0.318356,1.57655,2.01031,1.409685
std,0.770119,0.814386,0.321141,0.52865,0.68246,0.697436,0.521107,0.639544,0.367544,0.425958,...,0.718732,0.5713,0.154324,0.284998,0.818652,0.490869,0.189125,0.641379,0.759599,0.608248
min,0.744725,0.350683,0.024138,0.346828,0.106687,0.164197,0.033892,0.034038,0.035175,0.00082,...,0.225523,0.11994,0.034896,0.097423,0.234512,0.071627,0.000386,0.098258,0.167908,0.072518
25%,2.022677,1.734606,0.545871,1.272776,1.398396,1.29586,0.954282,0.99762,0.623766,0.348021,...,1.578108,1.0364,0.334998,0.548912,1.748714,0.76779,0.158452,1.12893,1.439226,0.960116
50%,2.544799,2.262204,0.791199,1.643331,1.895736,1.823162,1.33682,1.427165,0.877048,0.709751,...,2.129369,1.500919,0.441013,0.746416,2.30799,1.124793,0.316635,1.584202,1.992381,1.407355
75%,3.131643,2.88922,1.002011,2.035403,2.385285,2.285754,1.707176,1.901816,1.161114,1.063148,...,2.652114,1.902668,0.548559,0.962456,2.903333,1.474634,0.468088,2.032405,2.5582,1.808667
max,5.033365,4.63057,1.725452,3.22445,4.02354,4.445809,3.261866,3.274368,2.231853,1.738794,...,4.326731,3.177294,0.883657,1.784791,5.381936,2.915104,0.869227,3.60507,4.269807,3.218237


In [17]:
pd.DataFrame(V).describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,...,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,2.613851,2.353155,0.794718,1.670762,1.911451,1.799721,1.343144,1.467516,0.903949,0.732818,...,2.125679,1.497046,0.445924,0.770044,2.348559,1.142763,0.321091,1.597178,2.019863,1.413956
std,0.789029,0.848043,0.328527,0.542966,0.690148,0.704496,0.526303,0.652096,0.377909,0.434342,...,0.722725,0.574503,0.155491,0.287849,0.834327,0.486733,0.188429,0.65311,0.765154,0.597097
min,0.409886,0.162426,0.024138,0.274092,0.106687,0.125554,0.033892,0.034038,0.035175,0.00064,...,0.225523,0.11994,0.034896,0.081839,0.220966,0.031842,0.000386,0.066746,0.167908,0.035763
25%,2.039744,1.732792,0.549574,1.271369,1.39528,1.281073,0.955841,0.995211,0.624262,0.354084,...,1.584719,1.058695,0.334906,0.54944,1.742806,0.788115,0.158681,1.131793,1.450418,0.976774
50%,2.600091,2.300804,0.793235,1.650521,1.920395,1.803769,1.339596,1.435394,0.895936,0.726058,...,2.141906,1.509355,0.44094,0.760227,2.329631,1.132257,0.320616,1.603519,1.995757,1.406387
75%,3.150693,2.942218,1.021115,2.04985,2.385841,2.323434,1.728043,1.912082,1.184009,1.098434,...,2.639029,1.90497,0.55569,0.977431,2.923973,1.483566,0.475436,2.066769,2.565723,1.81321
max,5.033365,5.381397,1.842426,3.820948,4.365594,4.445809,3.261866,3.710737,2.231853,1.758098,...,4.441372,3.177294,0.976088,1.784791,5.381936,3.015301,0.869227,3.60507,4.77059,3.36941
