## EPA NMF-PY Algorithm Details

#### Summary
The details provided in this notebook are focused on describing two algorithms which are capable of meeting the constraints of PMF5, produce models which have a high correlation with the outputs of PMF5, and are able to generate models with a loss value that matches or is lower than PMF5. The algorithm used in PMF5, Multi-Linear Engine 2 (ME-2), was built upon the original ME-1 algorithm that is detailed in the publication by Pentti Paatero [*The Multilinear Engine: A Table-Driven, Least Squares Program for Solving Multilinear 
Problems, including the n-Way Parallel Factor Analysis Mode*](https://www.jstor.org/stable/1390831). The algorithm uses a modified conjugate gradient method using projection to maintain non-negativity of the factor profile matrix. The algorithm falls under the category of a semi-NMF algorithm, as not all the matrices are required to be positive. The reason for not fully enforcing non-negativity on the matrices is to help prevent a higher positive value bias due to having uncertainty and small values in the input datasets. Another reason was to allow datasets which may have small negative values in the inputs because of how the data was collected, where the input data represents a distribution rather than a single data point.

#### Algorithm Requirements
The constraints and requirements of the PMF5 algorithm which had to be considered in the new NMF-PY algorithms are:
 - The loss function had to remain the same as the ME2 loss function.
 - The output of NMF-PY must be able to produce results which have a high correlation with the ME2 output. Reproducing the output of ME2 exactly is highly improbable due to any differences in the update procedure and the randomness of the starting state.
   - A correlation of greater than 0.9 averaged across the factor profile, factor contributions and the concentration output.
   - A correlation of greater than 0.98 for the factor profile.
 - The algorithm needs to properly function when negative values are present in the input data, and allow for negative values to be present in the factor contribution matrix.
 - The output model needs to have a loss value that is comparable to PMF5.
 - The algorithm should be as fast or faster than ME2.
#### Loss Function
The loss function that is used in ME2, and is described in the PMF5 User's Manual is defined as:
$$ Q = \sum_{i=1}^n \sum_{j=1}^m \bigg[ \frac{V_{ij} - \sum_{k=1}^K W_{ik} H_{kj}}{U_{ij}} \bigg]^2 $$
here $V$ is the input data matrix of features (columns=$M$) by samples (rows=$N$), $U$ is the uncertainty matrix of the input data matrix, $W$ is the factor contribution matrix of samples by factors=$k$, $H$ is the factor profiles of factors by features.

#### Convergence
These algorithms all use a similar convergence criteria as the stopping condition. PMF5 has a tiered approach which is described full in the User's Manual. The two NMF-PY algorithms offers two parameters which can be used to adjust the convergence criteria and allow for tuning of the final output. The two parameters are *converge_delta* and *converge_n*, and is simply the number of steps, *converge_n* where the change in $Q$ is less than *converge_delta* the model is considered converged and updates stop. These values in testing are typically set to *converge_delta* = 0.1 and *converge_n* = 10 (these values were chosen to speed up model convergence during testing and development).

#### Initial Conditions
The performance of NMF algorithms is very sensitive to the initial conditions of the model, in this case the choices for the $W$ and $H$ matrices. There are multiple methods for initialization that are typically used for NMF. The method used by PMF5/ME2 is unknown. For NMF-PY, we have provided three different methods of initialization:
 - Random sampling from a normal distribution using the square root of the mean values of the input dataset $V$ by row $N$ for $W$ and by column $M$ for $H$. This is the default method in most NMF packages.
 - K-Means clustering, where the input dataset is normalized (can also be clustered without normalization) and $k$ clusters calculated. Allocation of a factor to a cluster is set to 1.0 and all other values are equal to $\frac{1}{k}$.
 - C-Means clustering, also known as fuzzy k-means clustering. Which is similar to K-Means but assignment to a cluster is not a binary value but continuous as calculated by the distance to the cluster centroids and the ratio of those distances.
Other methods of initialization exist but these are what we are currently providing in NMF-PY.

### NMF-PY Algorithms
We have implemented two algorithms, one of which fully satisfies the algorithm requirements stated above and another which satisfies all by the condition of allowing for negative values.

#### LS-NMF
The first algorithm option we provide is a well documented algorithm called *LS-NMF*, least-squares nmf, and is available in the R NMF package. The ls-nmf algorithm is documented in [*LS-NMF: A modified non-negative matrix factorization algorithm utilizing uncertainty estimates*](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-175) and the R NMF package can be found at [https://cran.r-project.org/web/packages/NMF/index.html](https://cran.r-project.org/web/packages/NMF/index.html). The NMF-PY versions of these algorithms converts the uncertainty of $U$ into weights defined as $Uw = \frac{1}{U^2}$. The update equations then become:

$$ H_{t+1} = H_t \circ \frac{W_t (V \circ Uw)}{W_t {((W_{t}H_{t}) \circ Uw)}} $$

$$ W_{t+1} = W_t \circ \frac{(V \circ Uw) H_{t+1}}{((W_{t}H_{t+1})\circ Uw )H_{t+1}} $$

The ls-nmf algorithm requires that all matrices be non-negative but is able to produce models with a lower loss value than PMF5 and is significantly faster than ME2. We include this algorithm as an option for when full non-negativity is permitted in the models because of the algorithm's performance and efficiency.

#### Weighted Semi-NMF
The weighted semi-nmf (ws-nmf) algorithm is a more complicated algorithm that more closely resembles the ME1 algorithm. The ws-nmf algorithm satisfies all of the requirements for a replacement algorithm of PMF5. The implemented algorithm was developed utilizing elements from two different publications, neither of which provided the complete algorithm we required for NMF-PY. [*Semi-NMF and Weighted Semi-NMF Algorithms Comparison*](https://www2.dc.ufscar.br/~marcela/anaisKDMiLe2014/artigos/FULL/kdmile2014_Artigo15.pdf) provides an overview of the semi-nmf algorithm and part of the update equation for ws-nmf. [*Convex and Semi-Nonnegative Matrix Factorizations*](https://ieeexplore.ieee.org/abstract/document/4685898) provides details on a complete algorithm for the non-weighted semi-nmf algorithm. Using these details we developed a complete update algorithm which may not yet have been published, further literature review is necessary to determine the novelty of the algorithm. 

As in the ls-nmf algorithm, the uncertainties are converted to weights $Uw$. In both algorithms, the loss function remains the same and uses the uncertainty and not the weights to maintain consistency with PMF5. The update equations for ws-nmf are:

$$ W_{t+1,i} = (H^{T}Uw_{i}^{d}H)^{-1}(H^{T}Uw_{i}^{d}V_{i})$$
$$ H_{t+1,i} = H_{t, i}\sqrt{\frac{((V^{T}Uw)W_{t+1})_{i}^{+} + [H_{t}(W_{t+1}^{T}Uw W)^{-}]_{i}}{((V^{T}Uw)W_{t+1})_{i}^{-} + [H_{t}(W_{t+1}^{T}Uw W)^{+}]_{i}}}$$

Each matrix requires a separate calculation for each sample=$N$ for $W$ and each feature=$M$ for $H$ and is indicated by the $i$ index, increasing the computational complexity of the algorithm. $Uw_{i}^{d}$ is the diagonal matrix created from the $i$th column or row of $Uw$ depending on which matrix is being updated. The first section of the update equation for $W_{t+1,i}$ requires calculating the inverse which is only possible if the determinant is not equal to zero, in which case the pseudo-inverse is used. The calculation of $H_{t+1}$ requires several additional steps. The positive and negative values from $W$ are separated with $W^{-} = \frac{(|W| - W)}{2.0}$ and $W^{+} = \frac{(|W| + W)}{2.0}$. 

#### Optimization
The algorithms are intended to be used directly as a python package, through Jupyter notebooks and eventually through a web application. With this in mind, the performance of the code must be considered during implementation. Here are a few approaches that have been taken to optimize the code/algorithm (with performance metrics shown later):
 1. Using Python 3.11, which has performance increases of 10-60% over 3.10
 2. Use of parallelization for batch modeling, fitting multiple models at a time.
 3. The algorithms have also be written in Rust, a low level language, which provides increased memory efficiency and decreased runtime.



### LS-NMF Example
Here is an example of how to use the code and generate either a single or multiple models

In [1]:
# Notebook imports
import os
import sys
import copy
import logging
import time
import json
import pandas as pd
import numpy as np
import plotly

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

##### Sample Datasets
PMF5 comes with three sample datasets which we will use in the code demo.

In [2]:
# Baton Rouge Dataset
br_input_file = os.path.join("D:\\", "projects", "nmf_py", "data", "Dataset-BatonRouge-con.csv")
br_uncertainty_file = os.path.join("D:\\", "projects", "nmf_py", "data", "Dataset-BatonRouge-unc.csv")
br_output_path = os.path.join("D:\\", "projects", "nmf_py", "output", "BatonRouge")
# Baltimore Dataset
b_input_file = os.path.join("D:\\", "projects", "nmf_py", "data", "Dataset-Baltimore_con.txt")
b_uncertainty_file = os.path.join("D:\\", "projects", "nmf_py", "data", "Dataset-Baltimore_unc.txt")
b_output_path = os.path.join("D:\\", "projects", "nmf_py", "output", "Baltimore")
# Saint Louis Dataset
sl_input_file = os.path.join("D:\\", "projects", "nmf_py", "data", "Dataset-StLouis-con.csv")
sl_uncertainty_file = os.path.join("D:\\", "projects", "nmf_py", "data", "Dataset-StLouis-unc.csv")
sl_output_path = os.path.join("D:\\", "projects", "nmf_py", "output", "StLouis")

##### Code Imports
We import the modules for the model and a datahandler.

In [3]:
from src.data.datahandler import DataHandler
from src.model.nmf import NMF
from src.model.model import BatchNMF

##### Parameters

In [4]:
index_col = "Date"                  # the index of the input/uncertainty datasets
factors = 4                         # the number of factors
method = "ws-nmf"                   # "ls-nmf", "ws-nmf"
init_method = "col_means"           # default is column means "col_means", "kmeans", "cmeans"
init_norm = True                    # if init_method is either kmeans or cmeans, whether to normalize the data prior to clustering.
seed = 42                           # seed = 26586, most comparable model to PMF5 currently found
max_iterations = 20000              # the maximum number of iterations for fitting a model
converge_delta = 0.1                # convergence criteria for the change in loss, Q
converge_n = 10                     # convergence criteria for the number of steps where the loss changes by less than converge_delta
dataset = "br"                      # "br": Baton Rouge, "b": Baltimore, "sl": St Louis
verbose = True                      # adds more verbosity to the algorithm workflow on execution.
optimized = True                    # use the Rust code if possible

##### Load the Data

In [5]:
# Loading the Baton Rouge dataset
dh_br = DataHandler(
    input_path=br_input_file,
    uncertainty_path=br_uncertainty_file,
    index_col=index_col,
    sn_threshold=2.0
)
V_br = dh_br.input_data_processed               # Cleaned input dataset (numpy array)
U_br = dh_br.uncertainty_data_processed         # Cleaned uncertainty dataset (numpy array)

08-Jun-23 15:09:09 - Input and output configured successfully


##### Initialize and Train

In [6]:
# Training a single model
nmf_br = NMF(V=V_br, U=U_br, factors=factors, method=method, seed=seed, optimized=optimized, verbose=verbose)
nmf_br.initialize(init_method=init_method, init_norm=init_norm, fuzziness=5.0)
nmf_br.train(max_iter=max_iterations, converge_delta=converge_delta, converge_n=converge_n)

08-Jun-23 15:09:15 - Model: -1, Seed: 42, Q(true): 84264.1101, Steps: 940/20000, Converged: True, Runtime: 6.25 sec


Here a single model was created that used the optimized Rust code with the ws-nmf algorithm. 940 iterations were taken before the convergence criteria was met with a resulting loss value of $Q=84264.11$.

In [7]:
%%time
# Training multiple models
models = 10                   # number of models to create
parallel = True               # execute training in parallel

batch_br = BatchNMF(V=V_br, U=U_br, factors=factors, models=models, method=method, seed=seed, 
                    init_method=init_method, init_norm=init_norm,
                    max_iter=max_iterations, converge_delta=converge_delta,
                    converge_n=converge_n, parallel=parallel, optimized=optimized,
                    verbose=verbose
                   )
batch_br.train()

08-Jun-23 15:10:12 - Results - Best Model: 7, Converged: True, Q: 83558.41866816676
08-Jun-23 15:10:12 - Runtime: 0.94 min(s)


In [8]:
# Save results
br_full_output_path = f"br_nmf-output-f{factors}.json"
batch_br.save(output_name=br_full_output_path)

08-Jun-23 15:10:13 - Results saved to: .\br_nmf-output-f4.json


In [9]:
# Imports for comparing to PMF5 outputs
from tests.factor_comparison import FactorComp
from src.utils import calculate_Q