## PyIVA: Independent Vector Analysis for Data Fusion Prior to Molecular Property Prediction with Machine Learning

ABSTRACT
Due to its high computational speed and accuracy compared to ab-initio quantum chemistry and
forcefield modeling, the prediction of molecular properties using machine learning has received
great attention in the fields of materials design and drug discovery. A main ingredient required for
machine learning is a training dataset consisting of molecular features—for example fingerprint
bits, chemical descriptors, etc. that adequately characterize the corresponding molecules. However,
choosing features for any application is highly non-trivial. No “universal” method for feature selection
exists. In this work, we propose a data fusion framework that uses Independent Vector Analysis
to exploit underlying complementary information contained in different molecular featurization
methods, bringing us a step closer to automated feature generation. Our approach takes an arbitrary
number of individual feature vectors and automatically generates a single, compact (low dimensional)
set of molecular features that can be used to enhance the prediction performance of regression models.
At the same time our methodology retains the possibility of interpreting the generated features to
discover relationships between molecular structures and properties. We demonstrate this on the QM7b
dataset for the prediction of several properties such as atomization energy, polarizability, frontier
orbital eigenvalues, ionization potential, electron affinity, and excitation energies. In addition, we
show how our method helps improve the prediction of experimental binding affinities for a set of
human BACE-1 inhibitors.

Link to paper: https://arxiv.org/pdf/1811.00628v1.pdf

Credit: https://github.com/zoisboukouvalas/pyiva


In [1]:
# Clone the repository and cd into directory
!git clone https://github.com/zoisboukouvalas/pyiva.git
%cd pyiva

Cloning into 'pyiva'...
remote: Enumerating objects: 59, done.[K
remote: Total 59 (delta 0), reused 0 (delta 0), pack-reused 59[K
Unpacking objects: 100% (59/59), done.
/content/pyiva


In [None]:
# to install, run the command
!python setup.py install

### Documentation on *iva_laplace()*
Required arguments:

* *X* : numpy array of shape (N, K, T) containing data observations from K data sets. Here X{k}=A{k}S{k}, where A{k} is an N x N unknown invertible mixing matrix and S{k} is N x T matrix with the nth row corresponding to T samples of the nth source in the kth dataset. For IVA it is assumed that each source is statistically independent of all the sources within a dataset and exactly dependent on at most one source in each of the other datasets. The data, X, is a 3-dimensional matrix of dimensions N x K x T. The latter enforces the assumption of an equal number of samples in each dataset.

Optional keyword arguments:

*   *A* : [], true mixing matrices A, automatically sets verbose
*   *whiten* : Boolean, default = True
*   *verbose* : Boolean, default = False : enables print statements
*   *W_init* : [], ... % initial estimates for demixing matrices in W
*   *maxIter* : 2*512, ... % max number of iterations
*   *terminationCriterion* : string, default = 'ChangeInCost' : criterion for terminating iterations, either 'ChangeInCost' or 'ChangeInW'
*   *termThreshold* : float, default = 1e-6, : termination threshold
*   *alpha0* : float, default = 0.1 : initial step size scaling

Output:
*  *W* : the estimated demixing matrices so that ideally W{k}A{k} = P*D{k} where P is any arbitrary permutation matrix and D{k} is any diagonal invertible (scaling) matrix.  Note P is common to all datasets; this is to indicate that the local permutation ambiguity between dependent sources across datasets should ideally be resolved by IVA.

During runtime the following are reported:

* *cost* - the cost for each iteration
* *isi* - joint inter-symbol-interference is available if user supplies true
mixing matrices for computing a performance metric

### Example usage

In [4]:
import numpy as np
from pyiva.iva_laplace import *

N = 8
K = 4
T = 10000

S = np.zeros((K, N, T))

for n in range(0, N):
   Z = randmv_laplace(K, T)
   S[:, n, :] = Z

A = np.random.rand(K,N,N)
X = S

for k in range(0,K):
    A[k,:,:]= np.transpose(vecnorm(A[k,:,:])[0])
    X[k,:,:]= np.matmul(A[k,:,:], S[k,:,:])

W = iva_laplace(X, A=A)

print("Done")


 Step  0 : W changes:  1 , Cost:  0.6525359986920707 , Avg ISI:  [[0.44079416]] , Joint ISI:  [[0.64044803]]

 Step  1 : W changes:  0.6111872495213494 , Cost:  0.4050032042432844 , Avg ISI:  [[0.39992359]] , Joint ISI:  [[0.60408768]]

 Step  2 : W changes:  0.1923722713609173 , Cost:  0.3396617096613903 , Avg ISI:  [[0.39048062]] , Joint ISI:  [[0.59601549]]

 Step  3 : W changes:  0.17898669147144317 , Cost:  0.28809630517327806 , Avg ISI:  [[0.38383627]] , Joint ISI:  [[0.58911693]]

 Step  4 : W changes:  0.18415157804675827 , Cost:  0.24329343516012447 , Avg ISI:  [[0.37892956]] , Joint ISI:  [[0.58503604]]

 Step  5 : W changes:  0.19983051166680335 , Cost:  0.20277316903880155 , Avg ISI:  [[0.3750452]] , Joint ISI:  [[0.58131826]]

 Step  6 : W changes:  0.22627608341642574 , Cost:  0.16535686521249937 , Avg ISI:  [[0.37191196]] , Joint ISI:  [[0.57895022]]

 Step  7 : W changes:  0.26803861159957965 , Cost:  0.13040365151334654 , Avg ISI:  [[0.36918124]] , Joint ISI:  [[0.578