Skip to content

S-Driscoll/SparseProjectionPursuit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sparse Projection Pursuit Analysis (SPPA)

SPPA

Kurtosis-based projection pursuit analysis (PPA) was developed as an alternative exploratory data analysis algorithm. Instead of using variance and distance-based metrics to obtain, hopefully, informative projections of high-dimensional data (like PCA, HCA, and kNN), ordinary PPA searches for interesting projections by optimizing the kurtosis. However, if the sample-variable ratio is too low, it is possible for ordinary PPA to "overmodel" the data by finding spurious combinations of the original variables that give a low kurtosis value. To overcome this, one can compress their data with PCA prior to applying PCA (~10:1 sample-to-variable ratio). To make PPA independent of PCA, we have developed a sparse implementation of PPA (SPPA), where subsets of the original variables are selected using a genetic algorithm. This repository contains MATLAB code that can be used to apply SPPA to high-dimensional data, examples of SPPA in use, and the corresponding paper published on SPPA. Below is a figure from our recent paper that shows the basic approach of the algorithm.

Sparse Projection Pursuit

MATLAB function

SPPA.m is a MATLAB function to perform sparse kurtosis-based projection pursuit using a genetic algorithm.

Citing this algorithm

Please cite Sparse Projection Pursuit Analysis: An Alternative for Exploring Multivariate Chemical Data (2020).

Structure of this repository

The master branch of this repository contains the original SPPA code (version 1.0) implemented for the work published in Sparse Projection Pursuit Analysis: An Alternative for Exploring Multivariate Chemical Data (2020). If available, enhancements to the original code can be found in additional branches named with a corresponding version number.

Current branches

  • Master - Original SPPA code (version 1.0)
  • Version 1.1 - Improved selection of initial population to ensure maximum coverage of variables. If population size is sufficient, each variable is selected a minimum of n times. Residual individuals are selected at random without repetition. This is more equitable than the original version which might exclude some variables and over-represent others.

Literature related to PPA

Literature related to SPPA

Examples

To be completed. Please check demo.m for a quick demonstration showing the use of SPPA to explore a salmon plasma data set (Nuclear Magnetic Resonance (NMR) Spectroscopy).