Skip to content

Scalable Manifold Learning Final Report

Jake Vanderplas edited this page Aug 27, 2014 · 3 revisions

Project Overview

Manifold Learning (ML), also known as Non-linear dimension reduction, finds a non-linear representation of high-dimensional data with a small number of parameters. ML is data intensive; it has been shown statistically that the estimation accuracy depends asymptotically on the sample size N like N1/(α d + β), hence requires large amounts of data when the intrinsic dimension d is larger than a few. On the other hand, manifold learning fully realizes its potential in scientific discovery from very large multi-dimensional data sets representing partially known physical systems, (e.g. spectra of galaxies) where there is reason to believe that the data can be modeled by a small set of parameters.

Therefore, we implemented a software suite that will enable scientists and methodologists alike to scale a broad class of manifold learning methods to very large data sets. In particular, the software can be used to analyze spectroscopic data from the SDSS, as well as other data from astronomical surveys. The software is written in Python, building upon the existing scikit-learn library for scientific computing/machine learning. Our project demonstrates, against the commonly held beliefs, that with careful implementation ML can be made tractable on large data.

Project Links

Clone this wiki locally