Skip to content

GSoC 2019 projects

michaelosthege edited this page Mar 28, 2019 · 22 revisions

Getting started

New contributors should first read the developer guide and learn the basic of Theano. Also they should read through some of the examples in the PyMC3 docs.

To be considered as a GSoC student, you should make a PR to PyMC3 or PyMC4. It can be something small, like a doc fix or simple bug fix. Some beginner friendly issues can be found here.

Projects

Help in creating the upcoming PyMC4 based on Tensorflow Probability

PyMC3 is based on Theano, and uses it for creating and computing the graph that comprises the probabilistic model. Given the discontinuation of support for Theano, we are moving to TensorFlow Probability (TFP), a library for probabilistic reasoning and statistical analysis in TensorFlow.

TFP's focus is to provide a strong foundation upon which flexible statistical models for inference and prediction can be constructed from the ground up. However its focus is not to provide a high-level API which makes construction and fitting of common classes of models easy for applied users. Thus, the main goal for PyMC4 is to become a high-level interface for TFP that focuses on usability and high-level model specification and inference. Leveraging TFP will have many benefits for our developers and users alike. TensorFlow is widely supported and is a de facto industry standard, making it a viable long-term solution for supporting computation for PyMC. With TFP, we can stand on the shoulders of giants and benefit from the work being done to build a highly performant and scalable statistical package with out-of-the-box support for GPUs and TPUs, as well as state-of-the-art inference algorithms including Markov Chain Monte Carlo and variational inference.

Potential mentors

  • Chris Fonnesbeck
  • Thomas Wiecki
  • Junpeng Lao
  • Colin Carroll
  • Maxim Kochurov

Bayesian Additive Regression Trees

Bayesian Additive Regression Trees (BART) is a Bayesian nonparametric approach to estimating functions using regression trees. A BART model consist on a sum of regression trees with (homoskedastic) normal additive noise. Regression trees are defined by recursively partitioning the input space, and defining a local model in each resulting region of input space in order to approximate some unknown function. BARTs are useful and flexible model to capture interactions and non-linearities and have been proved useful tools for variable selection.

Potential mentors:

  • Osvaldo Martin
  • Austin Rochford

ODE solvers

Some applications, such as pharmacokinetic models and other compartment models, involve specifying and solving a set of ordinary differential equations. Such models would be greatly facilitated by the addition of an ODE solver to PyMC3.

The challenge lies in the implementation of one or more custom Theano Ops that perform not only a fast forward-simulation of a user-specified ODE system, but also compute gradients.

A variety of approaches can be taken to compute gradients (see this thread) and this project seeks to implement and compare them with respect to performance and robustness.

We care deeply about PyMC3's API and want to make ODE-methods easy to use with real-world scenarios. For example, they should support situations where not all variables are observed at all time-points. With the famous Lotka-Volterra model this happens when one has data for the number of predators, but not for the number of prey in a particular year.

Potential mentors:

  • Chris Fonnesbeck
  • Maxim Kochurov
  • Michael Osthege

Replace backends with xarray

We currently have various different backends. ArviZ uses xarray so we should convert our internal backend to xarray as well.

Potential mentors:

  • Colin Carroll
  • Ravin Kumar

Add better support for time-series models

Time-series models are an important model class. We already have several time-series distributions like AR1, GARCH1 and Gaussian Random walks. This project seeks to add other models like ARIMA and Prophet.

Potential mentors

  • Ravin Kumar
  • Thomas Wiecki
  • Brandon Willard

Dirichlet processes

Dirichlet processes are an infinite-dimensional generalization of the Dirichlet distributions that can be used to set priors on unknown distributions. This can be useful, for example to build infinite component mixture models. It's currently possible to implement (truncated) Dirichlet processes in PyMC3, but the process is quite manual and involved. This project would seek to add a (truncated) stick-breaking process distribution to PyMC3, along with several flavours of Dirichlet processes based on it.

Potential mentors:

  • Austin Rochford
  • Maxim Kochurov

Random variable reimplementation and symbolic computation

By converting existing PyMC3 random variables to Theano Ops, it becomes possible to perform more elaborate symbolic mathematics on models using Theano's optimization framework. These manipulations include

  • sophisticated and automatic determination of applicable samplers and optimization methods (e.g. better sampler choices and MAP methods)
  • parameter expansions that improve the condition of sampling and optimization problems
  • generalized Gibbs(-within-*) and slice-sampling reformulations of models through scale mixture decompositions

Furthermore, a number of symbolic tensor shape issues can be resolved in the context of these changes, along with a major extension and simplification of random variable and model sampling.

Symbolic computation in PyMC3—using said random variable Ops—is currently underway in symbolic-pymc. This project would involve integrating elements from symbolic-pymc into PyMC3.

Likewise, alpha-stage work is needed to help support symbolic computation in PyMC4/TensorFlow.

Potential mentors:

  • Brandon Willard
You can’t perform that action at this time.