Google Summer of Code 2017

Josef Perktold edited this page Feb 7, 2017 · 5 revisions

Note: This is currently a draft and partially a copy of the 2016 page

Statsmodels has participated for eight years in GSOC under the umbrella of the the Python Software Foundation. The focus in previous years has been on adding new models. There are still several areas where statsmodels is missing commonly used models, we also have several models that have been worked on but still need work to finish, add unit tests and integrate into statsmodels, and finally there are several areas where existing models can be extended. One important consideration in the selection of the project is the background of the student, and it is an advantage if the student is familiar with the topic and may be using it also in her or his research.

Introduction

Statsmodels is a library for statistics and econometrics written in Python with some extension using cython. It contains by now many of the most commonly used models for estimation, hypothesis tests and statistical graphs. See our documentation for more information. The developer pages describe in more details how to make contributions to statsmodels and our work flow for pull requests. Our issues are also on github, which include bug reports and wishlist items amd enhancement plans and ideas.

Guidelines & requirements

We are planning again to participate in GSoC 2017 under the `umbrella of Python Software Foundation.

The PSF getting started page http://python-gsoc.org/#gettingstarted and the student guidelines http://wiki.python.org/moin/SummerOfCode/Expectations provide detailed information about the program and requirements and expectations.

The most important requirement that we expect from students is a sufficient background in statistics or econometrics. Students should be comfortable with Python (intermediate level). Knowing how to use Git is also important; this can be learned before the official start of GSoC if needed.

Advice

Potential candidates should take a look at the guidelines on how to contribute to statsmodels. Making a small enhancement/bugfix/documentation fix/etc (does not need to be related to your proposal) to statsmodels already before applying for the GSoC is a requirement from the PSF; it can help you get some idea how things would work during the GSoC.

Start on your proposal early, post a draft to the mailing list and iterate based on the feedback you receive. This will not only improve the quality of your proposal, but also help you find a suitable mentor.

Ideas

We encourage students to propose their own projects, but we also have several areas that are high on our priority list. Our priority list is flexible, and it is important that the topic matches the interest and background of the student.

Note the difficulty level depends on the statistics/econometrics background and on the familiarity with the current statsmodels code.

common to all projects:

  • domain-specific knowledge: high level of statistics or econometrics knowledge for the specific topic
  • programming language: Python, intermediate level

possible ideas for 2017

(initial suggestions, will be revised. This are in addition to the old topics below, or with higher priority this year.)

  • Survey methods and adding weighting to GLM, Cox, etc.
  • BigGLM and related high dimensional/distributed computing approaches to big regression models
  • Basic Structural Equations Modeling (SEM)
  • various standard econometrics models that are missing or in draft version, see below
  • sparse matrix support in models

possible topics in statespace and time series models

We would like to have one project that continuous the development of statespace and related models. This is still a large area and student and mentor will need to narrow it down to project proposal that is of interest to the student and the mentor(s).

A couple of possible state space topics, (Chad Fulton):

  • We now have the Cython smoothers, so the EM algorithm is possible in state space models. We could try to add that to various models.
  • Specific models that are missing: ARFIMA, multivariate unobserved components models, more complex cycles, now-casting type models and (non-state-space) MIDAS models
  • I want to get more postestimation results for state space models (e.g. IRF confidence intervals)
  • If someone really liked unobserved components models, a project could be to make a really comprehensive implementation in some way (e.g. Harvey's 1989 book has many more extensions, hypothesis / specification tests, etc. than we have) I think that a really well-developed model could be pretty nice.
  • If someone really liked VARIMA models, there's a bunch to be done there, as far as identification and estimation.
  • Nonlinear / non-Gaussian state space models.
  • We still don't have the framework for linear restrictions (this is pretty easy, and it's not in there because I've never used it myself or seen it used practically)
  • (This one is almost certainly too advanced and would require more time as a mentor than I have) exact initial Kalman filtering / smoothing

Non-state-space:

  • We still don't have forecasting or IRFs, etc. for any of the Markov switching models. Also the EM algorithms are not entirely "correct" (so right now they're private and only used for initial values for the usual fit methods).
  • Markov switching VARs are a good topic, and we have the Cython versions of the Hamilton Filter and Smoother, so I think we can do e.g. Krolzig's stuff (I think he develops the EM algorithm for the class he considers).
  • exponential smoothing type models and automatic forecasting.

Add Maximum Likelihood Models for other distributions

This is a relatively easy project in the sense that it can largely follow the existing patterns of current models. There is a large variety of distributions that can be added as Maximum Likelihood Models. One example are additional countmodels, zero-inflated, hurdle models, generalized distributions like generalized Poisson or NegativeBinomial, Poisson-inverse Gaussian and so on. Another example would be parametric survival or failure models, especially accelerated failure time models and similar. Another area that is not yet covered are models for compositional data (shares or proportions that add up to one, or to a constant).

difficulty: easy to intermediate

mentor: Josef Perktold

Panel Data

This is still one large category of basic models that are currently missing in statsmodels. There is a pull request for the standard econometrics model (PR #1133 ), but no work for extensions or for dynamic panel data models.

difficulty: intermediate

mentor: Kevin Sheppard?, Josef Perktold

Mixed Effects Models

statsmodels has now the basic linear mixed effects model. A previous GSOC project build large parts for MixedGLM, but more integration or approximation methods and efficient special case estimators are still needed.

difficulty: hard

mentor: Kerby Shedden, Josef Perktold

Extensions to State Space Models

Statsmodels includes now a general purpose Kalman filter and state space model.

see above

difficulty: intermediate to hard

mentor: Chad Fulton

Survival Models

statsmodels has Cox proportional hazard model and Kaplan-Meier included. One possible extension would be to extend Cox proportional hazard model to time varying explanatory variables, and add a Poisson or generalized linear model representation that can be used for semi-parametric estimation, e.g. using splines for the baseline hazard.

difficulty: intermediate

mentor: Kerby Shedden, Josef Perktold

Propensity score matching, and treatment effects estimation

This is another area that is currently missing in statsmodels. There are some projects outside of statsmodels that implement partially implement it in Python. One possibility is to implement the equivalent of Stata's psmatch or the new tseffects, or similar packages in R, or GSOC sized parts of it. Pr #2288 has an implementation of the basic parts, related discussion is in issue #858

difficulty: intermediate

mentor: Josef Perktold, Kerby Shedden

Other possible projects

bring your own

classical multivariate analysis: There are algorithm for some of this in other python packages, but they either don't provide the full statistical model or don't have the associated statistical results for it. This area is currently work in progress and we expect to merge more pull requests before GSOC. However, this is still an area that needs expansion.

other missing standard econometric models:

Several standard econometrics models are not yet available in statsmodels, such as endogenous regressors, instrumental variables for nonlinear or nonnormal models, selection models or endogenous switching models, and more.

  • ...
  • ...

Cleanup, Refactor and integrate unfinished projects

difficulty: hard for GSOC, requires familiarity with large parts of current statsmodels code.

The general objective is to increase unit test coverage and to bring pull requests and higher priority code in the sandbox into a condition so they can be merged. Additional improvements and enhancements can also be added to the current core code. There are many improvements that will not require a large amount of time, below are a non-exhaustive list of ideas, that are mostly larger in terms of the required time. The issues on github will provide a starting point for most cases.

Close gaps in unit test coverage and fix bugs if necessary: Almost all core code has good functional coverage (verifying correctness) but less common code paths and unusual user inputs are insufficiently tested. Some code on the "fringes" has insufficient test coverage. Some functions need updating for the full integration with and support of pandas data structures.

system of equations, simultaneous equations: a previous GSOC project that needs to be updated to the current statsmodels code base, plus missing test coverage, and possibly additional results.

repeated measures anova: rewrite code in pull request to integrate with pandas and conform to statsmodels code structure.

Migrate Pandas.stats to statsmodels: see https://github.com/pydata/pandas/issues/6077

Power and effect size: Currently power and sample size calculation provide mainly a low level interface. We need additional effect size calculations and additional functions that make power and sample size calculations easier to use.

Bootstrap, resampling methods: we have bootstrap methods incorporated in several models, and there are additional examples and scripts inside and outside of statsmodels. statsmodels is still missing a consistent framework, helper functions and integration of it with existing models.