Google Summer of Code 2019

Josef Perktold edited this page Feb 5, 2019 · 2 revisions

Note: This is currently a draft page. Topics have been updated to account for priorities in 2018 but might still change.

Statsmodels has participated for 10 years in GSOC under the umbrella of the the Python Software Foundation. The focus in previous years has been on adding new models. There are still several areas where statsmodels is missing commonly used models, we also have several models that have been worked on but still need work to finish, add unit tests and integrate into statsmodels, and finally there are several areas where existing models can be extended. One important consideration in the selection of the project is the background of the student, and it is an advantage if the student is familiar with the topic and may be using it also in her or his research.


Statsmodels is a library for statistics and econometrics written in Python with some extension using cython. It contains by now many of the most commonly used models for estimation, hypothesis tests and statistical graphs. See our documentation for more information. The developer pages describe in more details how to make contributions to statsmodels and our work flow for pull requests. Our issues are also on github, which include bug reports and wishlist items amd enhancement plans and ideas.

Guidelines & requirements

We are planning again to participate in GSoC 2019 under the `umbrella of Python Software Foundation.

The PSF getting started page and the student guidelines provide detailed information about the program and requirements and expectations.

The most important requirement that we expect from students is a sufficient background in statistics or econometrics. Students should be comfortable with Python (intermediate level). Knowing how to use Git is also important; this can be learned before the official start of GSoC if needed.


Potential candidates should take a look at the guidelines on how to contribute to statsmodels. Making a small enhancement/bugfix/documentation fix/etc (does not need to be related to your proposal) to statsmodels already before applying for the GSoC is a requirement from the PSF; it can help you get some idea how things would work during the GSoC.

Start on your proposal early, post a draft to the mailing list and iterate based on the feedback you receive. This will not only improve the quality of your proposal, but also help you find a suitable mentor.


In contrast to previous years, in 2019 we are mainly interested in some advanced statistics and econometrics project. We have several large topics where the GSOC project can implement some part or model. Some subtopics can be adjusted to the interest of the student. student and mentor will need to narrow it down to project proposal that is of interest to the student and the mentor(s)

Note the difficulty level depends on the statistics/econometrics background and on the familiarity with the current statsmodels code.

Common requirements to all projects:

  • domain-specific knowledge: high level of statistics or econometrics knowledge for the specific topic
  • programming language: Python, intermediate level, familiarity with numpy

Possible ideas for 2019

(initial suggestions, will be revised. More details and are listed below)

One high priority area is to continue our good coverage of time series models, specifically extensions to statespace model and work on VAR/VECM/SVAR:

  • Markov switching VARs (e.g. Krolzig)
  • Nowcasting (e.g. Banbura and Modugno)
  • Improvements to the unobserved components models (basically some selection of the extensions that Harvey has in his book)
  • Improvement and partial refactor of the current vector autoregressive model VAR/VECM/SVAR

High priority general topics for 2019 outside time series analysis are

  • Post-estimation Inference and Diagnostics for GLM and discrete models
  • Treatment effect estimation, Causality, Propensity score methods, needs to be narrowed down to a feasible plan
  • Penalized or regularized generalized method of moments, GMM

Extensions to state space models

difficulty: intermediate to hard

mentor: Chad Fulton, Josef Perktold

  • Markov-Switching VARs:
    Statsmodels has the Cython versions of the Hamilton Filter and Smoother. A project can build on this and implement, for example, Krolzig's approach (I think he develops the EM algorithm for the class he considers).
  • Improvements to the unobserved components models
    A project could make a comprehensive implementation of extensions to these models. Harvey's 1989 book has many more extensions, hypothesis / specification tests, etc.
  • Nowcasting
    Implement approach by Banbura and Modugno or similar

Post-estimation Inference and diagnostic tests especially for GLM

GLM currently has no analysis of deviance, analogue of anova_lm, or similar convenient method to compare nested models. Diagnostic and specification tests, and influence and outlier methods are only available for OLS and partially for GLM and partially for WLS. The third part of diagnostics are plots like regression or residual plot to help the visual inspection of the appropriateness of the model specification. Similar functions for GLM or other nonlinear maximum likelihood models are still largely missing. Some methods are described in the documentation of SAS or other packages. This will for most parts a collection of functions similar to what is available for OLS.

The basic outlier and influence measures for GLM are now included in statsmodels

difficulty: intermediate, but will require conceptual overview of several types of models and diagnostics

mentor: Josef Perktold, Kerby Shedden

Propensity score matching, and treatment effects estimation

High priority but it is a large topic. Needs to be narrowed down depending on interest. There are three main classes of statistical methods that comprise this area and that are not yet available in statsmodels. A GSOC project needs to focus on one of those subtopics. The documentation of Stata functions (user packages) or of R packages can be used to provide a specification of the requirements and as base for the unit tests. (Regression continuity design is another method in this area but it is not yet a priority for statsmodels.)

  • Inverse Probability weighting (IPW) and doubly robust estimators
    In this type of models, treatment effects are estimated by a weighted, linear or nonlinear regression, where weights are based on propensity scores that are estimated in a first stage model. A prototype for IPW is available in a pull request. This needs to be rewritten, extended to a full model and to supporting, diagnostic functions.
  • Propensity score matching
    In this type of models the treatment effects are estimated by matching treated and untreated individuals. This project includes implementing matching methods and supporting code for evaluating the goodness of a match, in addition to the implementation of model and results classes. There already exist some Python packages that covers this area and are license compatible.
  • Difference in Difference and causal impact in panel setting
    In this type of models the treatment effects are estimated by a comparing treated and untreated units over time, where some units change their treatment status over time. Prototype models for parts of this are available in pull requests.

difficulty: intermediate, good background for estimation and inference for treatment effects

mentor: Josef Perktold, Kerby Shedden

Penalized or regularized GMM

Statsmodels has now penalized linear or generalized linear models based on penalized least squares or penalized maximum likelihood estimation, but not yet for generalized method of moments estimators. The GSOC project should implement penalized or regularized estimation for a specific model or application that is based on GMM. One longer term objective is that these application of GMM provide the pattern for extending penalization and regularization to the general framework of GMM estimation. General method of moments and instrumental variables estimator become noisy or ill conditioned (ill posed) if the number of moment conditions or the number of instruments is large relative to the sample size. This can happen either in small samples or in large samples if the number of moment conditions or instruments increases with the sample size. Two possible projects in this area are

  • In the linear IV model with homoscedastic error, implementation of GMM with many regressors.
    When the number of regressors is large, the covariance matrix is ill-posed and its inverse is unreliable. Some others have proposed to perform a selection of the instruments such as Donald and Newey (Econometrica, 2001), other to use regularization. Three regularization methods can be implemented: Tikhonov (or ridge), principal components and Landweber Fridman. The regularization parameter needs to be selected as to minimize the approximate Mean Square Error of the estimator as in Carrasco (Journal of Econometrics, 2012).
  • Estimation of the dynamic panel data model using GMM as in Alvarez and Arellano (Econometrica, 2003).
    When the time dimension gets larger than 10 or the autoregressive coefficient becomes close to 1, the covariance matrix becomes ill posed and the GMM estimator can be heavily biased. In such cases, regularization will help to reduce the bias. The same regularizations as before should be implemented.

difficulty: hard, good background for estimation and inference in GMM

mentor: Josef Perktold

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.