Google Summer of Code 2017

Levi John Wolf edited this page Apr 18, 2017 · 18 revisions

Google Summer of Code 2017

PySAL is inviting students to join in PySAL's development by applying for Google Summer of Code 2017. This is the second year PySAL will be seeking to participate, and we hope to again work under the umbrella of the Python Software Foundation (PSF).

Introduction

PySAL is an open source library of spatial analysis functions written in Python intended to support the development of high level applications. See our documentation for more details. The developer guide describes in more details how to make contributions to PySAL and our work flow for contributing to the project. Our issues are also on github, which include bug reports and 'wishlist' items and enhancement plans and ideas.

If you are interested in participating in GSoC as a student, the best approach is to become an active and engaged contributor to the project right away. You should take a look at some of the existing issues on GitHub and see if there are any you think you might be able to take a crack at. Try submitting a pull request for something and start getting the hang of the process and interacting with the PySAL code base and development community.

Guidelines and Prerequisites

Students should start by reading the guidelines for participation. Google also provides guidelines to help with writing a proposal as part of their GSoC Student Guide. It is a good idea to start on your proposal early, post a draft to the pysal-dev mailing list and iterate based on the feedback you receive. This will not only improve the quality of your proposal, but also help you find a suitable mentor.

Please note that as a sub-organization of the PSF (and active members of the Python community), we ask that all mentors and students working with PySAL abide by the Python Community Code of Conduct.

Project Ideas

Below are a listing of possible projects that students might consider. We also encourage students to propose their own projects, though several of the following topics are relatively high on our priority list. Our priority list is flexible, and it is important that the topic matches the interest and background of the student.

When considering the following projects, don't be put off by the knowledge prerequisites -- you don't need to be an expert, and there is some scope for research and learning within the GSoC period. However, familiarity with and interest in the subject area and involved technologies will be helpful!

Point Pattern Analysis (PPA) Module

Point pattern analysis (PPA) is the study of the spatial arrangements of points in (usually 2D) space. Currently, there are very few options for conducting comprehensive PPA in Python. A preliminary module has been developed for PySAL which is a first step in this direction, however, extension of this module with unit-tests, examples, new functions/statistical tests, etc would be an excellent GSoC project. The goal is not neccesarily to be as comprehensive as say R's spatstat* package, but to support as much of the PPA workflow as possible in Python.

Specific activities/goals include:

  • additional tests/additional test coverage
  • optimization of envelopes and simulation based inference
  • algorithmic improvements and speedups
  • additional statistical tests and generating processes
  • development of educational resources

Difficulty level: beginner to intermediate

Knowledge of PPA theory and mathematical/statistical properties of 2D point processes is required. The primary goal at this stage is API development and extensions of tests, optimizations may be done as needed.

Expected outcomes: a set of production-ready tests and data generating functions for PPA to rival other languages/packages!

Mentors: Serge Rey, Levi John Wolf, Taylor Oshan

Geovisualization Module

PySAL was originally conceived as a library implementing advanced spatial statistics and econometric methods. Given that there were many different visualization toolkits in the Python ecosystem as well as GIS packages, visualization was not a focus of our library. However, over time users of PySAL wanted the ability to visualize the results of the computations that the analytical components provided. In response a contributed module viz was developed to explore alternative approaches towards providing light-weight visualization for PySAL.

The goal of the viz module is to provide a simple to use and lightweight interface that connects PySAL to different popular visualization toolkits. While much progress has been made, there is more that can be done on the viz project as the visualization space is one that is constantly evolving.

Specific activities for the viz project include:

  • Refinement and extension of the matplotlib interface (e.g. legends, views for analytics, regression object plots)
  • Development of interactive visualizations in jupyter
  • Exploration of potential interfaces for alternative packages (e.g., Bokeh, folium, D3)

Difficulty level: intermediate

Mentors: Dani Arribas-Bel, Levi John Wolf, , Taylor Oshan

Bayesian Spatial Models

Many of the models in pysal.spreg have long been able to be estimated using Bayesian methods. However, due to the lack of support for the simultaneous autoregressive specifications in common Bayesian spatial analysis packages, many statistical users end up writing custom Gibbs samplers for new model specifications.

To help the Bayesian computation community in Python and the spatial analysis community generally, a project demonstrating implementations of the common SAR specifications in pysal.spreg, in addition to spatial gaussian process models, would provide a set of common reference implementations for Bayesian Spatial Econometrics. These implementations could target either PyMC3 or Stan, but the goal would be to provide examples that allow HMC techniques to be used to estimate common spatial econometric models.

To make these estimation techniques efficient, we anticipate interested candidates possibly needing familiarity with sparse matrix techniques & libraries in python, namely theano.sparse and scipy.sparse. This module may be rolled together with with the new multilevel SAR-Error model estimators in spvcm. Together, this would include any custom classes, distributions, or utilities required to state & estimate models efficiently in either PyMC3 or Stan, as well as examples demonstrating how to do so.

Skills:

  • Familiarity with Theano, Numpy, Stan, and PyMC3
  • Background or familiarity with econometric methods and techniques
  • Basic understanding of Bayesian statistics, particularly Bayesian linear models or Gaussian process models

Related Readings:

  • Bannerjee, G. and B. Carlin and A. Gelfand. 2014. Hierarchical Modeling and Analysis for Spatial Data
  • LeSage, J. and R.K. Pace. 2010. Introduction to Spatial Econometrics

Difficulty Level: intermediate

Mentors: Levi John Wolf, Serge Rey

Explicitly spatial unsupervised learning (regionalization)

The field of regionalization (Duque, Ramos, & Suriñach, 2007) is a subdomain that aims to bring space explicitly into the grouping of observations into consistent categories. In essence, the idea is to cluster observations based on a given set of attributes --similar to how it would be performed in traditional unsupervised learning-- but to restrict the groupings by imposing a spatial constraint (usually, the observations be contiguous geographically). The result is thus the geographic aggregation of small areas into consistent and coherent regions.

Currently, there is an excellent package purely written in Python (clusterpy). However, it is Python 2 compatible only and it is not fully integrated with PySAL, so the workflow is not smooth to work with the rest of the eco-system (e.g. PySAL/geopandas data structures), ultimately compromising its more general adoption.

This project will focus on three specific lines of work:

  • Designing and implementing an architecture for clusterpy that allows it to be fully integrated in the pydata-geo eco-system (e.g. PySAL/geopandas).
  • Allowing clusterpy functionality to be Python 3 compatible.
  • Extending the suite of regionalization algorithms implemented.

Difficulty level: intermediate/advanced

Mentors: Dani Arribas-Bel, Serge Rey, Levi John Wolf

Other

PySAL is an open source project and as such we invite contributions from any interested developer. If you have an idea for an enhancement for PySAL please contact one of the developers to discuss the possibilities for the project in GSOC17.

Some of the above guidelines were 'borrowed' from previously successful GSoC Mentoring Organizations, such as Julia and Statsmodels.

* Note: spatstat is licenced under the Gnu GPL, so its code base is not compatible with that of PySAL.

Timeline

  • January 19-February 9 organizations apply
  • Feruary 27 organizations announced
  • February 27-March 20 students discuss applications with mentoring organizations
  • March 20 - April 3 Student application period
  • May 4 Accepted student proposals announced
  • May 5 - May 29 community bonding
  • May 30 - Aug 29 coding
  • September 6 results announced

Source: https://summerofcode.withgoogle.com

Student Application Template

Python Software Foundation's student application template.