Skip to content

GSoC 2017 Project Ideas

Ethan White edited this page Feb 9, 2017 · 2 revisions

Python and Julia Interfaces for the Data Retriever

Please ask questions here. Tag @ethanwhite and @henrysenyondo.

Rationale

The Data Retriever automates the tasks of finding, downloading, and cleaning up publicly available data, and then stores them in a local database or csv files. It's a package manager for data. This lets data analysts spend less time cleaning up and managing data, and more time analyzing it.

The Data Retriever is written in Python. It currently has a command line interface (CLI) and can also be used through an associated R package that wraps this CLI. Adding a native Python interface and a Julia package wrapping the CLI would provide access to the tools provided by the Data Retriever in the three major open source languages for data oriented computing.

Approach

This project would extend the Data Retriever using Python to provide a native interface Python interface. It would also develop a new Julia package that wraps the existing CLI. Proposals are also welcome that build only one of these interfaces.

Specifically this would involve:

  • Object oriented programming in Python
  • Package development in Julia

Involved toolkits or projects

Degree of difficulty and needed skills

  • Moderate Difficulty
  • Knowledge of Python
  • Knowledge of Julia

Involved developer communities

The Data Retriever primarily interacts via issues and pull requests on GitHub.

Mentors

  • @henrysenyondo
  • @ethanwhite

Add Spatial Data Support to the Data Retriever

Please ask questions here. Tag @ethanwhite and @henrysenyondo.

Rationale

The Data Retriever automates the tasks of finding, downloading, and cleaning up publicly available data, and then stores them in a local database or csv files. It's a package manager for data. This lets data analysts spend less time cleaning up and managing data, and more time analyzing it.

The Data Retriever currently focuses on tabular data. One of the most common and widely used types of non-tabular data is spatial data such as maps. Database management systems increasingly support the storing and querying of spatial data, which can be particularly useful for handling large quantities of spatial data. However many data analysts aren't familiar with how to store spatial data in databases.

Approach

This project would extend the Data Retriever using Python to support installing spatial data in PostgreSQL and SQLite.

Specifically this would involve:

  • Object oriented programming in Python
  • Working with spatial data in PostgreSQL and SQLite

Involved toolkits or projects

  • The Data Retriever
  • Python
  • PostgreSQL and PostGIS
  • SQLite and SpatiaLite

Degree of difficulty and needed skills

  • Moderate Difficulty
  • Knowledge of Python
  • Knowledge of spatial data and spatial database add-ons

Involved developer communities

The Data Retriever primarily interacts via issues and pull requests on GitHub.

Mentors

  • @henrysenyondo
  • @ethanwhite

Improving reproducibility in data handling by adding provenance features to the Data Retriever

Please ask questions here.

Rationale

The Data Retriever automates the tasks of finding, downloading, and cleaning up publicly available data, and then stores them in a local database or csv files. It's a package manager for data. This lets data analysts spend less time cleaning up and managing data, and more time analyzing it.

One of the challenges of reproducibility in data science is that most of the steps related to data: downloading it, cleaning it up, and restructuring it, are either done manually or using one-off scripts. If the data source is updated this can break scripts, requiring redoing manual cleaning, and make it difficult to understand the relationship between analyses on different versions of the data. The Data Retriever already solves many of these problems, but it doesn't currently support rerunning analyses using older versions of data and the package management scripts associated with this older versions. This prevents older analyses from being fully reproduced and compared to those based on updated data.

Approach

This project would extend the Data Retriever using Python to allow the automatic storing of downloaded versions of a data package and running the the Data Retriever on those archived data using the same version of both the Data Retriever and the data package script with which it was originally created.

Specifically this would involve:

  • Object oriented programming in Python
  • Working with Docker to automate running different versions of the Data Retriever

Involved toolkits or projects

  • The Data Retriever
  • Python
  • Relational database management systems (RDBMS) including MySQL, PostgreSQL, SQLite

Degree of difficulty and needed skills

  • Moderate Difficulty
  • Knowledge of Python
  • Some experience with Docker

Involved developer communities

The Data Retriever primarily interacts via issues and pull requests on GitHub.

Mentors

  • @henrysenyondo
  • @ethanwhite

Improve Data Retriever efficiency for out-of-memory scale datasets

Rationale

The Data Retriever automates the tasks of finding, downloading, and cleaning up publicly available data, and then stores them in a local database or csv files. It's a package manager for data. This lets data analysts spend less time cleaning up and managing data, and more time analyzing it.

The Data Retriever is designed to work with out-of-memory scale data, but is still slower than desirable when doing do. This project would involve both making the Data Retriever for efficient on large datasets and making querying them from the resulting databases more efficient.

Approach

This project would extend the Data Retriever using Python to increase the speed while maintaining a low memory footprint, and to allow indexes to be added to the databases for efficient querying.

Specifically this would involve:

  • Object oriented programming in Python
  • Using profilers to determine slow or/and memory intensive areas of the codebase
  • Working with relational database management systems

Involved toolkits or projects

Degree of difficulty and needed skills

  • Moderate Difficulty
  • Knowledge of Python

Involved developer communities

The Data Retriever primarily interacts via issues and pull requests on GitHub.

Mentors

  • @henrysenyondo
  • @ethanwhite