Skip to content

GSoC 2023 Project Ideas

henry senyondo edited this page Mar 9, 2023 · 14 revisions

Please ask questions through issues on the respective project's repo.

Tags available @henrykironde, @MarconiS, @bw4sz, @juniperlsimonis, @ethanwhite,

  • Preferred names (Henry, Sergio, Ben, Juniper, Ethan)
  • Preferred_greeting (Hi|Hello|Dear|Thanks|Thank you [First_name])

Join the chat at https://gitter.im/weecology/retriever

The code of conduct should be your first read.

Tree health and mortality from NEON data

Rationale

The National Ecological Observatory Network (NEON) collects and provides long-term, open-access ecological data. The NEON Data API provides access to this data. The neonwranglerpy is an open source package that helps in retrieving this data, clean it and provide researchers in a format ready for ecological analyses. Likewise, DeepForest is an open source software for object detection and classification from airborne and drone imagery. Integrating the two packages would allow for automatically retrieving both field and remote sensing data, turning the data into the format required by deepForest, and allow for training classification models with new data.

Approach

This project aims at developing new functions to automatically feed NEON field and remote sensing data to the deepForest package, format this data to be readily usable for training multi-class classification in deepForest. Eventually, build a baseline classification of tree health status. In particular it will require improving automated alignment of deepForest boxes to NEON individual tree coordinates, provided as stem location.

Source Code: deepForest Associated Code: neonwranglerpy

Degree of difficulty

  • Intermediate, long (350 hours)

Skills:

  • Python and Python package deployment
  • git/GitHub
  • Machine learning
  • Software testing

Expected outcomes

  • A set of functions that integrate the neonwranglerpy and the deepForest package by retrieving NEON field and remote sensing data, align field data to tree objects on images, and format the data to be readily usable to train multi-class models from NEON data.

Mentors

  • @MarconiS
  • @henrysenyondo
  • @ethanwhite

portalcasting: Implement Ecological Forecasting Initiative schema for forecast output and metadata artifacts

The portalcasting package provides a model development, deployment, and evaluation system for forecasting how ecological systems change through time, with a focus on a widely used long-term study of mammal population and community dynamics, the Portal Project. The goal of this project is to implement a new output standard for sharing and archiving the forecasting artifacts using the Ecological Forecasting Initiative community conventions for forecast file formats, forecast metadata, and forecast archiving. The EFI community conventions specification is defined in the article https://doi.org/10.32942/osf.io/9dgtq. The project will involve multiple open source code bases and an open data repository.

Source Code: https://github.com/weecology/portalPredictions

Associated Code: https://github.com/weecology/portalcasting, https://github.com/weecology/portalr, https://github.com/weecology/PortalData

Degree of difficulty

  • Intermediate, long (350 hours)

Skills:

  • R
  • git/GitHub

Expected outcomes

  • Formal EFI specified format for forecast output files, forecast metadata, input and forecasted variables.

Mentors

  • @henrysenyondo
  • @ethanwhite

Multi-class training and prediction in DeepForest

Approach

DeepForest is an open source Python package for detecting trees (and other organisms) in remote sensing (RGB) imagery from airplanes and drones. The underlying model structure allows for classification as well as detection, allowing DeepForest to be used for identifying trees to species or distinguishing between alive and dead trees, but support for the multi-class aspects of the package need further development. This project would involve a combination of software engineering to improve the UI for working with multi-class models and developing models that are pre-trained to provide features that are useful for transfer learning for species classification and alive/dead classification.

Source Code: https://github.com/weecology/DeepForest

Degree of difficulty

  • Intermediate, long (350 hours)

Skills:

  • Python
  • Deep learning using Pytorch
  • git/GitHub

Expected outcomes

  • An improved UI for working with multi-class models and developing models that are pre-trained to provide features that are useful for transfer learning for species classification and alive/dead classification.

Mentors

  • @henrysenyondo
  • @MarconiS
  • @bw4sz
  • @ethanwhite

High-performance parallel computing for model fitting and prediction in Portalcasting

Approach

Portalcasting is an open source R package that supports ecological forecasting of biodiversity for a long-term ecological research program that has been studying desert biodiversity for 45 years. The package provides automated data integration and modular models to produce forecasts for a range of ecological outcomes. While the forecasting system makes large numbers of forecasts it currently does this sequentially instead of in parallel. This project would involve the parallelization of the code base to allow for running on multiple cores both on individual machines and HPCs.

Source Code: https://github.com/weecology/portalcasting

Degree of difficulty

  • Intermediate, short (175 hours)

Skills:

  • R
  • Parallel programming for embarrassingly parallel problems (i.e., the simple end of parallel programming)
  • git/GitHub

Expected outcomes

  • A parallelized program which will reduce the runtime.

Mentors

  • @henrysenyondo
  • @juniperlsimonis
  • @ethanwhite

Add Support for Apache Arrow

Rationale

The Data Retriever automates the tasks of finding, downloading, and cleaning up publicly available data and then stores it in a variety of databases and file formats. The Apache Arrow project is an open-source data processing library that enables high-performance analytics for efficient analytic operations on modern CPU and GPU hardware. The project provides in-memory computing, a standardized columnar storage format, and an IPC and RPC framework for data exchange between processes and nodes, respectively.

Approach

The project aims to extend the capabilities of the Data Retriever by enabling the support of Apache Arrow, a technology that performs in-memory computation on big data and stores the data in Arrow Columnar Format.

Source Code: Data Retriever

Degree of difficulty

  • Intermediate, long (350 hours)

Skills:

  • Python and Python package deployment
  • git/GitHub

Expected outcomes

  • Addition of Arrow technology to the Data-Retriever by storing the data in Arrow columnar format. improve the fetch function by using in-memory computation or add a fetch utility that can use Apache Array.

Mentors

  • @henrysenyondo
  • @ethanwhite