Skip to content

gravitas: exploring probability distributions for bivariate temporal granularities, in large spatiotemporal sensor data

Dianne Cook edited this page Apr 7, 2019 · 7 revisions

Background

Considerable data is accumulated by sensors today. An example is data measuring energy usage on a fine scale using smart meters. Smart meters are installed on many households in many countries now. Providing tools to explore this type of data is an important activity. Because of the large volume of data, using probability distributions for display is a potentially useful approach. Probability distributions are induced by various aggregations of the data, by temporal components, by spatial region, by type of household.

The idea for this package is to provide a methods to operate on time in an automated way, to deconstruct it in many different ways. Deconstructions of time that respect the linear progression of time like days, weeks and months are defined as linear time granularities and those that accommodate for periodicities in time like hour of the day or day of the month are defined as circular granularities or calendar categorizations. This package will provide the methods to exhaustively construct granularities, and provide automatic checks on the feasibility of a particular granularity.

The package will provide these techniques into the tidy workflow, so that probability distributions can be examined in the range of graphics available in the ggplot2 package.

Related work

  • _lubridate_ is an R package that makes it easier to work with time and also has functions for creating calendar categorizations like hour of the day, day of the week, minutes of the hour. But it mostly creates calendar categorizations that are one step up. The framework proposed will allow creating calendar categorizations that are more than one step ahead, for example, hour of the week or one step up that are not present in lubridate package like week of the month.
  • Calendar based graphics in the package _sugrrants_ help explore data across linear time granularities in a calendar format, whereas this package would help explore circular time granularities.
  • _ggplot2_ facilitates the process of mapping different variables to a 2D frame through grammar of graphics. But it does not tell us which all variables to plot together to promote exploration of data.

If we define time variables that facilitate exploration as harmonies and those that do not as clashes, the proposed framework would provide the list of harmonies given a time variable.

  • This will use as inputs _tsibble_ objects which complement the tibble and extend the tidyverse concept to temporal data.

Details of your coding project

  1. Develop an R package with functions to
    • create circular time granularities that are multiple step up in time
    • categorize pairs of granularities as either a harmony or clashes
    • produce appropriate data structures to visualize with the grammar of graphics
  2. Develop a shiny UI to enable user to explore computing circular granularity construction
  3. Provide examples of probability visualization of smart meter data collected on Australian households
  4. Document the R package functionality in a vignette

Expected impact

The package would facilitate a tidy workflow for exploring probability distributions of variables for circular granularities on data collected by sensors. Providing users with new tools for exploring sensor data, and comparing different types of probability distribution visualizations. Promote exploratory analysis of temporal context data.

Mentors

  • Dianne Cook <dicook@monash.edu>
  • Antony Unwin <au50au@me.com>

Tests

Easy: Explore the data aus_elec (Half-hourly electricity demand for five Australian states) from the package tsibbledata and create five different probability distribution plots for each month. For example, boxplot which shows median, quartile boundaries, hinges, whiskers and outliers is one of the ways to display distribution of data.

Medium: Write a summary which includes pros and cons of each of the different probability distribution plots that you have chosen.

Hard: Create a R code to populate two additional numeric vectors in your data denoting hour-of-week and week-of-month.

Solutions of tests

Name: Sayani Gupta

Email: Sayani.Gupta@monash.edu

Link to all solutions: https://sayani.netlify.com/slides/gsoc2019_tests.html

Students, please post a link to your test results here.

Clone this wiki locally