/
blog1.Rmd
216 lines (172 loc) · 8.47 KB
/
blog1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
title: "spatiotemporal arrays for R: stars blog one"
author: "Edzer Pebesma"
output:
html_document:
toc: true
theme: united
vignette: >
%\VignetteIndexEntry{Spatiotemporal arrays for R; stars blog one}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(collapse = TRUE)
ev = FALSE
```
This is the first of a planned series of blogs on the
[stars](https://github.com/r-spatial/stars) project, an R-Consortium
funded project for _spatiotemporal tidy arrays with R_.
The goals of the stars project are
* to handle raster data in a way that integrates well with the [sf](https://github.com/r-spatial/sf) project and with the [tidyverse](https://www.tidyverse.org/)
* to handle array data (time series, or otherwise functional data) where time and space are among the dimensions
* to do this in a scalable way, i.e. deal with cases where data are too large to fit in memory or on disk
* to think about a migration path for the large and famous [raster](https://cran.r-project.org/package=raster) in the same directions
In its current stage `stars` and (as planned) does not have
* scalability to large data sets; everything is still in memory
* writing data back to disk
The package is loaded by
```{r}
# devtools::install_github("r-spatial/stars")
library(stars)
```
Spatiotemporal arrays are stored in objects of class `stars`;
methods for class `stars` currently available are
``` {r}
methods(class = "stars")
```
Note that _everything_ in the `stars` api may still be subject to change in the next few months.
# Reading a satellite image
We can read a satellite image through GDAL, e.g. from a GeoTIFF file in the package:
```{r}
tif = system.file("tif/L7_ETMs.tif", package = "stars")
x <- tif %>% read_stars
par(mar = rep(0,4))
image(x, col = grey((4:10)/10))
```
We see that the image is geographically referenced (has coordinate values along axes), and that the object returned (`x`) has three dimensions: `x`, `y` and `band`, and one attribute.
```{r}
x
```
Each dimension has a name; the meaning of the fields of a single dimension are:
|*field* |*meaning* |
|--------|--------------------------------------|
| from | the origin index (1) |
| to | the final index (dim(x)[i]) |
| offset | the start value for this dimension |
| delta | the step size for this dimension |
| refsys | the reference system, or proj4string |
| point | logical; whether cells refer to points, or intervals |
| values | the sequence of values for this dimension (e.g., geometries) |
This means that for an index i (starting at $i=1$) along a certain dimension, the corresponding dimension value (coordinate, time) is $\mbox{offset} + (i-1) \times \mbox{delta}$. This value then refers to the start (edge) of the cell or interval; in order to get the interval middle or cell centre, one needs to add half an offset.
Dimension `band` is a simple sequence from 1 to 6. Although bands refer to colors, their wavelength values (or b in the `values` field.
For this particular dataset (and most other raster datasets), we see that offset for dimension `y` is negative: this means that consecutive array values have decreasing $y$ values: cells are ordered from top to bottom, opposite the direction of the $y$ axis.
`read_stars` reads all bands from a raster dataset, or a set of raster datasets, into a single `stars` array structure. While doing so, raster values (often UINT8 or UINT16) are converted to double (numeric) values, and scaled back to their original values if needed.
The data structure `stars` is an extension of the `tbl_cube` found in `dplyr`; we can convert to that by
```{r}
library(dplyr)
as.tbl_cube(x)
```
# Affine grids
The GDAL model can deal also with spatial rasters that are regular but not aligned with $x$ and $y$: affine grids. An example is given here:
```{r}
par(cex.axis = .7) # font size axis tic labels
geomatrix = system.file("tif/geomatrix.tif", package = "stars")
geomatrix %>% read_stars %>% st_as_sf(as_points = FALSE, merge = FALSE) %>%
plot(axes =TRUE, main = "geomatrix.tif", graticule = TRUE)
```
Looking at the dimensions
```{r}
geomatrix %>% read_stars %>% st_dimensions
```
further reveals that we now have a `geotransform` field shown in the dimension table; this is only displayed when the affine parameters are non-zero. The geotransform field has six parameters, $gt_1,...,gt_6$
that are used to transform internal raster coordinates (column pixel $i$ and row pixel $j$)
to world coordinates ($x$, $y$):
$$x = gt_1 + (i-1) gt_2 + (j-1) gt_3$$
$$y = gt_4 + (i-1) gt_5 + (j-1) gt_6$$
When $gt_3$ and $gt_5$ are zero, the $x$ and $y$ are parallel to $i$ and $j$, which makes it appear unrotated.
# Reading a raster time series: netcdf
Another example is when we read raster time series model outputs in a NetCDF file, e.g. by
```{r eval=ev}
prec = read_stars("data/full_data_daily_2013.nc")
```
(Note that this 380 Mb file is not included; data are described [here](ftp://ftp.dwd.de/pub/data/gpcc/html/fulldata-daily_v1_doi_download.html), and were downloaded from [here](ftp://ftp.dwd.de/pub/data/gpcc/full_data_daily_V1/full_data_daily_2013.nc.gz)).
We see that
```{r eval=ev}
prec
```
For this dataset we can see that
* variables have units associated
* time is now a dimension, with proper units and time steps
* missing values for the fourth variable were not taken care off correctly
## Reading datasets from multiple files
Model data are often spread across many files. An example of a 0.25 degree grid, global daily sea surface temperature product is found [here](ftp://eclipse.ncdc.noaa.gov/pub/OI-daily-v2/NetCDF/); a subset of the 1981 data was downloaded from [here](ftp://eclipse.ncdc.noaa.gov/pub/OI-daily-v2/NetCDF/1981/AVHRR/).
We read the data by giving `read_stars` a vector with character names:
```{r eval=ev}
x = c(
"avhrr/avhrr-only-v2.19810901.nc",
"avhrr/avhrr-only-v2.19810902.nc",
"avhrr/avhrr-only-v2.19810903.nc",
"avhrr/avhrr-only-v2.19810904.nc",
"avhrr/avhrr-only-v2.19810905.nc",
"avhrr/avhrr-only-v2.19810906.nc",
"avhrr/avhrr-only-v2.19810907.nc",
"avhrr/avhrr-only-v2.19810908.nc",
"avhrr/avhrr-only-v2.19810909.nc"
)
(y = read_stars(x, quiet = TRUE))
```
Next, we select sea surface temperature (`sst`), and drop the singular `zlev` (depth) dimension using `adrop`:
```{r eval=ev}
library(abind)
z <- y %>% select(sst) %>% adrop
```
We can now graph the sea surface temperature (SST) using `ggplot`, which needs data in a long table form, and without units:
```{r eval=ev}
df = as.data.frame(z)
df$sst = unclass(df$sst)
library(ggplot2)
library(viridis)
library(ggthemes)
ggplot() +
geom_tile(data=df, aes(x=x, y=y, fill=sst), alpha=0.8) +
facet_wrap("time") +
scale_fill_viridis() +
coord_equal() +
theme_map() +
theme(legend.position="bottom") +
theme(legend.key.width=unit(2, "cm"))
```
# More complex arrays
Like `tbl_cube`, `stars` arrays have no limits to the number of dimensions they handle. An example is the origin-destination (OD) matrix, by time and travel mode.
## OD: space x space x travel mode x time x time
We create a 5-dimensional matrix of traffic between regions, by day, by time of day, and by travel mode. Having day and time of day each as dimension is an advantage when we want to compute patters over the day, for a certain period.
```{r}
nc = st_read(system.file("gpkg/nc.gpkg", package="sf"))
to = from = st_geometry(nc) # 100 polygons: O and D regions
mode = c("car", "bike", "foot") # travel mode
day = 1:100 # arbitrary
library(units)
units(day) = as_units("days since 2015-01-01")
hour = set_units(0:23, h) # hour of day
dims = st_dimensions(origin = from, destination = to, mode = mode, day = day, hour = hour)
(n = dim(dims))
traffic = array(rpois(prod(n), 10), dim = n) # simulated traffic counts
(st = st_as_stars(list(traffic = traffic), dimensions = dims))
```
This array contains the simple feature geometries of origin and destination so that we can directly plot every slice without additional table joins. If we want to represent such an array as a `tbl_cube`, the simple feature geometry dimensions need to be replaced by indexes:
```{r}
st %>% as.tbl_cube
```
The following demonstrates how `dplyr` can filter bike travel, and compute mean bike traffic by hour of day:
```{r eval=ev}
b <- st %>% as.tbl_cube %>%
filter(mode == "bike") %>%
group_by(hour) %>%
summarise(traffic = mean(traffic)) %>%
as.data.frame()
require(ggforce)
ggplot() +
geom_line(data=b, aes(x=hour, y=traffic))
```