
# Creating&nbsp;RasterFrames

## Initialization

There are a couple of setup steps necessary anytime you want to work with RasterFrames in python. The first is to import pyspark and create a session.
    

In [6]:
import pyspark
from pyspark.sql import SparkSession
from pyrasterframes import *
from pyrasterframes.rasterfunctions import *

# Add other configuration options as needed

spark = SparkSession.builder. \
    master("local[*]"). \
    appName("RasterFrames"). \
    config("spark.ui.enabled", "false"). \
    getOrCreate(). \
    withRasterFrames()

Next, call the `withRasterFrames` method on it:

Now we are ready to create a RasterFrame. The most straightforward way to create a `RasterFrame` is to read a [GeoTIFF](https://en.wikipedia.org/wiki/GeoTIFF)
file.

In [7]:
samplePath = 'samples/L8-B4-Elkton-VA.tiff'
rf = spark.read.geotiff(samplePath)

Let's inspect the structure of what we get back:

In [8]:
rf.printSchema()

root
 |-- spatial_key: struct (nullable = false)
 |    |-- col: integer (nullable = false)
 |    |-- row: integer (nullable = false)
 |-- bounds: polygon (nullable = true)
 |-- metadata: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = false)
 |-- tile: rf_tile (nullable = false)



As reported by Spark, RasterFrames extracts 6 columns from the GeoTIFF we selected. Some of these columns are dependent
on the contents of the source data, and some are are always available. Let's take a look at these in more detail.

* `spatial_key`: GeoTrellis assigns a `SpatialKey` or a `SpaceTimeKey` to each tile, mapping it to the layer grid from
which it came. If it has a `SpaceTimeKey`, RasterFrames will split it into a `SpatialKey` and a `TemporalKey` (the
latter with column name `temporal_key`).
* `bounds`: The bounding box of the tile in the tile's native CRS.
* `metadata`: The TIFF format header tags found in the file.
* `tile` or `tile_n` (where `n` is a band number): For singleband GeoTIFF files, the `tile` column contains the cell
data split into tiles. For multiband tiles, each column with `tile_` prefix contains each of the sources bands,
in the order they were stored.

See the section [Inspecting a `RasterFrame`](#inspecting-a--code-rasterframe--code-) (below) for more details on accessing the RasterFrame's metadata.

## Reading a GeoTrellis Layer

If your imagery is already ingested into a [GeoTrellis layer](https://docs.geotrellis.io/en/latest/guide/spark.html#writing-layers),
you can use the RasterFrames GeoTrellis DataSource. There are two parts to this GeoTrellis Layer support. The first
is the GeoTrellis Catalog DataSource, which lists the GeoTrellis layers available at a URI. The second part is the actual
RasterFrame reader for pulling a layer into a RasterFrame.

Before we show how all of this works we need to have a GeoTrellis layer to work with. We can create one with the RasterFrame we constructed above.

In [15]:
from pathlib import Path

In [21]:
base = Path("./notebooks/rf-").as_uri()

ValueError: relative path can't be expressed as a file URI

In [19]:
layer = Layer(base, "sample", 0)
rf.write.geotrellis.asLayer(layer).save()

NameError: name 'Layer' is not defined

Now we can point our catalog reader at the base directory and see what was saved:

In [25]:
cat = spark.read.geotrellisCatalog(base)
cat.printSchema
cat.show()

AttributeError: 'DataFrameReader' object has no attribute 'geotrellisCatalog'

As you can see, there's a lot of information stored in each row of the catalog. Most of this is associated with how the
layer is discretized. However, there may be other application specific metadata serialized with a layer that can be use
to filter the catalog entries or select a specific one. But for now, we're just going to load a RasterFrame in from the
catalog using a convenience function.

In [None]:
firstLayer = cat.select(geotrellis_layer).first
rfAgain = spark.read.geotrellis.loadRF(firstLayer)
rfAgain.show()

If you already know the `LayerId` of what you're wanting to read, you can bypass working with the catalog:

In [None]:
anotherRF = spark.read.geotrellis.loadRF(layer)

## Writing a GeoTrellis Layer

**TODO**

## Using GeoTrellis APIs

If you are used to working directly with the GeoTrellis APIs, there are a number of additional ways to create a `RasterFrame`, as enumerated in the sections below.

First, some more `import`s:

In [None]:
import geotrellis.raster.io.geotiff.SinglebandGeoTiff
import geotrellis.spark.io._

### From `ProjectedExtent`

The simplest mechanism for getting a RasterFrame is to use the `toRF(tileCols, tileRows)` extension method on `ProjectedRaster`.

scene = SinglebandGeoTiff("../core/src/test/resources/L8-B8-Robinson-IL.tiff")
rf = scene.projectedRaster.toRF(128, 128)
rf.show(5, false)

### From `TileLayerRDD`

Another option is to use a GeoTrellis [`LayerReader`](https://docs.geotrellis.io/en/latest/guide/tile-backends.html),
to get a `TileLayerRDD` for which there's also a `toRF` extension method.

In [None]:
import geotrellis.spark._
tiledLayer: TileLayerRDD[SpatialKey] = ???
rf = tiledLayer.toRF

## Inspecting a `RasterFrame`

`RasterFrame` has a number of methods providing access to metadata about the contents of the RasterFrame.

### Tile Column Names

In [None]:
rf.tileColumns.map(_.toString)

### Spatial Key Column Name

In [None]:
rf.spatialKeyColumn.toString

### Temporal Key Column

Returns an `Option[Column]` since not all RasterFrames have an explicit temporal dimension.

In [None]:
rf.temporalKeyColumn.map(_.toString)

### Tile Layer Metadata

The Tile Layer Metadata defines how the spatial/spatiotemporal domain is discretized into tiles, and what the key bounds are.

In [None]:
import spray.json._
// NB: The `fold` is required because an `Either` is returned, depending on the key type.
rf.tileLayerMetadata.fold(_.toJson, _.toJson).prettyPrint

```tut:invisible
spark.stop()
```