# Masked Arrays - Dealing with nodata

Masked arrays are similar to regular arrays but provide additional functionality related to nodata values.  Because the AVIRIS data is _skewed_ (the bounds of the data do not follow grid lines) there is a lot of nodata present and it might often be convenient to work with masked arrays.  The downside of masked arrays is that not every function available to a regular ndarray is available for masked arrays and also keeping track of both the data and the mask can be an added layer of complication.

In [None]:
# All notebook imports
import rasterio
import numpy.ma as ma

## Creating a masked array

Let's start by getting a band of data from our dataset.

In [1]:
import rasterio

In [2]:
filepath_rad = '../input_data/f100520t01p00r08rdn_b/f100520t01p00r08rdn_b_sc01_ort_img'

In [3]:
with rasterio.open(filepath_rad, 'r') as src:
    band=src.read(3)

To create the masked array we need to import a new numpy module.

In [4]:
import numpy.ma as ma

We then go through two steps:
1. add a datamask by indicating which value we want masked and on which array
2. set the _fill value_ , the nodata value, for the array

In [5]:
# Mask the array
masked_band = ma.masked_where(band == -50, band)
# Update the nodata value
ma.set_fill_value(masked_band, -50)

In [6]:
masked_band

masked_array(
  data=[[--, --, --, ..., --, --, --],
        [--, --, --, ..., --, --, --],
        [--, --, --, ..., --, --, --],
        ...,
        [--, --, --, ..., --, --, --],
        [--, --, --, ..., --, --, --],
        [--, --, --, ..., --, --, --]],
  mask=[[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],
  fill_value=-50,
  dtype=int16)

Looking at the output of the masked band we can see that the masked array has two parts - the data and the mask.  The data has the mask applied to it, meaning that we can't see the -50 values anymore we just see `--`.  The mask is a _boolean array_, an array of True and False values, that indicates if that location is masked or not in the original array.  The `fill_value` is the same as the nodata value.

We can confirm that the masked array is different from the original data by comparing our source `band` to the `masked_band` object we just created.

In [7]:
# Comparing exploratory statistics
print('-- MINIMUM VALUE')
print('min numpy: ', band.min())
print('min masked: ', masked_band.min())
print('-- MAXIMUM VALUE')
print('max numpy: ', band.max())
print('max masked: ', masked_band.max())
print('-- MEAN VALUE')
print('mean numpy: ', band.mean())
print('mean masked: ', masked_band.mean())

-- MINIMUM VALUE
min numpy:  -50
min masked:  1261
-- MAXIMUM VALUE
max numpy:  3451
max masked:  3451
-- MEAN VALUE
mean numpy:  1384.6904285991725
mean masked:  1574.0858305373667


## Mask vs. Data vs. Masked Array

The benefit of the masked array is that we have a more in-depth mechanism of working with nodata.  The added layer of complication is that we have to juggle the three different objects: 
* the mask - the True/False array (`masked_band.mask`)
* the data - the source values (`masked_band.data`)
* the masked array - the python object that blends the two together (`masked_band`)

Let's look at some output to clarify the distinction.

In [8]:
print('masked array ', masked_band[0,0])
print('mask ', masked_band.mask[0,0])
print('data ', masked_band.data[0,0])

masked array  --
mask  True
data  -50


In [9]:
print('masked array ', masked_band.mean())
print('mask ', masked_band.mask.mean())
print('data ', masked_band.data.mean())

masked array  1574.0858305373667
mask  0.11661662110279508
data  1384.6904285991725
