# Masked Array

## Introduction

Analyze COVID-19 sample data and handle missing values using NumPy and its masked array module along with its features.

## Objectives

* understand what are masked arrays and how they can be created
* see how to access and modify data for masked arrays
* be able to decide when the use of masked arrays is appropriate in real applications

## What is masking and masked array in NumPy?

Masking is a way to provide an element-wise boolean indicator of a dataset where each boolean value corresponds to the status of validity for the original data in with the same index. This, basically, puts a cover on the original data with holes exposing the valid data and that way the invalid data are covered by the mask which can later be optionally replaced with some value.

A masked array is the combination of a standard `numpy.ndarray` and a mask. A mask is either nomask, indicating that no value of the associated array is invalid, or an array of booleans that determines for each element of the associated array whether the value is valid or not. When an element of the mask is False, the corresponding element of the associated array is valid and is said to be unmasked. When an element of the mask is True, the corresponding element of the associated array is said to be masked (invalid).

In a nutshell, a NumPy MakedArray can be said to have the following components:

* Data, as a regular `numpy.ndarray` of any shape or datatype
* A boolean mask with the same shape as the data
* A fill_value, a value that may be used to replace the invalid entries in order to return a standard `numpy.ndarray`.

## Further usefulness of NumPy masked arrays

There are a few situations where masked arrays can be more useful than just eliminating the invalid entries of an array:

* When it is intended to preserve the values you masked for later processing, without copying the array
* When it is necessary to handle many arrays, each with their own mask. If the mask is part of the array, bugs are avoided and the code is possibly more compact
* When there are different flags for missing or invalid values, and is is intended to preserve these flags without replacing them in the original dataset, but exclude them from computations
* If it is troublesome to avoid or eliminate missing values, but it is expected to avoid dealing with NaN (Not a Number) values in the operations

In [1]:
import numpy as np
import os

In [4]:
filepath = os.getcwd()
filename = os.path.join(filepath, 'data/who_covid_19_sit_rep_time_series.csv')

## Description of the sample dataset

The data file contains data of different types and is organized as follows:

* The first row is a header line that (mostly) describes the data in each column that follow in the rows below, and beginning in the fourth column, the header is the date of the observation.

* The second through seventh row contain summary data that is of a different type than that which we are going to examine, so we will need to exclude that from the data with which we will work.

* The numerical data we wish to work with begins at column 4, row 8, and extends from there to the rightmost column and the lowermost row.

In [5]:
dates = np.genfromtxt(
    filename, dtype=np.unicode_, delimiter=',',
    max_rows=1, usecols=range(4, 18),
    encoding='utf-8-sig',
)

locations = np.genfromtxt(
    filename, dtype=np.unicode_, delimiter=',',
    skip_header=6, usecols=(0, 1),
    encoding='utf-8-sig',
)

nbcases = np.genfromtxt(
    filename, dtype=np.int_, delimiter=',',
    skip_header=6, usecols=(4, 18),
    encoding='utf-8-sig',
)