# IV. Handling missing values

Often times your data will contain missing values so you'll need to know how to handle these in Julia using dataframes. A key thing to note here is that in Julia missing is it's own type (__Missing__) and is not the same thing as the string "missing." The missing type is used for cases where a variable could have a value but was not observed or recorded. Julia also has a __Nothing__ type which should not be confused with __Missing__ type.

In [None]:
using DataFrames

In [None]:
a = "missing"
typeof(a)

In [None]:
a = missing
typeof(a)

As you'd expect most operations involving missing will return missing:

In [None]:
a+1

In [None]:
a*2

In [None]:
sin(a)

In addition, using standard comparision operators with missing will return missing.

In [None]:
a > 3

In [None]:
a < 3

In [None]:
a == 3

In [None]:
a == missing

You can use the `ismissing` function to determine if an object has value missing.

In [None]:
ismissing(a)

For the purposes of comparison Julia has special functions guaranteed to return a boolean such as __isless__ and __isequal__.

In [None]:
isless(3, missing) #missing is considered greater than any number

In [None]:
isless(missing, missing)

In [None]:
isequal(missing, 3)

In [None]:
isequal(missing, missing)

Logical operations with missing are a bit more tricky. The `||` (OR) operator will return true if one of the values in the logical comparison is true but return missing if one of the values is missing.

In [None]:
true || missing

In [None]:
false || missing

The reason `true || missing` returns **true** is because for a true condition to be met only one of the operands needs to be true; so it doesn't matter that the other is missing. 

However if one of the operands is false you need to know the other operand to know the result; therefore if the other operand is missing you can not determine the overall result and consequently `false || missing` returns missing.

Using similar reasoning you can guess what the behavior will be when doing the `&&` (AND) operation with missing values.

In [None]:
true && missing

In [None]:
false && missing

To start let's first work with a vector that contains missing values then we'll see how to deal with missing values in dataframes.

In [None]:
v = [32.2, 8.9, missing, 20.1, 50.2, missing]

Note the element type of the array is a union of types **Missing** and **Float64**. This means the element type of the array can be either **Missing** or **Float64**.

In [None]:
eltype(v)

For an array you can broadcast the `ismissing`_ function to get a boolean array that tells you if an observation in the array is missing or not.

In [None]:
ismissing.(v)

If you wanted the actual indices of missing values you could can pass `ismissing` to the `findall` function.

In [None]:
findall(ismissing, v)

Similar to the scalar case, functions that accept arrays as inputs will typically return missing if the array contains missing values.

In [None]:
maximum(v)

In [None]:
sum(v)

For an array, you can remove missing values and work with the actual values by applying the `skipmissing` function to the array and passing the result to the `collect` function.

In [None]:
collect(skipmissing(v))

The `skipmissing` function takes an iterable, such as an array, as input and returns an iterator with the missing values removed; the `collect` function retrieves the actual remaining values.

You can pass `skipmissing` directly into funtions to calculate on the array with the missing values removed.


In [None]:
maximum(skipmissing(v))

If you want to substitute the missing value with some arbitrary value you can do so in a couple of ways: `coalesce`, `replace`, and `recode`.

The first argument to coalesce is the array itself and the second is the value that should be substited for missing. Here we replace missing with 99.0 in the array.

In [None]:
coalesce.(v, 99.0)

Using the `replace` function you would pass the array as the first argument, but provide a pair as the second argument. In the pair the second value is the value that should be substituted for missing.

In [None]:
replace(v, missing => -99.0)

You can use `replace` to make arbitrary replacements not just for replacing missing values.

The `recode` function works much in the same way as the `replace` function. The in place versions of `recode` and `replace` exist as well: `recode!` and `replace!`.

Suppose you have a dataframe with missing values. There are a handful of useful functions to deal with these. First we need to crate a dataframe and inject some missing values into it.

In [None]:
using Random

# create a dataframe
df = DataFrame(x1 = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0], 
              x2 = [randstring(10) for j in 1:10],
              x3 = [1, 0, 1, 0, 0, 1, 0, 0, 0, 0],
              x4 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Note that we can't arbitrarily set values in the dataframe to missing.

In [None]:
df.x1[2] = missing

The reason is because <i>x1</i> has element type **Float64** so values that are not of type **Float64** can not be assigned to elements of <i>x1</i>.

In [None]:
eltype(df.x1)

To allow for missing values we can use the **allowmissing** function.

In [None]:
allowmissing!(df)

In [None]:
df.x1[[2, 4, 6, 8, 10]] .= missing;
df.x2[[3, 10]] .= missing;
df.x3[[1, 8, 9, 10]] .= missing;
df.x4[2] = missing;

In [None]:
df

If you want to drop rows with any missing values you can use __dropmissing__ or __dropmissing!__. The latter is the in-place version.

In [None]:
dropmissing(df)

If you wanted to drop rows of data for instances where a specific variable has a missing value you can provide a column name. Here we drop rows of data only if <i>x3</i> or <i>x4</i> has a missing value.

In [None]:
dropmissing(df, [:x3, :x4])

The `completecases` function will return a boolean array with value **true** if the row has a complete record and **false** otherwise.

In [None]:
completecases(df)

To get row numbers with complete records:

In [None]:
findall(row -> row == true, completecases(df))

Many of the functions we used above for arrays will work with the columns of a dataframe as you'd expect:

In [None]:
replace!(df.x3, missing => 2);

In [None]:
df

In this lesson we covered:
* The Missing type in Julia.
* Missing values with operators.
* Arrays containing missing values.
* Dataframes with missing values.