# Handling missing values

In [None]:
using DataFrames

A singleton type `Missing` allows us to deal with missing values.

In [None]:
missing, typeof(missing)

Arrays automatically create an appropriate union type.

In [None]:
x = [1, 2, missing, 3]

`ismissing` checks if passed value is missing.

In [None]:
ismissing(1), ismissing(missing), ismissing(x), ismissing.(x)

We can extract the type combined with Missing from a `Union` via `nonmissingtype` (This is useful for arrays!)

In [None]:
eltype(x), nonmissingtype(eltype(x))

`missing` comparisons produce `missing`.

In [None]:
missing == missing, missing != missing, missing < missing

This is also true when `missing`s are compared with values of other types.

In [None]:
1 == missing, 1 != missing, 1 < missing

`isequal`, `isless`, and `===` produce results of type `Bool`. Notice that `missing` is considered greater than any numeric value.

In [None]:
isequal(missing, missing), missing === missing, isequal(1, missing), isless(1, missing)

In the next few examples, we see that many (not all) functions handle `missing`.

In [None]:
map(x -> x(missing), [sin, cos, zero, sqrt]) ## part 1

In [None]:
map(x -> x(missing, 1), [+, -, *, /, div]) ## part 2

In [None]:
using Statistics ## needed for mean
map(x -> x([1, 2, missing]), [minimum, maximum, extrema, mean, float]) ## part 3

`skipmissing` returns iterator skipping missing values. We can use `collect` and `skipmissing` to create an array that excludes these missing values.

In [None]:
collect(skipmissing([1, missing, 2, missing]))

Here we use `replace` to create a new array that replaces all missing values with some value (`NaN` in this case).

In [None]:
replace([1.0, missing, 2.0, missing], missing => NaN)

Another way is to use `coalesce()`

In [None]:
coalesce.([1.0, missing, 2.0, missing], NaN)

You can also use `recode` from CategoricalArrays.jl if you have a default output value.

In [None]:
using CategoricalArrays
recode([1.0, missing, 2.0, missing], false, missing => true)

There are also `replace!` and `recode!` functions that work in place.
Here is an example how you can process missing input in a data frame.

In [None]:
df = DataFrame(a=[1, 2, missing], b=["a", "b", missing])

we change `df.a` vector in place.

In [None]:
replace!(df.a, missing => 100)

Now we overwrite `df.b` with a new vector, because the replacement type is different than what `eltype(df.b)` accepts.

In [None]:
df.b = coalesce.(df.b, 100)

In [None]:
df

You can use `unique` or `levels` to get unique values with or without missings, respectively.

In [None]:
unique([1, missing, 2, missing]), levels([1, missing, 2, missing])

In this next example, we convert `x` to `y` with `allowmissing`, where `y` has a type that accepts missing values.

In [None]:
x = [1, 2, 3]
y = allowmissing(x)

Then, we convert back with `disallowmissing`. This would fail if `y` contained missing values!

In [None]:
z = disallowmissing(y)
x, y, z

`disallowmissing` has `error` keyword argument that can be used to decide how it should behave when it encounters a column that actually contains a `missing` value

In [None]:
df = allowmissing(DataFrame(ones(2, 3), :auto))

In [None]:
df[1, 1] = missing

In [None]:
df

an error is thrown by `disallowmissing()`

In [None]:
try
    disallowmissing(df)
catch e
    show(e)
end

column `:x1` is left untouched as it contains missing

In [None]:
disallowmissing(df, error=false)

In this next example, we show that the type of each column in `x` is initially `Int64`. After using `allowmissing!` to accept missing values in columns 1 and 3, the types of those columns become `Union{Int64,Missing}`.

In [None]:
x = DataFrame(rand(Int, 2, 3), :auto)
println("Before: ", eltype.(eachcol(x)))
allowmissing!(x, 1) ## make first column accept missings
allowmissing!(x, :x3) ## make :x3 column accept missings
println("After: ", eltype.(eachcol(x)))

In this next example, we'll use `completecases` to find all the rows of a `DataFrame` that have complete data.

In [None]:
x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])
println("Complete cases:\n", completecases(x))

We can use `dropmissing` or `dropmissing!` to remove the rows with incomplete data from a `DataFrame` and either create a new `DataFrame` or mutate the original in-place.

In [None]:
y = dropmissing(x)
dropmissing!(x)

In [None]:
x

In [None]:
y

When we call `describe` on a `DataFrame` with dropped missing values, the columns do not allow missing values any more by default.

In [None]:
describe(x)

Alternatively you can pass `disallowmissing` keyword argument to `dropmissing` and `dropmissing!`

In [None]:
x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])

In [None]:
dropmissing!(x, disallowmissing=false)

## Making functions `missing`-aware
If we have a function that does not handle `missing` values we can wrap it using `passmissing` function so that if any of its positional arguments is missing we will get a `missing` value in return. In the example below we change how `string` function behaves:

In [None]:
string(missing)

In [None]:
string(missing, " ", missing)

In [None]:
string(1, 2, 3)

In [None]:
lift_string = passmissing(string)

In [None]:
lift_string(missing)

In [None]:
lift_string(missing, " ", missing)

In [None]:
lift_string(1, 2, 3)

## Aggregating rows containing missing values
Create an example data frame containing missing values:

In [None]:
df = DataFrame(a=[1, missing, missing], b=[1, 2, missing])

If we just run `sum` on the rows we get two missing entries:

In [None]:
sum.(eachrow(df))

One can apply `skipmissing` on the rows to avoid this problem:

In [None]:
try
    sum.(skipmissing.(eachrow(df)))
catch e
    show(e)
end

However, we get an error. The problem is that the last row of `df` contains only missing values, and since `eachrow` is type unstable the `eltype` of the result of `skipmissing` is unknown (so it is marked `Any`)

In [None]:
collect(skipmissing(eachrow(df)[end]))

In such cases it is useful to switch to `Tables.namedtupleiterator` which is type stable as discussed in 01_constructors.ipynb notebook.

In [None]:
sum.(skipmissing.(Tables.namedtupleiterator(df)))

Later in the tutorial you will learn that you can efficiently calculate such sums using the `select` function:

In [None]:
select(df, AsTable(:) => ByRow(sum ∘ skipmissing))

Note that it correctly handles the rows with all missing values.

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*