# Working with CategoricalArrays
CategoricalArrays.jl is independent from DataFrames.jl but it is often used in combination

In [None]:
using DataFrames
using CategoricalArrays

## Constructor
unordered arrays

In [None]:
x = categorical(["A", "B", "B", "C"])

ordered, by default order is sorting order

In [None]:
y = categorical(["A", "B", "B", "C"], ordered=true)

unordered with missing values

In [None]:
z = categorical(["A", "B", "B", "C", missing])

ordered array cut into equal counts, possible to rename labels and give custom breaks

In [None]:
c = cut(1:10, 5)

(we will cover grouping later, but let us here use it to analyze the  results, we use Chain.jl for chaining)

In [None]:
using Chain
@chain DataFrame(x=cut(randn(100000), 10)) begin
    groupby(:x)
    combine(nrow) ## just to make sure cut works right
end

contains integers not strings

In [None]:
v = categorical([1, 2, 2, 3, 3])

sometimes you need to convert back to a standard vector

In [None]:
Vector{Union{String,Missing}}(z)

## Managing levels

In [None]:
arr = [x, y, z, c, v]

check if categorical array is orderd

In [None]:
isordered.(arr)

make x ordered

In [None]:
ordered!(x, true), isordered(x)

and unordered again

In [None]:
ordered!(x, false), isordered(x)

list levels

In [None]:
levels.(arr)

missing will be included

In [None]:
unique.(arr)

can compare as y is ordered

In [None]:
y[1] < y[2]

not comparable, v is unordered although it contains integers

In [None]:
try
    v[1] < v[2]
catch e
    show(e)
end

comparison against type underlying categorical value is not allowed

In [None]:
try
    y[2] < "A"
catch e
    show(e)
end

you need to explicitly convert a value to a level

In [None]:
y[2] < CategoricalValue("A", y)

but it is treated as a level, and thus only valid levels are allowed

In [None]:
try
    y[2] < CategoricalValue("Z", y)
catch e
    show(e)
end

you can reorder levels, mostly useful for ordered CategoricalArrays

In [None]:
levels!(y, ["C", "B", "A"])

observe that the order is changed

In [None]:
y[1] < y[2]

you have to specify all levels that are present

In [None]:
try
    levels!(z, ["A", "B"])
catch e
    show(e)
end

unless the underlying array allows for missing values and force removal of levels

In [None]:
levels!(z, ["A", "B"], allowmissing=true)

now z has only "B" entries

In [None]:
z[1] = "B"
z

but it remembers the levels it had (the reason is mostly performance)

In [None]:
levels(z)

this way we can clean it up by `droplevels!(z)`

In [None]:
droplevels!(z)
levels(z)

## Data manipulation

In [None]:
x, levels(x)

new level added at the end (works only for unordered)

In [None]:
x[2] = "0"
x, levels(x)

In [None]:
v, levels(v)

even though the underlying data is Int, we cannot operate on it

In [None]:
try
    v[1] + v[2]
catch e
    show(e)
end

you have either to retrieve the data by conversion (may be expensive)

In [None]:
Vector{Int}(v)

or get a single value by `unwrap`

In [None]:
unwrap(v[1]) + unwrap(v[2])

this will work for arrays without missing values

In [None]:
unwrap.(v)

also works on missing values

In [None]:
unwrap.(z)

or do the explicit conversion

In [None]:
Vector{Union{String,Missing}}(z)

recode some values in an array; has also in place recode! equivalent

In [None]:
recode([1, 2, 3, 4, 5, missing], 1 => 10)

here we provided a default value for not mapped recodes

In [None]:
recode([1, 2, 3, 4, 5, missing], "a", 1 => 10, 2 => 20)

to recode Missing you have to do it explicitly

In [None]:
recode([1, 2, 3, 4, 5, missing], 1 => 10, missing => "missing")

In [None]:
t = categorical([1:5; missing])
t, levels(t)

note that the levels are dropped after recode

In [None]:
recode!(t, [1, 3] => 2)
t, levels(t)

and if you introduce a new levels they are added at the end in the order of appearance

In [None]:
t = categorical([1, 2, 3], ordered=true)
levels(recode(t, 2 => 0, 1 => -1))

when using default it becomes the last level

In [None]:
t = categorical([1, 2, 3, 4, 5], ordered=true)
levels(recode(t, 300, [1, 2] => 100, 3 => 200))

## Comparisons

In [None]:
x = categorical([1, 2, 3])
xs = [x, categorical(x), categorical(x, ordered=true), categorical(x, ordered=true)]
levels!(xs[2], [3, 2, 1])
levels!(xs[4], [2, 3, 1])
[a == b for a in xs, b in xs] ## all are equal - comparison only by contents

this is actually the full signature of CategoricalArray

In [None]:
signature(x::CategoricalArray) = (x, levels(x), isordered(x))

all are different, notice that x[1] and x[2] are unordered but have a different order of levels

In [None]:
[signature(a) == signature(b) for a in xs, b in xs]

you cannot compare elements of unordered CategoricalArray

In [None]:
try
    x[1] < x[2]
catch e
    show(e)
end

but you can do it for an ordered one

In [None]:
t[1] < t[2]

`isless()` works within the same CategoricalArray even if it is not ordered

In [None]:
isless(x[1], x[2])

but not across categorical arrays

In [None]:
y = deepcopy(x)
try
    isless(x[1], y[2])
catch e
    show(e)
end

you can use get to make a comparison of the contents of CategoricalArray

In [None]:
isless(unwrap(x[1]), unwrap(y[2]))

equality tests works OK across CategoricalArrays

In [None]:
x[1] == y[2]

## Categorical columns in a DataFrame

In [None]:
df = DataFrame(x=1:3, y='a':'c', z=["a", "b", "c"])

Convert all String columns to categorical in-place

In [None]:
transform!(df, names(df, String) => categorical, renamecols=false)

In [None]:
describe(df)

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*