In [None]:
options(jupyter.rich_display = FALSE)

# Factors
**by Serhat Çevikel**

## Factors in brief

Apart from numeric, integer, logical and character values, many datasets in data science have categorical variables that have to be represented in discrete values that are kept as integer values internally but printed with comprehensive labels.

They are called factors and they are mostly used with data frames

First let's create a numeric variable:

In [None]:
vec_4 <- c(1,1,4,2,3,1)
vec_4

And concert it into a factor with labels:

In [None]:
fct_1 <- factor(vec_4, levels = 1:4, labels = c("a", "b", "c", "d"))

In [None]:
fct_1

We can append or modify a value with any of the defined labels:

In [None]:
fct_1[7] <- "a"
fct_1

And we still have a vector of factor type:

In [None]:
class(fct_1)

However if we try to add a value of new label:

In [None]:
fct_1[8] <- "e"
fct_1

It is not identified as a level and hence added as NA

Now get the unique levels (as labels)

In [None]:
levels(fct_1)

Add add a new level:

In [None]:
levels(fct_1) <- c(levels(fct_1), "e")

In [None]:
fct_1

Now you can add a value of that new level:

In [None]:
fct_1[8] <- "e"
fct_1

## Factors in detail


- An R factor might be viewed simply as a vector with a bit more information added

- That extra information consists of a record of the distinct values in that vector, called levels.

(Norman Matloff, The Art of R Programming)

First get a subset of the letters

In [None]:
let_sub <- sample(letters, 5)
let_sub
class(let_sub)

And sample repeating values out of this vector

In [None]:
let_sam <- sample(let_sub, 30, replace = T)
let_sam
class(let_sub)

Now convert the sample into a factor and see what changes:

In [None]:
let_sam_fact <- factor(let_sam)
let_sam_fact
class(let_sam_fact)

See the levels:

In [None]:
levels1 <- levels(let_sam_fact)
levels1

And convert the factors to integer:

In [None]:
let_sam_fact
as.integer(let_sam_fact)

See that all:

- "c"'s are coded as 1,
- "f"'s are coded as 2,
- "j"'s are coded as 3,
- "n"'s are coded as 4 and
- "x"s are coded as 5

And r does this automatically

Can we convert the original character sample to integer?:

In [None]:
as.integer(let_sam)

No, we cannot

Now let's feed the "levels" manually:

In [None]:
let_sam_fact_2 <- factor(let_sam, levels = sort(levels1, decreasing = T))
let_sam_fact_2
as.integer(let_sam_fact_2)
levels(let_sam_fact_2)

Now all:

- "x"'s are coded as 1
- "n"'s are coded as 2
- "j"'s are coded as 3
- "f"'s are coded as 4 and
- "c"'s are coded as 5

Now let's say, while I am feeding the levels I forgot one of the level values!

In [None]:
levels1[1]
let_sam_fact_3 <- factor(let_sam, levels = levels1[-1])
let_sam
let_sam_fact_3
as.integer(let_sam_fact_3)
levels(let_sam_fact_3)

See the first item in the levels is not recognized now in a factor vector

Now let's append a value that is a level to let_sam_fact_3

In [None]:
let_sam_fact_3b <- c(let_sam_fact_3, levels1[2])
let_sam_fact_3b
class(let_sam_fact_3b)

Coerced to character when appended with c()

Now let's append by indexing:

In [None]:
let_sam_fact_3b <- let_sam_fact_3
let_sam_fact_3b[length(let_sam_fact_3b) + 1] <- levels1[2]
let_sam_fact_3b
class(let_sam_fact_3b)

Now "factor" attribute is retained

In [None]:
attributes(let_sam_fact_3b)

New let's append some value which is not a level, such as the first one we excluded

In [None]:
let_sam_fact_3b <- let_sam_fact_3
let_sam_fact_3b[length(let_sam_fact_3b) + 1] <- levels1[1]
let_sam_fact_3b
class(let_sam_fact_3b)

See that it is appended as NA

What if we add a new level?

In [None]:
levels(let_sam_fact_3b)

In [None]:
levels(let_sam_fact_3b) <- c(levels(let_sam_fact_3b), levels1[1])

In [None]:
levels(let_sam_fact_3b)

Let's do it again

In [None]:
levels(let_sam_fact_3b) <- c(levels(let_sam_fact_3b), levels1[1])

In [None]:
levels(let_sam_fact_3b)

See the levels are not double accounted and added only once

Now let's try to append the missing value'

In [None]:
let_sam_fact_3b[length(let_sam_fact_3b) + 1] <- levels1[1]
let_sam_fact_3b
class(let_sam_fact_3b)

Now it is accepted

Now let's see how labels work

In [None]:
teams <- c("GS", "FB", "BJK", "TS", "BS")
aliases <- c("Aslan", "Kanarya", "Kartal", "Kaplan", "Timsah")
teams_vec <- sample(teams, 20, replace = T)
teams_vec

In [None]:
teams_fac <- factor(teams_vec, levels = teams)
teams_fac
as.numeric(teams_fac)
teams_fac <- factor(teams_vec, levels = teams, label = aliases)
teams_fac
as.numeric(teams_fac)

See the values are actually converted to numeric values that are shown as some characters, the labels can be changed

Now let's append a numeric value with c():

In [None]:
let_sam_fact_3c <- c(let_sam_fact_3, 1)
let_sam_fact_3c
class(let_sam_fact_3c)

c() does not work either with numeric values 

In [None]:
let_sam_fact_3d <- let_sam_fact_3
let_sam_fact_3d[length(let_sam_fact_3d) + 1] <- 4
let_sam_fact_3d
class(let_sam_fact_3d)

See, we have to feed a "level" defined in the factor. Otherwise it is appended as NA

See the structure of a factor vector

In [None]:
str(let_sam_fact_2)

We can combine with other vectors into a data frame:

In [None]:
some_values <- sample(1:7, length(let_sam_fact_2), replace = T)
some_values

let_df <- data.frame(factors = let_sam_fact_2, values = some_values)
let_df
summary(let_df)

Factor may be automatically generated with some functions such as "cut"

In [None]:
scores <- sample(1:100, 30, replace = T)
scores

scores_cut <- cut(scores, breaks = 5)
scores_and_cut <- data.frame(scores_cut, scores)
scores_and_cut

Let's summarize and visualize the cut values.

In [None]:
scores_table <- table(scores_cut)
scores_table
pie(scores_table)
str(scores_cut)
summary(scores_cut)

## Relevel factors

Let's convert a character vector to factor

In [None]:
set.seed(1)
vec1 <- sample(c("low", "medium", "high"), 10, replace = T)
vec1

In [None]:
fact1 <- as.factor(vec1)
fact1

See that levels are sorted in alphabetical order

In [None]:
as.numeric(fact1)

In [None]:
barplot(table(fact1))

To make one of the levels the first one:

In [None]:
relevel(fact1, "low")

To completely change the order:

In [None]:
fact2 <- factor(fact1, levels = c("low", "medium", "high"))
fact2

In [None]:
barplot(table(fact2))

Can we compare the levels and their orders?

In [None]:
fact2

In [None]:
fact2[1] > fact2[2]

In [None]:
max(fact2)

So make the levels ordered:

In [None]:
fact3 <- factor(fact2, ordered = T)

In [None]:
fact3

We can make logical comparisons:

In [None]:
fact3

In [None]:
fact3[1] > fact3[2]

An get ordered values of factors:

In [None]:
max(fact3)