## Factors

R uses a special data type called a *factor* to represent sets of entities that are only allowed to take on certain discrete values. The types of variables are called *categorical variables* in statistics. Some common examples of categorical variables are:

1. Color ("red", "blue", "green")
2. Months ("January", "February", "March"...)
3. Gender ("M", "F")

To see how factors work and why they are useful for representing categorical variables, consider the following `character` vector containing several weekday values:

In [1]:
vec.of.days <- c("Monday", "Sunday", "Friday", "Thursday")

What if we want R to put these weekdays in order? If we ask it to `sort` these values, we will get a nonsense result, as shown below:

In [2]:
sort(vec.of.days)

This is because R thinks `vec.of.days` just contains a bunch of strings, so it sorts them alphabetically. This sort of mistake can get us into all kinds of trouble. We can avoid this by converting `vec.of.days` into a `factor` and defining an appropriate `order` for the entries using the `levels` argument, as shown below:

In [3]:
days.factor <- factor(vec.of.days, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

We can see that the `levels` value that we passed has been stored along with our original values inside `days.factor`:

In [4]:
print(days.factor)

[1] Monday   Sunday   Friday   Thursday
Levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday


Now when we `sort`, R will use the custom (correct) ordering that we defined using `levels`:

In [5]:
sort(days.factor)

Factors also allow us to effortlessly relabel large amounts of data by simply relabeling the `levels` appropriately. For example, we can relabel our `days.factor` to use the German days of the week while preserving the correct order:

In [6]:
german.lvls <- c("Montag", "Dienstag", "Mittwoch", "Donnerstag", "Freitag", "Samstag", "Sonntag")
levels(days.factor) <- german.lvls
sort(days.factor)

Notice that relabeling the levels effectively relabels all entries in the factor. 

<span style="color:blue;font-weight:bold">Exercise</span>: As shown above, remap the levels of the `days.factor` to use the French days of the week: 

* Lundi
* Mardi
* Mercredi
* Jeudi
* Vendredi
* Samedi
* Dimanche

In [7]:
# delete this entire line and replace it with your code

french.lvls <- c("Lundi", "Mardi", "Mercredi", "Jeudi", "Vendredi", "Samedi", "Dimanche")
levels(days.factor) <- french.lvls

In [8]:
scratch.fac <- factor(vec.of.days, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
levels(scratch.fac) <- c("Lundi", "Mardi", "Mercredi", "Jeudi", "Vendredi", "Samedi", "Dimanche")
check.variable.value("days.factor", scratch.fac)
success()

## Implicit Choice of Levels

Consider the following vector containing the colors of different cars: 

In [8]:
car.colors <- c("red",  "green", "blue", "red", "blue","blue")

Suppose that this car manufacturer only produces cars in those three colors - `"red"`, `"green"`, and `"blue"`. If that is the case, it may be advantageous for us to convert the above `character` vector into a `factor`:

In [9]:
car.colors.factor <- factor(car.colors)

Converting `car.colors` into a factor associates a vector of `levels` with our original values, even though we did not explicitly pass a `levels` argument:

In [10]:
print(car.colors.factor)

[1] red   green blue  red   blue  blue 
Levels: blue green red


We can see that these "Levels" consist of the unique entries in our original values, sorted in alphabetical order:

In [11]:
levels(car.colors.factor)

## Reordering Factor Levels

We need to be very careful when we attempt to reorder the levels of a factor. Suppose that we have the following data frame:

In [12]:
price <- c(1e4, 2e4, 1.5e4, 3e4, 2.5e4, 5e4)
make <- c("ford", "toyota", "mazda", "tesla", "toyota", "telsa")
color <- c("red",  "green", "blue", "red", "blue","blue")
car.df <- data.frame(price, make, color)
head(car.df)

price,make,color
<dbl>,<fct>,<fct>
10000,ford,red
20000,toyota,green
15000,mazda,blue
30000,tesla,red
25000,toyota,blue
50000,telsa,blue


Notice that `data.frame` has automatically converted our `character` vectors to factors:

In [13]:
levels(car.df$color)

Sorting our data frame by `color` sorts the entries alphabetically:

In [14]:
car.df[order(car.df$color),]

Unnamed: 0_level_0,price,make,color
Unnamed: 0_level_1,<dbl>,<fct>,<fct>
3,15000,mazda,blue
5,25000,toyota,blue
6,50000,telsa,blue
2,20000,toyota,green
1,10000,ford,red
4,30000,tesla,red


Suppose that we want our new level order to be `c("red", "green", "blue")` - we might be tempted to use the following `levels` call to accomplish this:

In [15]:
# make a copy so we don't mess up the original
car.df.bad <- car.df
levels(car.df.bad$color) <- c("red", "green", "blue")
print("before:")
head(car.df)
print("after:")
head(car.df.bad)

[1] "before:"


price,make,color
<dbl>,<fct>,<fct>
10000,ford,red
20000,toyota,green
15000,mazda,blue
30000,tesla,red
25000,toyota,blue
50000,telsa,blue


[1] "after:"


price,make,color
<dbl>,<fct>,<fct>
10000,ford,blue
20000,toyota,green
15000,mazda,red
30000,tesla,blue
25000,toyota,red
50000,telsa,red


As you can see, our attempt to reorder the levels has actually corrupted the data - it has changed all `red` cars to blue and all `blue` cars to red. To reorder the levels without corrupting the data, we need to call `factor` as follows to create a new factor with the appropriate levels:

In [16]:
car.df.good <- car.df
car.df.good$color <- factor(car.df.good$color, levels=c("red", "green", "blue"))
print("before:")
head(car.df)
print("after:")
head(car.df.good)
sorted.df <- car.df.good[order(car.df.good$color),]
print("sorted:")
head(sorted.df)

[1] "before:"


price,make,color
<dbl>,<fct>,<fct>
10000,ford,red
20000,toyota,green
15000,mazda,blue
30000,tesla,red
25000,toyota,blue
50000,telsa,blue


[1] "after:"


price,make,color
<dbl>,<fct>,<fct>
10000,ford,red
20000,toyota,green
15000,mazda,blue
30000,tesla,red
25000,toyota,blue
50000,telsa,blue


[1] "sorted:"


Unnamed: 0_level_0,price,make,color
Unnamed: 0_level_1,<dbl>,<fct>,<fct>
1,10000,ford,red
4,30000,tesla,red
2,20000,toyota,green
3,15000,mazda,blue
5,25000,toyota,blue
6,50000,telsa,blue


As you can see, the correct values of the `color` column entries are now preserved, and we gain the ability to sort in the order that we desired. 

## Factor Internals

We can understand the tricky reordering behavior discussed above by looking at how factor data is actually stored internally. We're going to use the `car.colors.factor` that we defined above as an example: 

In [17]:
car.colors.factor

Let's use the `as.numeric` function to take a look inside `car.colors.factor`:

In [18]:
as.numeric(car.colors.factor)

This shows us that our factor is just represented as a list of integer codes on the inside. R converts these integers into the appropriate level names when needed by looking up the corresponding index in the `levels` vector - essentially, it does the following process:

In [19]:
levels(car.colors.factor)[as.numeric(car.colors.factor)]

Now, let's reorder the levels the wrong way and observe how this process changes the internals of our factor:

In [20]:
print("as.character before:")
print(as.character(car.colors.factor))
print("as.numeric before:")
print(as.numeric(car.colors.factor))
# make a copy to aovid messing things up
car.colors.factor.new.order <- car.colors.factor
levels(car.colors.factor.new.order) <- c("green", "blue", "red")
print("as.character after:")
print(as.character(car.colors.factor.new.order))
print("as.numeric after:")
print(as.numeric(car.colors.factor.new.order))

[1] "as.character before:"
[1] "red"   "green" "blue"  "red"   "blue"  "blue" 
[1] "as.numeric before:"
[1] 3 2 1 3 1 1
[1] "as.character after:"
[1] "red"   "blue"  "green" "red"   "green" "green"
[1] "as.numeric after:"
[1] 3 2 1 3 1 1


Notice that the integer codes provided by `as.numeric` have not changed, but the actual values inside the factor appear to have changed - specifically, the `blue` and `green` values have been swapped. This happens because the order of the `level` vector has changed, so the indices `1` and `2` now resolve to the opposite values:

In [21]:
levels(car.colors.factor)[c(1,2)] 
levels(car.colors.factor.new.order)[c(1,2)] 

This mistake could cause serious problems for any analysis that we wish to run on our data - be very careful not to make this mistake. 

## Ordered Factors

We've seen that controlling the order of the levels in our factor allows us to sort them and display them in the order we desire. However, other order-based operations do not work as expected. For example, consider the following factor:

In [22]:
sizes <- c("low", "medium", "low", "medium", "high", "medium","high")
size.factor <- factor(
    sizes,
    levels=c("low", "medium", "high")
)
size.factor

We can sort this factor as expected:

In [23]:
sort(size.factor)

However, we cannot compare two values from this factor:

In [24]:
size.factor[1] < size.factor[2]

“‘<’ not meaningful for factors”

Nor can we find the `max` value:

In [25]:
max(size.factor)

ERROR: Error in Summary.factor(structure(c(1L, 2L, 1L, 2L, 3L, 2L, 3L), .Label = c("low", : ‘max’ not meaningful for factors


In order to make all of these order-based operations work, we must explicitly declare that the levels of our factor are meaningfully ordered. We can do this by passing the `ordered=TRUE` argument when we create our factor:

In [26]:
size.factor.ordered <- factor(sizes, levels=c("low", "medium", "high"), ordered=TRUE)

Now all of our order-based operations will work as expected:

In [27]:
size.factor.ordered
max(size.factor.ordered)
size.factor.ordered[1] < size.factor.ordered[5]

We can tell if a given factor is ordered using the `is.ordered` function:

In [28]:
is.ordered(size.factor)
is.ordered(size.factor.ordered)

## Why do Factors Matter?

The use of factors may seem somewhat superfluous given the material that we have covered thus far. However, the principal reason that it is necessary for you to learn about them is because factors are the standard input format for loading categorical variables like the ones discussed above into the statistical models that we will learn about in subsequent lessons and courses. If you do not understand how to create factors, you will be unable to make effective use of these advanced features. 