# Introduction to Factors

In [1]:
# What's a factor and why would you use it?
# In this chapter you dive into the wonderful world of factors.

# The term factor refers to a statistical data type used to store categorical variables.
# The difference between a categorical variable and a continuous variable is that 
# a categorical variable can belong to a limited number of categories. 
# A continuous variable, on the other hand, can correspond to an infinite number of values.

# It is important that R knows whether it is dealing with a continuous or a categorical variable, 
# as the statistical models you will develop in the future treat both types differently. 
# (You will see later why this is the case.)

# A good example of a categorical variable is sex. In many circumstances 
# you can limit the sex categories to "Male" or "Female". 
# (Sometimes you may need different categories. For example, you may need to consider chromosomal variation, 
# hermaphroditic animals, or different cultural norms, but you will always have a finite number of categories.)

# 3.01

In [3]:
# Sex vector
sex_vector <- c("Male", "Female", "Female", "Male", "Male")

# Convert sex_vector to a factor
factor_sex_vector <- factor(sex_vector)

# Print out factor_sex_vector
factor_sex_vector

In [4]:
# What's a factor and why would you use it? (3)
# There are two types of categorical variables: a nominal categorical variable and an ordinal categorical variable.

# A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that 
# 'one is worth more than the other'. For example, think of the categorical variable animals_vector with the categories 
# "Elephant", "Giraffe", "Donkey" and "Horse". Here, it is impossible to say that one stands above or below the other. 
# (Note that some of you might disagree ;-) ).

# In contrast, ordinal variables do have a natural ordering. 
# Consider for example the categorical variable temperature_vector with the categories: "Low", "Medium" and "High". 
# Here it is obvious that "Medium" stands above "Low", and "High" stands above "Medium".

# 3.02

In [6]:
# Animals
animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector

# Temperature
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector

In [13]:
# Code to build factor_survey_vector
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)

# Specify the levels of factor_survey_vector
levels(factor_survey_vector) <- c("Female", "Male")

factor_survey_vector

# Generating the Summary Result

In [14]:
# Generate summary for survey_vector
summary(survey_vector)

# Generate summary for factor_survey_vector
summary(factor_survey_vector)

   Length     Class      Mode 
        5 character character 

In [15]:
# Male
male <- factor_survey_vector[1]

# Female
female <- factor_survey_vector[2]

# Battle of the sexes: Male 'larger' than female?
male > female

“‘>’ not meaningful for factors”

In [16]:
# Create speed_vector
speed_vector <- c("medium", "slow", "slow", "medium", "fast")

# Convert speed_vector to ordered factor vector
factor_speed_vector <- factor(speed_vector,ordered = TRUE, levels = c("slow", "medium", "fast"))

# Print factor_speed_vector
factor_speed_vector
summary(factor_speed_vector)

In [17]:
# Factor value for second data analyst
da2 <- factor_speed_vector[2]

# Factor value for fifth data analyst
da5 <- factor_speed_vector[5]

# Is data analyst 2 faster than data analyst 5?
da2 > da5