<h1>Analysing Categorical Data<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#-Introduction-to-Factor-Variables-" data-toc-modified-id="-Introduction-to-Factor-Variables--1"><span class="toc-item-num">1&nbsp;&nbsp;</span><span style="background: #90ee90"> Introduction to Factor Variables </span></a></span><ul class="toc-item"><li><span><a href="#Nominal-and-Ordinal-Data" data-toc-modified-id="Nominal-and-Ordinal-Data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Nominal and Ordinal Data</a></span></li><li><span><a href="#Characters-and-Factors" data-toc-modified-id="Characters-and-Factors-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Characters and Factors</a></span></li><li><span><a href="#Identifying-Characters-and-Factors" data-toc-modified-id="Identifying-Characters-and-Factors-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Identifying Characters and Factors</a></span></li><li><span><a href="#Converting-Characters-to-Factors" data-toc-modified-id="Converting-Characters-to-Factors-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Converting Characters to Factors</a></span></li><li><span><a href="#Summarizing-Factors" data-toc-modified-id="Summarizing-Factors-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Summarizing Factors</a></span></li></ul></li></ul></div>

As a data scientist, we often find yourself working with non-numerical data, such as job titles, survey responses, or demographic information. R has a special way of representing them, called **Factors**. This Tutorial will help us master working with **Factors** using the tidyverse package forcats. We will work real world datasets, such as the fivethirtyeight flight dataset and Kaggle’s 2017 State of Data Science and ML Survey.

Prerequisites :  
a. Basic knowledge of dplyr for data manipulation  
b. Basic knowledge of tidyr and stringr for data cleaning    
c. Basic knowledge of ggplot2 for creating visualizations  
d. Packages : Tidyverse , Forcats , Fivethirtyeight  

##  <span style = 'background : #90ee90'> Introduction to Factor Variables </span>

In this chapter, we will learn all about factors. We will discover the difference between categorical and ordinal variables, how R represents them, and how to inspect them to find the number and names of the levels. Using the forcats package, we can improve plots by reordering variables by their frequency.

Let us load the libraries we will need for analyising categorical data

In [12]:
# load libraries
suppressPackageStartupMessages(library(tidyverse)) # meta package for data manipulation and data visualization

suppressPackageStartupMessages(library(fivethirtyeight)) # package with curated data sets

### Nominal and Ordinal Data
Categorical Data can be broadly classified into two distinct types :  
*Nominal* : If there is no apparent order/ranking to the categories of a categorical variable, we refer to to it as a Nominal Variable. Nominal categorical variables are those variables with two or more categories that do not have any relational order. Examples of nominal categories could be states in the U.S., brands of computers, or ethnicities.For Nominal variables, there is no intrinsic ordering that distinguishes a category as greater than or less than another category.The number of possible values for a nominal variable can be quite large. It’s even possible that a nominal categorical variable will take on a unique value for every observation in a dataset, like in the case of unique identifiers such as name or email_address 

*Ordinal* : When the groupings of a categorical variable have a specific order or ranking , it is an Ordinal Variable.  
Case 1 : Suppose there was a variable containing responses to the question “Rate your agreement with the statement: The minimum age to drive should be lowered.” The response options are “strongly disagree”, “disagree”, “neutral”, “agree”, and “strongly agree”.Because we can see an order where “strongly disagree” < “disagree” < “neutral” < “agree” < “strongly agree” in relation to agreement, we consider the variable to be ordinal.  
Case 2 : If we asked people about their annual income and offered four choices, "0-50,000", "50,000-150,000", "150,000-500,000", and "more than 500,000", this would be an ordinal variable, because these groups go from smallest to largest. Notice however, distance between each group is not constant - if you were asked to construct a mean salary from this data, you couldn't do it. This is what makes it qualitative instead of quantitative data.

### Characters and Factors  
R has two ways to represent qualitative variables: as *characters* and as *factors*.   
There are subtle differences between them, but generally, we use *factors for nominal and ordinal variables* and characters otherwise.   
e.g. Names would best be represented as characters, because there's no limit to the possible number of names! On the other hand, a survey question where you can select which programming languages you know among 40 possible answers can be represented as a factor  

### Identifying Characters and Factors

There are three ways to identify if the variable is a factor or not.  

Method 1 : Look at the head of the tibble.If we have stored our dataset as a tibble, a modern dataframe, each column type is automatically printed out below or next to the column name 

Method 2 : Use the function is.factor() , which will be true or false depending on whether the input is a factor  

Method 3 : use the glimpse() method on the dataframe , which will give us sample values for each column along with the data type of each column

Let us examine the college_ages dataset from the fivethirtyeight package

In [13]:
# examine the top 6 rows
head(college_all_ages)

major_code,major,major_category,total,employed,employed_fulltime_yearround,unemployed,unemployment_rate,p25th,median,p75th
<int>,<chr>,<chr>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1100,General Agriculture,Agriculture & Natural Resources,128148,90245,74078,2423,0.02614711,34000,50000,80000
1101,Agriculture Production And Management,Agriculture & Natural Resources,95326,76865,64240,2266,0.02863606,36000,54000,80000
1102,Agricultural Economics,Agriculture & Natural Resources,33955,26321,22810,821,0.03024832,40000,63000,98000
1103,Animal Sciences,Agriculture & Natural Resources,103549,81177,64937,3619,0.0426789,30000,46000,72000
1104,Food Science,Agriculture & Natural Resources,24280,17281,12722,894,0.04918845,38500,62000,90000
1105,Plant Science And Agronomy,Agriculture & Natural Resources,79409,63043,51077,2070,0.03179089,35000,50000,75000


We can see that there aren't any factors - major and minor category are both character columns

In [14]:
# apply the is.factor() method to major category column
college_all_ages %>% # take the data frame
select(major_category) %>% # select the major category column
is.factor() # apply the is.factor method

In [15]:
# using the glimpse method
glimpse(college_all_ages)

Rows: 173
Columns: 11
$ major_code                  <int> 1100, 1101, 1102, 1103, 1104, 1105, 1106, ~
$ major                       <chr> "General Agriculture", "Agriculture Produc~
$ major_category              <chr> "Agriculture & Natural Resources", "Agricu~
$ total                       <int> 128148, 95326, 33955, 103549, 24280, 79409~
$ employed                    <int> 90245, 76865, 26321, 81177, 17281, 63043, ~
$ employed_fulltime_yearround <int> 74078, 64240, 22810, 64937, 12722, 51077, ~
$ unemployed                  <int> 2423, 2266, 821, 3619, 894, 2070, 264, 261~
$ unemployment_rate           <dbl> 0.02614711, 0.02863606, 0.03024832, 0.0426~
$ p25th                       <dbl> 34000, 36000, 40000, 30000, 38500, 35000, ~
$ median                      <dbl> 50000, 54000, 63000, 46000, 62000, 50000, ~
$ p75th                       <dbl> 80000, 80000, 98000, 72000, 90000, 75000, ~


Let us load the Kaggle’s State of Data Science and ML Survey

In [16]:
# read the kaggle_mcq_responses dataset into a dataframe kaggle_response
kaggle_response <- read_csv("kaggle_mcq_responses.csv")

Parsed with column specification:
cols(
  .default = col_character(),
  Age = col_double()
)
See spec(...) for full column specifications.


In [17]:
# top 6 rows
head(kaggle_response)

LearningPlatformUsefulnessArxiv,LearningPlatformUsefulnessBlogs,LearningPlatformUsefulnessCollege,LearningPlatformUsefulnessCompany,LearningPlatformUsefulnessConferences,LearningPlatformUsefulnessFriends,LearningPlatformUsefulnessKaggle,LearningPlatformUsefulnessNewsletters,LearningPlatformUsefulnessCommunities,LearningPlatformUsefulnessDocumentation,...,WorkChallengeFrequencyScaling,WorkChallengeFrequencyEnvironments,WorkChallengeFrequencyClarity,WorkChallengeFrequencyDataAccess,WorkChallengeFrequencyOtherSelect,WorkInternalVsExternalTools,FormalEducation,Age,DataScienceIdentitySelect,JobSatisfaction
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>
,,,,Very useful,,,,,,...,Most of the time,,,,,Do not know,Bachelor's degree,,Yes,5
,,,,,,Somewhat useful,,,,...,,,,,,,Master's degree,30.0,Yes,
Very useful,,Somewhat useful,,,,Somewhat useful,,,,...,,,,,,,Master's degree,28.0,Yes,
,Very useful,Very useful,,Very useful,Very useful,,,,Very useful,...,Often,Often,Often,Often,,Entirely internal,Master's degree,56.0,Yes,10 - Highly Satisfied
Very useful,,,,Somewhat useful,,Somewhat useful,,,,...,,Sometimes,,,,Approximately half internal and half external,Doctoral degree,38.0,No,2
,,,,,,Very useful,,,,...,,,,,,More internal than external,Doctoral degree,46.0,,8


In [18]:
# the glimpse method
glimpse(kaggle_response)

Rows: 16,716
Columns: 48
$ LearningPlatformUsefulnessArxiv             <chr> NA, NA, "Very useful", NA,~
$ LearningPlatformUsefulnessBlogs             <chr> NA, NA, NA, "Very useful",~
$ LearningPlatformUsefulnessCollege           <chr> NA, NA, "Somewhat useful",~
$ LearningPlatformUsefulnessCompany           <chr> NA, NA, NA, NA, NA, NA, NA~
$ LearningPlatformUsefulnessConferences       <chr> "Very useful", NA, NA, "Ve~
$ LearningPlatformUsefulnessFriends           <chr> NA, NA, NA, "Very useful",~
$ LearningPlatformUsefulnessKaggle            <chr> NA, "Somewhat useful", "So~
$ LearningPlatformUsefulnessNewsletters       <chr> NA, NA, NA, NA, NA, NA, NA~
$ LearningPlatformUsefulnessCommunities       <chr> NA, NA, NA, NA, NA, NA, NA~
$ LearningPlatformUsefulnessDocumentation     <chr> NA, NA, NA, "Very useful",~
$ LearningPlatformUsefulnessCourses           <chr> NA, NA, "Very useful", "Ve~
$ LearningPlatformUsefulnessProjects          <chr> NA, NA, NA, "Very useful",~
$ LearningPlatf

### Converting Characters to Factors  
As with numerical variables, our first step when looking at categorical variables would be to get a high-level summary. Instead of numerical summaries, like the mean and the standard deviation, we can look at the number of categories and the name of each. What if we observe that some of the variables in our dataset are characters and not factors. How can we change this?  

Step 1 : Identify which columns are characters by using is.character() for this.   

Step 2 : Apply the function as.factor() to change columns from characters to factors.   

Extension : If we want to do this for all character columns, we can take advantage of dplyr's mutate_if() function. This function takes two arguments. The first needs to be a function that returns true or false. mutate_if() will check each column to see if that condition is true and if so, will change the column based on the second argument.e.g. mutate_is(is.character , as.factor) converts all character columns to factors    

**Let us validate the data type for the LearningDataScienceTime column**

In [19]:
# Validate if LearningDataScienceTime data type is a character
is.character(kaggle_response$LearningDataScienceTime)

**Let us now convert all the character columns to factors applying the mutate_if() function**

In [20]:
# Convert all character columns to factors
kaggle_response <- kaggle_response %>%
mutate_if(is.character , as.factor) 

# top 6 rows
head(kaggle_response)

LearningPlatformUsefulnessArxiv,LearningPlatformUsefulnessBlogs,LearningPlatformUsefulnessCollege,LearningPlatformUsefulnessCompany,LearningPlatformUsefulnessConferences,LearningPlatformUsefulnessFriends,LearningPlatformUsefulnessKaggle,LearningPlatformUsefulnessNewsletters,LearningPlatformUsefulnessCommunities,LearningPlatformUsefulnessDocumentation,...,WorkChallengeFrequencyScaling,WorkChallengeFrequencyEnvironments,WorkChallengeFrequencyClarity,WorkChallengeFrequencyDataAccess,WorkChallengeFrequencyOtherSelect,WorkInternalVsExternalTools,FormalEducation,Age,DataScienceIdentitySelect,JobSatisfaction
<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,...,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<fct>,<fct>
,,,,Very useful,,,,,,...,Most of the time,,,,,Do not know,Bachelor's degree,,Yes,5
,,,,,,Somewhat useful,,,,...,,,,,,,Master's degree,30.0,Yes,
Very useful,,Somewhat useful,,,,Somewhat useful,,,,...,,,,,,,Master's degree,28.0,Yes,
,Very useful,Very useful,,Very useful,Very useful,,,,Very useful,...,Often,Often,Often,Often,,Entirely internal,Master's degree,56.0,Yes,10 - Highly Satisfied
Very useful,,,,Somewhat useful,,Somewhat useful,,,,...,,Sometimes,,,,Approximately half internal and half external,Doctoral degree,38.0,No,2
,,,,,,Very useful,,,,...,,,,,,More internal than external,Doctoral degree,46.0,,8


### Summarizing Factors  
Once we have converted our desired coluns to factors , we want to find out more about each one. We can use three functions:   
1. nlevels() will give us the number of sub-categories of the factor   
2. levels() will give us the names of each sub category  
3. summarise_if() will returs a single number, to all columns that meet a certain condition 

Let us look at number of levels in LearningDataScienceTime column

In [21]:
# number of sub-categories / levels within the LearningDataScience column
nlevels(kaggle_response$LearningDataScienceTime)

Let us look at names/sub-categories of levels in LearningDataScienceTime column

In [22]:
# list of sub categories / levels within the LearningDataScience column
levels(kaggle_response$LearningDataScienceTime)

Let us get number of levels for every factor variable

In [26]:
# number of levels of every factor variable
kaggle_response %>%
summarise_if(is.factor , nlevels)

LearningPlatformUsefulnessArxiv,LearningPlatformUsefulnessBlogs,LearningPlatformUsefulnessCollege,LearningPlatformUsefulnessCompany,LearningPlatformUsefulnessConferences,LearningPlatformUsefulnessFriends,LearningPlatformUsefulnessKaggle,LearningPlatformUsefulnessNewsletters,LearningPlatformUsefulnessCommunities,LearningPlatformUsefulnessDocumentation,...,WorkChallengeFrequencyPrivacy,WorkChallengeFrequencyScaling,WorkChallengeFrequencyEnvironments,WorkChallengeFrequencyClarity,WorkChallengeFrequencyDataAccess,WorkChallengeFrequencyOtherSelect,WorkInternalVsExternalTools,FormalEducation,DataScienceIdentitySelect,JobSatisfaction
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
3,3,3,3,3,3,3,3,3,3,...,4,4,4,4,4,4,6,7,3,11


Let us convert this into a tidy table format by applying pivot_longer  

1. Change all the character columns to factor columns and save the new dataset as responses_as_factors

In [37]:
# Change all the character columns to factors
response_factor <- kaggle_response %>% 
  mutate_if(is.character, as.factor)

2. Create a new dataset, number_of_levels:Use summarise_all to apply the function nlevels to each column and change the dataset to long

In [39]:
# Create new dataset of variables and levels
response_factor %>%
summarise_all(nlevels) %>%
  # change the dataset from wide to long
  pivot_longer(everything(), names_to = "Variable" , values_to = "Levels") %>%
  # sort by descending order of levels
  arrange(desc(Levels))

Variable,Levels
<chr>,<int>
MLMethodNextYearSelect,25
CurrentJobTitleSelect,16
JobSatisfaction,11
FormalEducation,7
LearningDataScienceTime,6
WorkInternalVsExternalTools,6
WorkChallengeFrequencyPolitics,4
WorkChallengeFrequencyUnusedResults,4
WorkChallengeFrequencyUnusefulInstrumenting,4
WorkChallengeFrequencyDeployment,4
