In [1]:
# Load the Tidyverse

library('tidyverse')

“Failed to locate timezone database”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


Use the `pull` function to extract the `class` column from the `mpg` data set. Capture this column into a variable named `class_col`.

In [2]:
class_col = mpg |> 
    group_by(class) |>
    pull(class) # mpg$class is equivalent way to do this

class_col |> head()

Run the code below to turn `class_col` into a factor and view the default levels.

In [3]:
#Equivalent ways to do this
class_col %>% factor %>% levels
class_col |> factor() |> levels()

Let's turn `class_col` into a factor but manipulate the order of the levels with `forcats`!

First, use the `table` command to count the frequency of each value in `class_col`.

In [4]:
#Frequecny of each value in class_col
#this is equivalent to....
class_col |> table()

class_col
   2seater    compact    midsize    minivan     pickup subcompact        suv 
         5         47         41         11         33         35         62 

In [5]:
# ....this
class_col %>% table

.
   2seater    compact    midsize    minivan     pickup subcompact        suv 
         5         47         41         11         33         35         62 

Pipe `class_col` into the `fct_infreq` function to turn `class_col` into a factor and order the levels by ☝️frequency. Are the levels in the order you expected?

In [6]:
mpg |> 
    group_by(class) |>
    mutate(class_count = n()) |> #class_count = number of values in each class
    arrange(class_count) |>
    ungroup() |> #need to ungroup
    mutate(class = fct_inorder(class)) |> #here, we're doing this process independetly with each group that was created since there is no summarize
    #Trying to turn class into a factor where the levels are based on the order in which values in the class colum are seen; 
    # then pulling values out and inspecting the levels
    pull(class) |> levels()
    # head()


#This ultimately gives the same ordering as using the fct_infreq() function

In [7]:
#This lists levels based on the frequency
class_col |> fct_infreq()

In [8]:
#Why use this? 
#Bc the levels are being dictated based on the order that each value is seen in the dataset
class_col |> fct_inorder()

In [9]:
#This orders smallest first and then largest last
class_col |> 
    fct_infreq() |> fct_rev() |> table()


#This orders largest first and then smallest last
class_col |> 
    fct_infreq() |> table()


   2seater    minivan     pickup subcompact    midsize    compact        suv 
         5         11         33         35         41         47         62 


       suv    compact    midsize subcompact     pickup    minivan    2seater 
        62         47         41         35         33         11          5 

Group the low frequency levels into an "Other" category using `fct_lump`. Set the `n` argument to **five**.

In [10]:
class_col |> 
    fct_infreq() |> fct_rev() |> fct_lump(n=5) |> table()


    pickup subcompact    midsize    compact        suv      Other 
        33         35         41         47         62         16 

In [12]:
#changing low frequency values to the same value which is other
class_col |> fct_lump(n = 5) #this collapses our two seater and mini-van values into 'Other'

#there are fct_lump() variations that lump data in different ways, such as...
class_col |> fct_lump_n(3)

In [14]:
#lumps values based on their frequency
class_col |> fct_lump_min(40) #this lumps values based on the actual frequency value

## Bonus

Can you group low frequency values and sort the levels by frequency?

In [22]:
#might want to put the "Other" category somewhere else specifically
# i.e. want to sort by freq but want "Other" to be at the end --- we can do this with the fct_relevel() command

class_col |> fct_lump(n = 3) |> fct_infreq() |> fct_relevel('Other', after = 2) |> levels() #places at 3th spot (index starts at 0)

class_col |> fct_lump(n = 3) |> fct_infreq() |> fct_relevel('Other', after = Inf) |> levels() #always at very end

class_col |> fct_lump(n = 3) |> fct_infreq() |> fct_relevel('Other', after = 0) |> levels() #always at very front