# The Grammar of Data Wrangling

Let's go over some of the most used wrangling functions:

**Topic:** `filter()`, `select()`, `mutate()`, `group_by()`, `summarize()`, `arrange()`

You can find much more in the documentation for [wrangling](https://dplyr.tidyverse.org/articles/dplyr.html), `dplyr` [cheatsheet](https://rstudio.github.io/cheatsheets/data-transformation.pdf).


In [46]:
library(dplyr)
library(palmerpenguins)   # provides the penguins dataset
library(tibble)

# Warning messages come from overwritting (masking) variables declared in other packages. 
# So penguins is from the package dataset which is already loaded.
# Then we load palmerpenguins which has penguins, too. 

# Use this to surpress warning and prevent them from showing
# when you know you're doing the correct thing.

# suppressPackageStartupMessages({
#   library(dplyr)
#   library(palmerpenguins)   # provides penguins and penguins_raw
#   library(tibble)
# })

penguins |> glimpse()

Rows: 344
Columns: 8
$ species           [3m[90m<fct>[39m[23m Adelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdel…
$ island            [3m[90m<fct>[39m[23m Torgersen[90m, [39mTorgersen[90m, [39mTorgersen[90m, [39mTorgersen[90m, [39mTorgerse…
$ bill_length_mm    [3m[90m<dbl>[39m[23m 39.1[90m, [39m39.5[90m, [39m40.3[90m, [39m[31mNA[39m[90m, [39m36.7[90m, [39m39.3[90m, [39m38.9[90m, [39m39.2[90m, [39m34.1[90m, [39m…
$ bill_depth_mm     [3m[90m<dbl>[39m[23m 18.7[90m, [39m17.4[90m, [39m18.0[90m, [39m[31mNA[39m[90m, [39m19.3[90m, [39m20.6[90m, [39m17.8[90m, [39m19.6[90m, [39m18.1[90m, [39m…
$ flipper_length_mm [3m[90m<int>[39m[23m 181[90m, [39m186[90m, [39m195[90m, [39m[31mNA[39m[90m, [39m193[90m, [39m190[90m, [39m181[90m, [39m195[90m, [39m193[90m, [39m190[90m, [39m186…
$ body_mass_g       [3m[90m<int>[39m[23m 3750[90m, [39m3800[90m, [

# `filter()`

`filter()` keeps **rows** that match conditions.


In [47]:
# Keep only Adelie penguins
penguins |> 
  filter(species == "Adelie") |>
  glimpse()

Rows: 152
Columns: 8
$ species           [3m[90m<fct>[39m[23m Adelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdel…
$ island            [3m[90m<fct>[39m[23m Torgersen[90m, [39mTorgersen[90m, [39mTorgersen[90m, [39mTorgersen[90m, [39mTorgerse…
$ bill_length_mm    [3m[90m<dbl>[39m[23m 39.1[90m, [39m39.5[90m, [39m40.3[90m, [39m[31mNA[39m[90m, [39m36.7[90m, [39m39.3[90m, [39m38.9[90m, [39m39.2[90m, [39m34.1[90m, [39m…
$ bill_depth_mm     [3m[90m<dbl>[39m[23m 18.7[90m, [39m17.4[90m, [39m18.0[90m, [39m[31mNA[39m[90m, [39m19.3[90m, [39m20.6[90m, [39m17.8[90m, [39m19.6[90m, [39m18.1[90m, [39m…
$ flipper_length_mm [3m[90m<int>[39m[23m 181[90m, [39m186[90m, [39m195[90m, [39m[31mNA[39m[90m, [39m193[90m, [39m190[90m, [39m181[90m, [39m195[90m, [39m193[90m, [39m190[90m, [39m186…
$ body_mass_g       [3m[90m<int>[39m[23m 3750[90m, [39m3800[90m, [

Try it:
- Change `"Adelie"` to `"Chinstrap"` or `"Gentoo"`.
- Filter for a specific island (e.g., `"Biscoe"`).
- Filter out missing values with `!is.na(...)`.


In [48]:
# Your work area

In [49]:
# Examples: multiple filter conditions
penguins |>
  filter(
    island == "Biscoe",
    !is.na(sex),
    bill_length_mm >= 40
  ) |>
  glimpse()

Rows: 135
Columns: 8
$ species           [3m[90m<fct>[39m[23m Adelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdel…
$ island            [3m[90m<fct>[39m[23m Biscoe[90m, [39mBiscoe[90m, [39mBiscoe[90m, [39mBiscoe[90m, [39mBiscoe[90m, [39mBiscoe[90m, [39mBisc…
$ bill_length_mm    [3m[90m<dbl>[39m[23m 40.6[90m, [39m40.5[90m, [39m40.5[90m, [39m40.1[90m, [39m42.0[90m, [39m41.4[90m, [39m40.6[90m, [39m41.3[90m, [39m41.1…
$ bill_depth_mm     [3m[90m<dbl>[39m[23m 18.6[90m, [39m17.9[90m, [39m18.9[90m, [39m18.9[90m, [39m19.5[90m, [39m18.6[90m, [39m18.8[90m, [39m21.1[90m, [39m18.2…
$ flipper_length_mm [3m[90m<int>[39m[23m 183[90m, [39m187[90m, [39m180[90m, [39m188[90m, [39m200[90m, [39m191[90m, [39m193[90m, [39m195[90m, [39m192[90m, [39m192[90m, [39m18…
$ body_mass_g       [3m[90m<int>[39m[23m 3550[90m, [39m3200[90m, [39m3950[90m, [39m4300[90m, 

## Review: `=` versus `==`

- `=` and `<-` **assign** values (create/overwrite variables).
- `==` **tests** whether two values are equal.

So `filter(species == "Adelie")` keeps only rows where the *species* value equals `"Adelie"`.


## `>=` and `<=`

- `>=` means “greater than or equal to”
- `<=` means “less than or equal to”

Example: `filter(body_mass_g >= 4000)` keeps penguins with body mass at least 4000g.


In [50]:
penguins |>
  filter(body_mass_g >= 4000) |>
  glimpse()

Rows: 177
Columns: 8
$ species           [3m[90m<fct>[39m[23m Adelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdel…
$ island            [3m[90m<fct>[39m[23m Torgersen[90m, [39mTorgersen[90m, [39mTorgersen[90m, [39mTorgersen[90m, [39mTorgerse…
$ bill_length_mm    [3m[90m<dbl>[39m[23m 39.2[90m, [39m42.0[90m, [39m34.6[90m, [39m42.5[90m, [39m46.0[90m, [39m39.2[90m, [39m39.8[90m, [39m44.1[90m, [39m39.6…
$ bill_depth_mm     [3m[90m<dbl>[39m[23m 19.6[90m, [39m20.2[90m, [39m21.1[90m, [39m20.7[90m, [39m21.5[90m, [39m21.1[90m, [39m19.1[90m, [39m19.7[90m, [39m18.8…
$ flipper_length_mm [3m[90m<int>[39m[23m 195[90m, [39m190[90m, [39m198[90m, [39m197[90m, [39m194[90m, [39m196[90m, [39m184[90m, [39m196[90m, [39m190[90m, [39m191[90m, [39m18…
$ body_mass_g       [3m[90m<int>[39m[23m 4675[90m, [39m4250[90m, [39m4400[90m, [39m4500[90m, [39m4200[90m, [39

# `select()`

`select()` keeps (or reorders) **columns**. You can also rename columns inside `select()`.


In [51]:
penguins |>
  select(
    sex,
    bill_len = bill_length_mm,
    bill_depth = bill_depth_mm,
    flipper = flipper_length_mm,
    mass_g = body_mass_g
  ) |>
  glimpse()

Rows: 344
Columns: 5
$ sex        [3m[90m<fct>[39m[23m male[90m, [39mfemale[90m, [39mfemale[90m, [39m[31mNA[39m[90m, [39mfemale[90m, [39mmale[90m, [39mfemale[90m, [39mmale[90m, [39m[31mNA[39m[90m, [39m[31mN[39m…
$ bill_len   [3m[90m<dbl>[39m[23m 39.1[90m, [39m39.5[90m, [39m40.3[90m, [39m[31mNA[39m[90m, [39m36.7[90m, [39m39.3[90m, [39m38.9[90m, [39m39.2[90m, [39m34.1[90m, [39m42.0[90m, [39m3…
$ bill_depth [3m[90m<dbl>[39m[23m 18.7[90m, [39m17.4[90m, [39m18.0[90m, [39m[31mNA[39m[90m, [39m19.3[90m, [39m20.6[90m, [39m17.8[90m, [39m19.6[90m, [39m18.1[90m, [39m20.2[90m, [39m1…
$ flipper    [3m[90m<int>[39m[23m 181[90m, [39m186[90m, [39m195[90m, [39m[31mNA[39m[90m, [39m193[90m, [39m190[90m, [39m181[90m, [39m195[90m, [39m193[90m, [39m190[90m, [39m186[90m, [39m180[90m, [39m…
$ mass_g     [3m[90m<int>[39m[23m 3750[90m, [39m3800[90m, [39m3250[90m, [39m[31mNA[39m[90m, [39m345

Try it:
- Add or remove columns.
- Rename a column by using `new_name = old_name`.


# `mutate()`

`mutate()` creates **new columns** 

In [52]:
penguins |>
  mutate(
    body_mass_kg = body_mass_g / 1000,
    flipper_length_cm = flipper_length_mm / 10
  ) |>
  select(species, island, sex, body_mass_g, body_mass_kg, flipper_length_mm, flipper_length_cm) |>
  glimpse()

Rows: 344
Columns: 7
$ species           [3m[90m<fct>[39m[23m Adelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdelie[90m, [39mAdel…
$ island            [3m[90m<fct>[39m[23m Torgersen[90m, [39mTorgersen[90m, [39mTorgersen[90m, [39mTorgersen[90m, [39mTorgerse…
$ sex               [3m[90m<fct>[39m[23m male[90m, [39mfemale[90m, [39mfemale[90m, [39m[31mNA[39m[90m, [39mfemale[90m, [39mmale[90m, [39mfemale[90m, [39mmale…
$ body_mass_g       [3m[90m<int>[39m[23m 3750[90m, [39m3800[90m, [39m3250[90m, [39m[31mNA[39m[90m, [39m3450[90m, [39m3650[90m, [39m3625[90m, [39m4675[90m, [39m3475[90m, [39m…
$ body_mass_kg      [3m[90m<dbl>[39m[23m 3.750[90m, [39m3.800[90m, [39m3.250[90m, [39m[31mNA[39m[90m, [39m3.450[90m, [39m3.650[90m, [39m3.625[90m, [39m4.675[90m,[39m…
$ flipper_length_mm [3m[90m<int>[39m[23m 181[90m, [39m186[90m, [39m195[90m, [39m[31mNA[39m[90m, [39m

or overwrites existing columns. 

`penguins` is pretty clean. A good reason to overwrite a column might be to change to a more appropriate data type. 

Let's check out some messier data:

In [53]:
penguins_raw |>
    glimpse()

Rows: 344
Columns: 17
$ studyName             [3m[90m<chr>[39m[23m "PAL0708"[90m, [39m"PAL0708"[90m, [39m"PAL0708"[90m, [39m"PAL0708"[90m, [39m"PAL…
$ `Sample Number`       [3m[90m<dbl>[39m[23m 1[90m, [39m2[90m, [39m3[90m, [39m4[90m, [39m5[90m, [39m6[90m, [39m7[90m, [39m8[90m, [39m9[90m, [39m10[90m, [39m11[90m, [39m12[90m, [39m13[90m, [39m14[90m, [39m1…
$ Species               [3m[90m<chr>[39m[23m "Adelie Penguin (Pygoscelis adeliae)"[90m, [39m"Adelie P…
$ Region                [3m[90m<chr>[39m[23m "Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"…
$ Island                [3m[90m<chr>[39m[23m "Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgerse…
$ Stage                 [3m[90m<chr>[39m[23m "Adult, 1 Egg Stage"[90m, [39m"Adult, 1 Egg Stage"[90m, [39m"Adu…
$ `Individual ID`       [3m[90m<chr>[39m[23m "N1A1"[90m, [39m"N1A2"[90m, [39m"N2A1"[90m, [39m"

What types do you think we should use?

In [54]:
penguins_raw |>
  mutate(
    # overwrite the existing column with a cleaned version
    `Body Mass (g)` = as.integer(`Body Mass (g)`)
  ) |>
  glimpse()

Rows: 344
Columns: 17
$ studyName             [3m[90m<chr>[39m[23m "PAL0708"[90m, [39m"PAL0708"[90m, [39m"PAL0708"[90m, [39m"PAL0708"[90m, [39m"PAL…
$ `Sample Number`       [3m[90m<dbl>[39m[23m 1[90m, [39m2[90m, [39m3[90m, [39m4[90m, [39m5[90m, [39m6[90m, [39m7[90m, [39m8[90m, [39m9[90m, [39m10[90m, [39m11[90m, [39m12[90m, [39m13[90m, [39m14[90m, [39m1…
$ Species               [3m[90m<chr>[39m[23m "Adelie Penguin (Pygoscelis adeliae)"[90m, [39m"Adelie P…
$ Region                [3m[90m<chr>[39m[23m "Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"…
$ Island                [3m[90m<chr>[39m[23m "Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgerse…
$ Stage                 [3m[90m<chr>[39m[23m "Adult, 1 Egg Stage"[90m, [39m"Adult, 1 Egg Stage"[90m, [39m"Adu…
$ `Individual ID`       [3m[90m<chr>[39m[23m "N1A1"[90m, [39m"N1A2"[90m, [39m"N2A1"[90m, [39m"

We'll come back to this later in the notebook.

Try it:
- Create a new variable that is **three times** the value of `bill_depth_mm`.
- Create `bill_length_sq = bill_length_mm^2`.

In [55]:
# Your work area

# `group_by()` + `summarize()` + `arrange()`

A very common pattern:
1. group by a category,
2. compute summary statistics per group,
3. sort (rank) the results.

In [56]:
names(penguins)

In [57]:
penguins |>
  group_by(species) |>
  summarize(
    mean_mass_g = mean(body_mass_g),
    median_flipper_mm = median(flipper_length_mm),
    max_bill_ratio = max(bill_depth_mm),
  ) |>
  arrange(desc(mean_mass_g))

species,mean_mass_g,median_flipper_mm,max_bill_ratio
<fct>,<dbl>,<dbl>,<dbl>
Chinstrap,3733.088,196.0,20.8
Adelie,,,
Gentoo,,,


Filter out the missing values (or NAs) or set `na.rm` to TRUE to remove them. 

In [58]:
penguins |>
  group_by(species) |>
  summarize(
    mean_mass_g = mean(body_mass_g, na.rm = TRUE),
    median_flipper_mm = median(flipper_length_mm, na.rm = TRUE),
    max_bill_ratio = max(bill_depth_mm, na.rm = TRUE),
  ) |>
  arrange(desc(mean_mass_g))

species,mean_mass_g,median_flipper_mm,max_bill_ratio
<fct>,<dbl>,<dbl>,<dbl>
Gentoo,5076.016,216,17.3
Chinstrap,3733.088,196,20.8
Adelie,3700.662,190,21.5


## Using `across()` to apply the same function to multiple columns

This is handy when you want “the mean of several numeric columns” (or median, sd, etc.).


In [59]:
names(penguins)

In [60]:
penguins |>
  group_by(species) |>
  summarize(
    across(
      c(body_mass_g, flipper_length_mm, bill_depth_mm),
      \(x) mean(x, na.rm = TRUE),
      .names = "mean_{col}")
    ) |>
  arrange(desc(mean_body_mass_g))

species,mean_body_mass_g,mean_flipper_length_mm,mean_bill_depth_mm
<fct>,<dbl>,<dbl>,<dbl>
Gentoo,5076.016,217.187,14.98211
Chinstrap,3733.088,195.8235,18.42059
Adelie,3700.662,189.9536,18.34636


# Practice

1. Group by **island** (instead of species) and compute:
   - `mean()` body mass,
   - `median()` flipper length,
   - and `count()` of observations.

2. Filter to `island == "Biscoe"` and group by **sex**:
   - What is the mean mass for each sex?

3. Use `across()` to compute the mean of `bill_len`, `bill_depth`, `flipper_mm`, and `mass_g` for each **(species, island)** pair.


In [61]:
# Your work area


# Cleaning `penguins_raw` Example

What is wrong with this data? What makes it messy?


## What makes data *clean* (practical checklist)

Clean data is **trustworthy** and **consistent**, so analysis works the way you expect.

- **Clear variable names** (consistent style, descriptive, no hidden units)
- **Correct data types** (numbers as numeric, dates as dates, categories as factors/characters)
- **Consistent coding of categories** (no `"male "` vs `"Male"` vs `"M"`)
- **Consistent units** (no mix of mm/cm or g/kg in the same column)
- **Missing values encoded consistently** (`NA`, not `""`, `"N/A"`, `-999`)
- **Valid values** (ranges make sense; no impossible measurements)
- **Duplicates handled** (no unintended repeated rows)
- **Documented meaning** (you can explain what each column is and the units/coding)

## Step 1: Take a look at the raw data

In [66]:
penguins_raw |> glimpse()

Rows: 344
Columns: 17
$ studyName             [3m[90m<chr>[39m[23m "PAL0708"[90m, [39m"PAL0708"[90m, [39m"PAL0708"[90m, [39m"PAL0708"[90m, [39m"PAL…
$ `Sample Number`       [3m[90m<dbl>[39m[23m 1[90m, [39m2[90m, [39m3[90m, [39m4[90m, [39m5[90m, [39m6[90m, [39m7[90m, [39m8[90m, [39m9[90m, [39m10[90m, [39m11[90m, [39m12[90m, [39m13[90m, [39m14[90m, [39m1…
$ Species               [3m[90m<chr>[39m[23m "Adelie Penguin (Pygoscelis adeliae)"[90m, [39m"Adelie P…
$ Region                [3m[90m<chr>[39m[23m "Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"…
$ Island                [3m[90m<chr>[39m[23m "Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgerse…
$ Stage                 [3m[90m<chr>[39m[23m "Adult, 1 Egg Stage"[90m, [39m"Adult, 1 Egg Stage"[90m, [39m"Adu…
$ `Individual ID`       [3m[90m<chr>[39m[23m "N1A1"[90m, [39m"N1A2"[90m, [39m"N2A1"[90m, [39m"

## Step 2: Make column names consistent

`penguins_raw` has long names and punctuation. A common first step is to standardize names to **snake_case**.


In [67]:
library(janitor) # to clean_names

raw1 <- penguins_raw |> 
  clean_names() # Removes spacing and cases from variables names to make them easier to work with. Now we don't have to use ``.

raw1 |> glimpse()

Rows: 344
Columns: 17
$ study_name        [3m[90m<chr>[39m[23m "PAL0708"[90m, [39m"PAL0708"[90m, [39m"PAL0708"[90m, [39m"PAL0708"[90m, [39m"PAL0708…
$ sample_number     [3m[90m<dbl>[39m[23m 1[90m, [39m2[90m, [39m3[90m, [39m4[90m, [39m5[90m, [39m6[90m, [39m7[90m, [39m8[90m, [39m9[90m, [39m10[90m, [39m11[90m, [39m12[90m, [39m13[90m, [39m14[90m, [39m15[90m, [39m1…
$ species           [3m[90m<chr>[39m[23m "Adelie Penguin (Pygoscelis adeliae)"[90m, [39m"Adelie Pengu…
$ region            [3m[90m<chr>[39m[23m "Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"A…
$ island            [3m[90m<chr>[39m[23m "Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgersen"[90m, [39m…
$ stage             [3m[90m<chr>[39m[23m "Adult, 1 Egg Stage"[90m, [39m"Adult, 1 Egg Stage"[90m, [39m"Adult, …
$ individual_id     [3m[90m<chr>[39m[23m "N1A1"[90m, [39m"N1A2"[90m

## Step 3: Keep the columns you actually want

`penguins_raw` includes some columns used for labeling/notes. Start by selecting the core variables.


In [69]:
# Only include what your want
raw2 <- raw1 |>
  select(
    species,
    region,
    island,
    stage,
    clutch_completion,
    date_egg,
    sex,
    culmen_length_mm,
    culmen_depth_mm,
    flipper_length_mm,
    body_mass_g
  )

raw2 |> glimpse()

Rows: 344
Columns: 11
$ species           [3m[90m<chr>[39m[23m "Adelie Penguin (Pygoscelis adeliae)"[90m, [39m"Adelie Pengu…
$ region            [3m[90m<chr>[39m[23m "Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"A…
$ island            [3m[90m<chr>[39m[23m "Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgersen"[90m, [39m…
$ stage             [3m[90m<chr>[39m[23m "Adult, 1 Egg Stage"[90m, [39m"Adult, 1 Egg Stage"[90m, [39m"Adult, …
$ clutch_completion [3m[90m<chr>[39m[23m "Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"No"[90m, [39m"No"…
$ date_egg          [3m[90m<date>[39m[23m 2007-11-11[90m, [39m2007-11-11[90m, [39m2007-11-16[90m, [39m2007-11-16[90m, [39m200…
$ sex               [3m[90m<chr>[39m[23m "MALE"[90m, [39m"FEMALE"[90m, [39m"FEMALE"[90m, [39m[31mNA[39m[90m, [39m"FEMALE"[90m, [39m"MALE"

In [70]:
# Another way is to just say don't include what your don't want

raw2 <- raw1 |>
  select(
    -c(comments,
       study_name,
       sample_number,
       individual_id,
       delta_15_n_o_oo,
       delta_13_c_o_oo)
  )

raw2 |> glimpse()

Rows: 344
Columns: 11
$ species           [3m[90m<chr>[39m[23m "Adelie Penguin (Pygoscelis adeliae)"[90m, [39m"Adelie Pengu…
$ region            [3m[90m<chr>[39m[23m "Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"A…
$ island            [3m[90m<chr>[39m[23m "Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgersen"[90m, [39m…
$ stage             [3m[90m<chr>[39m[23m "Adult, 1 Egg Stage"[90m, [39m"Adult, 1 Egg Stage"[90m, [39m"Adult, …
$ clutch_completion [3m[90m<chr>[39m[23m "Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"No"[90m, [39m"No"…
$ date_egg          [3m[90m<date>[39m[23m 2007-11-11[90m, [39m2007-11-11[90m, [39m2007-11-16[90m, [39m2007-11-16[90m, [39m200…
$ culmen_length_mm  [3m[90m<dbl>[39m[23m 39.1[90m, [39m39.5[90m, [39m40.3[90m, [39m[31mNA[39m[90m, [39m36.7[90m, [39m39.3[90m, [39m38.9

## Step 4: Decide how to handle missingness

Two common approaches:
- **Drop rows** missing key variables (simple for teaching)
- **Keep missing rows** but be explicit (`na.rm = TRUE` in summaries)

Here we drop rows missing core measurements we might like to measure or visualize.


In [71]:
raw3 <- raw2 |>
  filter(
    !is.na(species),
    !is.na(island),
    !is.na(culmen_length_mm),
    !is.na(culmen_depth_mm),
    !is.na(flipper_length_mm),
    !is.na(body_mass_g)
  )

raw3 |> glimpse()


Rows: 342
Columns: 11
$ species           [3m[90m<chr>[39m[23m "Adelie Penguin (Pygoscelis adeliae)"[90m, [39m"Adelie Pengu…
$ region            [3m[90m<chr>[39m[23m "Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"A…
$ island            [3m[90m<chr>[39m[23m "Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgersen"[90m, [39m…
$ stage             [3m[90m<chr>[39m[23m "Adult, 1 Egg Stage"[90m, [39m"Adult, 1 Egg Stage"[90m, [39m"Adult, …
$ clutch_completion [3m[90m<chr>[39m[23m "Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"No"[90m, [39m"No"[90m, [39m"Yes"…
$ date_egg          [3m[90m<date>[39m[23m 2007-11-11[90m, [39m2007-11-11[90m, [39m2007-11-16[90m, [39m2007-11-16[90m, [39m200…
$ culmen_length_mm  [3m[90m<dbl>[39m[23m 39.1[90m, [39m39.5[90m, [39m40.3[90m, [39m36.7[90m, [39m39.3[90m, [39m38.9[90m, [39m39.2[90m, [

## Step 5: change values to be consistent


Common issues:
- extra spaces (leading/trailing)
- inconsistent capitalization

We’ll make several the sex column lower case


In [72]:
raw4 <- raw3 |>
  mutate(sex = tolower(sex)) #sex is lower case in penguin data

raw4 |> glimpse()

Rows: 342
Columns: 11
$ species           [3m[90m<chr>[39m[23m "Adelie Penguin (Pygoscelis adeliae)"[90m, [39m"Adelie Pengu…
$ region            [3m[90m<chr>[39m[23m "Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"Anvers"[90m, [39m"A…
$ island            [3m[90m<chr>[39m[23m "Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgersen"[90m, [39m"Torgersen"[90m, [39m…
$ stage             [3m[90m<chr>[39m[23m "Adult, 1 Egg Stage"[90m, [39m"Adult, 1 Egg Stage"[90m, [39m"Adult, …
$ clutch_completion [3m[90m<chr>[39m[23m "Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"Yes"[90m, [39m"No"[90m, [39m"No"[90m, [39m"Yes"…
$ date_egg          [3m[90m<date>[39m[23m 2007-11-11[90m, [39m2007-11-11[90m, [39m2007-11-16[90m, [39m2007-11-16[90m, [39m200…
$ culmen_length_mm  [3m[90m<dbl>[39m[23m 39.1[90m, [39m39.5[90m, [39m40.3[90m, [39m36.7[90m, [39m39.3[90m, [39m38.9[90m, [39m39.2[90m, [

This step is a big subjective. You could do other things with this data set like shorten the species names. 

## Step 6: Convert everything to their appropriate types

In [74]:
raw5 <- raw4 |>
  mutate(
    # integers (IDs / counts)
    body_mass_g = as.integer(body_mass_g),
    flipper_length_mm = as.integer(flipper_length_mm),

    # dates
    date_egg = as.Date(date_egg),  # should already work if it's a Date-like string

    # categorical variables
    species = factor(species),
    region  = factor(region),
    island  = factor(island),
    stage   = factor(stage),
    sex = factor(sex),

    # logical
    clutch_completion = ifelse(clutch_completion == "Yes", TRUE, FALSE),
  )

raw5 |> glimpse()

Rows: 342
Columns: 11
$ species           [3m[90m<fct>[39m[23m Adelie Penguin (Pygoscelis adeliae)[90m, [39mAdelie Penguin …
$ region            [3m[90m<fct>[39m[23m Anvers[90m, [39mAnvers[90m, [39mAnvers[90m, [39mAnvers[90m, [39mAnvers[90m, [39mAnvers[90m, [39mAnve…
$ island            [3m[90m<fct>[39m[23m Torgersen[90m, [39mTorgersen[90m, [39mTorgersen[90m, [39mTorgersen[90m, [39mTorgerse…
$ stage             [3m[90m<fct>[39m[23m "Adult, 1 Egg Stage"[90m, [39m"Adult, 1 Egg Stage"[90m, [39m"Adult, …
$ clutch_completion [3m[90m<lgl>[39m[23m TRUE[90m, [39mTRUE[90m, [39mTRUE[90m, [39mTRUE[90m, [39mTRUE[90m, [39mFALSE[90m, [39mFALSE[90m, [39mTRUE[90m, [39mTR…
$ date_egg          [3m[90m<date>[39m[23m 2007-11-11[90m, [39m2007-11-11[90m, [39m2007-11-16[90m, [39m2007-11-16[90m, [39m200…
$ culmen_length_mm  [3m[90m<dbl>[39m[23m 39.1[90m, [39m39.5[90m, [39m40.3[90m, [39m36.7[90m, [39m39.3[90m, [39m38.9[90m

## Step 7: Add variables you would like

- Create derived variables with clear units


In [75]:
penguins_raw_clean <- raw5 |>
  mutate(
    body_mass_kg = body_mass_g / 1000,
    bill_ratio = culmen_length_mm / culmen_depth_mm
  ) 

penguins_raw_clean |> glimpse()

Rows: 342
Columns: 13
$ species           [3m[90m<fct>[39m[23m Adelie Penguin (Pygoscelis adeliae)[90m, [39mAdelie Penguin …
$ region            [3m[90m<fct>[39m[23m Anvers[90m, [39mAnvers[90m, [39mAnvers[90m, [39mAnvers[90m, [39mAnvers[90m, [39mAnvers[90m, [39mAnve…
$ island            [3m[90m<fct>[39m[23m Torgersen[90m, [39mTorgersen[90m, [39mTorgersen[90m, [39mTorgersen[90m, [39mTorgerse…
$ stage             [3m[90m<fct>[39m[23m "Adult, 1 Egg Stage"[90m, [39m"Adult, 1 Egg Stage"[90m, [39m"Adult, …
$ clutch_completion [3m[90m<lgl>[39m[23m TRUE[90m, [39mTRUE[90m, [39mTRUE[90m, [39mTRUE[90m, [39mTRUE[90m, [39mFALSE[90m, [39mFALSE[90m, [39mTRUE[90m, [39mTR…
$ date_egg          [3m[90m<date>[39m[23m 2007-11-11[90m, [39m2007-11-11[90m, [39m2007-11-16[90m, [39m2007-11-16[90m, [39m200…
$ culmen_length_mm  [3m[90m<dbl>[39m[23m 39.1[90m, [39m39.5[90m, [39m40.3[90m, [39m36.7[90m, [39m39.3[90m, [39m38.9[90m