# Palmer penguins and split-apply-combine {#exr-palmer-penguins-split-apply-combine}

[Data set download](https://s3.amazonaws.com/bebi103.caltech.edu/data/penguins_subset.csv)

<hr />

The [Palmer penguins data set](https://towardsdatascience.com/penguins-dataset-overview-iris-alternative-9453bb8c8d95) is a nice data set with which to practice various data science skills. For this exercise, we will use as subset of it, which you can download here: [https://s3.amazonaws.com/bebi103.caltech.edu/data/penguins_subset.csv](https://s3.amazonaws.com/bebi103.caltech.edu/data/penguins_subset.csv). The data set consists of measurements of three different species of penguins acquired at the [Palmer Station in Antarctica](https://en.wikipedia.org/wiki/Palmer_Station). The measurements were made between 2007 and 2009 by [Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php).

**a)** Take a look at the CSV file containing the data set. Is it in tidy format? Why or why not?

**b)** You can convert the CSV file to a "tall" format using the `bebi103.utils.unpivot_csv()` function. You can do that with the following function call, where `path_to_penguins` is a string containing the path to the penguin_subset.csv file.

    bebi103.utils.unpivot_csv(
        path_to_penguins,
        "penguins_tall.csv",
        n_header_rows=2,
        header_names=["species", "quantity"],
        comment_prefix="#",
        retain_row_index=True,
        row_index_name='penguin_id',
    )    

After running that function, load in the data set stored in the `penguins_tall.csv` file and store it in a variable named `df_tall`. Is this a tidy data set?

**c)** Perform the following operations to make a new `DataFrame` from the one you loaded in to generate a new `DataFrame`. Explain what the operations do.

```python
df = (
    df_tall
    .pivot(
        index=['penguin_id', 'species'], columns='quantity', values='value'
    )
    .select(pl.exclude('penguin_id'))
)
```

Is the resulting data frame `df` tidy? Why or why not?

**d)** Using the data frame you created in part (c), slice out all of the bill lengths for *Gentoo* penguins.

**e)** Make a new data frame, `df_tidy`, containing the mean measured bill depth, bill length, body mass in kg, and flipper length for each species. You can use millimeters for all length measurements.

**f)** Save the data frame you made in part (c) in a file named `penguins_subset_tidy.csv`.

<br />

## Solution

<hr>

In [1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars bebi103 watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import polars as pl
import bebi103

**a)** Let's take a look.

In [2]:
!head ../data/penguins_subset.csv

Gentoo,Gentoo,Gentoo,Gentoo,Adelie,Adelie,Adelie,Adelie,Chinstrap,Chinstrap,Chinstrap,Chinstrap
bill_depth_mm,bill_length_mm,flipper_length_mm,body_mass_g,bill_depth_mm,bill_length_mm,flipper_length_mm,body_mass_g,bill_depth_mm,bill_length_mm,flipper_length_mm,body_mass_g
16.3,48.4,220.0,5400.0,18.5,36.8,193.0,3500.0,18.3,47.6,195.0,3850.0
15.8,46.3,215.0,5050.0,16.9,37.0,185.0,3000.0,16.7,42.5,187.0,3350.0
14.2,47.5,209.0,4600.0,19.5,42.0,200.0,4050.0,16.6,40.9,187.0,3200.0
15.7,48.7,208.0,5350.0,18.3,42.7,196.0,4075.0,20.0,52.8,205.0,4550.0
14.1,48.7,210.0,4450.0,18.0,35.7,202.0,3550.0,18.7,45.4,188.0,3525.0
15.0,49.6,216.0,4750.0,19.1,39.8,184.0,4650.0,18.2,49.6,193.0,3775.0
15.7,49.3,217.0,5850.0,18.4,40.8,195.0,3900.0,17.5,48.5,191.0,3400.0
15.2,49.2,221.0,6300.0,18.4,36.6,184.0,3475.0,18.2,49.2,195.0,4400.0


This is not a tidy data set, since each row contains measurements from three different penguins. Strictly speaking, the two-level hierarchy of the column names does not make it untidy, but it is cumbersome to work with.

**b)** We can now convert this data frame with a two-level hierarchical column heading to a tall data frame.

In [3]:
bebi103.utils.unpivot_csv(
    os.path.join(data_path, 'penguins_subset.csv'),
    "penguins_tall.csv",
    n_header_rows=2,
    header_names=["species", "quantity"],
    comment_prefix="#",
    retain_row_index=True,
    row_index_name='penguin_id',
    force_overwrite=True,
)

We can now load in the data frame and take a look.

In [4]:
df_tall = pl.read_csv('penguins_tall.csv')

# Take a look
df_tall

penguin_id,species,quantity,value
i64,str,str,f64
0,"""Gentoo""","""bill_depth_mm""",16.3
0,"""Gentoo""","""bill_length_mm""",48.4
0,"""Gentoo""","""flipper_length_mm""",220.0
0,"""Gentoo""","""body_mass_g""",5400.0
0,"""Adelie""","""bill_depth_mm""",18.5
…,…,…,…
67,"""Adelie""","""body_mass_g""",3050.0
67,"""Chinstrap""","""bill_depth_mm""",18.9
67,"""Chinstrap""","""bill_length_mm""",46.0
67,"""Chinstrap""","""flipper_length_mm""",195.0


This is tidy, with each row corresponding to a single observation and each column a property of that observation.

**d)** Let's do the prescribed operations and take a look.

In [5]:
df = (
    df_tall
    .pivot(
        on='quantity', index=['penguin_id', 'species'], values='value'
    )
    .select(pl.exclude('penguin_id'))
)

# Take a look
df

species,bill_depth_mm,bill_length_mm,flipper_length_mm,body_mass_g
str,f64,f64,f64,f64
"""Gentoo""",16.3,48.4,220.0,5400.0
"""Adelie""",18.5,36.8,193.0,3500.0
"""Chinstrap""",18.3,47.6,195.0,3850.0
"""Gentoo""",15.8,46.3,215.0,5050.0
"""Adelie""",16.9,37.0,185.0,3000.0
…,…,…,…,…
"""Adelie""",18.9,45.8,197.0,4150.0
"""Chinstrap""",18.2,46.1,178.0,3250.0
"""Gentoo""",14.6,45.8,210.0,4200.0
"""Adelie""",15.9,35.2,186.0,3050.0


The pivot operation moves the data frame from a tall to a wide format. By pivoting on quantitity, the names of the quantities become the column headings and the respective values get populated. However, the penguin ID and species do not get pivoted, but are "along for the ride" as ID columns. The select operation simply ignores the `penguin_id` column because it is not longer necessary.

This is also a tidy dataframe, with each row being a specific penguin, and each column being an aspect of the measurements of that penguin.

**d)** We perform a filtering operation to slice out the *Gentoo* entries and then fetch the bill length column.

In [6]:
gentoo_bill_lengths = (
    df
    .filter(pl.col('species') == 'Gentoo')
    .get_column('bill_length_mm')
)

# Take a look to verify
gentoo_bill_lengths

bill_length_mm
f64
48.4
46.3
47.5
48.7
48.7
…
48.4
46.8
55.9
44.5


These values are stored as a Series. If we wanted them as a NumPy array, we would use the `.to_numpy()` method.

In [7]:
gentoo_bill_lengths.to_numpy()

array([48.4, 46.3, 47.5, 48.7, 48.7, 49.6, 49.3, 49.2, 45.5, 48.8, 46.4,
       48.6, 46.5, 50.8, 45.7, 50. , 42.7, 44.9, 49.5, 45.2, 50. , 40.9,
       53.4, 50.1, 51.1, 49. , 42.9, 50.5, 50. , 52.5, 46.2, 46.8, 55.1,
       46.1, 45.5, 49.1, 52.1, 46.5, 44.5, 44.5, 45.5, 44.4, 47.2, 45.2,
       48.1, 45.3, 48.7, 46.6, 43.3, 43.2, 43.5, 45.2, 48.2, 46.2, 44. ,
       46.1, 42.8, 42.6, 50.8, 52.2, 46.5, 50.2, 48.2, 48.4, 46.8, 55.9,
       44.5, 45.8])

**e)** To compute the mean of the features for each species, we do a groupby operation. This is classic split-apply-combine. After applying the `mean()` aggregating function, we make a new column with the mass in kilograms before finally droping the column containing the mass in grams.

In [8]:
df_means = (
    df
    .group_by('species')
    .mean()
    .with_columns(body_mass_kg=pl.col('body_mass_g') / 1000) 
    .drop('body_mass_g')
)

# Take a look
df_means

species,bill_depth_mm,bill_length_mm,flipper_length_mm,body_mass_kg
str,f64,f64,f64,f64
"""Chinstrap""",18.420588,48.833824,195.823529,3.733088
"""Adelie""",18.135294,38.476471,189.602941,3.621691
"""Gentoo""",15.013235,47.419118,216.514706,5.058456


**f)** To save the tidy data frame, we use the `df.write_csv()` function.

In [9]:
df.write_csv(os.path.join(data_path, 'penguins_subset_tidy.csv'))

## Computing environment

In [10]:
%load_ext watermark
%watermark -v -p polars,bebi103,jupyterlab

Python implementation: CPython
Python version       : 3.12.9
IPython version      : 9.1.0

polars    : 1.29.0
bebi103   : 0.1.26
jupyterlab: 4.3.7

