# Session 17: 

By: Viktoria Haghani

Session Date: TBD

Last Updated: 2026-01-30

Reference materials include teaching material from Dr. Nick Ulle.

## Summarizing Data in a Data Frame

In the last session, we started working with the Terns dataset. Initiate a new Jupyter notebook called `02_summarize_data.ipynb`. Load the appropriate modules and data:

### Import Modules

In [None]:
import polars as pl

### Load Data

In [None]:
terns = pl.read_csv("../Session_16/2000-2023_ca_least_tern.csv")

### Summarizing Data Frames

The `.glimpse` method provides a structural summary of a data frame. The method
lists the data frame's shape and column names, as well as the type of data in
each column and a few example values. Try calling `.glimpse` on the `terns`
data frame:

In [None]:
terns.glimpse()

The next chapter explains data types in more detail. For now, just take note
that there are multiple types (`i64`, `str`, and `f64` in the `terns` data
frame).

In contrast to the `.glimpse` method, the `.describe` method provides a
statistical summary of a data frame:

In [None]:
terns.describe()

### Summarizing Columns

You can select individual columns with **bracket notation**. Put the name of
the column in quotes and place that inside of square brackets `[]`. For
example, to select the `total_nests` column:

In [None]:
terns["total_nests"]

Polars provides a variety of methods to compute on columns. For instance, you
can use the `.mean` method to compute the mean:

In [None]:
terns["total_nests"].mean()

Similarly, you can use the `.min` method to compute the smallest value in a
column:

In [None]:
terns["total_nests"].min()

For columns of categories, statistics like means and minimums aren't defined.
Instead, it's often informative to count the number of observations of each
category. You can do this with the `.value_counts` method:

In [None]:
terns["year"].value_counts()

## Working with Different Data Types

When we used the .glimpse method to get a structural summary of the California least tern data set, we can start to take away some information about the data being stored.

In [None]:
terns.glimpse()

The first two rows describe the shape of the data set. After that, each row
lists a column name, the type of data in that column, and that column's first
few values. For instance, the `site_name` column contains `str`, or string,
data. This is why we spent so much time learning about different data types. Data frames usually have a mix of several different data types, and depending on the type, we can do different things with the data.

When doing data analysis, statisticians conventionally categorize data as one
of four types within two larger categories:

- Numeric
    - continuous (real or complex numbers)
    - discrete (integers)
- Categorical
    - nominal (categories with no ordering)
    - ordinal (categories with some ordering)

Here is a quick reminder about some of the different data types we have learned about:

Type      | Example         | Description
--------- | --------------- | -----------
`bool`    | `True`, `False` | Boolean values
`int`     | `-8`, `0`, `42` | Integers
`float`   | `-2.1`, `0.5`   | Real numbers
`complex` | `3j`, `1-2j`    | Complex numbers
`str`     | `"hi"`, `"2.1"` | Strings

We previously learned about the `type()` function, which allows us to determine datatypes:

In [None]:
print(type(True))
print(type(-8.3))

We can also use another function, `isinstance()` to test whether a value is of a particular class. The following tests if `5` is a string:

In [None]:
isinstance(5, str)

### Series

A **series** is an ordered, one-dimensional data structure. Series are a
fundamental data structure in Polars, because each column in a data frame is a
series.

For example, in the California least tern data set, the `site_name` column is a
series. Take a look at the first few elements with its `.head` method:

In [None]:
terns["site_name"].head()

Series and data frames have many attributes and methods in common; the `.head`
method is one of these.

Notice that the elements of the `site_name` series are all strings. Unlike a
list, in a series, all elements must be of the same type, so we say series are
**homogeneous**. A series can contain strings, integers, decimal numbers, or
any of several other types of data, but not a mix of these all at once.

The other columns in the least tern data are also series. For instance, the
`year` column is a series of integers:

In [None]:
terns["year"]

Series can contain any number of elements, including 0 or 1 element. You can
check the number of elements, or length, of a series with Python's built-in
`len` function:

In [None]:
len(terns["year"])

### Creating Series

Sometimes you’ll want to create series by manually inputting data, perhaps
because your data set isn't digitized or because you want a toy data set to
test out some code. You can create a series from a list (or other sequence)
with the `pl.Series` function:

In [None]:
pl.Series([1, 2, 19, -3])

In [None]:
pl.Series(["hi", "hello"])

The Polars documentation recommends setting a name for every series. To do this
with `pl.Series`, pass the name as the first argument and the elements as the
second argument:

In [None]:
pl.Series("tens", [10, 20, 30])

Polars will print the name when you print the series and will use the name as a
column name if you put the series in a data frame. You can get or set the name
on a series through the `.name` attribute:

In [None]:
x = pl.Series("tens", [10, 20, 30])
x.name

By default, Polars infers the data type of a series' elements from the first
element. This can lead to errors you might not expect:

In [None]:
pl.Series([1, 9.2, 2.3])

You can explicitly specify a data type for a series with the `pl.Series`
function's third parameter, `dtype`:

In [None]:
pl.Series([1, 9.2, 2.3], dtype = float)

Polars uses its own data types for series elements, so that:

* It can efficiently support types of data that are not built into Python, such
  as categorical data.
* Every numeric type has an explicit **bit size**: the number of bits of memory
  necessary to store an element. Bit sizes appear as suffixes in the name of
  the type. For instance, `Float32` stores a floating point number in 32 bits.
* A special value, `null`, can be present in any series to indicate missing
  data. We'll explain `null` in {ref}`sec-special-values`.

When you create or access the elements of a series, Polars silently converts
between its types and Python's built-in types. Some of the Polars types and
their Python equivalents are listed in the following table:

| Type                      | Python Equivalent | Description                        |
|---------------------------|-------------------|------------------------------------|
| Boolean                   | bool              | Boolean values                     |
| Int8, Int16, Int32, Int64 | int               | Integers                           |
| Float32, Float64          | float             | Real numbers (base-2 floating point) |
| Not yet supported         | complex           | Complex numbers                    |
| String                    | str               | Strings                            |
| Categorical, Enum         | No equivalent     | Categorical data                   |

The Polars documentation has [the complete list](https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#appendix-full-data-types-table).

If you call Python's `type` function on a data structure, it returns the type
of the data structure:

In [None]:
type(terns["site_name"])

For a series, you can get the element type with the `.dtype` attribute:

In [None]:
terns["site_name"].dtype

### Categorical Data

A feature is **categorical** if it measures a qualitative category. For
example, the genres `rock`, `blues`, `alternative`, `folk`, `pop` are
categories.

Polars uses the `Categorical` and `Enum` data types to represent categorical
data. Visualizations and statistical models sometimes treat categorical data
differently than other data types, so it's important to make sure you have the
right data type.

When it reads a data set, Polars usually can't tell which features are
categorical. That means identifying and converting the categorical features is
up to you. For beginners, it can be difficult to understand whether a feature
is categorical or not. The key is to think about whether you want to use the
feature to divide the data into groups.

For example, if you want to know how many songs are in the `rock` genre, you
first need to divide the songs by genre, and then count the number of songs in
each group (or at least the `rock` group).

As a second example, months recorded as numbers can be categorical or not,
depending on how you want to use them. You might want to treat them as
categorical (for example, to compute max rainfall in each month) or you might
want to treat them as numbers (for example, to compute the number of months
time between two events).

**The bottom line is that you have to think about what you'll be doing in the
analysis**. In some cases, you might treat a feature as categorical only for part
of the analysis.

Let's think about which features are categorical in least terns data set. To
refresh your memory of what's in the data set, take a look at the structural
summary:

In [None]:
terns.glimpse()

The `site_name`, `site_abbr`, and `event` columns are all examples of
categorical data. The `region_` columns and some of the `pred_` columns also
contain categorical data.

One way to check whether a feature is useful for grouping (and thus effectively
categorical) is to count the number of times each value appears. For a series,
you can do this with the `.value_counts` method. For instance, to count the
number of times each category of `event` appears:

In [None]:
terns["event"].value_counts()

Features with only a few unique values, repeated many times, are ideal for
grouping. Numerical features, like `total_nests`, usually aren't good for
grouping, both because of what they measure and because they tend to have many
unique values, which leads to very small groups.

The `year` column can be treated as categorical or quantitative data. It's easy
to imagine grouping observations by year, but years are also numerical: they
have an order and we might want to do math on them. The most appropriate type
for `year` depends on how we want to use it for analysis.

You can cast a column to the `Categorical` type with the `.cast` method. Try
this for the `event` column:

In [None]:
event = terns["event"].cast(pl.Categorical)
event

Polars organizes attributes and methods for categorical data under the `.cat`
attribute of series. These raise errors if the element type of the series is
not `Categorical` (or `Enum`). You can get the categories of a categorical
series with the `.cat.get_categories` method:

In [None]:
event.cat.get_categories()

A categorical series remembers all possible categories even if you take a
subset where some of the categories aren't present:

In [None]:
event[:3]

In [None]:
event[:3].cat.get_categories()

This is one way the `Categorical` type is different from the `String` type, and
ensures that when you, for example, plot a categorical series, missing
categories are represented.

### Broadcasting

If you use an arithmetic operator on a series, Polars **broadcasts** the
operation to each element:

In [None]:
x = pl.Series([1, 3, 0])
x - 3

The result is the same as if you had applied the operation element-by-element.
That is:

In [None]:
pl.Series([1 - 3, 3 - 3, 0 - 3])

Most NumPy (and SciPy) functions also broadcast. For instance:

In [None]:
import numpy as np

x = pl.Series([1.0, 3.0, 0.0, np.pi])
np.sin(x)

Some examples of functions that broadcast are `np.sin`, `np.cos`, `np.tan`,
`np.log`, `np.exp`, and `np.sqrt`.

NumPy functions that combine or aggregate values usually don't broadcast. For
example, `np.sum`, `np.mean`, and `np.median` don't broadcast.

Generally, you should:

* Use broadcasting with data structures that support it, such as series and
  NumPy arrays (explained in {ref}`sec-numpy-arrays`).
* Use comprehensions with lists and Python's other built-in data structures
  (explained in {ref}`sec-built-in-data-structures`).


A function can broadcast across multiple arguments. To demonstrate this,
suppose we want to estimate number of nests per breeding pair for the least
terns data. The `total_nests` column contains the total number of nests at each
site, and the `bp_max` column contains the maximum reported number of breeding
pairs. So to compute nests per breeding pair:

In [None]:
terns["total_nests"] / terns["bp_max"]

The elements are paired up and divided according to their positions. Notice
that the result is a `Float64` series. The `total_nests` column is an `Int64`
series, so besides broadcasting, the example also demonstrates that Series are
subject to implicit coercion (introduced in {ref}`sec-coercion-casting`).

If you try to broadcast a function across two series of different lengths,
Polars raises an error:

In [None]:
x = pl.Series([1, 2])
y = pl.Series([9, 8, 7])
x - y

### Missing Values

Most data sets are not perfect. Some data is not usable, some is missing for one reason or another, etc. In the least terns data set, notice that some of the entries are `null`. For
instance, look at the second element of the `nonpred_eggs` column:

In [None]:
terns.head()

Polars uses `null`, called the **missing value**, to represent missing entries
in a data set. It’s implied that the entries are missing due to how the data
was collected, although there are exceptions. As an example, imagine the data
came from a survey, and respondents chose not to answer some questions. In the
data set, their answers for those questions can be recorded as `null`.

The missing value `null` is a chameleon: it can be of an element of any type in
a series. Polars implicitly converts `null` to and from `None` when you get or
set an element in a series. This means you can use `None` to create a series
with `null` elements:

In [None]:
x = pl.Series([1, 2, None])
x

And you get back `None` if you access a `null` element:

In [None]:
terns["nonpred_eggs"][1]

The missing value `null` is also contagious: it represents an unknown quantity,
so computing on it usually produces another missing value. The idea is that if
the inputs to a computation are unknown, generally so is the output:

In [None]:
x - 3

Polars makes an exception for aggregation functions, which automatically filter
out missing values:

In [None]:
x.mean()

You can use the `.is_null` method to test if elements of a series are `null`:

In [None]:
x.is_null()

Polars also provides an `.is_not_null` method and a `.fill_null` method to fill
missing values with a different value.