# Structs

In [1]:
import polars as pl

## Intro to Structs
- The `Struct` is a composite data type that is similar to a Python dictionary. It holds key-value pairs.
- The struct's keys are called **fields** (e.g., "the struct's `name` field has a value of `Boris`").
- Each field is a unique identifier for a corresponding value.
- A `Struct` enables a row entry to store multiple pieces of information.

### Create a Column of Structs
- Let's instantiate a `DataFrame` with a single column of structs.
- We'll pass the constructor a dictionary with the column names as keys and the column values as values.
- For our single column, we'll pass a list of dictionaries.

In [2]:
pl.DataFrame(
    {
        "Candies": [
            {"name": "Snickers", "calories": 200, "price": 1.99},
            {"name": "3 Musketeers", "calories": 250, "price": 2.49},
        ]
    }
)

Candies
struct[3]
"{""Snickers"",200,1.99}"
"{""3 Musketeers"",250,2.49}"


- The visual output will not include the struct's keys/fields (`name`, `calories`, and `price`). Polars still stores them internally.
- The shape of the `DataFrame` is 2 rows by 1 column.
- Each `DataFrame` row holds 1 value, the struct.
- The `struct[3]` type indicates that each struct holds 3 key-value pairs.

### The First Entry is the Source of Truth
- Polars uses the first list as the source of truth for the key-value pairs.
- The next example's list has a dictionary with 4 key-value pairs.
- But Polars uses the first dictionary (3 key-value pairs) to infer the schema (fields and types).

In [3]:
candies = pl.DataFrame(
    {
        "Candies": [
            {"name": "Snickers", "calories": 200, "price": 1.99},
            {"name": "3 Musketeers", "calories": 250, "price": 2.49, "delicious": True},
        ]
    }
)
candies

Candies
struct[3]
"{""Snickers"",200,1.99}"
"{""3 Musketeers"",250,2.49}"


- The `schema` attribute will print out the columns and their associated types.
- The `Candies` column stores `Struct` values.
- Each `Struct` holds `name`, `calories`, and `price` fields.

In [4]:
candies.schema

Schema([('Candies',
         Struct({'name': String, 'calories': Int64, 'price': Float64}))])

### Further Reading
- https://docs.pola.rs/user-guide/expressions/structs/
- https://docs.pola.rs/api/python/stable/reference/api/polars.datatypes.Struct.html#polars.datatypes.Struct

## The struct.field and unnest Methods
- Polars nests struct methods within a `struct` attribute/namespace.
- The `field` method extracts the value for a given key in each struct.
- Polars will keep the struct key as the column name. Use `alias` to rename.

In [5]:
candies = pl.DataFrame(
    {
        "Candies": [
            {"name": "Snickers", "calories": 200, "price": 1.99},
            {"name": "3 Musketeers", "calories": 250, "price": 2.49},
        ]
    }
)
candies

Candies
struct[3]
"{""Snickers"",200,1.99}"
"{""3 Musketeers"",250,2.49}"


In [6]:
candies.with_columns(pl.col("Candies").struct.field("name"))

candies.with_columns(pl.col("Candies").struct.field("calories"))

Candies,calories
struct[3],i64
"{""Snickers"",200,1.99}",200
"{""3 Musketeers"",250,2.49}",250


- The `struct.unnest` method extract the struct's key-value pairs into separate columns.

In [7]:
candies.with_columns(pl.col("Candies").struct.unnest())

Candies,name,calories,price
struct[3],str,i64,f64
"{""Snickers"",200,1.99}","""Snickers""",200,1.99
"{""3 Musketeers"",250,2.49}","""3 Musketeers""",250,2.49


- The `DataFrame` has a complementary `unnest` method that accepts a column expression.

In [8]:
candies.unnest(pl.col("Candies"))

name,calories,price
str,i64,f64
"""Snickers""",200,1.99
"""3 Musketeers""",250,2.49


### Further Reading
- https://docs.pola.rs/user-guide/expressions/structs/#extracting-individual-values-of-a-struct
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.struct.field.html
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.struct.unnest.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.unnest.html

## The value_counts Method
- A core principle of Polars is that an operation on a column produces a single new column.
- The one-to-one design creates consistency and predictability in method output.
- Say we wanted to count the number of occurrences of each distinct value in a column.
- There are two pieces of data to track: each unique value and the number of times it occurs.
- The `value_counts` method returns a column of structs. The struct data type supports multiple fields.

In [9]:
colors = pl.read_csv("colors.csv")
colors

color
str
"""Red"""
"""Green"""
"""Orange"""
"""Green"""
"""Blue"""
"""Blue"""
"""Red"""
"""Red"""


In [10]:
colors.select(pl.col("color").value_counts().alias("data"))

data
struct[2]
"{""Red"",3}"
"{""Green"",2}"
"{""Blue"",2}"
"{""Orange"",1}"


- Use `schema` to see the names and data types of the struct fields.
- Polars used the column name (`color`) for the struct field.
- Polars chose the name `count` for the struct field with the count of each value.
- The count can only be 0 or positive, so the method uses a `UInt32` type for `count`.

In [11]:
colors.select(pl.col("color").value_counts().alias("data")).schema

Schema([('data', Struct({'color': String, 'count': UInt32}))])

- Pass the `sort` parameter a value of `True` to sort the value counts in descending order.
- The value with the greatest number of occurrences will appear first.

In [12]:
colors.select(pl.col("color").value_counts(sort=True).alias("data"))

data
struct[2]
"{""Red"",3}"
"{""Green"",2}"
"{""Blue"",2}"
"{""Orange"",1}"


- The `struct.unnest` method extracts each struct field to a separate column.
- The `DataFrame` supports a top-level `unnest` method as well.

In [13]:
colors.select(pl.col("color").value_counts(sort=True).alias("data")).select(
    pl.col("data").struct.unnest()
)

color,count
str,u32
"""Red""",3
"""Green""",2
"""Blue""",2
"""Orange""",1


In [14]:
colors.select(pl.col("color").value_counts(sort=True).alias("data")).unnest(
    pl.col("data")
)

color,count
str,u32
"""Red""",3
"""Green""",2
"""Blue""",2
"""Orange""",1


### Further Reading
- https://docs.pola.rs/user-guide/expressions/structs/#encountering-the-data-type-struct
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.value_counts.html

## Rename Struct Fields
- The `struct.rename_fields` method renames the struct fields.
- The method accepts a list of new field names.
- Access the `DataFrame`'s `schema` attribute to see the order of the struct fields.

In [15]:
colors = pl.read_csv("colors.csv")
colors.head(3)

color
str
"""Red"""
"""Green"""
"""Orange"""


In [16]:
counts = colors.select(pl.col("color").value_counts().alias("data"))
counts

data
struct[2]
"{""Red"",3}"
"{""Green"",2}"
"{""Blue"",2}"
"{""Orange"",1}"


In [17]:
counts.schema

Schema([('data', Struct({'color': String, 'count': UInt32}))])

In [18]:
counts = counts.select(pl.col("data").struct.rename_fields(["Hue", "Occurrences"]))

In [19]:
counts

data
struct[2]
"{""Red"",3}"
"{""Green"",2}"
"{""Blue"",2}"
"{""Orange"",1}"


In [20]:
counts.schema

Schema([('data', Struct({'Hue': String, 'Occurrences': UInt32}))])

### Further Reading
- https://docs.pola.rs/user-guide/expressions/structs/#renaming-individual-fields-of-a-struct
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.struct.rename_fields.html

## Using Structs to Work with Nested Data
- Structs can contain other structs. Polars will utilize structs for nested data.
- JSON (JavaScript Object Notation) is another common data transfer format.
- The syntax is inspired by the JavaScript language.
- An "object" in this domain can be thought of as similar to a Python dictionary.
- Each object in the `ecommerce_orders.json` dataset has two top-level keys.
- Polars extracts the `order_id` and `shipping` keys to their own columns.
- The rows in the `shipping` column store structs with 2 fields.

In [21]:
orders = pl.read_json("ecommerce_orders.json")
orders.head(2)

order_id,shipping
str,struct[2]
"""a7c3e947-e16b-458e-9af3-e2f83a…","{""USPS Priority"",{""7794 Melissa Knolls Suite 837"",""Barnesfurt"",""16133""}}"
"""698e111f-2d5f-4211-a686-4ee785…","{""USPS Priority"",{""5643 Henderson Hollow"",""North Hannahbury"",""20421""}}"


- Polars supports nested structs. One field's value can be another struct.
- Each row in `shipping` is a struct that contains an `address` field.
- The `address` field stores a struct with `street`, `city`, and `zip` fields.

In [22]:
orders.schema

Schema([('order_id', String),
        ('shipping',
         Struct({'method': String, 'address': Struct({'street': String, 'city': String, 'zip': String})}))])

- Let's start by extracting the `method` and `address` fields to new columns.

In [23]:
orders.unnest(pl.col("shipping"))

orders.select(pl.col("order_id"), pl.col("shipping").struct.unnest())

order_id,method,address
str,str,struct[3]
"""a7c3e947-e16b-458e-9af3-e2f83a…","""USPS Priority""","{""7794 Melissa Knolls Suite 837"",""Barnesfurt"",""16133""}"
"""698e111f-2d5f-4211-a686-4ee785…","""USPS Priority""","{""5643 Henderson Hollow"",""North Hannahbury"",""20421""}"
"""e9911fc0-3883-45db-8e36-770394…","""DHL Express""","{""305 Roy Mountains"",""Port Jennifer"",""55220""}"
"""e93537b0-c187-45c8-bbfd-34e76f…","""UPS Ground""","{""556 Michael Fort"",""Kristinfort"",""33956""}"
"""f01d311f-ba49-4602-b462-e4cfd7…","""UPS Ground""","{""78038 Alec Ridges"",""Lake Nicolebury"",""91068""}"
…,…,…
"""b1b569b0-166f-4dc7-82c8-7487d5…","""USPS Priority""","{""608 Mcdonald Shore"",""Port Kimberlyshire"",""19714""}"
"""da97d9b9-1dc8-4b16-a5c9-202c09…","""USPS Priority""","{""14420 Matthew Harbors"",""East Daniel"",""96207""}"
"""5fff92c6-2fd2-4992-8ca0-506a1b…","""FedEx""","{""9285 Justin Dam Apt. 796"",""Shannonton"",""78317""}"
"""b6f19a70-5bb5-4976-9a9c-75e239…","""DHL Express""","{""983 Darren Walk"",""Payneborough"",""40260""}"


- Next, we want to pull out the values from the `address` structs.

In [24]:
orders.unnest(pl.col("shipping")).unnest(pl.col("address"))

orders.select(pl.col("order_id"), pl.col("shipping").struct.unnest()).select(
    pl.col("order_id"), pl.col("method"), pl.col("address").struct.unnest()
)

order_id,method,street,city,zip
str,str,str,str,str
"""a7c3e947-e16b-458e-9af3-e2f83a…","""USPS Priority""","""7794 Melissa Knolls Suite 837""","""Barnesfurt""","""16133"""
"""698e111f-2d5f-4211-a686-4ee785…","""USPS Priority""","""5643 Henderson Hollow""","""North Hannahbury""","""20421"""
"""e9911fc0-3883-45db-8e36-770394…","""DHL Express""","""305 Roy Mountains""","""Port Jennifer""","""55220"""
"""e93537b0-c187-45c8-bbfd-34e76f…","""UPS Ground""","""556 Michael Fort""","""Kristinfort""","""33956"""
"""f01d311f-ba49-4602-b462-e4cfd7…","""UPS Ground""","""78038 Alec Ridges""","""Lake Nicolebury""","""91068"""
…,…,…,…,…
"""b1b569b0-166f-4dc7-82c8-7487d5…","""USPS Priority""","""608 Mcdonald Shore""","""Port Kimberlyshire""","""19714"""
"""da97d9b9-1dc8-4b16-a5c9-202c09…","""USPS Priority""","""14420 Matthew Harbors""","""East Daniel""","""96207"""
"""5fff92c6-2fd2-4992-8ca0-506a1b…","""FedEx""","""9285 Justin Dam Apt. 796""","""Shannonton""","""78317"""
"""b6f19a70-5bb5-4976-9a9c-75e239…","""DHL Express""","""983 Darren Walk""","""Payneborough""","""40260"""


- We can rename the columns after unnesting or rename the struct fields before unnesting.

In [25]:
orders.select(pl.col("order_id"), pl.col("shipping").struct.unnest()).select(
    pl.col("order_id"),
    pl.col("method"),
    pl.col("address")
    .struct.rename_fields(["Street", "City", "Address"])
    .struct.unnest(),
)

order_id,method,Street,City,Address
str,str,str,str,str
"""a7c3e947-e16b-458e-9af3-e2f83a…","""USPS Priority""","""7794 Melissa Knolls Suite 837""","""Barnesfurt""","""16133"""
"""698e111f-2d5f-4211-a686-4ee785…","""USPS Priority""","""5643 Henderson Hollow""","""North Hannahbury""","""20421"""
"""e9911fc0-3883-45db-8e36-770394…","""DHL Express""","""305 Roy Mountains""","""Port Jennifer""","""55220"""
"""e93537b0-c187-45c8-bbfd-34e76f…","""UPS Ground""","""556 Michael Fort""","""Kristinfort""","""33956"""
"""f01d311f-ba49-4602-b462-e4cfd7…","""UPS Ground""","""78038 Alec Ridges""","""Lake Nicolebury""","""91068"""
…,…,…,…,…
"""b1b569b0-166f-4dc7-82c8-7487d5…","""USPS Priority""","""608 Mcdonald Shore""","""Port Kimberlyshire""","""19714"""
"""da97d9b9-1dc8-4b16-a5c9-202c09…","""USPS Priority""","""14420 Matthew Harbors""","""East Daniel""","""96207"""
"""5fff92c6-2fd2-4992-8ca0-506a1b…","""FedEx""","""9285 Justin Dam Apt. 796""","""Shannonton""","""78317"""
"""b6f19a70-5bb5-4976-9a9c-75e239…","""DHL Express""","""983 Darren Walk""","""Payneborough""","""40260"""


- Let's say we wanted to concatenate the address (street, city, and zip) in a single new column.

In [26]:
address = pl.col("shipping").struct.field("address")

orders.select(
    pl.format(
        "{}, {} {}",
        address.struct.field("street"),
        address.struct.field("city"),
        address.struct.field("zip"),
    )
)

street
str
"""7794 Melissa Knolls Suite 837,…"
"""5643 Henderson Hollow, North H…"
"""305 Roy Mountains, Port Jennif…"
"""556 Michael Fort, Kristinfort …"
"""78038 Alec Ridges, Lake Nicole…"
…
"""608 Mcdonald Shore, Port Kimbe…"
"""14420 Matthew Harbors, East Da…"
"""9285 Justin Dam Apt. 796, Shan…"
"""983 Darren Walk, Payneborough …"


### Further Reading
- https://docs.pola.rs/user-guide/io/json/
- https://docs.pola.rs/api/python/stable/reference/api/polars.read_json.html
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.struct.field.html
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.struct.unnest.html

## Using Structs to Identify Duplicates across Columns
- Structs are useful for detecting duplicate combinations of values across several columns.
- For example, let's locate the orders that have the same `method` and the same `city`.

In [27]:
orders = pl.read_json("ecommerce_orders.json").unnest("shipping").unnest("address")
orders.head(2)

order_id,method,street,city,zip
str,str,str,str,str
"""a7c3e947-e16b-458e-9af3-e2f83a…","""USPS Priority""","""7794 Melissa Knolls Suite 837""","""Barnesfurt""","""16133"""
"""698e111f-2d5f-4211-a686-4ee785…","""USPS Priority""","""5643 Henderson Hollow""","""North Hannahbury""","""20421"""


- The `filter` method filters rows based on a condition being met.
- The `is_duplicated` method returns true for a row if it repeats within a column.

In [28]:
orders.filter(pl.col("zip").is_duplicated())

order_id,method,street,city,zip
str,str,str,str,str
"""8be1c16e-ba26-4004-a148-5ecbc5…","""USPS Priority""","""77986 Harrison Dam Suite 134""","""New Dawn""","""16509"""
"""97a4837f-4a53-40ac-aa4e-b4cd33…","""DHL Express""","""16845 Booth Ridge Apt. 023""","""North Robert""","""95268"""
"""dc35dba1-f832-4edf-a6ed-1d13d3…","""UPS Ground""","""0825 Sara Light Suite 341""","""Richardmouth""","""16509"""
"""f755a120-f673-495d-bd52-037e17…","""UPS Ground""","""676 Jennifer Groves Apt. 357""","""Chamberschester""","""25537"""
"""4a31736c-3088-4c41-8d53-f145ef…","""DHL Express""","""38560 Kara Summit Suite 150""","""New Diane""","""25537"""
"""fd2c3308-6aae-4d88-9c68-62d8f0…","""USPS Priority""","""627 Cisneros Estate Suite 860""","""North Timothy""","""95268"""


- The `pl.struct` function returns an expression that creates a struct from specified column values.

In [29]:
orders.select(
    pl.struct("method", "city"),
    pl.struct("method", "city").is_duplicated().alias("is_duplicate"),
)

method,is_duplicate
struct[2],bool
"{""USPS Priority"",""Barnesfurt""}",false
"{""USPS Priority"",""North Hannahbury""}",false
"{""DHL Express"",""Port Jennifer""}",false
"{""UPS Ground"",""Kristinfort""}",false
"{""UPS Ground"",""Lake Nicolebury""}",false
…,…
"{""USPS Priority"",""Port Kimberlyshire""}",false
"{""USPS Priority"",""East Daniel""}",false
"{""FedEx"",""Shannonton""}",false
"{""DHL Express"",""Payneborough""}",false


- Invoke the `is_duplicated` method on a struct to identify duplicates based on all struct values.
- We can use this strategy to identify duplicate rows based on matching values across multiple columns.
- The rows below have duplicate values in both `method` and `city`.

In [30]:
orders.filter(pl.struct("method", "city").is_duplicated().alias("is_duplicate")).sort(
    "method", "city"
)

order_id,method,street,city,zip
str,str,str,str,str
"""83879b15-f040-4284-994c-dc143f…","""FedEx""","""82332 David Course Suite 689""","""Andrewview""","""88079"""
"""495e999a-ee22-43c4-af17-0d53c4…","""FedEx""","""385 Tara Prairie Suite 793""","""Andrewview""","""96705"""
"""895ab7bf-a1be-4ba6-a245-aa0b97…","""FedEx""","""62286 Smith Mills""","""Johnsonshire""","""79703"""
"""5dfee7f6-b1dd-4aca-be80-ce6f70…","""FedEx""","""251 Johnson Ways Apt. 044""","""Johnsonshire""","""85790"""
"""4effbab2-0cc2-4049-9f13-7f88d4…","""FedEx""","""5939 Nguyen Manor Apt. 606""","""Lake Jared""","""44754"""
"""4015e987-e134-446c-9bdb-cd448f…","""FedEx""","""0961 Austin Village Apt. 778""","""Lake Jared""","""37370"""
"""8b131ad3-9b6a-4a56-820d-3d0089…","""UPS Ground""","""91698 Steele Glen""","""Kellybury""","""77654"""
"""4e16ee8e-1cc1-4827-9ccd-149fdb…","""UPS Ground""","""59504 Henderson Corner Apt. 85…","""Kellybury""","""64349"""
"""f0df3a2c-6d27-4315-9cdc-72a2b1…","""UPS Ground""","""58325 Kenneth Land""","""North Melissa""","""87081"""
"""b227dd55-d217-4d2b-bf5a-e04539…","""UPS Ground""","""5465 Herman Turnpike Suite 050""","""North Melissa""","""11868"""


### Further Reading
- https://docs.pola.rs/user-guide/expressions/structs/#identifying-duplicate-rows
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.struct.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_duplicated.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.filter.html