# Structs

## Intro to Structs
- The `Struct` is a composite data type that is similar to a Python dictionary. It holds key-value pairs.
- The struct's keys are called **fields** (e.g., "the struct's `name` field has a value of `Boris`").
- Each field is a unique identifier for a corresponding value.
- A `Struct` enables a row entry to store multiple pieces of information.

### Create a Column of Structs
- Let's instantiate a `DataFrame` with a single column of structs.
- We'll pass the constructor a dictionary with the column names as keys and the column values as values.
- For our single column, we'll pass a list of dictionaries.

- The visual output will not include the struct's keys/fields (`name`, `calories`, and `price`). Polars still stores them internally.
- The shape of the `DataFrame` is 2 rows by 1 column.
- Each `DataFrame` row holds 1 value, the struct.
- The `struct[3]` type indicates that each struct holds 3 key-value pairs.

### The First Entry is the Source of Truth
- Polars uses the first list as the source of truth for the key-value pairs.
- The next example's list has a dictionary with 4 key-value pairs.
- But Polars uses the first dictionary (3 key-value pairs) to infer the schema (fields and types).

- The `schema` attribute will print out the columns and their associated types.
- The `Candies` column stores `Struct` values.
- Each `Struct` holds `name`, `calories`, and `price` fields.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/structs/
- https://docs.pola.rs/api/python/stable/reference/api/polars.datatypes.Struct.html#polars.datatypes.Struct

## The struct.field and unnest Methods
- Polars nests struct methods within a `struct` attribute/namespace.
- The `field` method extracts the value for a given key in each struct.
- Polars will keep the struct key as the column name. Use `alias` to rename.

- The `struct.unnest` method extract the struct's key-value pairs into separate columns.

- The `DataFrame` has a complementary `unnest` method that accepts a column expression.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/structs/#extracting-individual-values-of-a-struct
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.struct.field.html
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.struct.unnest.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.unnest.html

## The value_counts Method
- A core principle of Polars is that an operation on a column produces a single new column.
- The one-to-one design creates consistency and predictability in method output.
- Say we wanted to count the number of occurrences of each distinct value in a column.
- There are two pieces of data to track: each unique value and the number of times it occurs.
- The `value_counts` method returns a column of structs. The struct data type supports multiple fields.

- Use `schema` to see the names and data types of the struct fields.
- Polars used the column name (`color`) for the struct field.
- Polars chose the name `count` for the struct field with the count of each value.
- The count can only be 0 or positive, so the method uses a `UInt32` type for `count`.

- Pass the `sort` parameter a value of `True` to sort the value counts in descending order.
- The value with the greatest number of occurrences will appear first.

- The `struct.unnest` method extracts each struct field to a separate column.
- The `DataFrame` supports a top-level `unnest` method as well.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/structs/#encountering-the-data-type-struct
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.value_counts.html

## Rename Struct Fields
- The `struct.rename_fields` method renames the struct fields.
- The method accepts a list of new field names.
- Access the `DataFrame`'s `schema` attribute to see the order of the struct fields.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/structs/#renaming-individual-fields-of-a-struct
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.struct.rename_fields.html

## Using Structs to Work with Nested Data
- Structs can contain other structs. Polars will utilize structs for nested data.
- JSON (JavaScript Object Notation) is another common data transfer format.
- The syntax is inspired by the JavaScript language.
- An "object" in this domain can be thought of as similar to a Python dictionary.
- Each object in the `ecommerce_orders.json` dataset has two top-level keys.
- Polars extracts the `order_id` and `shipping` keys to their own columns.
- The rows in the `shipping` column store structs with 2 fields.

- Polars supports nested structs. One field's value can be another struct.
- Each row in `shipping` is a struct that contains an `address` field.
- The `address` field stores a struct with `street`, `city`, and `zip` fields.

- Let's start by extracting the `method` and `address` fields to new columns.

- Next, we want to pull out the values from the `address` structs.

- We can rename the columns after unnesting or rename the struct fields before unnesting.

- Let's say we wanted to concatenate the address (street, city, and zip) in a single new column.

### Further Reading
- https://docs.pola.rs/user-guide/io/json/
- https://docs.pola.rs/api/python/stable/reference/api/polars.read_json.html
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.struct.field.html
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.struct.unnest.html

## Using Structs to Identify Duplicates across Columns
- Structs are useful for detecting duplicate combinations of values across several columns.
- For example, let's locate the orders that have the same `method` and the same `city`.

- The `filter` method filters rows based on a condition being met.
- The `is_duplicated` method returns true for a row if it repeats within a column.

- The `pl.struct` function returns an expression that creates a struct from specified column values.

- Invoke the `is_duplicated` method on a struct to identify duplicates based on all struct values.
- We can use this strategy to identify duplicate rows based on matching values across multiple columns.
- The rows below have duplicate values in both `method` and `city`.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/structs/#identifying-duplicate-rows
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.struct.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_duplicated.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.filter.html