# LazyFrames

In [None]:
import polars as pl

## Eager vs. Lazy
- The word **eager** means "wanting to do something".
- The word **lazy** means "unwilling to do work".
- The terms are applicable to programming!
- All of the code we've written so far has been eager. Polars executes the code immediately.

## Lazy Evaluation
- Polars supports a lazy API that delays execution of query logic until we request it explicitly.
- If we construct a complex, multi-step query, Polars can optimize its execution because it knows all future steps.
- The Polars docs recommend using the lazy API once you've determined the correct code:
>The lazy API should be preferred unless you are either interested in the intermediate results or are doing exploratory work and don't know yet what your query is going to look like.

## Intro to LazyFrames
- A `LazyFrame` is a Polars object that represents a future computation.
- A `LazyFrame` is less like a table and more like a sequence of instructions for a query.
- A `LazyFrame` defers the execution of the query until we explicitly request it.
- By deferring execution, the `LazyFrame` can reason about the steps and optimize the query.
- The `collect` method executes the optimized query plan and returns the resulting `DataFrame`.

## Introducing the Dataset
- The `education_costs` dataset stores the tuitions for various college programs.

In [None]:
pl.read_csv("education_costs.csv")

## NOTE: LazyFrames and Visual Graphs
The next lesson creates a LazyFrame and renders it in visual form. If you'd like to see the graph in your Notebook, you'll need to install the GraphViz application. Note that this software is **optional**. The video will show you what the graph looks like.

You can download the GraphViz installer here:
https://graphviz.org/download/

## The scan_csv Function and the collect Method
- When we method chain, each method returns a new `DataFrame` that the following method operates on.
- Our goal: Find the average tuition grouped by Level for all American universities with a tuition > 50,000
- The code below:
    - imports a complete CSV
    - creates a `DataFrame` with a subset of 5 of the 7 columns
    - filters rows based on two columns' (`Tuition`, `Country`) values
    - groups the data by the values in the `Level` column
    - calculates the average tuition for all groups

In [None]:
pl.read_csv("education_costs.csv").select(
    pl.col("Program", "Level", "University", "Tuition", "Country")
).filter(pl.col("Tuition") > 50000, pl.col("Country") == "USA").group_by("Level").agg(
    pl.col("Tuition").mean()
)

- The `pl.scan_csv` function creates a `LazyFrame` (query plan).
- The `LazyFrame` is a query plan. It is a strategy, a sequence of steps.
- The plan includes a step to read in the CSV file. Polars will _not_ read the file yet.
- The Ruff formatter may wrap complex method chains in parentheses.

In [None]:
education = (
    pl.scan_csv("education_costs.csv")
    .select(pl.col("Program", "Level", "University", "Tuition", "Country"))
    .filter(pl.col("Tuition") > 50000, pl.col("Country") == "USA")
    .group_by("Level")
    .agg(pl.col("Tuition").mean())
)

In [None]:
type(education)

- The standard text output of a `LazyFrame` is the unoptimized version of the query plan.
- The GraphViz utility needs to be installed to render the query plan in Jupyter Notebook.
- Read the query plan in reverse order, from the **bottom** to the **top**.
- Each box represents a stage/step.
- Polars identifies the steps as:
    - scanning 7 columns in the CSV (`*` means "all", `*/7` means "all of seven")
    - selecting 5 of those 7 columns
    - filtering the rows by the `Tuition > 50000` condition
    - filtering the rows by the `Country == "USA"` condition

In [None]:
education

- Let's follow the recommended instructions and invoke `show_graph` to see the optimized version.
- First up, let's pass the `optimized` parameter a value of `False`. This is the non-optimized query pln.
- The return of `show_graph` is the same as the printed output of the `LazyFrame`.

In [None]:
education.show_graph(optimized=False)

- Now, let's the `optimized` parameter a value of `True` to see the optimized query.
- Polars recognizes it only needs to read 5 columns of 7 from the CSV file (**projection pushdown** optimization).
- Polars recognites it only needs to import the rows that fit the filter criteria (**predicate pushdown** optimization).

In [None]:
education.show_graph(optimized=True)

- Invoke the `collect` method on a `LazyFrame` to execute the query and gather the results in a `DataFrame`.
- The resulting `DataFrame` will be identical to the original eager evaluation...
- ...but the query/calculation will likely be much faster/more efficient.

In [None]:
education.collect()

- Note that the `collect` method result is not cached.
- In other words, Polars will re-execute the query everytime we call `collect` on the `LazyFrame`.

In [None]:
education.collect()

### Further Reading
- https://docs.pola.rs/user-guide/concepts/lazy-api/
- https://docs.pola.rs/user-guide/concepts/lazy-api/#previewing-the-query-plan
- https://docs.pola.rs/api/python/stable/reference/api/polars.scan_csv.html
- https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.show_graph.html
- https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.collect.html
- https://docs.pola.rs/user-guide/lazy/using/#using-the-lazy-api-from-a-file
- https://docs.pola.rs/user-guide/lazy/optimizations/
- https://docs.pola.rs/user-guide/lazy/query-plan/#graphviz-visualization
- https://docs.pola.rs/user-guide/lazy/sources_sinks/#scan

## A Matter of Time
- The `%%timeit` magic runs a cell repeatedly to calculate its average runtime.
- The complementary `%%time` magic times the single execution of a cell.
- The lazy evaluation should generally execute faster than the eager equivalent.
- The difference between the two may seem small but grows with the size of the dataset.
- `Î¼s` is the symbol for microseconds (one millionth of a second).

In [None]:
%%timeit

education = (
    pl.read_csv("education_costs.csv")
    .select(pl.col("Program", "Level", "University", "Tuition", "Country"))
    .filter(pl.col("Tuition") > 50000, pl.col("Country") == "USA")
    .group_by("Level")
    .agg(pl.col("Tuition").mean())
)
education

In [None]:
%%timeit

education = (
    pl.scan_csv("education_costs.csv")
    .select(pl.col("Program", "Level", "University", "Tuition", "Country"))
    .filter(pl.col("Tuition") > 50000, pl.col("Country") == "USA")
    .group_by("Level")
    .agg(pl.col("Tuition").mean())
    .collect()
)
education

In [None]:
699 / 838

## Convert a DataFrame to a LazyFrame

In [None]:
education = (
    pl.read_csv("education_costs.csv")
    .select(pl.col("Program", "Level", "University", "Tuition", "Country"))
    .filter(pl.col("Tuition") > 50000, pl.col("Country") == "USA")
    .group_by("Level")
    .agg(pl.col("Tuition").mean())
)

education = pl.read_csv("education_costs.csv")
education.head(2)

- Invoke the `lazy` method on a `DataFrame` to convert it into a `LazyFrame`.
- In this example, Polars will has already imported the CSV but we can still potentially gain efficiencies for _future_ steps.
- Computation will not occur until we invoke the `collect` method on the `LazyFrame`.

In [None]:
education.lazy()

In [None]:
type(education.lazy())

- The optimized evaluation will select 3 of 7 columns, then perform two consecutive filters in one traversal.
- The query plan no longer includes the import/read CSV step.

In [None]:
education.lazy().select(
    pl.col("Program", "Level", "University", "Tuition", "Country")
).filter(pl.col("Tuition") > 50000, pl.col("Country") == "USA").group_by("Level").agg(
    pl.col("Tuition").mean()
).show_graph()

- Invoke the `collect` method to run the computations and return the `DataFrame`.

In [None]:
education.lazy().select(
    pl.col("Program", "Level", "University", "Tuition", "Country")
).filter(pl.col("Tuition") > 50000, pl.col("Country") == "USA").group_by("Level").agg(
    pl.col("Tuition").mean()
).collect()

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.lazy.html
- https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.collect.html

## LazyFrame Limitations
- The query optimizer must be able to infer the schema of the potential `DataFrame` at every step.
- If an operation results in a dynamic/unpredictable schema, the lazy API will not support it.

In [None]:
education = pl.read_csv("education_costs.csv")
education.head(2)

- The `pivot` method transforms a tall dataset into a wide one.
- Say we wanted to calculate the average tuition of every program (row values), organized by degree (column values).
- The `pivot` method returns a pivot table with dynamic columns, one per each unique `Level`.
- The `pivot` method requires reading through the rows of the `DataFrame`.

In [None]:
education.pivot(
    on="Level", index="Program", values="Tuition", aggregate_function="mean"
).filter(
    pl.col("Master").is_not_null(),
    pl.col("Bachelor").is_not_null(),
    pl.col("PhD").is_not_null(),
)

- A `LazyFrame` does not read the data so it cannot know what values it will encounter.
- A `LazyFrame` cannot infer the 3 column names in advance; they will vary based on the dataset's values.
- As a result, the `LazyFrame` does not support the `pivot` method.

In [None]:
# education.lazy().pivot(
#     on="Level", index="Program", values="Tuition", aggregate_function="mean"
# ).filter(
#     pl.col("Master").is_not_null(),
#     pl.col("Bachelor").is_not_null(),
#     pl.col("PhD").is_not_null(),
# )

- To solve the problem, convert back and forth between a `LazyFrame` and a `DataFrame`.

In [None]:
education.pivot(
    on="Level", index="Program", values="Tuition", aggregate_function="mean"
).lazy().filter(
    pl.col("Master").is_not_null(),
    pl.col("Bachelor").is_not_null(),
    pl.col("PhD").is_not_null(),
).collect()

- Even with this example, Polars can optimize the query by only traversing the rows once for the 3 filters.

In [None]:
education.pivot(
    on="Level", index="Program", values="Tuition", aggregate_function="mean"
).lazy().filter(
    pl.col("Master").is_not_null(),
    pl.col("Bachelor").is_not_null(),
    pl.col("PhD").is_not_null(),
).show_graph()

### Further Reading
- https://docs.pola.rs/user-guide/transformations/pivot/#lazy
- https://docs.pola.rs/user-guide/lazy/schemas/#the-lazy-api-must-know-the-schema
- https://docs.pola.rs/user-guide/lazy/schemas/#dealing-with-operations-not-available-in-the-lazy-api