# Lazy mode 1: Introducing lazy mode
By the end of this lecture you will be able to:
- create a `LazyFrame` from a CSV file
- explain the difference between a `DataFrame` and a `LazyFrame`
- print the optimized query plan

Lazy mode is crucial to taking full advantage of Polars with query optimisation and streaming large datasets. We introduce lazy mode in this lesson and we re-visit it again and again throughout the course.  

## Code or queries?
Data analysis often involves multiple steps:
- loading data from a file or database
- transforming the data
- grouping by a column
- ...

We call the set of steps a **query**.

We can write some lines of code that carry out a query step-by-step in eager mode.

There are two problems with this approach:
- Each line of code is not aware of what the others are doing.
- Each line of code requires copying the full dataframe.

We can instead write the steps as an integrated query in lazy mode.

With an integrated query:
- a query optimizer can identify efficiencies
- a query engine can minimise the memory usage and produce a single output

## So what are eager and lazy modes?

**Eager mode**: each line of code is run as soon as it is encountered.

**Lazy mode**: each line is added to a query plan and the query plan is optimized.

In [None]:
import polars as pl

In [None]:
csv_file = "../data/titanic.csv"

## `DataFrames` and `LazyFrames`
We **read** a CSV in eager mode with `pl.read_csv`. This creates a **`DataFrame`**

In [None]:
df_eager = pl.read_csv(csv_file)
df_eager.head(2)

We **scan** a CSV in lazy mode with `pl.scan_csv`. This creates a **`LazyFrame`**

In [None]:
df_lazy = pl.scan_csv(csv_file)
df_lazy

When we scan a CSV Polars:
- opens the file 
- gets the column names as headers
- infers the type of each column from the first 100 rows

We can get the dtype schema of a `LazyFrame`. This is a mapping from column names to dtypes

In [None]:
df_lazy.schema

We cannot get the shape of the `LazyFrame` as Polars does not know how many rows there are from a CSV scan.

We evaluate a lazy query by calling `collect` - we learn more about this in the next lecture

### Creating a LazyFrame from data
We can also directly create a `LazyFrame` from a constructor with some data

In [None]:
pl.LazyFrame({"values":[0,1,2]})

Or we can call `.lazy` on `DataFrame`

In [None]:
pl.DataFrame({"values":[0,1,2]}).lazy()

### What's the difference between a `DataFrame` and a `LazyFrame`?

If we print a `DataFrame` we see data...

In [None]:
df_eager.head(2)

...but if we print a `LazyFrame` we see a **query plan**

**Key message: a method on a `DataFrame` acts on the data. An method on a `LazyFrame` acts on the query plan**.

## Operations on a `DataFrame` and a `LazyFrame` 
To show the difference between operations on a `DataFrame` and a `LazyFrame` we rename the `PassengerID` column to `Id` using `rename`.

On a `DataFrame` we see the first column is renamed...

In [None]:
(
    df_eager
    .rename({"PassengerId":"Id"})
    .head(2)
)    

while on a `LazyFrame` we see that a `RENAME` step is added the query plan

In [None]:
(
    df_lazy
    .rename({"PassengerId":"Id"})
)    

## Chaining or re-assigning?
In this course we typically run operations with method chaining like this

In [None]:
(
    pl.scan_csv(csv_file)
    .rename({"PassengerId":"Id"})
)    

However, we can also do operations by re-assigning the variable in each step

In [None]:
df_lazy = pl.scan_csv(csv_file)
df_lazy = df_lazy.rename({"PassengerId":"Id"})

The two methods are equivalent.

## Query optimisation
Polars creates a *naive query plan* from your query.

`Polars` passes the naive query plan to its **query optimizer**. The query optimizer looks for more efficient ways to arrive at the output you want.

The `explain` method shows the optimized plan. We use a `print` statement to format it correctly

In [None]:
print(
    pl.scan_csv(csv_file)
    .explain()
)

The query plan is always read bottom-to-top. In this simple case the query plan shows that we:
- scan the CSV file
- select all 12 of the columns (*/12*)

and the output is a `DataFrame`

## What query optimizations are applied?
Query optimizations aren't magic. Most optimizations could be implemented by users in a well-written query if the user:
- knows the optimization exists 
- remembers to implement the optimization and 
- implements the optimization correctly!

Optimizations applied by Polars include:
- `projection pushdown` limit the number of columns read to those required
- `predicate pushdown` apply filter conditions as early as possible
- `combine predicates` combine multiple filter conditions
- `slice pushdown` limit rows processed when limited rows are required
- `common subplan elimination` run duplicated transformations on the same data once and then re-use
- `common subexpression elimination` duplicated expressions are cached and re-used

We see how most of these optimisations arise later in the course.

### Common subexpression elimination
We see how the common subexpression elimination optimisation works here. With common subexpression elimination Polars identifies where the same expression is calculated more than once so Polars caches the first output to be re-used.

In this example we have a lazy query where we scan the Titanic CSV file. We then:
- use `select` to output a subset of columns
- create a first expression which has the mean age minus one standard deviation
- a second expression with the mean age
- create a third expression which has the mean age plus one standard deviation
- evaluate the query with .`collect`

In [None]:
(
    pl.scan_csv(csv_file)
    .select(
        (pl.col("Age").mean() - pl.col("Age").std()).alias("minus_one_std"),
        pl.col("Age").mean().alias("mean"),
        (pl.col("Age").mean() + pl.col("Age").std()).alias("plus_one_std"),
    )
    .collect()
)              

In this query we use the `pl.col("Age").mean()` and `pl.col("Age").std()` expressions repeatedly. If we print the optimised query plan with `.explain` we can see that Polars is applying the common subexpression optimisation

In [None]:
print(
    pl.scan_csv(csv_file)
    .select(
        (pl.col("Age").mean() - pl.col("Age").std()).alias("minus_one_std"),
        pl.col("Age").mean().alias("mean"),
        (pl.col("Age").mean() + pl.col("Age").std()).alias("minus_one_std"),
    )
    .explain()
)               

This query plan has two blocks separated by `FROM`.

Within the upper `SELECT` block we see the expressions are called with `__POLARS_CSER_X` where there is one code for the mean expression and one for the standard deviation expression. We can see that Polars has identified these as the same sub-expression across the three expressions in the `SELECT` block.

Polars also implements other optimisations such as fast-path algorithms on sorted data (separate from the query optimiser).  We learn more about these later in the course.

## Exercises

In the exercises you will develop your understanding of:
- creating a `LazyFrame` from a CSV file
- getting metadata from a `LazyFrame`
- printing the query plans

### Exercise 1
Create a `LazyFrame` by doing a scan of the Titanic CSV file

In [None]:
df = pl.<blank>

Check to see which of the following metadata you can get from a `LazyFrame`:
- number of rows
- column names
- schema

Create a lazy query where you scan the Titanic CSV file and then select the `Name` and `Age` columns.

In [None]:
(
    pl.scan_csv(csv_file)
    <blank>
)

Print out the optimised query plan for this query

## Solutions

### Solution to Exercise 1

Create a `LazyFrame` by doing a scan of the Titanic CSV file

In [None]:
df = pl.scan_csv(csv_file)

A `LazyFrame` does not know the number of rows in a CSV

In [None]:
df.shape

A `LazyFrame` does know the column names. As we will see in the I/O section `Polars` scans the first row of the CSV file to get column names in `pl.scan_csv`

In [None]:
df.columns

In [None]:
df.schema

Create a lazy query where you scan the Titanic CSV file and then select the `Name` and `Age` columns.

In [None]:
(
    pl.scan_csv(csv_file)
    .select("Name","Age")
)   

Print out the optimised query plan for this query

In [None]:
print(
    pl.scan_csv(csv_file)
    .select("Name","Age")
    .explain()
)   