# Lazy mode 2: evaluating queries

By the end of this lecture you will be able to:
- trigger evaluation of a `LazyFrame`
- evaluate a `LazyFrame` in streaming mode
- convert a `DataFrame` to a `LazyFrame`

We can also evaluate a `LazyFrame` and profile how long each part of the query takes. We cover this in the lecture on `LazyGroupby` in Section 6.

In [4]:
import polars as pl

In [5]:
csv_file = "../data/titanic.csv"

Create a `LazyFrame` with `pl.scan_csv`

In [6]:
df = pl.scan_csv(csv_file)
df

## Triggering evaluation of a `LazyFrame`


When we trigger evaluation we convert `LazyFrame` to `DataFrame`.

### Full evaluation

To trigger evaluation of the full output we call `.collect` 

In [7]:
(
    df
    .collect()
    .head()
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs.…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,"""Allen, Mr. Wil…","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


### Partial evaluation

To trigger evaluation of a limited number of rows we call `.fetch`. We can specify the number of rows Polars should aim to fetch as an argument 

In [8]:
(
    df
    .fetch(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


### When do you use `fetch` instead of `collect`?

The `fetch` method is useful during development and debugging to avoid running the query on a large dataset.

## Evaluating larger-than-memory queries in streaming mode
Be default when we evaluate a `LazyFrame` Polars works with the entire `DataFrame` in memory. If the query requires more memory than we have available we may be able to evaluate the query in *streaming* mode.

In streaming mode Polars processes the query **in chunks** instead of all-at-once. This allows Polars to work with datasets that are larger than memory.

We tell Polars to use streaming with the `streaming` argument to `collect` or `fetch` 

In [9]:
(
    pl.scan_csv(csv_file)
    .collect(streaming=True)
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


Streaming is not supported for all operations. However, many key operations such as `filter`, `groupby` and `join` support streaming. If streaming is not possible then Polars will run the query without streaming.

> We return to streaming when we look at input from CSV and Parquet files in the Input/Output section. However, if you want to reduce memory consumption in your queries go ahead and try it out on your own data with `streaming=True`. In this blog post I suggest that turning streaming mode on can be a good default for many use cases (this is what I do for my ML pipelines):https://www.rhosignal.com/posts/polars-dont-fear-streaming/

## Turning a `DataFrame` into a `LazyFrame`

In some cases we have a `DataFrame` and want to convert it to a `LazyFrame`.

We may want to save intermediate values from a query. So we trigger evaluation to create a `DataFrame` we can save before converting back to lazy mode.

Our query might contain a transformation that can only be done in eager mode such as a `pivot`. So we can trigger evaluation to do the pivot and then convert back to lazy mode. 

We convert a `DataFrame` to a `LazyFrame` with `lazy`

In [None]:
df_eager = pl.read_csv(csv_file)
df_eager = df_eager.lazy()
df_eager

## Limits of lazy mode
There are operations that cannot be done in lazy mode (whether in Polars or other lazy frameworks such as SQL databases). One limitation is that Polars must know the column names and dtypes at each step of the query plan.

For example we cannot `pivot` in lazy mode as the column names are data-dependant following a pivot

In [10]:
(
    pl.read_csv(csv_file)
    .pivot(index="Pclass",columns="Sex",values="Age",aggregate_function="mean")
)

Pclass,male,female
i64,f64,f64
3,26.507589,21.75
1,41.281386,34.611765
2,30.740707,28.722973


In these cases I recommend:
- starting queries in lazy mode as far as possible
- evaluating with `collect` when a non-lazy method is required
- calling the non-lazy method
- calling `lazy` on the output to continue in lazy mode

In [None]:
(
    pl.scan_csv(csv_file)
    .collect()
    .pivot(index="Pclass",columns="Sex",values="Age",aggregate_function="mean")
    .lazy()
)

## Exercises

In the exercises you will develop your understanding of:
- triggering full evaluation of a query
- triggering partial evaluation of a query
- triggering evaluation in streaming mode
- converting from eager to lazy mode

### Exercise 1
Create a `LazyFrame` by doing a scan of the Titanic CSV file

In [None]:
(
    pl.<blank>
)

Continue with your code from the first part in subsequent parts of this exercise.

Use the fetch statement on the `LazyFrame` and count how many rows `fetch` returns by default

Check to see which of the following metadata you can get from a `LazyFrame`:
- number of rows
- column names

### Exercise 2: converting between eager and lazy mode
Create a `LazyFrame` by doing a scan of the Titanic CSV file

In [None]:
(
    <blank>
)


Convert the `LazyFrame` to a `DataFrame`

Convert the `DataFrame` to a `LazyFrame`

## Solutions

### Solution to Exercise 1

Create a `LazyFrame` by doing a scan of the Titanic CSV file

In [None]:
(
    pl.scan_csv(csv_file)
)

Use the fetch statement on the `LazyFrame` and count how many rows `fetch` returns by default

In [None]:
(
    pl.scan_csv(csv_file)
    .fetch()
    .shape
)

We discuss the notification about common subplan elimination in the lecture on streaming CSVs 

A `LazyFrame` does not know the number of rows in a CSV

In [None]:
# (
#     pl.scan_csv(csv_file)
#     .shape
# )

A `LazyFrame` does know the column names. `Polars` scans the first row of the CSV file to get column names in `pl.scan_csv`

In [None]:
(
    pl.scan_csv(csv_file)
    .columns
)

## Solution to Exercise 2

Create a `LazyFrame` by doing a scan of the Titanic CSV file

In [None]:
(
    pl.scan_csv(csv_file)
)


Convert the `LazyFrame` to a `DataFrame`

In [None]:
(
    pl.scan_csv(csv_file)
    .collect()
    .head(3)
)


Convert the `DataFrame` to a `LazyFrame`

In [None]:
(
    pl.scan_csv(csv_file)
    .collect()
    .lazy()
    .head(3)
)
