## Date and time ranges
By the end of this lecture you will be able to:
- create a vertical datetime, date or time range
- create a lazy datetime, date or time range
- create a horizontal datetime, date or time range

In [None]:
from datetime import datetime,date,time,timedelta

import polars as pl

In [None]:
start_date = date(2020,1,1)
end_date = date(2020,1,2)
interval = timedelta(hours=3)

## Vertical ranges

### Datetime

As we have already seen we can create a datetime range with `pl.datetime`

In [None]:
pl.datetime_range(
    start=start_date,
    end=end_date,
    interval=interval,
    eager=True
).head(3)

### Date range
We can also create a date range with `pl.date_range`

In [None]:
pl.date_range(
    start=start_date,
    end=end_date,
    eager=True
)

A date range defaults to a one day interval but other intervals can be specified.

### Time range
We can also create a time range with `pl.time_range`

In [None]:
start_time = time(0)
end_time = time(12)
interval = timedelta(hours=3)
pl.time_range(
    start=start_time,
    end=end_time,
    interval=interval,
    eager=True
)

## Lazy datetime ranges
In all of these examples we set `eager=True`. With this argument Polars evaluates the date range and creates a `Series`.

If we instead set `eager=False` Polars does not evaluate the date range and create a `Series`. In this case we are in lazy mode

In [None]:
pl.datetime_range(
    start=start_date,
    end=end_date,
    interval=interval,
)

The type of this output is an expression

In [None]:
type(
    pl.datetime_range(
        start=start_date,
        end=end_date,
        interval=interval,
        eager=False
    )
)

The `eager=False` mode is primarily useful for creating datetime ranges in lazy mode. 

In my own pipelines I typically find that creating a date range is not something I need lazy mode for as the memory requirement is not large. 

A lazy date range could allow for lazy queries like the following (contrived) example

In [None]:
start_date = date(2020,1,1)
end_date = datetime(2020,1,1,9)
interval = timedelta(hours=3)
(
    # Create a lazy frame with some data
    pl.LazyFrame(
        {
            "index":pl.arange(0,4,eager=True)
        }
    )
    .with_columns(
        # Add a datetime column
        pl.datetime_range(
            start=start_date,
            end=end_date,
            interval=interval,
            eager=False
        ).alias("datetime")
    )
    # Evaluate the query
    .collect()
)

## Horizontal datetime ranges
With `pl.datetime_range` we get a vertical range in a `Series`.

We can also get a datetime range as a list in every row with `pl.datetime_ranges`.

To show this we first create a `DataFrame` with:
- an `id` column
- a column for the start of the date range in that row
- a column for the end of the date range in that row

In [None]:
df = pl.DataFrame(
    {
        "id":["A","B"],
        "start": [datetime(2022, 1, 1), datetime(2022, 1, 2)],
        "end": datetime(2022, 1, 3),
    }
)
df

We can now create a column with the date range in each row from `start` to `end`

In [None]:
(
    df
    .with_columns(
        pl.datetime_ranges("start","end",interval="1mo").alias("datetime_range")
    )
)

We see a use case for horizontal ranges in the exercises.

## Exercises
In the exercises you will:
- create a vertical time range
- create a horizontal datetime range
- join these ranges to other dataframes

### Exercise 1 
We have a short hourly temperature record with a gap at 2 am

In [None]:
df_weather = (
    pl.DataFrame(
        {
            "time": [time(0), time(1), time(3)], 
            "temperature": [12.0, 11, 9]
        }
    )
)
df_weather

We want to create an hourly `DataFrame` with no time gaps.

First create a `DataFrame` where the `time` column has no gaps

In [None]:
df_time = pl.DataFrame(
            {
                "time":<blank>)
            }
        )
df_time   

Now do a left join of `df_weather` to `df_time`

Fill the gaps in the `temperature` column with linear interpolation

### Exercise 2
Our client is a bike shop and wants to look at sales during their summer and halloween sale periods.

The client provides you with the following data for the start and end of each sale period

In [None]:
df_sales_periods = pl.DataFrame(
    {
        "sale":["Summer","Halloween"],
        "start": [date(2015, 6, 1), date(2015, 10, 15)],
        "end": [date(2015, 9, 1),date(2015, 11, 15)]
    }
)
df_sales_periods

Add a `date` column that has the range of dates between `start` and `end` on each row

Expand the list column to have a row for each element of the list

The bike sales data is in the following `DataFrame`

In [None]:
df_sales = pl.read_parquet("../data/bike_sales.parquet")
df_sales.head(2)

Join the sale periods to the full sales dataframe. Ensure that only rows that fall inside either the Summer or Halloween sale period are kept

In [None]:
(
    df_sales
    .join(
    <blank>
)

Aggregate the data by sale period and get the total cost and revenue for each sale period. Sort by revenue

## Solutions

### Solution to exercise 1

We have an hourly temperature record with a gap at 2 am

In [None]:
df_weather = (
    pl.DataFrame(
        {
            "time": [time(0), time(1), time(3)], 
            "temperature": [12.0, 11, 9]
        }
    )
)
df_weather

We want to create a `DataFrame` with no time gaps.

First create a `DataFrame` with a `time` column no gaps

In [None]:
df_time = pl.DataFrame(
            {
                "time":pl.time_range(time(0),time(3),eager=True)
            }
        )
df_time

Now do a left join of the original `DataFrame` to `df_time`

In [None]:
(
    df_time
    .join(
        df_weather,on="time",how="left"
    )
)

Fill the gaps in the `temperature` column with linear interpolation

In [None]:
(
    df_time
    .join(
        df_weather,on="time",how="left"
    )
    .with_columns(
        pl.col("temperature").interpolate()
    )
)

Note - Polars has an `upsample` method that can also fills gaps in a time series. However, `upsample` only works in eager mode.

In my time-series forecsating pipelines I use an approach based on this exercise where I use a `datetime_range` to create a gap-free time series and do a left-join of the data to this. The advantage of this approach is that it works in lazy mode and can use the streaming engine for large datasets.

### Solution to exercise 2
Our client is a bike shop and wants to look at sales during their summer and halloween sale periods.

The client provides you with the following data for the start and end of each sale period

In [None]:
df_sales_periods = pl.DataFrame(
    {
        "sale":["Summer","Halloween"],
        "start": [date(2015, 6, 1), date(2015, 10, 15)],
        "end": [date(2015, 9, 1),date(2015, 11, 15)]
    }
)
df_sales_periods

Add a `date` column that has the range of dates between `start` and `end` on each row

In [None]:
(
    df_sales_periods
    .with_columns(
        date = pl.date_ranges("start","end")
    )
)

Expand the list column to have a row for each element of the list

In [None]:
(
    df_sales_periods
    .with_columns(
        date = pl.date_ranges("start","end")
    )
    .explode("date")
)

The bike sales data is in the following `DataFrame`

In [None]:
df_sales = pl.read_parquet("../data/bike_sales.parquet")
df_sales.head(2)

Join the sale periods to the full sales data. Ensure that only rows that fall inside either the Summer or Halloween sale period are kept

In [None]:
(
    df_sales
    .join(
    (
        df_sales_periods
        .with_columns(
            date = pl.date_ranges("start","end")
        )
        .explode("date")
    ),
    on="date",
    how="inner"
    )
)

Aggregate the data by sale period and get the total cost and revenue for each sale period. Sort by revenue

In [None]:
(
    df_sales
    .join(
    (
        df_sales_periods
        .with_columns(
            date = pl.date_ranges("start","end")
        )
        .explode("date")
    ),
    on="date",
    how="inner"
    )
    .group_by("sale")
    .agg(
        pl.col("cost","revenue").sum()
    )
    .sort("revenue")
)