# Filtering time series
By the end of this lecture you will be able to:
- filter by datetimes
- filter by a date range
- filter on a duration

In [None]:
from datetime import datetime,date,time

import polars as pl

In [None]:
csv_file = "../data/nyc_trip_data_1k.csv"

In [None]:
df = pl.read_csv(csv_file,try_parse_dates=True)
df.head(2)

## Filtering by datetimes
We filter by datetimes using python's built-in `datetime.datetime` object

In [None]:
(
    df
    .filter(
        pl.col("pickup") < datetime(2022, 1, 1, 1, 0),
    )
)

We can also filter a `pl.Datetime` by a `datetime.date` for inequalities

In [None]:
df.filter(
    pl.col("pickup") < date(2022, 1, 2),
).head()

We can only filter on a datetime string if we provide the datetime format

In [None]:
df.filter(
    pl.col("pickup") < pl.lit("2022-01-02").str.strptime(pl.Date, format="%Y-%m-%d")
).head()

In the exercises we see how to filter by a time (e.g. after 2 PM).

## Filtering on a datetime range

We filter by a datetime range using the `is_between` expression

In [None]:
df.filter(
    pl.col("pickup").is_between(
        datetime(2021, 12, 31),datetime(2022, 1, 2)
    )
).head(2)

## Filtering datetimes in lazy mode
We can filter datetimes in lazy mode

In [None]:
print(
    pl.scan_csv(csv_file,try_parse_dates=True)
    .filter(
        pl.col("pickup") < date(2022, 1, 2),
    )
    .explain()
)

Just as with non-datetime filters, the query optimiser applies the `filter` in `SELECTION` so the filter is applied when reading the CSV file. 

The datetime is expressed in microseconds in the query plan.

## Filtering on a duration

We create a duration for the length of the taxi trip

In [None]:
(
    df
    .select(["pickup","dropoff"])
    .with_columns(
        (pl.col("dropoff") - pl.col("pickup")).alias("duration")
    )
    .head()
)

To filter on a duration we use `pl.duration` (note this function is different from the dtype `pl.Duration`)

In [None]:
(
    df
    .select(["pickup","dropoff"])
    .with_columns(
        (pl.col("dropoff") - pl.col("pickup")).alias("duration")
    )
    .filter(pl.col("duration") < pl.duration(minutes=10))
    .head(3)
)

We can also filter on a duration by getting the duration in the desired time unit. This is more expensive as it requires casting the entire left-hand column rather than the single duration on the right-hand side

In [None]:
(
    df
    .select(["pickup","dropoff"])
    .with_columns(
        (pl.col("dropoff") - pl.col("pickup")).alias("duration")
    )
    .filter(pl.col("duration").dt.total_minutes() < 10)
    .head(3)
)

# Exercises
In the exercises you will develop your understanding of:
- filtering by a date
- filtering by a datetime
- filtering by a time
- filtering by a duration

### Exercise 1
Create a `DataFrame` with a daily interval that starts on 1st January 2020 and ends on 31st January 2020

In [None]:
(
    <blank>
)

Find all dates on or after 15th January

Find all dates between 15th and 20th January including the start date but excluding the end date. 

For a reminder on how to manage the bounds see the Lecture in Section 2 "Filtering rows 2" or the API docs for `is_between`

### Exercise 2
Read the NYC taxi dataset with automatic date parsing

In [None]:
(
    <blank>
    .head(3)
)

Filter to get all the records with a pickup after 10 PM.

Expand the following collapsed cell if you want a hint.

In [None]:
# Hint: cast the pickup column to a pl.Time dtype first

Add a column that calculates the difference in pickup time between successive rows called `pickup_delta`

Filter to find all records that started less than 3 minutes after the previous pickup

## Solutions

### Solution to exercise 1

Create a `DataFrame` with a daily interval that starts on 1st January 2020 and ends on 31st January 2020

In [None]:
pl.Config.set_tbl_rows(4)
start = date(2020,1,1)
stop = date(2020,1,31)
df = pl.DataFrame(
    {
        "date":pl.date_range(start,stop,interval="1d",eager=True)
    }
)
df

Find all dates on or after 15th January

In [None]:
(
    df
    .filter(
        pl.col("date") >= date(2020,1,15)
    )
)

Find all dates between 15th and 20th January including the start date but excluding the end date. 

For a reminder on how to manage the boundaries see the end of Lecture in the Section on "Filtering rows 2"

In [None]:
(
    df
    .filter(
        pl.col("date").is_between(date(2020,1,15), date(2020,1,20),closed="left")
    )
)

### Solution to exercise 2
Read the NYC taxi dataset with automatic date parsing

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .head(3)
)

Filter to get all the records with a pickup on or after 22:00:00 (10 PM).

Expand the following collapsed cell if you want a hint.

In [None]:
# Hint: cast the pickup column to a pl.Time dtype first

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .filter(
        pl.col("pickup").cast(pl.Time) >= time(22)
    )
    .head(3)
)

Add a column that calculates the difference in pickup time between successive rows called `pickup_delta`

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .with_columns(
        (pl.col("pickup").diff()).alias("pickup_delta")
    )
    .head(3)    
)

Filter to find all records that started less than 3 minutes after the previous pickup

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .with_columns(
        (pl.col("pickup").diff()).alias("pickup_delta")
    )
    .filter(
        pl.col("pickup_delta") < pl.duration(minutes=3) 
    )
)