## Extracting datetime components
By the end of this lecture you will be able to:
- extract date components from a datetime dtype
- extract week-of-year and day-of-year from a datetime dtype
- extract time components from a datetime dtype


In [None]:
from datetime import datetime

import polars as pl

In [None]:
csv_file = "../data/nyc_trip_data_1k.csv"

In [None]:
df = pl.read_csv(csv_file,try_parse_dates=True)
df.head()

## Extracting date and time
We extract the date from a `pl.Datetime` dtype by casting it to `pl.Date`

In [None]:
(
    df
    .with_columns(
        pl.col("pickup").cast(pl.Date)
    )
).head(3)

We can also use the `dt.date` expression to get the date from a `pl.Datetime`

In [None]:
(
    df
    .select('pickup')
    .with_columns(
        pl.col("pickup").dt.date()
    )
).head(3)

We extract the time from a `pl.Datetime` dtype by casting it to `pl.Time` or using `dt.time`

In [None]:
(
    df
    .select('pickup')
    .with_columns(
        pl.col("pickup").cast(pl.Time).alias('cast_time'),
        pl.col("pickup").dt.time().alias('dt.time')

    )
).head(3)

Note that the `dt.date` and `dt.time` methods give a different result from `cast(pl.Date)` and `cast(pl.Time)` when a timezone is specified!

In the example below:
- we first tell Polars that the `pickup` column is in the New York timezone
- we then extract the time from both the original (no-timezone) `pickup` column and the new `local_datetime` column

In [None]:
(
    df
    # We only need the pickup columm so only select pickup
    .select("pickup")
    # Add a local_datetime column that has a timezone specified
    .with_columns(
       pl.col("pickup").dt.replace_time_zone("America/New_York").alias("local_datetime")
    )
    .with_columns(
        pl.col("pickup").dt.time().alias("pickup_date"),
        pl.col("local_datetime").dt.time().alias("local_datetime_date"),
        pl.col("local_datetime").cast(pl.Time).alias("cast_local_datetime_date")
    )
).head(1)

We see in the first row that the `dt.time` expression takes the local time (00:04:14 ) from the datetime. However, `cast(pl.Time`) takes the time from the underlying UTC timestamp at 05:04:14 hours 

## Extracting datetime features
We use expressions in the `dt` namespace to extract date features

In [None]:
(
    df
    .select(
        pl.col("pickup"),
        pl.col("pickup").dt.quarter().alias("quarter"),
        pl.col("pickup").dt.month().alias("month"),
        pl.col("pickup").dt.day().alias("day"),
        pl.col("pickup").dt.hour().alias("hour"),
        pl.col("pickup").dt.minute().alias("minute"),
        pl.col("pickup").dt.second().alias("second"),
        pl.col("pickup").dt.millisecond().alias("millisecond"),
        pl.col("pickup").dt.microsecond().alias("microsecond"),
        pl.col("pickup").dt.nanosecond().alias("nanosecond"),
    )
    .sample(5)
    .sort("pickup")
)

For the year there is both `year` and `iso_year`. 

- The `year` is the literal year from the calendar year
- The `iso_year` is the year according to the ISO defintion which is based on 52 full weeks for a year

For datetimes in the first few days of a year these values may be different (see the first row below)

In [None]:
(
    df
    .select(
        pl.col("pickup"),
        pl.col("pickup").dt.year().alias("year"),
        pl.col("pickup").dt.iso_year().alias("iso_year"),
    )
    .sort("pickup")
    .head(3)
)

The dtype for the `year` and `iso_year` columns is a signed 32-bit integer. All other columns are unsigned 32-bit integers.

## Ordinal week and day numbers

We can also extract week and day feaures:
- `.dt.week` gives the <a href="https://en.wikipedia.org/wiki/ISO_week_date" target="_blank">ISO week of the year</a>
- `.dt.weekday` gives the day of week where monday = 0 and sunday = 6
- `.dt.day` gives the day of month from 1-31
- `.dt.ordinal_day` gives the day of year from 1-365/366

In [None]:
(
    df
    .select(
        pl.col("pickup"),
        pl.col("pickup").dt.week().alias("week"),
        pl.col("pickup").dt.weekday().alias("weekday"),
        pl.col("pickup").dt.day().alias("day_of_month"),
        pl.col("pickup").dt.ordinal_day().alias("ordinal_day"),
    )
    .head(2)
    .sort("pickup")
)

In the ISO system the first two days of 2022 are in week 52 of 2021.

## Extracting datetime components in lazy mode
We do the same query in lazy mode to see how Polars extracts datetime components in lazy mode

In [None]:
print(
    pl.scan_csv(csv_file,try_parse_dates=True)
    .select(
        pl.col("pickup"),
        pl.col("pickup").dt.week().alias("week"),
        pl.col("pickup").dt.weekday().alias("weekday"),
        pl.col("pickup").dt.day().alias("day_of_month"),
        pl.col("pickup").dt.ordinal_day().alias("ordinal_day"),
    )
    .explain()
)

The datetime extraction happens in a `SELECT...FROM` block in the optimized query plan above.

This means that Polars first reads in the datetime column from the CSV and then does the conversion once the column is in a `DataFrame` in memory.


## Exercises
In the exercises you will develop your understanding of:
- extracting datetime components
- extracting ordinal components
- doing these operations in lazy mode

### Exercise 1
Count the number of records for each date (by pickup)

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    <blank>
)

### Exercise 2

Add a `day_of_year` column to get the number of records per ordinal day of the year

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    <blank>
)


Continue by counting how many records there are for each day-of-year

Add columns with the day-of-week and hour of the day based on the pickup time

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .select("pickup")
    <blank>
    .head()
)

Continue by counting the number of records for each (day-of-week,hour-of-the-day) pair.

Sort the output from largest number of records to smallest

Do the count of records by (day-of-week,hour-of-the-day) again, but this time extract the day-of-week & hour-of-the-day **inside the `groupby`**

Do the same operation but this time in lazy mode

## Solutions

### Solution to exercise 1
Count the number of records for each date (by pickup).

This can be done either with `groupby` (first cell) or `value_counts` (second cell)

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .group_by(
        pl.col("pickup").cast(pl.Date)
    )
    .len()    
)

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .with_columns(
        pl.col("pickup").cast(pl.Date)
    )
    ["pickup"]
    .value_counts()
)

### Solution to exercise 2
Add a `day_of_year` column to get the number of records per ordinal day of the year

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .with_columns(
        pl.col("pickup").dt.ordinal_day().alias("day_of_year")
    )
)

Count how many records there are for each day-of-year

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .with_columns(
        pl.col("pickup").dt.ordinal_day().alias("day_of_year")
    )
    ["day_of_year"]
    .value_counts()
)

Add columns with the day-of-week and hour of the day based on the pickup time

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .select("pickup")
    .with_columns(
        pl.col("pickup").dt.weekday().alias("day_of_week"),
        pl.col("pickup").dt.hour().alias("hour")
    )
    .head(3)
)

Count the number of records for each (day-of-week,hour-of-the-day) pair.

Sort the output from largest number of records to smallest

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .select("pickup")
    .with_columns(
        pl.col("pickup").dt.weekday().alias("day_of_week"),
        pl.col("pickup").dt.hour().alias("hour")
    )
    .group_by("day_of_week","hour")
    .len()
    .sort("len",descending=True)
)

Do the count of records by (day-of-week,hour-of-the-day) again, but this time extract the day-of-week & hour-of-the-day inside the `groupby`

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .select("pickup")
    .group_by(
        pl.col("pickup").dt.weekday().alias("day_of_week"),
        pl.col("pickup").dt.hour().alias("hour")
    )
    .len()
    .sort("len",descending=True)
)

Do the same operation in lazy mode

In [None]:
(
    pl.scan_csv(csv_file,try_parse_dates=True)
    .select("pickup","dropoff")
    .group_by(
        pl.col("pickup").dt.weekday().alias("day_of_week"),
        pl.col("pickup").dt.hour().alias("hour")
    )
    .agg(
        pl.col("dropoff").count().alias("count")
    )
    .sort("count",descending=True)
    .collect()
)

We cannot call `len` on a `LazyGroupBy`, we must use `agg`. I recommend just using `agg` when doing any `groupby` to make the conversion to lazy mode easier.