## Introduction to datetime dtypes
By the end of this lecture you will be able to:
- explain the difference between Polars datetime dtypes
- extract the integer representation underlying datetime dtypes

Time series analysis is easier if you have a good understanding of the datetime dtypes and their underlying representation. We get to know the dtypes here.

Time series dtypes behave in some ways like a categorical dtype with an underlying integer representation that maps to a more interpretable datetime representation. I recommend that you do the String and categorical dtypes lecture in Section 3 before doing this lecture.    

In [None]:
from datetime import date,datetime

import polars as pl
import pandas as pd

Before looking at the dtypes we create a datetime range in Polars with `pl.datetime_range`. 

In this example we create an hourly datetime range where we specify the start and stop dates with Python `datetime.date` objects

In [None]:
pl.Config.set_tbl_rows(4)
start = date(2022,1,1)
stop = date(2022,1,2)
df = pl.DataFrame(
    {
        'date':pl.datetime_range(
            start = start,
            end = stop,
            interval='1h',
            eager=True
        ),
    }
)
df

The dtype of this column is `datetime[μs]`. This means it has a `pl.Datetime` dtype where the underlying representation is microseconds since the start of the Unix epoch on 1st January 1970.

## Datetime dtypes
As well as `pl.Datetime` Polars has `pl.Date`,`pl.Time` and `pl.Duration` dtypes. In this table we set out key characteristics of each


| dtype|Example |Time unit | Internal dtype |
---|---|---|---|
|`pl.Datetime` | 2020-01-01 01:00:00 |Microseconds since UNIX epoch | 64-bit signed integer |
|`pl.Date` |2020-01-01 |Days since UNIX epoch | 32-bit signed integer |
| `pl.Time` | 01:00:00 | Nanoseconds since midnight | 64-bit signed integer |
|`pl.Duration` |1d 1h |Microseconds |  64-bit signed integer  |


> In Pandas and Numpy the Datetime objects use nanoseconds rather than microseconds by default. We cover conversion from Pandas and Numpy below.

In the `DataFrame` below we create a date range at 6-hour intervals. We then cast this date range to the other datetime dtypes to see how it changes.



In [None]:
start = datetime(2020,1,1)
stop = datetime(2020,1,2)
interval = "6h"
pl.Config.set_tbl_rows(5)
df_datetimes = (
    pl.DataFrame(
        {
            # Create a date range
            "datetime":pl.datetime_range(start,stop,interval=interval,eager=True)
        }
    ).with_columns(
        # Cast it to other dtypes
        pl.col("datetime").cast(pl.Date).alias("date"),
        pl.col("datetime").cast(pl.Time).alias("time"),        
    )

)
df_datetimes

We see that casting the datetime to `pl.Date` extracts the date and similiarly for time.

To get a `pl.Duration` we subtract successive values in the column of datetimes with the `diff` expression

In [None]:
df_datetimes = (
    df_datetimes
    .with_columns(
        pl.col("datetime").diff().alias("duration"),        
    )
)

### Integer representations
Internally each datetime dtype is an integer count (see the table above for details of what). When you find a time series operation behaving in an unexpected way it can be useful to look at what is going on with these underlying integers.

We get the underlying integer representations with the `to_physical` expression

In [None]:
df_datetimes_physical = (
    df_datetimes
    .select(
        pl.col("datetime").to_physical().name.suffix("_us"),
        pl.col("date").to_physical().name.suffix("_days"),
        pl.col("duration").to_physical().name.suffix("_us"),
        pl.col("time").to_physical().name.suffix("_ns"),            
    )
)
df_datetimes_physical

With a 64-bit integer we can represent a datetime range of 584 billion years at microsecond intervals!

### Changing the underlying time unit & conversion from Pandas/Numpy
In Polars a `pl.Datetime` is represented as microseconds by default. However, in Pandas and Numpy the underling representation is nanoseconds.

In this example we create a *Pandas* `DataFrame` and check the dtype

In [None]:
df_datetimes_pandas = (
    pd.DataFrame(
        {
            # Create a Pandas date range
            "datetime":pd.date_range(start,stop,freq="6h",)
        }
    )
)
df_datetimes_pandas.dtypes

We see the dtype is `datetime64[ns]`

If we convert this to Pandas `DataFrame` to a Polars `DataFrame` we still have a `pl.Datetime` with nanoseconds as the underlying representation

In [None]:
(
    pl.from_pandas(
        df_datetimes_pandas
    )
    .head(2)
)

This nanosecond representation can stop you from joining to another Polars `DataFrame` that has a microsecond representation. To address this we can cast the `pl.Datetime` column from nanoseconds to microseconds with `dt.cast_time_unit`

In [None]:
(
    pl.from_pandas(
        df_datetimes_pandas
    )
    .with_columns(
        pl.col("datetime").dt.cast_time_unit("us")
    )
    .head(2)
)

### Timestamp
The integer representation of a datetime is sometimes referred to as the timestamp. 

In Polars we have a `.dt.timestamp` expression that gives the integer representation in a given unit - similar to `to_physical` above. We can also change the time unit of the integers.

In this example we get the integer representation of the datetime column in the default microseconds and in nanoseconds

In [None]:
(
    df_datetimes
    .select(
        pl.col("datetime"),
        pl.col("datetime").to_physical().alias("datetime_to_phys"),
        pl.col("datetime").dt.timestamp().alias("timestamp_us"),
        pl.col("datetime").dt.timestamp(time_unit="ns").alias("timestamp_ns"),
    )
)

There is also a `.dt.epoch` expression that is an alias for `.dt.timestamp`

## Exercises
In the exercises you will develop your understanding of:
- creating a date range
- converting datetime dtypes
- extracting the integer representation
 
### Exercise 1
Create a `DataFrame` with a column called `datetime` that has datetimes from the start of 2020 to 30th June 2022 at 6-monthly intervals

Extend your query by copying your existing code in each subsequent part of this exercise.

Create this date range again but including the end date and excluding the start date

Add columns that encode the `datetime` column as a:
- date
- time

Add three new columns that have the physical representation for the `datetime`, `date` and `time` columns. Each new column name should end with `_physical`.

Challenge: do this as a single expression inside an additional `with_column`

Add a new column that calculates the differences between the `datetime` column entries

## Solutions

### Solution to Exercise 1

Create a `DataFrame` with a column called `datetime` that has datetimes from the start of 2020 to 30th June 2022 at 6-monthly intervals

In [None]:
start = datetime(2020,1,1)
stop = datetime(2022,6,30)
df = (
        pl.DataFrame(
            {
                "datetime":pl.datetime_range(
                    start=start,
                    end=stop,
                    interval="6mo",
                    eager=True
                )
            }
        )
)
df

Create this date range again but including the end date and excluding the start date

In [None]:
df = (
        pl.DataFrame(
            {
                "datetime":pl.datetime_range(
                    start=start,
                    end=stop,
                    interval="6mo",
                    closed="right",

                    eager=True
                )
            }
        )
)
df

Add columns that encode the `datetime` column as a:
- date
- time

In [None]:
df = (
        pl.DataFrame(
            {
                "datetime":pl.datetime_range(
                    start=start,
                    end=stop,
                    interval="6mo",
                    closed="right",

                    eager=True
                )
            }
        )
    .with_columns(
        [
            pl.col("datetime").cast(pl.Date).alias("date"),
            pl.col("datetime").cast(pl.Time).alias("time")
        ]
    )

)
df

Add three new columns that have the physical representation for the `datetime`, `date` and `time` columns. 

In [None]:
df = (
        pl.DataFrame(
            {
                "datetime":pl.datetime_range(
                    start=start,
                    end=stop,
                    interval="6mo",
                    closed="right",
                    eager=True
                )
            }
        )
    .with_columns(
        [
            pl.col("datetime").cast(pl.Date).alias("date"),
            pl.col("datetime").cast(pl.Time).alias("time")
        ]
    )
    .with_columns(
        pl.all().to_physical().name.suffix("_physical")
    )

)
df

Add a new column that calculates the differences between the `datetime` column entries

In [None]:
df = (
        pl.DataFrame(
            {
                "datetime":pl.datetime_range(
                    start=start,
                    end=stop,
                    interval="6mo",
                    closed="right",
                    eager=True
                )
            }
        )
    .with_columns(
        [
            pl.col("datetime").cast(pl.Date).alias("date"),
            pl.col("datetime").cast(pl.Time).alias("time")
        ]
    )
    .with_columns(
        pl.all().to_physical().name.suffix("_physical"),
        pl.col("datetime").diff().alias("duration")
    )

)
df