# Parsing datetime strings
By the end of this lecture you will be able to:
- parse datetime strings from a file
- convert datetime strings into time series dtypes
- saving datetime dtypes to a file


In [None]:
import polars as pl

In this lecture we work with a 1,000 row extract of the NYC taxi dataset

In [None]:
csv_file = "../data/nyc_trip_data_1k.csv"

## Reading datetime strings from a CSV
Polars does not try to parse datetimes from strings by default

In [None]:
pl.Config.set_fmt_str_lengths(100)
df = (
    pl.read_csv(
        csv_file,
    )
)
df.head(2)

Polars will try to do this if we set the `try_parse_dates` argument

In [None]:
df = (
    pl.read_csv(
        csv_file,
        try_parse_dates=True
    )
)
df.head(2)

If you want to see the range of datetime string regex patterns supported by `try_parse_dates` [see the Rust code](https://github.com/pola-rs/polars/blob/master/polars/polars-time/src/chunkedarray/utf8/patterns.rs).

For more control we can also pass the `dtypes` argument

In [None]:
df = (
    pl.read_csv(
        csv_file,
        dtypes={
            "pickup":pl.Datetime,
            "dropoff":pl.Datetime
        }
    )
)
df.head(2)

## Reading datetime strings from a CSV in lazy mode
We can apply the `parse_dates` and `dtypes` arguments in lazy mode

In [None]:
print(
    pl.scan_csv(csv_file,try_parse_dates=True)
    .explain()
)

### Other file types
CSV files store all data as strings and so do not preserve datetime dtypes. However, IPC (Arrow) and Parquet files store the dtypes. If the `DataFrame` is saved with datetime dtypes for these file formats it will be loaded with datetime dtypes.

For JSON there is no `parse_dates` argument and the conversion from strings to datetime must be done manually after the JSON is read. 

## Parsing dates manually

We convert date strings to datetime dtypes using `.str.strptime` (string-parse-time).

First we read the CSV again without automatic date parsing

In [None]:
df = pl.read_csv(csv_file)
df.head(2)

To parse the date string in `str.strptime` for this data we pass:
- the target dtype e.g. `pl.Datetime` or `pl.Date` and
- the format of the string (possibly including characters such as a `T` before the time)
- the number of decimal places (6) in the fractional seconds

In [None]:
(
    df
    .with_columns(
        pl.col("pickup").str.strptime(pl.Datetime, format="%Y-%m-%dT%H:%M:%S%.6f"),
        pl.col("dropoff").str.strptime(pl.Datetime, format="%Y-%m-%dT%H:%M:%S%.6f"),
    )
   .head(2)   
)

The format follows the convention of the Rust strftime module: https://docs.rs/chrono/latest/chrono/format/strftime/index.html

There are also some short-cut formats e.g. `%F` for `%Y-%m-%d` and `%T` for `%H:%M:%S`

In [None]:
(
    df
    .with_columns(
        pl.col("pickup").str.strptime(pl.Datetime, format="%FT%T%.6f"),
  )
   .head(2)
)

It is easy to get the formats wrong - pay particular attention to uppercase and lowercase letters

## Saving datetimes
If we write a datetime dtype to IPC or Parquet file types the dtype will be preserved.

If we write to a CSV then the datetime is converted back to a string

In [None]:
df = pl.read_csv(csv_file)
df_formatted = (
    df
     .with_columns(
      pl.col("pickup").str.strptime(pl.Datetime, format="%Y-%m-%dT%H:%M:%S%.6f"),
      pl.col("dropoff").str.strptime(pl.Datetime, format="%Y-%m-%dT%H:%M:%S%.6f"),
  )
)
df_formatted.head(2)

If we want to adjust the formatting of the `pl.Datetime`/`pl.Date`/`pl.Time` before saving it we can use the corresponding arguments in `write_csv`.

In this example we replace the empty space between the date and time with a `T`

In [None]:
df_formatted.write_csv("test.csv",datetime_format="%Y-%m-%d%T%H:%M:%S")

## Duration dtype
We cannot write a `pl.Duration` type to CSV directly.

Instead we extract the underlying integer representation in microseconds.

In this example we convert the microseconds to seconds and change the column name to reflect this

In [None]:
(
    df_formatted
    .with_columns(
        ((pl.col("dropoff")-pl.col("pickup")).dt.total_seconds()).alias("trip_length_seconds")
    )
).head(2)

## Exercises
In the exercises you will develop your understanding of:
- manually converting datetime strings to a datetime dtype
- writing datetime dtypes to a CSV

### Exercise 1

Parse the dates

Convert the `date` strings to `pl.Date` dtype

In [None]:
df = pl.DataFrame(
    {'date':['31-01-2020','28-02-2020','31-03-2020']}
)
(
    df
    .with_columns(
        pl.col('date')<blank>
    )
)

With YMD format

In [None]:
df = pl.DataFrame(
    {'date':['2020-01-31','2020-02-28','2020-03-31']}
)
(
    df
    .with_columns(
        pl.col('date')<blank>
    )
)

With forward-slashes

In [None]:
df = pl.DataFrame(
    {'date':['31/01/2020','28/02/2020','31/03/2020']}
)
(
    df
    .with_columns(
        pl.col('date')<blank>
    )
)

With month names.

Recall the formats are here: https://docs.rs/chrono/latest/chrono/format/strftime/index.html

In [None]:
df = pl.DataFrame({
    'date': ["27 July 2020", "31 December 2020"]
})
(
    df
    .with_columns(
        pl.col('date')<blank>
    )
)

### Exercise 2 

Parse the datetimes

Convert the `date` column from string to `pl.Datetime` dtype

In [None]:
df = pl.DataFrame(
    {'date':['31-01-2020 00:00:00','28-02-2020 00:00:00','31-03-2020 00:00:00']}
)
(
    df
    .with_columns(
        pl.col('date')<blank>
    )
)

Convert to `pl.Datetime` preserving the milliseconds

Hint: find formats for fractional seconds: https://docs.rs/chrono/latest/chrono/format/strftime/index.html

In [None]:
df = pl.DataFrame(
    {'date':['31-01-2020 00:00:00.500','31-01-2020 00:00:00.600','31-01-2020 00:00:00.700']}
)
(
    df
    .with_columns(
        pl.col('date')<blank>
    )
)

Convert strings with AM/PM to `pl.Datetime` dtype

In [None]:
df = pl.DataFrame(
    {'date':['01-01-2020 01:00 AM','01-02-2020 01:00 AM','01-03-2020 02:00 AM']}
)
(
    df
    .with_columns(
        pl.col('date')<blank>
    )
)

### Exercise 3 

Parse datetimes from a CSV.

Read in the NYC taxi dataset from the CSV file. Use `read_csv` to parse the dates automatically

In [None]:
dfNYC = pl.read_csv(csv_file,<blank>)

Change the pickup and dropoff columns to be `pl.Date` (and not `pl.Datetime`)

Challenge: do this in a single expression using `with_column`

In [None]:
dfNYC = (
    pl.read_csv(csv_file,<blank>)
    <blank>
)

Count how many trips had a pickup on each date with the output sorted by the number of trips

## Solutions

### Solution to exercise 1

In [None]:
df = pl.DataFrame(
    {'date':['31-01-2020','28-02-2020','31-03-2020']}
)
(
    df
    .with_columns(
        pl.col('date').str.strptime(pl.Date,format='%d-%m-%Y')
    )
)

With YMD format

In [None]:
df = pl.DataFrame(
    {'date':['2020-01-31','2020-02-28','2020-03-31']}
)
(
    df
    .with_columns(
        pl.col('date').str.strptime(pl.Date,format='%Y-%m-%d')
    )
)

With forward-slashes

In [None]:
df = pl.DataFrame(
    {'date':['31/01/2020','28/02/2020','31/03/2020']}
)
(
    df
    .with_columns(
        pl.col('date').str.strptime(pl.Date,format='%d/%m/%Y')
    )
)

With month names

In [None]:
df = pl.DataFrame({
    'date': ["27 July 2020", "31 December 2020"]
})
(
    df
    .with_columns(
        pl.col('date').str.strptime(pl.Date, format='%d %B %Y').cast(pl.Datetime)
    )
)

### Solution to exercise 2 - Datetimes

Convert the `date` column from string to `pl.Datetime` dtype

In [None]:
df = pl.DataFrame(
    {'date':['31-01-2020 00:00:00','28-02-2020 00:00:00','31-03-2020 00:00:00']}
)
(
    df
    .with_columns(
        pl.col('date').str.strptime(pl.Datetime,format='%d-%m-%Y %H:%M:%S')
    )
)

Cast to `pl.Datetime` preserving fractional seconds

In [None]:
df = pl.DataFrame(
    {'date':['31-01-2020 00:00:00.500','31-01-2020 00:00:00.600','31-01-2020 00:00:00.700']}
)
(
    df
    .with_columns(
        pl.col('date').str.strptime(pl.Datetime,format='%d-%m-%Y %H:%M:%S%.3f')
    )
)

Cast to `pl.Datetime` with AM/PM

In [None]:
df = pl.DataFrame(
    {'date':['01-01-2020 01:00 AM','01-02-2020 01:00 AM','01-03-2020 02:00 AM']}
)
(
    df
    .with_columns(
        pl.col('date').str.strptime(pl.Datetime,format='%d-%m-%Y %I:%M %p')
    )
)

### Solution to exercise 3 - read from CSV
Read in the NYC taxi dataset from the CSV file. Use `read_csv` to parse the dates automatically

In [None]:
dfNYC = pl.read_csv(csv_file,try_parse_dates=True)

Change the pickup and dropoff columns to be `pl.Date` (and not `pl.Datetime`)

In [None]:
dfNYC = (
    dfNYC
    .with_columns(
        pl.col(pl.Datetime).cast(pl.Date)
    )
)

Count how many trips had a pickup on each date. Sort the output

In [None]:
dfNYC["pickup"].value_counts(sort=True)