<div class="licence">
<span>Licence CC BY-NC-ND</span>
<span>Valérie Roy</span>
<span><img src="media/ensmp-25-alpha.png" /></span>
</div>

#  time-related data in *pandas*

In [None]:
import numpy as np
import pandas as pd

a substantial majority of datasets  
have to do with **time-related** data

**3 data types** to talk about **temporality**
   - **date** or **time** (often called a datetime)  
     is a particular point in time
     e.g. **just now** or 1970-01-01T00:00 UTC
   - **time duration**: 
     e.g. **3 hours** (deltas)
   - **time period**: it is an **interval of time**  
     it is thus a **date** plus a **duration**

## date and time intervals in **numpy**

### date in **numpy** : *numpy.datetime64*

   - dates in *pandas* are based on *numpy.datetime64*
   - the format is **'year-month-day hour:minute:second'**
   - the numbers are **zero-padded** ($09$ and not $9$)

In [None]:
# first course is from 14:00 to 17:00 
np_beg = np.datetime64('2019-09-04 14:00:00')
np_end = np.datetime64('2019-09-04 17:00:00')
np_beg

### **time duration** in **numpy** : *numpy.timedelta64*

In [None]:
# course duration can be obtained
# by a simple substraction
np_duration = np_end - np_beg
np_duration

In [None]:
# same course the day after
# we translate by 1 day
# one day
np_day = np.timedelta64(
    1, 'D')

# beginning of course 
# next day
np_beg_day2 = \
    np_beg + np_day
np_beg_day2

In [None]:
# end of course next day
# is for example
np_beg_day2 + np_duration

## temporality in **pandas**

### dates in **pandas** : *Timestamp*

In [None]:
# same logic using pandas

pd_beg = pd.Timestamp('2019-09-04 14:00:00')
pd_end = pd.Timestamp('2019-09-04 17:00:00')

pd_beg


- you can use your **specific format**
- using *pandas.to_datetime()* with the *format* parameter
- for example the format used above is `%Y-%m-%d %H:%M:%S`
- `%Y` is year (2019), ... [see complete list in the Python doc](https://docs.python.org/3.7/library/datetime.html#strftime-and-strptime-behavior)

In [None]:
# using a specific format for setting a date

pd.to_datetime('2019|10|04 14;00;07', format='%Y|%m|%d %H;%M;%S') 

* as a side note, be aware that down the software stack, at the OS level,
  a particular point in time is almost always encoded by
  the number of seconds elapsed since the 'epoch' of 1970-01-01T00:00 UTC

In [None]:
# if you give a number it's considered 
# to be a duration since the epoch
pd.Timestamp(0) # the Unix epoch; that was a thursday, can you check it ?

### time duration in **pandas**  : *Timedelta*

- it is a **time interval**
- so a **duration** between **two** dates
- with no mention of a precise **date**

In [None]:
# can also use + and -
pd_duration = pd_end - pd_beg
pd_duration

In [None]:
pd_beg + pd_duration == pd_end

In [None]:
#pd.Timedelta?

### time period in **pandas**  : *Period*

   - a **period** is a **date** and a **duration**

In [None]:
# make a period from a starting date and a duration
pd_period = pd_beg.to_period(pd_duration)
pd_period

In [None]:
# to_period supports shortcuts like e.g
pd_beg.to_period('3H')

## columns of dates in a *DataFrame*

### in an already created **dataframe**

   - you have a **dataset** with a column of **dates**

In [None]:
df = pd.DataFrame(
    {'time': ['2019/12/25 23:59', '2019/12/31 23:59'],
     'holidays': ['Christmas', 'New Year']})

   - the **time** is a **simple** python **string** 

In [None]:
type(df.loc[0, 'time'])   

   - you can **transform** a **string** in **objects** of type **date**
   - with the *pandas.to_datetime* method

In [None]:
# str objects are not at all convenient,
# let's replace that column with datetime objects instead
df['time'] = pd.to_datetime(df['time'])

   - this allowed us to replace *str* values with *numpy.datetime64* ones

In [None]:
df.dtypes

   - now we can use this column for **indexing** the dataframe 

In [None]:
df.set_index('time')

### creating **date** type while reading the **csv** file

In [None]:
# imagine you have a raw csv file 
# we simulate this use case by storing
# our dataframe with no index

df.to_csv('holidays.csv', index=None)

   - *read_cvs()* can do this conversion **on the fly**
   - thanks to the **parse_dates** parameter

In [None]:
df = pd.read_csv('holidays.csv', parse_dates=['time'])

In [None]:
df.dtypes

In [None]:
df.head()

   - we can also **index** the data frame by **date** while we read the csv-file

In [None]:
df = pd.read_csv('holidays.csv', parse_dates=['time'], index_col='time')

In [None]:
df.head()

In [None]:
df.index.name

**unusual dates formats**

In [None]:
# the file **holidays-custom.csv** we have written dates with |
# like e.g. 2019|12|25 23:59,Christmas
!cat holidays-custom.csv

- for **unsusual dates format** indicate the parser function **to use** 

In [None]:
def my_date_parser (d):
    return pd.to_datetime(d, format='%Y|%m|%d %H:%M')

df = pd.read_csv('holidays-custom.csv', 
                 parse_dates=['time'],
                 index_col='time', 
                 date_parser=my_date_parser)

In [None]:
df.head()

### when dates are **wrong** you can **ignore** or **coerce**

   - you get an error

In [None]:
try:
    pd.to_datetime('30/02/2019')
except ValueError as e:
    print(e)

   - you ignore the error

In [None]:
pd.to_datetime('30/02/2019', errors='ignore') # your create a 30th of February

   - you coerce the error

In [None]:
pd.to_datetime('30/02/2019', errors='coerce') # this is Not a Time

   - it is the *pandas* **object**: *pandas.NaT*
   - classical **NaN** methods work on **NaT values**