In [1]:
import pandas as pd

# Time series / date functionality in Pandas
* Pandas was developed for financial modelling (Wes Mckinney, AQR Capital)
* Handling time series therefore comprises an integral part of the package
* We're going to look at two different concepts in Pandas
    - 1) Timestamps
    - 2) Timedeltas

## 1) Timestamp objects
* Pandas has built-in TimeStamp objects
* An array of TimeStamp objects is a DateTimeIndex
* The datatype of a TimeStamp object is datetime64
* There are two main methods of creating Timestamps or a DateTimeIndex:
    - 1) `pd.to_datetime()`
    - 2) `pd.date_range()`

### 1.1) `pd.to_datetime()`

- You can use a lot of different formats to convert a string into a Timestamp

In [2]:
# Convert todays day in a pd.Timestamp

today = '29 March 2021'

In [3]:
type(today)

str

In [4]:
today = pd.to_datetime(today)

In [8]:
pd.to_datetime('10/03/2021', dayfirst=True)

Timestamp('2021-03-10 00:00:00')

- You could also pass a `pd.Series` or a `pd.DataFrame` into `pd.to_datetime()` if the values are convertable to a Timestamp.

In [12]:
# Create a list of dates, e.g. [today, tomorrow]

datetime_index = pd.to_datetime(['29/03/2021', '30-03-2021'])

In [13]:
datetime_index

DatetimeIndex(['2021-03-29', '2021-03-30'], dtype='datetime64[ns]', freq=None)

In [14]:
datetime_index[0]

Timestamp('2021-03-29 00:00:00')

In [15]:
# Convert it to a pd.DatetimeIndex


### 1.2) `pd.date_range()`

What happens if you want to create a range of dates?
You can use `pd.date_range()` to create a DateTimeIndex (array of Timestamps):

In [16]:
# On which date did you start the bootcamp?

start_date = '15 March 2021'

In [17]:
# On which date are you going to graduate?

end_date = '10 June 2021'

In [18]:
# Create a DatetimeIndex from start to end

dt_range = pd.date_range(start=start_date, end=end_date, freq='D')

In [19]:
dt_range

DatetimeIndex(['2021-03-15', '2021-03-16', '2021-03-17', '2021-03-18',
               '2021-03-19', '2021-03-20', '2021-03-21', '2021-03-22',
               '2021-03-23', '2021-03-24', '2021-03-25', '2021-03-26',
               '2021-03-27', '2021-03-28', '2021-03-29', '2021-03-30',
               '2021-03-31', '2021-04-01', '2021-04-02', '2021-04-03',
               '2021-04-04', '2021-04-05', '2021-04-06', '2021-04-07',
               '2021-04-08', '2021-04-09', '2021-04-10', '2021-04-11',
               '2021-04-12', '2021-04-13', '2021-04-14', '2021-04-15',
               '2021-04-16', '2021-04-17', '2021-04-18', '2021-04-19',
               '2021-04-20', '2021-04-21', '2021-04-22', '2021-04-23',
               '2021-04-24', '2021-04-25', '2021-04-26', '2021-04-27',
               '2021-04-28', '2021-04-29', '2021-04-30', '2021-05-01',
               '2021-05-02', '2021-05-03', '2021-05-04', '2021-05-05',
               '2021-05-06', '2021-05-07', '2021-05-08', '2021-05-09',
      

In [20]:
# Convert dt_range into a pd.Series
pd.Series(dt_range)

0    2021-03-15
1    2021-03-16
2    2021-03-17
3    2021-03-18
4    2021-03-19
        ...    
83   2021-06-06
84   2021-06-07
85   2021-06-08
86   2021-06-09
87   2021-06-10
Length: 88, dtype: datetime64[ns]

### 1.3) This is all nice and fine, but why do we care?

The reason we care for Timestamps in pandas is because they allow us to perform time related actions on the data.
Several things you can do with a timestamp:

- extract the hour
- extract the day
- extract the month
- extract the year
- Slice a DataFrame or Series if the DateTimeIndex is the Index of the DataFrame
- calculate time differences

How could that be useful?
Let us look at the data for this week and think about what we could do with this functionality:

- Extract weekdays from the Timestamp
- Extract the hour of the day from the Timestamp
- Create subsets of the data
- ...

In [21]:
# Load this weeks training data

df = pd.read_csv('./data/train.csv', parse_dates=True, index_col=0)
# parse_dates=True will try to interpret the index_col as a pd.DatetimeIndex

df.head()

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [22]:
# Inspect the type of the df.index

type(df.index)

pandas.core.indexes.datetimes.DatetimeIndex

In [23]:
# Until which date do we have data?

df.tail()

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336
2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241
2012-12-19 21:00:00,4,0,1,1,13.94,15.91,61,15.0013,4,164,168
2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129
2012-12-19 23:00:00,4,0,1,1,13.12,16.665,66,8.9981,4,84,88


In [24]:
df.index.max()

Timestamp('2012-12-19 23:00:00')

In [25]:
# Slice the pd.DataFrame by one day / year

df['2011-01-01']

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1
2011-01-01 05:00:00,1,0,0,2,9.84,12.88,75,6.0032,0,1,1
2011-01-01 06:00:00,1,0,0,1,9.02,13.635,80,0.0,2,0,2
2011-01-01 07:00:00,1,0,0,1,8.2,12.88,86,0.0,1,2,3
2011-01-01 08:00:00,1,0,0,1,9.84,14.395,75,0.0,1,7,8
2011-01-01 09:00:00,1,0,0,1,13.12,17.425,76,0.0,8,6,14


In [28]:
# Slice the pd.DataFrame by one date + datetime
df.loc['2011-01-01 01']

season         1.000
holiday        0.000
workingday     0.000
weather        1.000
temp           9.020
atemp         13.635
humidity      80.000
windspeed      0.000
casual         8.000
registered    32.000
count         40.000
hour           1.000
Name: 2011-01-01 01:00:00, dtype: float64

In [29]:
# Extract information about the date or time

df['hour'] = df.index.hour

In [30]:
df.shape

(10886, 12)

In [31]:
df.head()

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,hour
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,0
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,1
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32,2
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13,3
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1,4


In [32]:
df.index.day_name()

Index(['Saturday', 'Saturday', 'Saturday', 'Saturday', 'Saturday', 'Saturday',
       'Saturday', 'Saturday', 'Saturday', 'Saturday',
       ...
       'Wednesday', 'Wednesday', 'Wednesday', 'Wednesday', 'Wednesday',
       'Wednesday', 'Wednesday', 'Wednesday', 'Wednesday', 'Wednesday'],
      dtype='object', name='datetime', length=10886)

In [33]:
# Can we use these datetime functionalities if the array of timestamps is not in the index
df_reindexed = df.reset_index()
df_reindexed.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,hour
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,1
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32,2
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13,3
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1,4


In [34]:
df_reindexed.datetime[0]

Timestamp('2011-01-01 00:00:00')

In [35]:
df_reindexed.datetime.dt.day

0         1
1         1
2         1
3         1
4         1
         ..
10881    19
10882    19
10883    19
10884    19
10885    19
Name: datetime, Length: 10886, dtype: int64

In [36]:
# df_reindexed[df_reindexed.datetime == '2011-01-01']

### 1.4) `df.between_time()`

In [37]:
df.between_time(start_time='22:00', end_time='05:00')

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,hour
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16,0
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40,1
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32,2
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13,3
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...
2012-12-19 03:00:00,4,0,1,1,10.66,13.635,75,8.9981,0,5,5,3
2012-12-19 04:00:00,4,0,1,1,9.84,12.120,75,8.9981,1,6,7,4
2012-12-19 05:00:00,4,0,1,1,10.66,14.395,75,6.0032,2,29,31,5
2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129,22


## 2) Timedelta objects

Pandas has built-in Timedelta objects
* An array of Timedelta objects is a TimedeltaIndex
* The datatype of a Timedelta object is timedelta64
* There are three methods of creating Timedeltas or a TimedeltaIndex:
    - 1) `pd.to_timedelta()`
    - 2) `pd.timedelta_range()`
    - 3) Subtract two pd.Timestamp objects

In [38]:
# Calculate the Timedelta between the last and the first observation of our data

time_range = df.index.max() - df.index.min()

In [39]:
time_range.seconds

82800

## 3) Other concepts: `resample()`, `shift()` and `rolling()`

You will see these concepts in later points of the course.