# TimeSeries Operations

In this lesson we'll explore time shifting and resampling (grouping). Two of the most common operations with Time Series.

In [1]:
import pandas as pd
import numpy as np

### Time Shifting

In [2]:
ts = pd.Series(
    np.random.randn(10) * 10 + 500,
    index=pd.date_range(start='2018-01-01', periods=10, freq='D'))

In [3]:
ts

2018-01-01    500.951582
2018-01-02    496.084777
2018-01-03    490.357086
2018-01-04    495.445451
2018-01-05    493.706927
2018-01-06    505.912024
2018-01-07    490.610043
2018-01-08    497.658551
2018-01-09    480.753560
2018-01-10    487.750870
Freq: D, dtype: float64

In [4]:
ts.shift(1)

2018-01-01           NaN
2018-01-02    500.951582
2018-01-03    496.084777
2018-01-04    490.357086
2018-01-05    495.445451
2018-01-06    493.706927
2018-01-07    505.912024
2018-01-08    490.610043
2018-01-09    497.658551
2018-01-10    480.753560
Freq: D, dtype: float64

In [5]:
pd.DataFrame({
    'Original': ts,
    'Shfit (1)': ts.shift(1),
    'Shift (2)': ts.shift(2)
})

Unnamed: 0,Original,Shfit (1),Shift (2)
2018-01-01,500.951582,,
2018-01-02,496.084777,500.951582,
2018-01-03,490.357086,496.084777,500.951582
2018-01-04,495.445451,490.357086,496.084777
2018-01-05,493.706927,495.445451,490.357086
2018-01-06,505.912024,493.706927,495.445451
2018-01-07,490.610043,505.912024,493.706927
2018-01-08,497.658551,490.610043,505.912024
2018-01-09,480.75356,497.658551,490.610043
2018-01-10,487.75087,480.75356,497.658551


These operations are usually employed to compare the timeseries with previous values of the same time series. For example, calculating the percent change over the previous period:

In [6]:
df = pd.DataFrame({
    'Original': ts,
    'Shifted': ts.shift(1)
})
df

Unnamed: 0,Original,Shifted
2018-01-01,500.951582,
2018-01-02,496.084777,500.951582
2018-01-03,490.357086,496.084777
2018-01-04,495.445451,490.357086
2018-01-05,493.706927,495.445451
2018-01-06,505.912024,493.706927
2018-01-07,490.610043,505.912024
2018-01-08,497.658551,490.610043
2018-01-09,480.75356,497.658551
2018-01-10,487.75087,480.75356


In [7]:
(df['Original'] / df['Shifted']) - 1

2018-01-01         NaN
2018-01-02   -0.009715
2018-01-03   -0.011546
2018-01-04    0.010377
2018-01-05   -0.003509
2018-01-06    0.024721
2018-01-07   -0.030246
2018-01-08    0.014367
2018-01-09   -0.033969
2018-01-10    0.014555
Freq: D, dtype: float64

You can see how much sales grew or shrank vs the previous month.

This is a particularly silly example, because there's a pandas method specially intended for percentage changes: [`pct_change()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pct_change.html), so we don't even need `shift`:

In [8]:
ts.pct_change()

2018-01-01         NaN
2018-01-02   -0.009715
2018-01-03   -0.011546
2018-01-04    0.010377
2018-01-05   -0.003509
2018-01-06    0.024721
2018-01-07   -0.030246
2018-01-08    0.014367
2018-01-09   -0.033969
2018-01-10    0.014555
Freq: D, dtype: float64

Shifting also works with smaller periods, just changing the time of the original timestamps:

In [9]:
ts.shift(1, freq='15Min')

2018-01-01 00:15:00    500.951582
2018-01-02 00:15:00    496.084777
2018-01-03 00:15:00    490.357086
2018-01-04 00:15:00    495.445451
2018-01-05 00:15:00    493.706927
2018-01-06 00:15:00    505.912024
2018-01-07 00:15:00    490.610043
2018-01-08 00:15:00    497.658551
2018-01-09 00:15:00    480.753560
2018-01-10 00:15:00    487.750870
Freq: D, dtype: float64

## Time Frequency

We'll now see how to change the frequency of our indexes. These will be just raw adjustments we'll do to directly modify the frequency of our data structure:

In [10]:
ts = pd.Series(
    np.random.randn(10) * 10 + 500,
    index=pd.date_range(start='2018-01-01', periods=10, freq='H'))
ts

2018-01-01 00:00:00    515.467395
2018-01-01 01:00:00    516.777442
2018-01-01 02:00:00    501.168494
2018-01-01 03:00:00    510.984121
2018-01-01 04:00:00    496.795909
2018-01-01 05:00:00    481.157866
2018-01-01 06:00:00    508.577581
2018-01-01 07:00:00    512.112786
2018-01-01 08:00:00    507.675569
2018-01-01 09:00:00    506.803990
Freq: H, dtype: float64

In [11]:
ts.asfreq('45min') # this might break things. Need to be careful

2018-01-01 00:00:00    515.467395
2018-01-01 00:45:00           NaN
2018-01-01 01:30:00           NaN
2018-01-01 02:15:00           NaN
2018-01-01 03:00:00    510.984121
2018-01-01 03:45:00           NaN
2018-01-01 04:30:00           NaN
2018-01-01 05:15:00           NaN
2018-01-01 06:00:00    508.577581
2018-01-01 06:45:00           NaN
2018-01-01 07:30:00           NaN
2018-01-01 08:15:00           NaN
2018-01-01 09:00:00    506.803990
Freq: 45T, dtype: float64

In [12]:
ts.asfreq('45Min', method='ffill')

2018-01-01 00:00:00    515.467395
2018-01-01 00:45:00    515.467395
2018-01-01 01:30:00    516.777442
2018-01-01 02:15:00    501.168494
2018-01-01 03:00:00    510.984121
2018-01-01 03:45:00    510.984121
2018-01-01 04:30:00    496.795909
2018-01-01 05:15:00    481.157866
2018-01-01 06:00:00    508.577581
2018-01-01 06:45:00    508.577581
2018-01-01 07:30:00    512.112786
2018-01-01 08:15:00    507.675569
2018-01-01 09:00:00    506.803990
Freq: 45T, dtype: float64

In [13]:
ts.asfreq('45Min', method='bfill')

2018-01-01 00:00:00    515.467395
2018-01-01 00:45:00    516.777442
2018-01-01 01:30:00    501.168494
2018-01-01 02:15:00    510.984121
2018-01-01 03:00:00    510.984121
2018-01-01 03:45:00    496.795909
2018-01-01 04:30:00    481.157866
2018-01-01 05:15:00    508.577581
2018-01-01 06:00:00    508.577581
2018-01-01 06:45:00    512.112786
2018-01-01 07:30:00    507.675569
2018-01-01 08:15:00    506.803990
2018-01-01 09:00:00    506.803990
Freq: 45T, dtype: float64

In [14]:
ts.asfreq?

[0;31mSignature:[0m [0mts[0m[0;34m.[0m[0masfreq[0m[0;34m([0m[0mfreq[0m[0;34m,[0m [0mmethod[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mhow[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mnormalize[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mfill_value[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Convert TimeSeries to specified frequency.

Optionally provide filling method to pad/backfill missing values.

Returns the original data conformed to a new index with the specified
frequency. ``resample`` is more appropriate if an operation, such as
summarization, is necessary to represent the data at the new frequency.

Parameters
----------
freq : DateOffset object, or string
method : {'backfill'/'bfill', 'pad'/'ffill'}, default None
    Method to use for filling holes in reindexed Series (note this
    does not fill NaNs that already were present):

    * 'pad' / 'ffill': propagate last valid observation forward to next
     

In these examples, we've gone from a "less frequent" index to a "more frequent" index. But we could go the other way:

In [15]:
ts = pd.Series(
    np.random.randn(20) * 10 + 500,
    index=pd.date_range(start='2018-01-01', periods=20, freq='30min'))
ts

2018-01-01 00:00:00    502.855957
2018-01-01 00:30:00    498.178670
2018-01-01 01:00:00    481.810078
2018-01-01 01:30:00    506.937012
2018-01-01 02:00:00    503.455202
2018-01-01 02:30:00    493.525074
2018-01-01 03:00:00    477.763373
2018-01-01 03:30:00    509.070876
2018-01-01 04:00:00    513.217378
2018-01-01 04:30:00    494.092888
2018-01-01 05:00:00    495.190586
2018-01-01 05:30:00    515.384765
2018-01-01 06:00:00    507.185273
2018-01-01 06:30:00    485.879794
2018-01-01 07:00:00    488.165561
2018-01-01 07:30:00    507.492558
2018-01-01 08:00:00    500.645558
2018-01-01 08:30:00    498.414551
2018-01-01 09:00:00    506.548212
2018-01-01 09:30:00    491.627167
Freq: 30T, dtype: float64

In [16]:
ts.asfreq('2H')

2018-01-01 00:00:00    502.855957
2018-01-01 02:00:00    503.455202
2018-01-01 04:00:00    513.217378
2018-01-01 06:00:00    507.185273
2018-01-01 08:00:00    500.645558
Freq: 2H, dtype: float64

In [17]:
ts.asfreq('2H25min')

2018-01-01 00:00:00    502.855957
2018-01-01 02:25:00           NaN
2018-01-01 04:50:00           NaN
2018-01-01 07:15:00           NaN
Freq: 145T, dtype: float64

In [18]:
ts.asfreq('2H25min', method='ffill')

2018-01-01 00:00:00    502.855957
2018-01-01 02:25:00    503.455202
2018-01-01 04:50:00    494.092888
2018-01-01 07:15:00    488.165561
Freq: 145T, dtype: float64

But, what if you want to do some more "advanced" filling. For example, filling the new freq values with the "mean"? For that, we'll use resampling:

### Resampling

Resampling a timeseries is converting it to another time frequency. If you're going from high frequency to low frequency, the process is called "downsampling", and it involves an aggregation process. For example, you have daily sales data, and you want to aggregate it by month. You'll be "grouping" your daily sales per month, and you need to decide the aggregation operation to perform. For example, `sum` to get the total sales per month, or `mean` to get the average sale. Let's use an example:

In [39]:
all_days_2018 = pd.date_range(start='2018-01-01', end='2018-12-31', freq='D')
ts = pd.Series(
    np.random.randn(20) * 10 + 500,
    index=np.random.choice(all_days_2018, size=20))

ts.sort_index(inplace=True)
ts

2018-01-03    508.679819
2018-01-09    497.012917
2018-01-25    492.489636
2018-02-07    493.097634
2018-02-10    501.020459
2018-03-01    511.552410
2018-03-06    511.169199
2018-04-06    496.049118
2018-04-13    495.219019
2018-04-20    513.710480
2018-04-30    495.490997
2018-05-31    493.725044
2018-07-04    494.377761
2018-07-05    497.025863
2018-07-14    511.863808
2018-08-05    500.289235
2018-08-06    502.710773
2018-08-27    493.514904
2018-10-14    485.721787
2018-10-27    509.466544
dtype: float64

January sales:

In [40]:
ts['2018-01']

2018-01-03    508.679819
2018-01-09    497.012917
2018-01-25    492.489636
dtype: float64

In [41]:
ts['2018-01'].sum()

1498.1823719972317

February sales:

In [42]:
ts['2018-02']

2018-02-07    493.097634
2018-02-10    501.020459
dtype: float64

In [43]:
ts['2018-02'].sum()

994.1180937119402

**Downsampling**: We'll now use `resample` to "group" the sales monthly (downsampling our TimeSeries), and calculate the total sales per month:

In [44]:
ts.resample('M').sum() # month end is default

2018-01-31    1498.182372
2018-02-28     994.118094
2018-03-31    1022.721610
2018-04-30    2000.469613
2018-05-31     493.725044
2018-06-30       0.000000
2018-07-31    1503.267431
2018-08-31    1496.514912
2018-09-30       0.000000
2018-10-31     995.188332
Freq: M, dtype: float64

The parameter `M` means "month end frequency. We could instead choose "Month Start":

In [45]:
ts.resample('MS').sum()

2018-01-01    1498.182372
2018-02-01     994.118094
2018-03-01    1022.721610
2018-04-01    2000.469613
2018-05-01     493.725044
2018-06-01       0.000000
2018-07-01    1503.267431
2018-08-01    1496.514912
2018-09-01       0.000000
2018-10-01     995.188332
Freq: MS, dtype: float64

Which would of course yield the same results, but the index contains the first day of each month. More correctly speaking, in this example, we're collecting sales of _"the period January 2018"_. Pandas also has a `Period` type, which we can use with the `kind` parameter:

In [48]:
monthly_sales = ts.resample('M', kind='period').sum() # this is the correct period, but isn't supported as well in other libraries
monthly_sales

2018-01    1498.182372
2018-02     994.118094
2018-03    1022.721610
2018-04    2000.469613
2018-05     493.725044
2018-06       0.000000
2018-07    1503.267431
2018-08    1496.514912
2018-09       0.000000
2018-10     995.188332
Freq: M, dtype: float64

In [49]:
monthly_sales.index

PeriodIndex(['2018-01', '2018-02', '2018-03', '2018-04', '2018-05', '2018-06',
             '2018-07', '2018-08', '2018-09', '2018-10'],
            dtype='period[M]', freq='M')

As you can see, the Index is a `PeriodIndex`. Each entry in the index is of type `pd.Period`: 

In [28]:
monthly_sales.index[0]

Period('2018-01', 'M')

Period support basic arithmetic operations which makes them convenient to express these time ranges:

In [29]:
pd.Period('2018-01') + 5

Period('2018-06', 'M')

In [30]:
pd.Period('2018-01', freq='H') + 9

Period('2018-01-01 09:00', 'H')

**Upsampling**: With upsampling we'll convert a low-frequency time series to a higher frequency time series. We'll add more "time points". Let's use an example:

We'll start with 3 months of sales, only 3 data points:

In [31]:
ts = pd.Series(
    np.random.randn(3) * 10 + 500,
    index=pd.date_range(start='2018-01-01', periods=3, freq='MS'))
ts

2018-01-01    503.586684
2018-02-01    488.363833
2018-03-01    496.702451
Freq: MS, dtype: float64

We'll now `resample` it to be "Semi Month", every 15 days:

In [32]:
ts.resample('SMS').asfreq()

2018-01-01    503.586684
2018-01-15           NaN
2018-02-01    488.363833
2018-02-15           NaN
2018-03-01    496.702451
Freq: SMS-15, dtype: float64

And as you can see, we have a few missing values, because we don't have data for those specific time periods. What can you do with that missing data? One option is to fill it with previous data:

In [33]:
ts.resample('SMS').ffill()

2018-01-01    503.586684
2018-01-15    503.586684
2018-02-01    488.363833
2018-02-15    488.363833
2018-03-01    496.702451
Freq: SMS-15, dtype: float64