# Resampling

Data may come at some frequency when we need a different frequency. Resampling allows us to change the frequency of the data while maintaining its reliability.

In [1]:
import pandas as pd
import numpy as np

In [2]:
rng = pd.date_range('1/1/2011', periods = 72, freq = 'H')
ts = pd.Series(list(range(len(rng))), index = rng)

In [3]:
converted = ts.asfreq('45Min', method = 'ffill')

### What does the above code do to the size and content of your data frame?

In [4]:
converted[1:10]

2011-01-01 00:45:00    0
2011-01-01 01:30:00    1
2011-01-01 02:15:00    2
2011-01-01 03:00:00    3
2011-01-01 03:45:00    3
2011-01-01 04:30:00    4
2011-01-01 05:15:00    5
2011-01-01 06:00:00    6
2011-01-01 06:45:00    6
Freq: 45T, dtype: int64

### Take a look at the specs for .asfreq(). What are your options for filling in missing data?

In [5]:
ts[1:10]

2011-01-01 01:00:00    1
2011-01-01 02:00:00    2
2011-01-01 03:00:00    3
2011-01-01 04:00:00    4
2011-01-01 05:00:00    5
2011-01-01 06:00:00    6
2011-01-01 07:00:00    7
2011-01-01 08:00:00    8
2011-01-01 09:00:00    9
Freq: H, dtype: int64

In [6]:
print(ts.shape)
print(converted.shape)

(72,)
(95,)


_Reply:_ This conversion increased the frequency. Doing so required adding in additional rows of data. We can choose to fill the new rows with either forward or backward fills or an interpolation. But, with time series, backfills and interpolation tend to move future data into the past.

### How can you go to less frequent rather than more frequent?

In [7]:
converted = ts.asfreq('3H')

In [8]:
converted[1:10]

2011-01-01 03:00:00     3
2011-01-01 06:00:00     6
2011-01-01 09:00:00     9
2011-01-01 12:00:00    12
2011-01-01 15:00:00    15
2011-01-01 18:00:00    18
2011-01-01 21:00:00    21
2011-01-02 00:00:00    24
2011-01-02 03:00:00    27
Freq: 3H, dtype: int64

In [9]:
ts[1:10]

2011-01-01 01:00:00    1
2011-01-01 02:00:00    2
2011-01-01 03:00:00    3
2011-01-01 04:00:00    4
2011-01-01 05:00:00    5
2011-01-01 06:00:00    6
2011-01-01 07:00:00    7
2011-01-01 08:00:00    8
2011-01-01 09:00:00    9
Freq: H, dtype: int64

In [10]:
# Let's try the more flexible .resample()
ts.resample('2H').mean()[1:10]

2011-01-01 02:00:00     2.5
2011-01-01 04:00:00     4.5
2011-01-01 06:00:00     6.5
2011-01-01 08:00:00     8.5
2011-01-01 10:00:00    10.5
2011-01-01 12:00:00    12.5
2011-01-01 14:00:00    14.5
2011-01-01 16:00:00    16.5
2011-01-01 18:00:00    18.5
Freq: 2H, dtype: float64

In [11]:
# What's particularly useful is that we can use reample to event out irregular time series
irreg_ts = ts[list(np.random.choice(a = list(range(len(ts))), size = 10, replace = False))]

In [12]:
irreg_ts

2011-01-01 01:00:00     1
2011-01-03 23:00:00    71
2011-01-03 20:00:00    68
2011-01-03 19:00:00    67
2011-01-02 23:00:00    47
2011-01-02 16:00:00    40
2011-01-03 12:00:00    60
2011-01-02 08:00:00    32
2011-01-03 03:00:00    51
2011-01-01 04:00:00     4
dtype: int64

In [13]:
irreg_ts.asfreq('D')

2011-01-01 01:00:00    1
Freq: D, dtype: int64

### Why didn't that work?

_Reply:_ The data was not in chronological order.

In [14]:
irreg_ts = irreg_ts.sort_index()
irreg_ts

2011-01-01 01:00:00     1
2011-01-01 04:00:00     4
2011-01-02 08:00:00    32
2011-01-02 16:00:00    40
2011-01-02 23:00:00    47
2011-01-03 03:00:00    51
2011-01-03 12:00:00    60
2011-01-03 19:00:00    67
2011-01-03 20:00:00    68
2011-01-03 23:00:00    71
dtype: int64

In [15]:
irreg_ts.asfreq('D')

2011-01-01 01:00:00    1.0
2011-01-02 01:00:00    NaN
2011-01-03 01:00:00    NaN
Freq: D, dtype: float64

In [16]:
irreg_ts.resample('D').count()

2011-01-01    2
2011-01-02    3
2011-01-03    5
Freq: D, dtype: int64

# Try

(1) What if you want to go to a higher frequency, but you don't want to back fill or forward fill? Why might you want to do that?

_Reply:_ You don't have to fill anything in; you could, instead, simply leave the method blank and allow pandas to fill the data with `NaN`.

(2) What is the difference between .ressample() and .asfreq()?

_Reply:_ Resampling is a more systematic method, while changing the frequency is, as the video indicates, for 'fast and loose changes' in frequency.

(3) How can I forward-fill only a few days? (hint: .fillna())

_Reply:_ see code below

(4) What are some helpful functions to use with a Resampler object?

_Reply:_ One of the things you can do is to apply the `fillna()` method to forward- or backfill data for limited amounts of time. You can also run different data aggregation operations, such as finding means and sums of observations during different timeframes.

In [17]:
rng = pd.date_range('1/1/2011', periods = 72, freq = 'H')
ts = pd.Series(list(range(len(rng))), index = rng)
irreg_ts = ts[list(np.random.choice(a = list(range(len(ts))), size = 10, replace = False))]
irreg_ts = irreg_ts.sort_index()
irreg_ts

2011-01-01 01:00:00     1
2011-01-01 15:00:00    15
2011-01-01 23:00:00    23
2011-01-02 13:00:00    37
2011-01-02 14:00:00    38
2011-01-03 08:00:00    56
2011-01-03 12:00:00    60
2011-01-03 15:00:00    63
2011-01-03 16:00:00    64
2011-01-03 21:00:00    69
dtype: int64

In [18]:
irreg_ts.resample('H').fillna(method = 'ffill', limit = 5)

2011-01-01 01:00:00     1.0
2011-01-01 02:00:00     1.0
2011-01-01 03:00:00     1.0
2011-01-01 04:00:00     1.0
2011-01-01 05:00:00     1.0
2011-01-01 06:00:00     1.0
2011-01-01 07:00:00     NaN
2011-01-01 08:00:00     NaN
2011-01-01 09:00:00     NaN
2011-01-01 10:00:00     NaN
2011-01-01 11:00:00     NaN
2011-01-01 12:00:00     NaN
2011-01-01 13:00:00     NaN
2011-01-01 14:00:00     NaN
2011-01-01 15:00:00    15.0
2011-01-01 16:00:00    15.0
2011-01-01 17:00:00    15.0
2011-01-01 18:00:00    15.0
2011-01-01 19:00:00    15.0
2011-01-01 20:00:00    15.0
2011-01-01 21:00:00     NaN
2011-01-01 22:00:00     NaN
2011-01-01 23:00:00    23.0
2011-01-02 00:00:00    23.0
2011-01-02 01:00:00    23.0
2011-01-02 02:00:00    23.0
2011-01-02 03:00:00    23.0
2011-01-02 04:00:00    23.0
2011-01-02 05:00:00     NaN
2011-01-02 06:00:00     NaN
                       ... 
2011-01-02 16:00:00    38.0
2011-01-02 17:00:00    38.0
2011-01-02 18:00:00    38.0
2011-01-02 19:00:00    38.0
2011-01-02 20:00:00 

In [19]:
irreg_ts.resample('H')

DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]