## 重采样（resampling）
指的是将时间序列从一个频率转换到另一个频率的处理过程。将高频率数据聚合到低频率称为降采样（downsampling），而将低频率数据转换到高频率称为升采样（upsampling）。并不是所有的重采样都能被划分到这两个大类中。例如，将 W-WED（每周三）转换为 W-FRI 即不是降采样也不是升采样。

In [6]:
import pandas as pd
from pandas import Series ,DataFrame
import numpy as np

In [4]:
rng = pd.date_range('1/1/2000',periods=100,freq='D')

In [8]:
ts = Series(np.random.randn(len(rng)), index = rng)

In [10]:
ts.resample('M', how='mean')

the new syntax is .resample(...).mean()
  """Entry point for launching an IPython kernel.


2000-01-31    0.047860
2000-02-29    0.055687
2000-03-31    0.137204
2000-04-30   -0.262742
Freq: M, dtype: float64

In [12]:
ts.resample('M',how='mean',kind='period')

the new syntax is .resample(...).mean()
  """Entry point for launching an IPython kernel.


2000-01    0.047860
2000-02    0.055687
2000-03    0.137204
2000-04   -0.262742
Freq: M, dtype: float64

resample方法的参数：

参数 |说明
---|---
freq	| 表示重采样频率的字符串或 DataOffset，例如'M'、'5min'或 Second（15）
how = 'mean'	| 用于产生聚合值的函数名或数组函数，例如'mean'、'ohlc'、np.max 等。默认为'mean'。其他常用的值有：'first'、'last'、'median'、'ohlc'、'max'、'min'
axis = 0	| 重采样的轴，默认为 axis = 0
closed = 'right'	| 在降采样中，各时间段的哪一端是闭合（即包含）的，'right'或'left'。默认为'right'
label = 'right'	| 在降采样中，如何设置聚合值的标签，'right'或'left'（面元的右边界或左边界）。例如，9：30到9：35之间的这5分钟会被标记为9：30或9：35。默认为'right'
loffset = None	| 面元标签的时间校正值
limit = None	| 在前向或后向填充时，允许填充的最大时期数
kind = None	| 聚合到时期（'period'）或时间戳（'timestamp'），默认聚合到时间序列的索引类型
convention = None	| 当重采样时期时，降低频率转换到高频率所采用的约定（'start'或'end'）。默认为'end'

## 降采样
将数据聚合到规整的低频率是一件非常普通的时间序列处理任务。待聚合的数据不必拥有固定的频率，期望的频率会自动定义聚合的面元边界，这些面元用于将时间序列拆分为多个片段。例如，要转换到月度频率（‘M’或‘BM’），数据需要被划分到多个单月时间段。各时间段都是半开放的。一个数据点只能属于一个时间段，所有时间段的并集必须能组成整个时间帧。在用 resample 对数据进行降采样时，需要考虑两样东西：

- 各区间哪边是闭合的。
- 如何标记各个聚合面元，用区间的开头还是末尾。

In [14]:
rng = pd.date_range('1/1/2000',periods=12,freq='T')

ts=Series(np.arange(12),index=rng)

ts

2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int64

In [16]:
#假设我们想要通过求和的方式将这些数据聚合到“5分钟”块中：
ts.resample('5min',how='sum')

the new syntax is .resample(...).sum()
  


2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int64

In [18]:
#右边界减去一秒以便更容易明白该时间戳到底表示的是哪个区间。只需通过 loffset 设置一个字符串或日期偏移量即可
ts.resample('5min',how='sum',loffset='-1s')

the new syntax is .resample(...).sum()
  


1999-12-31 23:59:59    10
2000-01-01 00:04:59    35
2000-01-01 00:09:59    21
Freq: 5T, dtype: int64

## 升采样和插值
在将数据从低频率转换到高频率时，就不需要聚合了

In [28]:
frame = DataFrame(np.random.randn(2,4),
    index = pd.date_range('1/1/2000',periods=2,
   freq='W-WED'),
    columns=['Colorado','Texas','New York','Ohio'])

frame

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,1.809102,-1.630959,-1.006704,0.121054
2000-01-12,1.474755,-0.673744,1.629237,0.017543


In [23]:
# 将其重采样到日频率，默认会引入缺失值

df_daily=frame.resample('D')

df_daily

<pandas.core.resample.DatetimeIndexResampler object at 0x7f3f05243a50>

In [29]:
#resampling的填充和插值方式跟fillna和reindex的一样

frame.resample('D',fill_method='ffill')

the new syntax is .resample(...).ffill()
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,1.809102,-1.630959,-1.006704,0.121054
2000-01-06,1.809102,-1.630959,-1.006704,0.121054
2000-01-07,1.809102,-1.630959,-1.006704,0.121054
2000-01-08,1.809102,-1.630959,-1.006704,0.121054
2000-01-09,1.809102,-1.630959,-1.006704,0.121054
2000-01-10,1.809102,-1.630959,-1.006704,0.121054
2000-01-11,1.809102,-1.630959,-1.006704,0.121054
2000-01-12,1.474755,-0.673744,1.629237,0.017543


In [27]:
# 注意，新的日期索引完全没必要跟旧的相交

frame.resample('W-THU',fill_method='ffill')



the new syntax is .resample(...).ffill()
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-06,3.984161,-0.603598,1.110562,-0.420589
2000-01-13,1.016859,0.313414,-2.413059,-1.038709


## 四、通过时期进行重采样


In [31]:
frame = DataFrame(np.random.randn(24,4),
   index = pd.period_range('1-2000','12-2001',
    freq = 'M'),
    columns = ['Colorado','Texas','New York','Ohio'])
frame[:5]

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01,-0.165357,0.144137,-1.021694,-0.353892
2000-02,-0.180151,0.756981,0.503628,0.868642
2000-03,0.828445,0.148229,0.315159,-0.744768
2000-04,0.913425,-0.924731,-0.337093,-0.800243
2000-05,1.058702,1.541956,0.078934,0.454098


In [34]:
annual_frame = frame.resample('A-DEC',how='mean')
annual_frame

the new syntax is .resample(...).mean()
  """Entry point for launching an IPython kernel.


Unnamed: 0,Colorado,Texas,New York,Ohio
2000,0.077144,-0.190804,-0.157145,0.171407
2001,-0.131635,0.070121,-0.594448,0.207032


升采样要稍微麻烦一些，因为我们必须决定新频率中各区间的哪端用于放置原来的值，就像 asfreq 方法那样。convention 参数默认为‘start’，可设置为‘end’：



In [36]:
annual_frame.resample('Q-DEC',fill_method='ffill')

the new syntax is .resample(...).ffill()
  """Entry point for launching an IPython kernel.


Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q1,0.077144,-0.190804,-0.157145,0.171407
2000Q2,0.077144,-0.190804,-0.157145,0.171407
2000Q3,0.077144,-0.190804,-0.157145,0.171407
2000Q4,0.077144,-0.190804,-0.157145,0.171407
2001Q1,-0.131635,0.070121,-0.594448,0.207032
2001Q2,-0.131635,0.070121,-0.594448,0.207032
2001Q3,-0.131635,0.070121,-0.594448,0.207032
2001Q4,-0.131635,0.070121,-0.594448,0.207032


In [38]:
annual_frame.resample('Q-DEC',fill_method='ffill',
convention = 'end')

the new syntax is .resample(...).ffill()
  


Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q4,0.077144,-0.190804,-0.157145,0.171407
2001Q1,0.077144,-0.190804,-0.157145,0.171407
2001Q2,0.077144,-0.190804,-0.157145,0.171407
2001Q3,0.077144,-0.190804,-0.157145,0.171407
2001Q4,-0.131635,0.070121,-0.594448,0.207032


由于时间指的是时间区间，所以升采样和降采样的规则就比较严格：

在降采样中，目标频率必须是原频率的子时期（subperiod） 
在升采样中，目标频率必须是原频率的超时期（superperiod） 


　　如果不满足这些条件，就会引发异常。这主要影响的是按季、年、周计算的频率。例如，由 Q-MAR 定义的时间区间只能升采样为 A-MAR、A-JUN、A-SEP、A-DEC 等：

In [39]:
annual_frame.resample('Q-MAR',fill_method='ffill')

        

the new syntax is .resample(...).ffill()
  """Entry point for launching an IPython kernel.


Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q4,0.077144,-0.190804,-0.157145,0.171407
2001Q1,0.077144,-0.190804,-0.157145,0.171407
2001Q2,0.077144,-0.190804,-0.157145,0.171407
2001Q3,0.077144,-0.190804,-0.157145,0.171407
2001Q4,-0.131635,0.070121,-0.594448,0.207032
2002Q1,-0.131635,0.070121,-0.594448,0.207032
2002Q2,-0.131635,0.070121,-0.594448,0.207032
2002Q3,-0.131635,0.070121,-0.594448,0.207032
