## 時間處理(Time Series)

- 時間戳記(timestamp)
- 固定時間(2021-1-1,2021-1,2021)
- 一段時間(起始-結束)

### Date and Time Data Types

- python提供3種module處理時間datatime,time,calendar

In [3]:
from datetime import datetime
now = datetime.now()
now

datetime.datetime(2021, 11, 23, 12, 34, 44, 50577)

In [4]:
now.year, now.month, now.day

(2021, 11, 23)

- timedelta 是提供儲存一段時間的類型

In [5]:
delta = datetime(2017,1,7) - datetime(2011, 6, 24, 8, 15)
delta

datetime.timedelta(days=2023, seconds=56700)

In [8]:
delta.days, delta.seconds

(2023, 56700)

 - 可以使用datatime和timedelta的數學運算來增加或減少時間

In [9]:
from datetime import timedelta
start = datetime(2011, 1, 7)
start + timedelta(12)

datetime.datetime(2011, 1, 19, 0, 0)

In [10]:
start - 2 * timedelta(12)

datetime.datetime(2010, 12, 14, 0, 0)

### 字串和Datatime之間的轉換

- 使用str(),datetime的方法strftime(),將時間轉成字串

%Y -> Four-digit year

%y -> Two-digit year

%m -> Two-digit month [01, 12] 

%d -> Two-digit day [01, 31]

%H -> Hour (24-hour clock) [00, 23]

%I -> Hour (12-hour clock) [01, 12]

%M Two-digit minute [00, 59]

%S Second [00, 61] (seconds 60, 61 account for leap seconds) 

%w Weekday as integer [0 (Sunday), 6]


In [14]:
stamp = datetime(2011, 1, 3)
str(stamp) , stamp.strftime("%Y-%m-%d")

('2011-01-03 00:00:00', '2011-01-03')

- 將字串轉成時間(datetime.strptime())

In [16]:
value = '2011-03-01'
datetime.strptime(value, "%Y-%m-%d")

datetime.datetime(2011, 3, 1, 0, 0)

In [17]:
datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(x,'%m/%d/%Y') for x in datestrs]

[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]

datatime.strptime()非常好使用，但每次要指定格式也是非常麻煩，可以使用第3方套件dateutil的parse方法

- pip install python-dateutil

In [20]:
from dateutil.parser import parse

parse('2011-01-03') , parse('Jan 31, 1997 10:15 PM')

(datetime.datetime(2011, 1, 3, 0, 0), datetime.datetime(1997, 1, 31, 22, 15))

- 在國際上有時會日期放在前方，只要設定dayfirst=True來指示

In [23]:
parse('6/12/2011',dayfirst=True)

datetime.datetime(2011, 12, 6, 0, 0)

- pandas提供to_datetime方法，快速將包含時間字串的list轉換為DatetimeIndex

In [25]:
import pandas as pd

datestrs = ['2011-07-06 12:00:00', '2011-08-06 00:00:00']
pd.to_datetime(datestrs)
            

DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)

### Time Series的基礎操作

- 最常使用的是將Time Series常作index
- 建立Time Series時使用的是datetime物件或str

In [2]:
import numpy as np
import pandas as pd
from datetime import datetime

dates = [datetime(2021,1,2),
         datetime(2021,1,5),
         datetime(2021,1,9),
         datetime(2021,1,13),
         datetime(2021,1,21)]
ts = pd.Series(np.random.randn(5),index=dates)
ts

2021-01-02    0.635659
2021-01-05    0.866713
2021-01-09   -0.986443
2021-01-13   -0.855726
2021-01-21   -1.555767
dtype: float64

In [3]:
ts.index

DatetimeIndex(['2021-01-02', '2021-01-05', '2021-01-09', '2021-01-13',
               '2021-01-21'],
              dtype='datetime64[ns]', freq=None)

In [7]:
stamp = ts.index[0]
stamp, stamp.year, stamp.day, stamp.hour, stamp.minute, stamp.second

(Timestamp('2021-01-02 00:00:00'), 2021, 2, 0, 0, 0)

### Indexing, Slection, Subsetting

In [8]:
ts

2021-01-02    0.635659
2021-01-05    0.866713
2021-01-09   -0.986443
2021-01-13   -0.855726
2021-01-21   -1.555767
dtype: float64

In [9]:
#Timestamp當作索引

stamp=ts.index[2]
ts[stamp]

-0.9864425754189368

In [12]:
#時間字串當作索引
ts['1/2/2021'], ts['20210102']

(0.6356590383348519, 0.6356590383348519)

In [13]:
#使用pd.date_range()建立長的Time Series 
longer_ts = pd.Series(np.random.randn(1000),
                      index=pd.date_range('1/1/2019',periods=1000))
longer_ts

2019-01-01    0.221904
2019-01-02   -0.369424
2019-01-03    0.214481
2019-01-04   -0.188063
2019-01-05    0.579601
                ...   
2021-09-22   -0.503608
2021-09-23   -1.385307
2021-09-24    0.619705
2021-09-25    0.997681
2021-09-26    0.647184
Freq: D, Length: 1000, dtype: float64

In [15]:
#只索引年份
longer_ts['2019']

2019-01-01    0.221904
2019-01-02   -0.369424
2019-01-03    0.214481
2019-01-04   -0.188063
2019-01-05    0.579601
                ...   
2019-12-27   -0.022774
2019-12-28    2.039316
2019-12-29   -0.663657
2019-12-30    0.206417
2019-12-31    1.635915
Freq: D, Length: 365, dtype: float64

In [17]:
#只索引年份和月份
longer_ts['2019-05']

2019-05-01    0.927582
2019-05-02   -0.705121
2019-05-03    1.045496
2019-05-04    1.402669
2019-05-05    0.437554
2019-05-06    0.288062
2019-05-07   -0.148887
2019-05-08   -0.609939
2019-05-09    0.069485
2019-05-10   -1.374335
2019-05-11    1.678401
2019-05-12    0.271153
2019-05-13   -0.731405
2019-05-14    0.692798
2019-05-15   -0.838299
2019-05-16   -0.779959
2019-05-17   -1.863190
2019-05-18   -1.134650
2019-05-19    1.498687
2019-05-20   -0.764551
2019-05-21    0.885190
2019-05-22   -1.247698
2019-05-23    0.963204
2019-05-24    0.699194
2019-05-25    1.296661
2019-05-26   -0.151037
2019-05-27   -0.565532
2019-05-28   -1.330525
2019-05-29    0.806819
2019-05-30    0.347592
2019-05-31   -0.060094
Freq: D, dtype: float64

In [22]:
#使用slicing
longer_ts[datetime(2021,1,1):]

2021-01-01   -0.211341
2021-01-02    1.354876
2021-01-03    0.422720
2021-01-04    1.247645
2021-01-05   -0.909123
                ...   
2021-09-22   -0.503608
2021-09-23   -1.385307
2021-09-24    0.619705
2021-09-25    0.997681
2021-09-26    0.647184
Freq: D, Length: 269, dtype: float64

In [23]:
longer_ts['2021-1-1':'2021-3-31']

2021-01-01   -0.211341
2021-01-02    1.354876
2021-01-03    0.422720
2021-01-04    1.247645
2021-01-05   -0.909123
                ...   
2021-03-27   -0.441707
2021-03-28   -0.547188
2021-03-29    0.969223
2021-03-30    2.180465
2021-03-31   -1.650817
Freq: D, Length: 90, dtype: float64

In [26]:
#DataFrame
dates = pd.date_range('1/1/2021',periods=100)
long_df = pd.DataFrame(np.random.randn(100,4),
                      index = dates,
                      columns = ['台北','台中','高雄','花蓮'])
long_df

Unnamed: 0,台北,台中,高雄,花蓮
2021-01-01,1.657568,-1.352675,0.634564,2.109804
2021-01-02,0.594663,0.309041,-1.310166,0.853669
2021-01-03,-1.293925,-1.652481,1.920713,-1.114137
2021-01-04,-0.158570,2.326904,1.312809,-2.104593
2021-01-05,-1.212333,1.373365,2.175019,-0.842572
...,...,...,...,...
2021-04-06,-0.246304,-0.123489,0.015097,0.106791
2021-04-07,0.747633,0.843869,-0.830307,1.916031
2021-04-08,0.104077,0.693688,-1.108173,-0.921798
2021-04-09,1.874833,0.770297,1.825950,0.327793


In [29]:
#使用loc
long_df.loc['2021-03']

Unnamed: 0,台北,台中,高雄,花蓮
2021-03-01,-0.971071,-1.122776,0.230993,-0.635563
2021-03-02,-0.671686,-0.667776,0.666699,1.520706
2021-03-03,1.5458,0.623324,-0.788679,-0.768983
2021-03-04,1.802388,-0.067553,-0.789496,-1.529197
2021-03-05,0.15391,-0.559144,0.87258,-0.49222
2021-03-06,0.179281,0.136172,-1.703999,0.623671
2021-03-07,0.553433,-1.255597,0.429705,1.046209
2021-03-08,-1.405561,0.051523,-0.789161,-1.650783
2021-03-09,1.446504,-2.181151,-0.2343,0.14459
2021-03-10,0.514304,0.183046,1.449759,0.368212


### 重複的Time Series

In [36]:
dates = pd.DatetimeIndex(['1/1/2021','1/2/2021','1/2/2021','1/2/2021','1/3/2021'])
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts

2021-01-01    0
2021-01-02    1
2021-01-02    2
2021-01-02    3
2021-01-03    4
dtype: int64

In [37]:
#檢查是否有重覆
dup_ts.index.is_unique

False

In [38]:
dup_ts['1/3/2021'], dup_ts['1/2/2021']

(4,
 2021-01-02    1
 2021-01-02    2
 2021-01-02    3
 dtype: int64)

In [39]:
#使用groupby()處理
grouped = dup_ts.groupby(level=0)
grouped.mean(), grouped.count()

(2021-01-01    0.0
 2021-01-02    2.0
 2021-01-03    4.0
 dtype: float64,
 2021-01-01    1
 2021-01-02    3
 2021-01-03    1
 dtype: int64)