## 常用数据类型及操作

### Agenda

- 字符串
- 数值型
- 时间类型
- 类别型

In [40]:
import pandas as pd
import numpy as np

### 字符串

字符串类型是最常用的格式之一，Pandas中字符串的操作和原生字符串操作几乎一样，唯一不同的是需要在操作前加上".str"

In [41]:
df = pd.DataFrame({'A': ['one', 'two', 'three#', 'four'],
                   'B': np.random.randn(4)})

df

Unnamed: 0,A,B
0,one,0.653237
1,two,1.020075
2,three#,0.098127
3,four,-1.01887


字符替换

In [42]:
df['A'] = df['A'].str.replace('#', '')
df

Unnamed: 0,A,B
0,one,0.653237
1,two,1.020075
2,three,0.098127
3,four,-1.01887


### 数值型

数值型数据，常见的操作是计算，主要有但列的运算，及列和列之间的运算。

In [43]:
df['C'] = np.random.randn(4)

df

Unnamed: 0,A,B,C
0,one,0.653237,-0.253951
1,two,1.020075,0.156008
2,three,0.098127,0.346802
3,four,-1.01887,-1.311769


取绝对值

In [44]:
df['B'] = np.abs(df['B'])

df

Unnamed: 0,A,B,C
0,one,0.653237,-0.253951
1,two,1.020075,0.156008
2,three,0.098127,0.346802
3,four,1.01887,-1.311769


列与列之间的数学运算

In [45]:
df['D'] = df['B'] + df['C']

df

Unnamed: 0,A,B,C,D
0,one,0.653237,-0.253951,0.399286
1,two,1.020075,0.156008,1.176083
2,three,0.098127,0.346802,0.444929
3,four,1.01887,-1.311769,-0.292899


### 时间类型

Pandas提供了强大的时间及序列函数，比如频率转换，时区转换等。

生成时间序列

In [46]:
df['E'] = pd.date_range('20200101', periods=4)

df

Unnamed: 0,A,B,C,D,E
0,one,0.653237,-0.253951,0.399286,2020-01-01
1,two,1.020075,0.156008,1.176083,2020-01-02
2,three,0.098127,0.346802,0.444929,2020-01-03
3,four,1.01887,-1.311769,-0.292899,2020-01-04


生成时间戳

In [47]:
df['F'] = pd.Timestamp('20190101')

df

Unnamed: 0,A,B,C,D,E,F
0,one,0.653237,-0.253951,0.399286,2020-01-01,2019-01-01
1,two,1.020075,0.156008,1.176083,2020-01-02,2019-01-01
2,three,0.098127,0.346802,0.444929,2020-01-03,2019-01-01
3,four,1.01887,-1.311769,-0.292899,2020-01-04,2019-01-01


时间相减

In [48]:
df['G'] = df['E'] - df['F']

df

Unnamed: 0,A,B,C,D,E,F,G
0,one,0.653237,-0.253951,0.399286,2020-01-01,2019-01-01,365 days
1,two,1.020075,0.156008,1.176083,2020-01-02,2019-01-01,366 days
2,three,0.098127,0.346802,0.444929,2020-01-03,2019-01-01,367 days
3,four,1.01887,-1.311769,-0.292899,2020-01-04,2019-01-01,368 days


### 类别型

Pandas 的 DataFrame 里可以包含类别数据。

创建并转换为类别数据

In [49]:
df["H"] = ["very good", "good", "good", "very bad"]
df["H"] = df["H"].astype("category")
df["H"]

0    very good
1         good
2         good
3     very bad
Name: H, dtype: category
Categories (3, object): [good, very bad, very good]

设置全部类别数据

In [50]:
df["H"] = df["H"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["H"]

0    very good
1         good
2         good
3     very bad
Name: H, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

排序，是按类别的生成顺序排序而不是字符串顺序

In [51]:
df.sort_values(by="H")

Unnamed: 0,A,B,C,D,E,F,G,H
3,four,1.01887,-1.311769,-0.292899,2020-01-04,2019-01-01,368 days,very bad
1,two,1.020075,0.156008,1.176083,2020-01-02,2019-01-01,366 days,good
2,three,0.098127,0.346802,0.444929,2020-01-03,2019-01-01,367 days,good
0,one,0.653237,-0.253951,0.399286,2020-01-01,2019-01-01,365 days,very good


按类列分组（groupby）时，即便某类别为空，也会显示

In [52]:
df.groupby("H").size()

H
very bad     1
bad          0
medium       0
good         2
very good    1
dtype: int64