[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zhujisheng/learn_python/blob/master/08.科学计算与作图/5.Pandas数据处理.ipynb)

[《Python应用实战》视频课程](https://study.163.com/course/courseMain.htm?courseId=1209533804&share=2&shareId=400000000624093)

# Pandas数据处理

难度：★★★☆☆

# 基础概念

- Panel data sheet

|年度|收入（万）|支出（万）|
|-----|-----|-----|
|1999|326.34|403.23|
|2000|235.34|213.23|
|……|……|……|
|2019|4098.12|3213.42|

- Pandas是一个数据处理库
- Pandas提供了现成的Series和DataFrame对象，使数据处理更简单

- 安装

  `pip install pandas`

- [Pandas文档网站](https://pandas.pydata.org/pandas-docs/stable/)

## DataFrame与Series

#### DataFrame

- DataFrame就是一张关系型数据表，其中包含多个行和列

![DataFrame](images/dataframe.JPG)

In [None]:
import numpy as np
import pandas as pd

values = [
            [1985, np.nan, "Biking",   68],
            [1984, 3,      "Dancing",  83],
            [1992, 0,      np.nan,    112]
         ]
people = pd.DataFrame(values,
                     columns=["birthyear", "children", "hobby", "weight"],
                     index=["alice", "bob", "charles"]
                     )
people

#### Series

- Series是带索引项（index）的列；DataFrame是由若干相同index的Series组成的

![Series](images/series.JPG)

In [None]:
people.birthyear

In [None]:
people["birthyear"]

## 操作DataFrame

#### 访问DataFrame表中某一项数据

In [None]:
people["birthyear"]["alice"]

In [None]:
people.loc["alice"]["birthyear"]

In [None]:
people.iloc[0]["birthyear"]

#### 在DataFrame中增/减序列

In [None]:
people["height"] = [172, 168, 184]
people

In [None]:
people["age"] = 2020 - people["birthyear"]
people

In [None]:
del people["hobby"]
people

In [None]:
people.pop("children")
people

#### DataFrame数据总览

In [None]:
people.head(2)

In [None]:
people.tail(1)

In [None]:
people.info()

In [None]:
people.describe()

## 时间序列

- 时间序列是以时间为index的序列
- 在Pandas中，提供了丰富的时间序列处理函数

In [None]:
# 构建一个时间序列
dates = pd.date_range('2019/10/29 5:30pm', periods=12, freq='H')
temperatures = [4.4,5.1,6.1,6.2,6.1,6.1,5.7,5.2,4.7,4.1,3.9,3.5]

temp_series = pd.Series(temperatures, dates)
temp_series

In [None]:
# 画出时间序列
%matplotlib inline
import matplotlib.pyplot as plt

temp_series.plot(kind="bar")
plt.show()

In [None]:
# 降低采样频率，每2小时取一个最小值
temp_series_freq_2H = temp_series.resample("2H").min()
temp_series_freq_2H.plot(kind="bar")
plt.show()

In [None]:
# 使用平方插值计算每15分钟的时间序列

temp_series_freq_15min = temp_series.resample("15Min").interpolate(method="quadratic")

temp_series.plot(label="Period: 1 hour")
temp_series_freq_15min.plot(label="Period: 15 minutes")
plt.legend()
plt.show()

In [None]:
# 时区变化
temp_series_sh = temp_series.tz_localize("Asia/Shanghai")
temp_series_paris = temp_series_sh.tz_convert("Europe/Paris")
temp_series_paris

## 其它功能

- 数据文件读取与存储
- 数据聚类（group）
- 数据透视表（Pivot）
- 数据表的合并/连接
- 多重索引（Multi-indexing）